splito.simpd
splito.simpd.SIMPDSplitter
            Bases: BaseShuffleSplit
The SIMPD (SImulated Medicinal chemistry Project Data) is based on a multi-objective genetic algorithm (MOGA) to split a set of compounds with bioactivity data into one or more training and test sets that differ from each other in ways resembling the differences between the temporal training/test splits observed in medicinal chemistry projects.
It's the implementation proposed in "SIMPD: an Algorithm for Generating Simulated Time Splits for Validating Machine Learning Approaches" available at https://chemrxiv.org/engage/chemrxiv/article-details/6406049e6642bf8c8f10e189.
The source code has been largely inspired by the original authors implementation available at https://github.com/rinikerlab/molecular_time_series/tree/55eb420ab0319fbb18cc00fe62a872ac568ad7f5.
__init__
__init__(
    n_splits: int = 1,
    pop_size: int = 500,
    ngens: int = 100,
    swap_fraction: float = 0.1,
    simpd_descriptors: Optional[pd.DataFrame] = None,
    target_train_frac_active: float = -1,
    target_test_frac_active: float = -1,
    target_test_set_frac: float = 0.2,
    target_delta_test_frac_active: Optional[float] = None,
    target_GF_delta_window: Tuple[int, int] = (10, 30),
    target_G_val: int = 70,
    max_population_cluster_entropy: float = 0.9,
    pareto_weight_GF_delta: float = 10,
    pareto_weight_G: float = 5,
    num_threads: int = 1,
    random_seed: Optional[int] = 19,
    verbose: bool = True,
    verbose_pymoo: bool = True,
    progress: bool = True,
    progress_leave: bool = False,
)
Creates the splitter object.
We invite the user to refer to the original paper for more details on the parameters.
Parameters:
| Name | Type | Description | Default | 
|---|---|---|---|
| n_splits | int | Number of splits to generate. | 1 | 
| pop_size | int | The population size for the GA. | 500 | 
| ngens | int | The number of generations for the GA. | 100 | 
| swap_fraction | float | The swap fraction for the GA. Swap N% of the bits in each mutation. | 0.1 | 
| simpd_descriptors | Optional[DataFrame] | The descriptors to use for the GA. If None, the default descriptors from the paper will be used.
Load them from  | None | 
| target_train_frac_active | float | The target fraction of active compounds in the training set. Set to -1 to disable. | -1 | 
| target_test_frac_active | float | The target fraction of active compounds in the test set. Set to -1 to disable. | -1 | 
| target_test_set_frac | float | The target fraction of the test set. | 0.2 | 
| target_delta_test_frac_active | Optional[float] | The target delta of active between the test and training set. | None | 
| target_GF_delta_window | Tuple[int, int] | The target window for the GF delta. | (10, 30) | 
| target_G_val | int | The target G value. | 70 | 
| max_population_cluster_entropy | float | The maximum cluster entropy. | 0.9 | 
| pareto_weight_GF_delta | float | The weight for the GF delta. | 10 | 
| pareto_weight_G | float | The weight for the G value. | 5 | 
| num_threads | int | The number of threads to use for the GA. | 1 | 
| random_seed | Optional[int] | The random seed to use for the GA. | 19 | 
| verbose | bool | Whether to print information about the splitter. | True | 
| verbose_pymoo | bool | Whether to print information about the GA. | True | 
| progress | bool | Whether to display a progress bar. | True | 
| progress_leave | bool | Whether to leave the progress bar after completion. | False | 
fit
Fit the splitter against a given dataset.
Parameters:
| Name | Type | Description | Default | 
|---|---|---|---|
| X | ndarray | An array of molecules. | required | 
| y | ndarray | An array of activities (only 1D is supported). | required | 
| groups | Optional[ndarray] | An array of groups. | None | 
splito.simpd.run_SIMPD
run_SIMPD(
    data: pd.DataFrame,
    mol_column: str = "mol",
    activity_column: str = "active",
    pop_size: int = 500,
    ngens: int = 100,
    swap_fraction: float = 0.1,
    simpd_descriptors: Optional[pd.DataFrame] = None,
    target_train_frac_active: float = -1,
    target_test_frac_active: float = -1,
    target_test_set_frac: float = 0.2,
    target_delta_test_frac_active: Optional[float] = None,
    target_GF_delta_window: Tuple[int, int] = (10, 30),
    target_G_val: int = 70,
    max_population_cluster_entropy: float = 0.9,
    pareto_weight_GF_delta: float = 10,
    pareto_weight_G: float = 5,
    num_threads: int = 1,
    random_seed: Optional[int] = 19,
    verbose: bool = True,
    verbose_pymoo: bool = True,
    progress: bool = True,
    progress_leave: bool = False,
)
          splito.simpd.DEFAULT_SIMPD_DESCRIPTORS
  
  
      module-attribute
  
DEFAULT_SIMPD_DESCRIPTORS = pd.DataFrame(
    [
        {
            "name": "SA_Score",
            "function": "datamol.descriptors.sas",
            "target_delta_value": 0.1 * 2.8,
        },
        {
            "name": "HeavyAtomCount",
            "function": "datamol.descriptors.n_heavy_atoms",
            "target_delta_value": 0.1 * 31,
        },
        {
            "name": "TPSA",
            "function": "datamol.descriptors.tpsa",
            "target_delta_value": 0.15 * 88.0,
        },
        {
            "name": "fr_benzene/1000 HeavyAtoms",
            "function": "splito.simpd.descriptors.fr_benzene_1000_heavy_atoms_count",
            "target_delta_value": -0.2 * 0.44,
        },
    ]
)