`splito.simpd`

splito.simpd.SIMPDSplitter

Bases: BaseShuffleSplit

The SIMPD (SImulated Medicinal chemistry Project Data) is based on a multi-objective genetic algorithm (MOGA) to split a set of compounds with bioactivity data into one or more training and test sets that differ from each other in ways resembling the differences between the temporal training/test splits observed in medicinal chemistry projects.

It's the implementation proposed in "SIMPD: an Algorithm for Generating Simulated Time Splits for Validating Machine Learning Approaches" available at https://chemrxiv.org/engage/chemrxiv/article-details/6406049e6642bf8c8f10e189.

The source code has been largely inspired by the original authors implementation available at https://github.com/rinikerlab/molecular_time_series/tree/55eb420ab0319fbb18cc00fe62a872ac568ad7f5.

init

__init__(
    n_splits: int = 1,
    pop_size: int = 500,
    ngens: int = 100,
    swap_fraction: float = 0.1,
    simpd_descriptors: Optional[pd.DataFrame] = None,
    target_train_frac_active: float = -1,
    target_test_frac_active: float = -1,
    target_test_set_frac: float = 0.2,
    target_delta_test_frac_active: Optional[float] = None,
    target_GF_delta_window: Tuple[int, int] = (10, 30),
    target_G_val: int = 70,
    max_population_cluster_entropy: float = 0.9,
    pareto_weight_GF_delta: float = 10,
    pareto_weight_G: float = 5,
    num_threads: int = 1,
    random_seed: Optional[int] = 19,
    verbose: bool = True,
    verbose_pymoo: bool = True,
    progress: bool = True,
    progress_leave: bool = False,
)

Creates the splitter object.

We invite the user to refer to the original paper for more details on the parameters.

Parameters:

Name	Type	Description	Default
`n_splits`	`int`	Number of splits to generate.	`1`
`pop_size`	`int`	The population size for the GA.	`500`
`ngens`	`int`	The number of generations for the GA.	`100`
`swap_fraction`	`float`	The swap fraction for the GA. Swap N% of the bits in each mutation.	`0.1`
`simpd_descriptors`	`Optional[DataFrame]`	The descriptors to use for the GA. If None, the default descriptors from the paper will be used. Load them from `splito.simpd.DEFAULT_SIMPD_DESCRIPTORS`.	`None`
`target_train_frac_active`	`float`	The target fraction of active compounds in the training set. Set to -1 to disable.	`-1`
`target_test_frac_active`	`float`	The target fraction of active compounds in the test set. Set to -1 to disable.	`-1`
`target_test_set_frac`	`float`	The target fraction of the test set.	`0.2`
`target_delta_test_frac_active`	`Optional[float]`	The target delta of active between the test and training set.	`None`
`target_GF_delta_window`	`Tuple[int, int]`	The target window for the GF delta.	`(10, 30)`
`target_G_val`	`int`	The target G value.	`70`
`max_population_cluster_entropy`	`float`	The maximum cluster entropy.	`0.9`
`pareto_weight_GF_delta`	`float`	The weight for the GF delta.	`10`
`pareto_weight_G`	`float`	The weight for the G value.	`5`
`num_threads`	`int`	The number of threads to use for the GA.	`1`
`random_seed`	`Optional[int]`	The random seed to use for the GA.	`19`
`verbose`	`bool`	Whether to print information about the splitter.	`True`
`verbose_pymoo`	`bool`	Whether to print information about the GA.	`True`
`progress`	`bool`	Whether to display a progress bar.	`True`
`progress_leave`	`bool`	Whether to leave the progress bar after completion.	`False`

fit

fit(X: np.ndarray, y: np.ndarray, groups: Optional[np.ndarray] = None)

Fit the splitter against a given dataset.

Parameters:

Name	Type	Description	Default
`X`	`ndarray`	An array of molecules.	required
`y`	`ndarray`	An array of activities (only 1D is supported).	required
`groups`	`Optional[ndarray]`	An array of groups.	`None`

splito.simpd.run_SIMPD

run_SIMPD(
    data: pd.DataFrame,
    mol_column: str = "mol",
    activity_column: str = "active",
    pop_size: int = 500,
    ngens: int = 100,
    swap_fraction: float = 0.1,
    simpd_descriptors: Optional[pd.DataFrame] = None,
    target_train_frac_active: float = -1,
    target_test_frac_active: float = -1,
    target_test_set_frac: float = 0.2,
    target_delta_test_frac_active: Optional[float] = None,
    target_GF_delta_window: Tuple[int, int] = (10, 30),
    target_G_val: int = 70,
    max_population_cluster_entropy: float = 0.9,
    pareto_weight_GF_delta: float = 10,
    pareto_weight_G: float = 5,
    num_threads: int = 1,
    random_seed: Optional[int] = 19,
    verbose: bool = True,
    verbose_pymoo: bool = True,
    progress: bool = True,
    progress_leave: bool = False,
)

splito.simpd.DEFAULT_SIMPD_DESCRIPTORS `module-attribute`

DEFAULT_SIMPD_DESCRIPTORS = pd.DataFrame(
    [
        {
            "name": "SA_Score",
            "function": "datamol.descriptors.sas",
            "target_delta_value": 0.1 * 2.8,
        },
        {
            "name": "HeavyAtomCount",
            "function": "datamol.descriptors.n_heavy_atoms",
            "target_delta_value": 0.1 * 31,
        },
        {
            "name": "TPSA",
            "function": "datamol.descriptors.tpsa",
            "target_delta_value": 0.15 * 88.0,
        },
        {
            "name": "fr_benzene/1000 HeavyAtoms",
            "function": "splito.simpd.descriptors.fr_benzene_1000_heavy_atoms_count",
            "target_delta_value": -0.2 * 0.44,
        },
    ]
)

splito.simpd

splito.simpd.SIMPDSplitter

__init__

fit

splito.simpd.run_SIMPD

splito.simpd.DEFAULT_SIMPD_DESCRIPTORS module-attribute

`splito.simpd`

init

splito.simpd.DEFAULT_SIMPD_DESCRIPTORS `module-attribute`