Skip to content

splito.simpd

splito.simpd.SIMPDSplitter

Bases: BaseShuffleSplit

The SIMPD (SImulated Medicinal chemistry Project Data) is based on a multi-objective genetic algorithm (MOGA) to split a set of compounds with bioactivity data into one or more training and test sets that differ from each other in ways resembling the differences between the temporal training/test splits observed in medicinal chemistry projects.

It's the implementation proposed in "SIMPD: an Algorithm for Generating Simulated Time Splits for Validating Machine Learning Approaches" available at https://chemrxiv.org/engage/chemrxiv/article-details/6406049e6642bf8c8f10e189.

The source code has been largely inspired by the original authors implementation available at https://github.com/rinikerlab/molecular_time_series/tree/55eb420ab0319fbb18cc00fe62a872ac568ad7f5.

fit

fit(X: np.ndarray, y: np.ndarray, groups: Optional[np.ndarray] = None)

Fit the splitter against a given dataset.

Parameters:

Name Type Description Default
X ndarray

An array of molecules.

required
y ndarray

An array of activities (only 1D is supported).

required
groups Optional[ndarray]

An array of groups.

None

splito.simpd.run_SIMPD

run_SIMPD(
    data: pd.DataFrame,
    mol_column: str = "mol",
    activity_column: str = "active",
    pop_size: int = 500,
    ngens: int = 100,
    swap_fraction: float = 0.1,
    simpd_descriptors: Optional[pd.DataFrame] = None,
    target_train_frac_active: float = -1,
    target_test_frac_active: float = -1,
    target_test_set_frac: float = 0.2,
    target_delta_test_frac_active: Optional[float] = None,
    target_GF_delta_window: Tuple[int, int] = (10, 30),
    target_G_val: int = 70,
    max_population_cluster_entropy: float = 0.9,
    pareto_weight_GF_delta: float = 10,
    pareto_weight_G: float = 5,
    num_threads: int = 1,
    random_seed: Optional[int] = 19,
    verbose: bool = True,
    verbose_pymoo: bool = True,
    progress: bool = True,
    progress_leave: bool = False,
)

splito.simpd.DEFAULT_SIMPD_DESCRIPTORS module-attribute

DEFAULT_SIMPD_DESCRIPTORS = DataFrame(
    [
        {
            "name": "SA_Score",
            "function": "datamol.descriptors.sas",
            "target_delta_value": 0.1 * 2.8,
        },
        {
            "name": "HeavyAtomCount",
            "function": "datamol.descriptors.n_heavy_atoms",
            "target_delta_value": 0.1 * 31,
        },
        {
            "name": "TPSA",
            "function": "datamol.descriptors.tpsa",
            "target_delta_value": 0.15 * 88.0,
        },
        {
            "name": "fr_benzene/1000 HeavyAtoms",
            "function": "splito.simpd.descriptors.fr_benzene_1000_heavy_atoms_count",
            "target_delta_value": -0.2 * 0.44,
        },
    ]
)