splito.simpd
splito.simpd.SIMPDSplitter
Bases: BaseShuffleSplit
The SIMPD (SImulated Medicinal chemistry Project Data) is based on a multi-objective genetic algorithm (MOGA) to split a set of compounds with bioactivity data into one or more training and test sets that differ from each other in ways resembling the differences between the temporal training/test splits observed in medicinal chemistry projects.
It's the implementation proposed in "SIMPD: an Algorithm for Generating Simulated Time Splits for Validating Machine Learning Approaches" available at https://chemrxiv.org/engage/chemrxiv/article-details/6406049e6642bf8c8f10e189.
The source code has been largely inspired by the original authors implementation available at https://github.com/rinikerlab/molecular_time_series/tree/55eb420ab0319fbb18cc00fe62a872ac568ad7f5.
fit
Fit the splitter against a given dataset.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
X |
ndarray
|
An array of molecules. |
required |
y |
ndarray
|
An array of activities (only 1D is supported). |
required |
groups |
Optional[ndarray]
|
An array of groups. |
None
|
splito.simpd.run_SIMPD
run_SIMPD(
data: pd.DataFrame,
mol_column: str = "mol",
activity_column: str = "active",
pop_size: int = 500,
ngens: int = 100,
swap_fraction: float = 0.1,
simpd_descriptors: Optional[pd.DataFrame] = None,
target_train_frac_active: float = -1,
target_test_frac_active: float = -1,
target_test_set_frac: float = 0.2,
target_delta_test_frac_active: Optional[float] = None,
target_GF_delta_window: Tuple[int, int] = (10, 30),
target_G_val: int = 70,
max_population_cluster_entropy: float = 0.9,
pareto_weight_GF_delta: float = 10,
pareto_weight_G: float = 5,
num_threads: int = 1,
random_seed: Optional[int] = 19,
verbose: bool = True,
verbose_pymoo: bool = True,
progress: bool = True,
progress_leave: bool = False,
)
splito.simpd.DEFAULT_SIMPD_DESCRIPTORS
module-attribute
DEFAULT_SIMPD_DESCRIPTORS = DataFrame(
[
{
"name": "SA_Score",
"function": "datamol.descriptors.sas",
"target_delta_value": 0.1 * 2.8,
},
{
"name": "HeavyAtomCount",
"function": "datamol.descriptors.n_heavy_atoms",
"target_delta_value": 0.1 * 31,
},
{
"name": "TPSA",
"function": "datamol.descriptors.tpsa",
"target_delta_value": 0.15 * 88.0,
},
{
"name": "fr_benzene/1000 HeavyAtoms",
"function": "splito.simpd.descriptors.fr_benzene_1000_heavy_atoms_count",
"target_delta_value": -0.2 * 0.44,
},
]
)