splito.simpd
splito.simpd.SIMPDSplitter
Bases: BaseShuffleSplit
The SIMPD (SImulated Medicinal chemistry Project Data) is based on a multi-objective genetic algorithm (MOGA) to split a set of compounds with bioactivity data into one or more training and test sets that differ from each other in ways resembling the differences between the temporal training/test splits observed in medicinal chemistry projects.
It's the implementation proposed in "SIMPD: an Algorithm for Generating Simulated Time Splits for Validating Machine Learning Approaches" available at https://chemrxiv.org/engage/chemrxiv/article-details/6406049e6642bf8c8f10e189.
The source code has been largely inspired by the original authors implementation available at https://github.com/rinikerlab/molecular_time_series/tree/55eb420ab0319fbb18cc00fe62a872ac568ad7f5.
__init__
__init__(
n_splits: int = 1,
pop_size: int = 500,
ngens: int = 100,
swap_fraction: float = 0.1,
simpd_descriptors: Optional[pd.DataFrame] = None,
target_train_frac_active: float = -1,
target_test_frac_active: float = -1,
target_test_set_frac: float = 0.2,
target_delta_test_frac_active: Optional[float] = None,
target_GF_delta_window: Tuple[int, int] = (10, 30),
target_G_val: int = 70,
max_population_cluster_entropy: float = 0.9,
pareto_weight_GF_delta: float = 10,
pareto_weight_G: float = 5,
num_threads: int = 1,
random_seed: Optional[int] = 19,
verbose: bool = True,
verbose_pymoo: bool = True,
progress: bool = True,
progress_leave: bool = False,
)
Creates the splitter object.
We invite the user to refer to the original paper for more details on the parameters.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
n_splits |
int
|
Number of splits to generate. |
1
|
pop_size |
int
|
The population size for the GA. |
500
|
ngens |
int
|
The number of generations for the GA. |
100
|
swap_fraction |
float
|
The swap fraction for the GA. Swap N% of the bits in each mutation. |
0.1
|
simpd_descriptors |
Optional[DataFrame]
|
The descriptors to use for the GA. If None, the default descriptors from the paper will be used.
Load them from |
None
|
target_train_frac_active |
float
|
The target fraction of active compounds in the training set. Set to -1 to disable. |
-1
|
target_test_frac_active |
float
|
The target fraction of active compounds in the test set. Set to -1 to disable. |
-1
|
target_test_set_frac |
float
|
The target fraction of the test set. |
0.2
|
target_delta_test_frac_active |
Optional[float]
|
The target delta of active between the test and training set. |
None
|
target_GF_delta_window |
Tuple[int, int]
|
The target window for the GF delta. |
(10, 30)
|
target_G_val |
int
|
The target G value. |
70
|
max_population_cluster_entropy |
float
|
The maximum cluster entropy. |
0.9
|
pareto_weight_GF_delta |
float
|
The weight for the GF delta. |
10
|
pareto_weight_G |
float
|
The weight for the G value. |
5
|
num_threads |
int
|
The number of threads to use for the GA. |
1
|
random_seed |
Optional[int]
|
The random seed to use for the GA. |
19
|
verbose |
bool
|
Whether to print information about the splitter. |
True
|
verbose_pymoo |
bool
|
Whether to print information about the GA. |
True
|
progress |
bool
|
Whether to display a progress bar. |
True
|
progress_leave |
bool
|
Whether to leave the progress bar after completion. |
False
|
fit
Fit the splitter against a given dataset.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
X |
ndarray
|
An array of molecules. |
required |
y |
ndarray
|
An array of activities (only 1D is supported). |
required |
groups |
Optional[ndarray]
|
An array of groups. |
None
|
splito.simpd.run_SIMPD
run_SIMPD(
data: pd.DataFrame,
mol_column: str = "mol",
activity_column: str = "active",
pop_size: int = 500,
ngens: int = 100,
swap_fraction: float = 0.1,
simpd_descriptors: Optional[pd.DataFrame] = None,
target_train_frac_active: float = -1,
target_test_frac_active: float = -1,
target_test_set_frac: float = 0.2,
target_delta_test_frac_active: Optional[float] = None,
target_GF_delta_window: Tuple[int, int] = (10, 30),
target_G_val: int = 70,
max_population_cluster_entropy: float = 0.9,
pareto_weight_GF_delta: float = 10,
pareto_weight_G: float = 5,
num_threads: int = 1,
random_seed: Optional[int] = 19,
verbose: bool = True,
verbose_pymoo: bool = True,
progress: bool = True,
progress_leave: bool = False,
)
splito.simpd.DEFAULT_SIMPD_DESCRIPTORS
module-attribute
DEFAULT_SIMPD_DESCRIPTORS = pd.DataFrame(
[
{
"name": "SA_Score",
"function": "datamol.descriptors.sas",
"target_delta_value": 0.1 * 2.8,
},
{
"name": "HeavyAtomCount",
"function": "datamol.descriptors.n_heavy_atoms",
"target_delta_value": 0.1 * 31,
},
{
"name": "TPSA",
"function": "datamol.descriptors.tpsa",
"target_delta_value": 0.15 * 88.0,
},
{
"name": "fr_benzene/1000 HeavyAtoms",
"function": "splito.simpd.descriptors.fr_benzene_1000_heavy_atoms_count",
"target_delta_value": -0.2 * 0.44,
},
]
)