Skip to content

splito.lohi

splito.lohi.LoSplitter

__init__

__init__(
    threshold: float = 0.4,
    min_cluster_size: int = 5,
    max_clusters: int = 50,
    std_threshold: float = 0.6,
)

A splitter that prepares data for training ML models for Lead Optimization or to guide molecular generative models. These models must be sensitive to minor modifications of molecules, and this splitter constructs a test that allows the evaluation of a model's ability to distinguish those modifications.

Parameters:

Name Type Description Default
threshold float

ECFP4 1024-bit Tanimoto similarity threshold. Molecules more similar than this threshold are considered too similar and can be grouped together in one cluster.

0.4
min_cluster_size int

the minimum number of molecules per cluster.

5
max_clusters int

the maximum number of selected clusters. The remaining molecules go to the training set. This can be useful for limiting your test set to get more molecules in the train set.

50
std_threshold float

the lower bound of the acceptable standard deviation for a cluster's values. It should be greater than the measurement noise. For ChEMBL-like data set it to 0.60 for logKi and 0.70 for logIC50. Set it lower if you have a high-quality dataset.

0.6

For more information, see a tutorial in the docs and Steshin 2023, Lo-Hi: Practical ML Drug Discovery Benchmark.

split

split(
    smiles: list[str], values: list[float], n_jobs: int = -1, verbose: int = 1
) -> tuple[list[int], list[list[int]]]

Split the dataset into test clusters and train.

Parameters:

Name Type Description Default
smiles list[str]

list of smiles.

required
values list[float]

list of their continuous activity values.

required
verbose int

set to 0 to turn off progressbar.

1

Returns:

Name Type Description
train_idx list[int]

list of train indices.

clusters_idx list[list[int]]

list of lists of cluster indices.