Lo Splitter

Sometimes, we are interested in the properties of molecules that are similar to molecules with known properties. For instance, during the Lead Optimization stage of the drug discovery process, we've already identified a lead -- a potential drug candidate. Our goal is to enhance certain properties of this lead. Medical chemists frequently make minor modifications to improve the lead's activity, selectivity, and other characteristics.

Another example is generative molecular models. These models use a scorer to guide the generation of new molecules, aiming to maximize a predicted score by making step-by-step adjustments. At each step, the generator decides how to navigate the latent space, relying on the predictive model sensitive to minor changes to direct the generation.

In both scenarios, when applying machine learning, we want to choose models that can detect these minor modifications.

The Lo splitter addresses this need by creating a split with a test consisting of clusters of similar molecules. This allows us to assess a model's ability to identify minor modifications, helping us select the most suitable model for lead optimization or as a scorer for molecular generation. Unlike other splitters, the Lo splitter reveals whether a model can only distinguish scaffolds or if it "understands" how different residues influence properties.

Lo splitter is good

The goal of the Lead Optimization task is to predict how minor modifications of a molecule affect its properties. The Lo splitter divides the dataset into training and test sets in such a way that:

The test set consists of clusters of similar molecules.
Clusters have a variation of the property larger than the experimental noise.
In each test cluster, exactly one molecule is moved to train to imitate a known lead.

Quickstart¶

Imagine you want to improve solubility during lead optimization. In this case, you would want to select an ML model that is highly sensitive to minor modifications to guide your optimization cycles.

Note that we are using a toy dataset for convenience. In real life, you should give more thought to dataset selection.

In [1]:

Copied!





import datamol as dm
from splito.lohi import LoSplitter
from rdkit import Chem
from rdkit.Chem import Draw
import numpy as np
from scipy.stats import spearmanr
import datamol as dm
from splito.lohi import LoSplitter
from rdkit import Chem
from rdkit.Chem import Draw
import numpy as np
from scipy.stats import spearmanr

In [2]:

Copied!

data = dm.data.freesolv()
data
data = dm.data.freesolv()
data

Failed to find the pandas get_adjustment() function to patch
Failed to patch pandas - PandasTools will have limited functionality

Out[2]:

	iupac	smiles	expt	calc
0	4-methoxy-N,N-dimethyl-benzamide	CN(C)C(=O)c1ccc(cc1)OC	-11.01	-9.625
1	methanesulfonyl chloride	CS(=O)(=O)Cl	-4.87	-6.219
2	3-methylbut-1-ene	CC(C)C=C	1.83	2.452
3	2-ethylpyrazine	CCc1cnccn1	-5.45	-5.809
4	heptan-1-ol	CCCCCCCO	-4.21	-2.917
...	...	...	...	...
637	methyl octanoate	CCCCCCCC(=O)OC	-2.04	-3.035
638	pyrrolidine	C1CCNC1	-5.48	-4.278
639	4-hydroxybenzaldehyde	c1cc(ccc1C=O)O	-8.83	-10.050
640	1-chloroheptane	CCCCCCCCl	0.29	1.467
641	1,4-dioxane	C1COCCO1	-5.06	-4.269

642 rows × 4 columns

In [3]:

Copied!

lo_splitter = LoSplitter()
train_idx, cluster_idx = lo_splitter.split(data['smiles'], data['calc'])
lo_splitter = LoSplitter()
train_idx, cluster_idx = lo_splitter.split(data['smiles'], data['calc'])

Collecting clusters:   0%|          | 0/50 [00:00<?, ?it/s]Failed to find the pandas get_adjustment() function to patch
Failed to patch pandas - PandasTools will have limited functionality
Failed to find the pandas get_adjustment() function to patch
Failed to patch pandas - PandasTools will have limited functionality
Failed to find the pandas get_adjustment() function to patch
Failed to patch pandas - PandasTools will have limited functionality
Failed to find the pandas get_adjustment() function to patch
Failed to patch pandas - PandasTools will have limited functionality
Failed to find the pandas get_adjustment() function to patch
Failed to patch pandas - PandasTools will have limited functionality
Failed to find the pandas get_adjustment() function to patch
Failed to patch pandas - PandasTools will have limited functionality
Failed to find the pandas get_adjustment() function to patch
Failed to patch pandas - PandasTools will have limited functionality
Failed to find the pandas get_adjustment() function to patch
Failed to patch pandas - PandasTools will have limited functionality
Failed to find the pandas get_adjustment() function to patch
Failed to find the pandas get_adjustment() function to patch
Failed to patch pandas - PandasTools will have limited functionality
Failed to patch pandas - PandasTools will have limited functionality
Failed to find the pandas get_adjustment() function to patch
Failed to patch pandas - PandasTools will have limited functionality
Failed to find the pandas get_adjustment() function to patch
Failed to patch pandas - PandasTools will have limited functionality
Failed to find the pandas get_adjustment() function to patch
Failed to patch pandas - PandasTools will have limited functionality
Failed to find the pandas get_adjustment() function to patch
Failed to patch pandas - PandasTools will have limited functionality
Failed to find the pandas get_adjustment() function to patch
Failed to patch pandas - PandasTools will have limited functionality
Failed to find the pandas get_adjustment() function to patch
Failed to patch pandas - PandasTools will have limited functionality
Failed to find the pandas get_adjustment() function to patch
Failed to patch pandas - PandasTools will have limited functionality
Failed to find the pandas get_adjustment() function to patch
Failed to patch pandas - PandasTools will have limited functionality
Failed to find the pandas get_adjustment() function to patch
Failed to patch pandas - PandasTools will have limited functionality
Failed to find the pandas get_adjustment() function to patch
Failed to patch pandas - PandasTools will have limited functionality
Collecting clusters:  30%|███       | 15/50 [00:01<00:03,  9.88it/s]
2024-04-06 13:40:17.820 | INFO     | splito.lohi._lo:_select_distinct_clusters:99 - Found 15 clusters.

Now train_idx contains indices of molecules in the train set. cluster_idx contains a list of clusters, each consisting of indices of molecules in that cluster. Let's visualize a few of them.

In [4]:

Copied!





# Visualize a cluster
cluster_smiles = data.iloc[cluster_idx[1]]['smiles']
molecules = [Chem.MolFromSmiles(smiles) for smiles in cluster_smiles]
img = Draw.MolsToGridImage(molecules, molsPerRow=5, subImgSize=(300, 300))
img
# Visualize a cluster
cluster_smiles = data.iloc[cluster_idx[1]]['smiles']
molecules = [Chem.MolFromSmiles(smiles) for smiles in cluster_smiles]
img = Draw.MolsToGridImage(molecules, molsPerRow=5, subImgSize=(300, 300))
img

Out[4]:

No description has been provided for this image

In [5]:

Copied!





# Visualize another cluster
cluster_smiles = data.iloc[cluster_idx[3]]['smiles']
molecules = [Chem.MolFromSmiles(smiles) for smiles in cluster_smiles]
img = Draw.MolsToGridImage(molecules, molsPerRow=5, subImgSize=(300, 300))
img
# Visualize another cluster
cluster_smiles = data.iloc[cluster_idx[3]]['smiles']
molecules = [Chem.MolFromSmiles(smiles) for smiles in cluster_smiles]
img = Draw.MolsToGridImage(molecules, molsPerRow=5, subImgSize=(300, 300))
img

Out[5]:

Now let's see how to evaluate ML models using this split. Our goal is to determine whether a model can make better predictions than assuming "the modified molecule is as active as the original one". We can chose Spearman’s correlation coefficient as our metric, calculated within each cluster and averaged across clusters. This metric does not rely on intracluster variation, depends solely on the ranking of molecules within the cluster, and is normalized, between minus one (ideally wrong), zero (random) and one (ideal), making it easily interpretable.

Let's skip model training and calculate Spearman's coefficient for a cluster.

In [6]:

Copied!





# These are placeholders to show the workflow

def train_model(train_data):
    # add your model training here
    model = None
    return model

def predict(model, test_data):
    # add your code here
    predictions = np.random.normal(size=len(test_data))
    return predictions
# These are placeholders to show the workflow

def train_model(train_data):
    # add your model training here
    model = None
    return model

def predict(model, test_data):
    # add your code here
    predictions = np.random.normal(size=len(test_data))
    return predictions

In [7]:

Copied!





train_data = data.iloc[train_idx]
model = train_model(train_data)

cluster = data.iloc[cluster_idx[0]]  # the first cluster
cluster_preds = predict(model, cluster)
cluster_true = cluster['calc']

spearmanr(cluster_true, cluster_preds)[0]
train_data = data.iloc[train_idx]
model = train_model(train_data)

cluster = data.iloc[cluster_idx[0]]  # the first cluster
cluster_preds = predict(model, cluster)
cluster_true = cluster['calc']

spearmanr(cluster_true, cluster_preds)[0]

Out[7]:

0.19999999999999998

Values near zero indicate poor quality of predictions, which is expected for random predictions.

Advanced Lo splitting¶

Each use case is different, so the Lo splitter has several parameters that allow for flexible configuration. Let's examine them.

In [8]:

Copied!





import datamol as dm
from splito.lohi import LoSplitter
from rdkit import Chem
from rdkit.Chem import Draw
import numpy as np
from scipy.stats import spearmanr
import datamol as dm
from splito.lohi import LoSplitter
from rdkit import Chem
from rdkit.Chem import Draw
import numpy as np
from scipy.stats import spearmanr

In [9]:

Copied!

data = dm.data.freesolv()
data
data = dm.data.freesolv()
data

Out[9]:

	iupac	smiles	expt	calc
0	4-methoxy-N,N-dimethyl-benzamide	CN(C)C(=O)c1ccc(cc1)OC	-11.01	-9.625
1	methanesulfonyl chloride	CS(=O)(=O)Cl	-4.87	-6.219
2	3-methylbut-1-ene	CC(C)C=C	1.83	2.452
3	2-ethylpyrazine	CCc1cnccn1	-5.45	-5.809
4	heptan-1-ol	CCCCCCCO	-4.21	-2.917
...	...	...	...	...
637	methyl octanoate	CCCCCCCC(=O)OC	-2.04	-3.035
638	pyrrolidine	C1CCNC1	-5.48	-4.278
639	4-hydroxybenzaldehyde	c1cc(ccc1C=O)O	-8.83	-10.050
640	1-chloroheptane	CCCCCCCCl	0.29	1.467
641	1,4-dioxane	C1COCCO1	-5.06	-4.269

642 rows × 4 columns

threshold -- This is the ECFP4 1024-bit Tanimoto similarity threshold. Molecules more similar than this threshold are considered too similar and can be grouped together in one cluster. The default value is 0.4.
min_cluster_size -- Sets a minimum number of molecules per cluster. The default is 5. This is why in the previous section you saw clusters of five molecules.
max_clusters -- The algorithm stops either when there are no more suitable clusters or when the max_clusters limit is reached. This can be useful for limiting your test set to get more molecules in the train set. The default is 50.
std_threshold -- We aim to evaluate models on their ability to rank very similar molecules by activity. What if the molecules actually have the same activity and all the variability in the data is explained by experimental noise? This is quite possible since the molecules are very similar. Therefore, the algorithm only selects clusters whose intracluster variation exceeds noise variation. This behavior is controlled by the std_threshold parameter, which is why we pass the activity values in the split() method. A larger std_threshold results in the selection of more variable clusters. Choose this based on the quality of your data. The default is 0.60, which is the standard deviation of noise for logKi data in ChEMBL, a quite noisy dataset. If you use in-house data, a threshold of 0.2 might be more suitable. For more info see: Steshin 2023, Lo-Hi: Practical ML Drug Discovery Benchmark, appendix B: Lo is not just a noise

In [10]:

Copied!





# An example of different parameters
lo_splitter = LoSplitter(
    threshold=0.6,
    min_cluster_size=10,
    max_clusters=50,
    std_threshold=0.2
)
train_idx, cluster_idx = lo_splitter.split(data['smiles'], data['calc'])
# An example of different parameters
lo_splitter = LoSplitter(
    threshold=0.6,
    min_cluster_size=10,
    max_clusters=50,
    std_threshold=0.2
)
train_idx, cluster_idx = lo_splitter.split(data['smiles'], data['calc'])

Collecting clusters:   2%|▏         | 1/50 [00:00<00:07,  6.87it/s]
2024-04-06 13:40:18.116 | INFO     | splito.lohi._lo:_select_distinct_clusters:99 - Found 1 clusters.

In [11]:

Copied!

data.iloc[cluster_idx[0]]
data.iloc[cluster_idx[0]]

Out[11]:

	iupac	smiles	expt	calc
37	decan-1-ol	CCCCCCCCCCO	-3.64	-2.446
39	nonane	CCCCCCCCC	3.13	3.221
88	heptane	CCCCCCC	2.67	2.925
204	hexan-1-ol	CCCCCCO	-4.40	-3.000
217	butan-1-ol	CCCCO	-4.72	-3.232
263	decane	CCCCCCCCCC	3.16	3.335
368	nonan-1-ol	CCCCCCCCCO	-3.88	-2.564
437	octan-1-ol	CCCCCCCCO	-4.09	-2.690
494	octane	CCCCCCCC	2.88	3.088
572	pentan-1-ol	CCCCCO	-4.57	-3.054

In [12]:

Copied!





# Visualize the cluster
cluster_smiles = data.iloc[cluster_idx[0]]['smiles']
molecules = [Chem.MolFromSmiles(smiles) for smiles in cluster_smiles]
img = Draw.MolsToGridImage(molecules, molsPerRow=5, subImgSize=(300, 300))
img
# Visualize the cluster
cluster_smiles = data.iloc[cluster_idx[0]]['smiles']
molecules = [Chem.MolFromSmiles(smiles) for smiles in cluster_smiles]
img = Draw.MolsToGridImage(molecules, molsPerRow=5, subImgSize=(300, 300))
img

Out[12]:

Lo Splitter

Quickstart¶

Advanced Lo splitting¶

Further reading¶