Skip to content

splito

Basic usage

splito

train_test_split

train_test_split(
    X: np.ndarray,
    y: np.ndarray,
    molecules: Optional[Sequence[Union[str, dm.Mol]]] = None,
    method: Union[str, SimpleSplittingMethod] = "random",
    test_size: float = 0.2,
    seed: int = None,
    n_jobs: Optional[int] = None,
) -> tuple[np.ndarray, np.ndarray, np.ndarray, np.ndarray]

Splits a set of molecules into a train and test set.

Inspired by sklearn.model_selection.train_test_split, this function is meant as a convenience function that provides a less verbose way of using the different splitters.

Examples:

Let's first create a toy dataset

import datamol as dm
import numpy as np

data = dm.data.freesolv()
smiles = data["smiles"].values
X = np.array([dm.to_fp(dm.to_mol(smi)) for smi in smiles])
y = data["expt"].values

Now we can split our data.

X_train, X_test, y_train, y_test = train_test_split(X, y, method="random")

More parameters

X_train, X_test, y_train, y_test = train_test_split(X, y, method="random", test_size=0.1, random_state=42)

Scaffold split (note that you need to specify smiles):

X_train, X_test, y_train, y_test = train_test_split(X, y, smiles=smiles, method="scaffold")

Distance-based split:

X_train, X_test, y_train, y_test = train_test_split(X, y, method="kmeans")

Parameters:

Name Type Description Default
X ndarray

The feature matrix.

required
y ndarray

The target values.

required
molecules Optional[Sequence[Union[str, Mol]]]

A list of molecules to be used for the split. Required for some splitting methods.

None
method Union[str, SimpleSplittingMethod]

The splitting method to use. Defaults to "random".

'random'
test_size float

The proportion of the dataset to include in the test split.

0.2
seed int

The seed to use for the random number generator.

None
n_jobs Optional[int]

The number of jobs to run in parallel.

None

train_test_split_indices

train_test_split_indices(
    X: np.ndarray,
    y: np.ndarray,
    molecules: Optional[Sequence[Union[str, dm.Mol]]] = None,
    method: Union[str, SimpleSplittingMethod] = "random",
    test_size: float = 0.2,
    seed: int = None,
    n_jobs: Optional[int] = None,
) -> tuple[np.ndarray, np.ndarray]

Returns the indices of the train and test sets.

Different from scikit-learn's API, we assume some data-types are not represented as numpy arrays and cannot be directly indexed as we do in train_test_split. This functions offers a way to just return the indices and take care of the split manually.

See train_test_split for more information.


Advanced usage

splito

StratifiedDistributionSplit

Bases: GroupShuffleSplit

Split a dataset using the values of a readout, so both train, test and valid have the same distribution of values. Instead of bining using some kind of interval (rolling_windows), we will instead use a 1D clustering of the readout.

KMeansSplit

Bases: GroupShuffleSplit

Group-based split that uses the k-Mean clustering in the input space for splitting.

MaxDissimilaritySplit

Bases: KMeansReducedDistanceSplitBase

Splits the data such that the train and test set are maximally dissimilar.

get_split_from_distance_matrix
get_split_from_distance_matrix(
    mat: np.ndarray, group_indices: np.ndarray, n_train: int, n_test: int
)

The Maximum Dissimilarity Split splits the data by trying to maximize the distance between train and test.

This is done as follows

(1) As initial test sample, take the data point that on average is furthest from all other samples. (2) As initial train sample, take the data point that is furthest from the initial test sample. (3) Iteratively add the train sample that is closest to the initial train sample.

MolecularMinMaxSplit

Bases: BaseShuffleSplit

Uses the Min-Max Diversity picker from RDKit and Datamol to have a diverse set of molecules in the train set.

MolecularWeightSplit

Bases: BaseShuffleSplit

Splits the dataset by sorting the molecules by their molecular weight and then finding an appropriate cutoff to split the molecules in two sets.

MOODSplitter

Bases: BaseShuffleSplit

The MOOD splitter takes in multiple candidate splitters and a set of deployment datapoints you plan to use a model on and prescribes one splitting method that creates the test set that is most representative of the deployment set.

prescribed_splitter_label property
prescribed_splitter_label

Textual identifier of the splitting method that was deemed most representative.

visualize
visualize(
    downstream_distances: np.ndarray,
    splits: Optional[List[_SplitCharacterization]] = None,
    ax: Optional[plt.Axes] = None,
)

Visualizes the results of the splitting protocol by visualizing the test-to-train distance distributions resulting from each of the candidate splitters and coloring them based on their representativeness.

score_representativeness staticmethod
score_representativeness(
    downstream_distances, distances, num_samples: int = 100
)

Scores a candidate split by comparing the test-to-train and deployment-to-dataset distributions. A higher score should be interpreted as more representative

get_prescribed_splitter
get_prescribed_splitter() -> BaseShuffleSplit

Returns the prescribed scikit-learn Splitter object that is most representative

get_protocol_visualization
get_protocol_visualization() -> plt.Axes

Visualizes the results of the splitting protocol

get_protocol_results
get_protocol_results() -> pd.DataFrame

Returns the results of the splitting protocol in tabular form

fit
fit(
    X: np.ndarray,
    y: Optional[np.ndarray] = None,
    groups: Optional[np.ndarray] = None,
    X_deployment: Optional[np.ndarray] = None,
    deployment_distances: Optional[np.ndarray] = None,
    progress: bool = False,
)

Follows the MOOD specification protocol to prescribe a train-test split that is most representative of the deployment setting and as such closes the testing-deployment gap.

The k-NN distance in the representation space is used as a proxy of difficulty. The further a datapoint is from the training set, the lower the expected model's performance. Using that observation, we select the train-test split that best replicates the distance distribution (i.e. "the difficulty") of the deployment set.

Parameters:

Name Type Description Default
X ndarray

An array of (n_samples, n_features)

required
y Optional[ndarray]

An array of (n_samples, 1) targets, passed to candidate splitter's split() method

None
groups Optional[ndarray]

An array of (n_samples,) groups, passed to candidate splitter's split() method

None
X_deployment Optional[ndarray]

An array of (n_deployment_samples, n_features)

None
deployment_distances Optional[ndarray]

An array of (n_deployment_samples, 1) precomputed distances.

None
progress bool

Whether to show a progress bar

False

PerimeterSplit

Bases: KMeansReducedDistanceSplitBase

Places the pairs of data points with maximal pairwise distance in the test set. This was originally called the extrapolation-oriented split, introduced in Szántai-Kis et. al., 2003

get_split_from_distance_matrix
get_split_from_distance_matrix(
    mat: np.ndarray, group_indices: np.ndarray, n_train: int, n_test: int
)

Iteratively places the pairs of data points with maximal pairwise distance in the test set. Anything that remains is added to the train set.

Intuitively, this leads to a test set where all the datapoints are on the "perimeter" of the high-dimensional data cloud.

ScaffoldSplit

Bases: GroupShuffleSplit

The default scaffold split popular in molecular modeling literature