splito
Basic usage
splito
train_test_split
train_test_split(
X: np.ndarray,
y: np.ndarray,
molecules: Optional[Sequence[Union[str, dm.Mol]]] = None,
method: Union[str, SimpleSplittingMethod] = "random",
test_size: float = 0.2,
seed: int = None,
n_jobs: Optional[int] = None,
) -> tuple[np.ndarray, np.ndarray, np.ndarray, np.ndarray]
Splits a set of molecules into a train and test set.
Inspired by sklearn.model_selection.train_test_split, this function is meant as a convenience function that provides a less verbose way of using the different splitters.
Examples:
Let's first create a toy dataset
import datamol as dm
import numpy as np
data = dm.data.freesolv()
smiles = data["smiles"].values
X = np.array([dm.to_fp(dm.to_mol(smi)) for smi in smiles])
y = data["expt"].values
Now we can split our data.
More parameters
X_train, X_test, y_train, y_test = train_test_split(X, y, method="random", test_size=0.1, random_state=42)
Scaffold split (note that you need to specify smiles
):
Distance-based split:
Parameters:
Name | Type | Description | Default |
---|---|---|---|
X |
ndarray
|
The feature matrix. |
required |
y |
ndarray
|
The target values. |
required |
molecules |
Optional[Sequence[Union[str, Mol]]]
|
A list of molecules to be used for the split. Required for some splitting methods. |
None
|
method |
Union[str, SimpleSplittingMethod]
|
The splitting method to use. Defaults to "random". |
'random'
|
test_size |
float
|
The proportion of the dataset to include in the test split. |
0.2
|
seed |
int
|
The seed to use for the random number generator. |
None
|
n_jobs |
Optional[int]
|
The number of jobs to run in parallel. |
None
|
train_test_split_indices
train_test_split_indices(
X: np.ndarray,
y: np.ndarray,
molecules: Optional[Sequence[Union[str, dm.Mol]]] = None,
method: Union[str, SimpleSplittingMethod] = "random",
test_size: float = 0.2,
seed: int = None,
n_jobs: Optional[int] = None,
) -> tuple[np.ndarray, np.ndarray]
Returns the indices of the train and test sets.
Different from scikit-learn's API, we assume some data-types are not represented as numpy arrays
and cannot be directly indexed as we do in train_test_split
. This
functions offers a way to just return the indices and take care of the split manually.
See train_test_split
for more information.
Advanced usage
splito
StratifiedDistributionSplit
Bases: GroupShuffleSplit
Split a dataset using the values of a readout, so both train, test and valid have the same distribution of values. Instead of bining using some kind of interval (rolling_windows), we will instead use a 1D clustering of the readout.
KMeansSplit
Bases: GroupShuffleSplit
Group-based split that uses the k-Mean clustering in the input space for splitting.
MaxDissimilaritySplit
Bases: KMeansReducedDistanceSplitBase
Splits the data such that the train and test set are maximally dissimilar.
get_split_from_distance_matrix
get_split_from_distance_matrix(
mat: np.ndarray, group_indices: np.ndarray, n_train: int, n_test: int
)
The Maximum Dissimilarity Split splits the data by trying to maximize the distance between train and test.
This is done as follows
(1) As initial test sample, take the data point that on average is furthest from all other samples. (2) As initial train sample, take the data point that is furthest from the initial test sample. (3) Iteratively add the train sample that is closest to the initial train sample.
MolecularMinMaxSplit
Bases: BaseShuffleSplit
Uses the Min-Max Diversity picker from RDKit and Datamol to have a diverse set of molecules in the train set.
MolecularWeightSplit
Bases: BaseShuffleSplit
Splits the dataset by sorting the molecules by their molecular weight and then finding an appropriate cutoff to split the molecules in two sets.
MOODSplitter
Bases: BaseShuffleSplit
The MOOD splitter takes in multiple candidate splitters and a set of deployment datapoints you plan to use a model on and prescribes one splitting method that creates the test set that is most representative of the deployment set.
prescribed_splitter_label
property
Textual identifier of the splitting method that was deemed most representative.
visualize
visualize(
downstream_distances: np.ndarray,
splits: Optional[List[_SplitCharacterization]] = None,
ax: Optional[plt.Axes] = None,
)
Visualizes the results of the splitting protocol by visualizing the test-to-train distance distributions resulting from each of the candidate splitters and coloring them based on their representativeness.
score_representativeness
staticmethod
Scores a candidate split by comparing the test-to-train and deployment-to-dataset distributions. A higher score should be interpreted as more representative
get_prescribed_splitter
Returns the prescribed scikit-learn Splitter object that is most representative
get_protocol_visualization
Visualizes the results of the splitting protocol
get_protocol_results
Returns the results of the splitting protocol in tabular form
fit
fit(
X: np.ndarray,
y: Optional[np.ndarray] = None,
groups: Optional[np.ndarray] = None,
X_deployment: Optional[np.ndarray] = None,
deployment_distances: Optional[np.ndarray] = None,
progress: bool = False,
)
Follows the MOOD specification protocol to prescribe a train-test split that is most representative of the deployment setting and as such closes the testing-deployment gap.
The k-NN distance in the representation space is used as a proxy of difficulty. The further a datapoint is from the training set, the lower the expected model's performance. Using that observation, we select the train-test split that best replicates the distance distribution (i.e. "the difficulty") of the deployment set.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
X |
ndarray
|
An array of (n_samples, n_features) |
required |
y |
Optional[ndarray]
|
An array of (n_samples, 1) targets, passed to candidate splitter's split() method |
None
|
groups |
Optional[ndarray]
|
An array of (n_samples,) groups, passed to candidate splitter's split() method |
None
|
X_deployment |
Optional[ndarray]
|
An array of (n_deployment_samples, n_features) |
None
|
deployment_distances |
Optional[ndarray]
|
An array of (n_deployment_samples, 1) precomputed distances. |
None
|
progress |
bool
|
Whether to show a progress bar |
False
|
PerimeterSplit
Bases: KMeansReducedDistanceSplitBase
Places the pairs of data points with maximal pairwise distance in the test set. This was originally called the extrapolation-oriented split, introduced in Szántai-Kis et. al., 2003
get_split_from_distance_matrix
get_split_from_distance_matrix(
mat: np.ndarray, group_indices: np.ndarray, n_train: int, n_test: int
)
Iteratively places the pairs of data points with maximal pairwise distance in the test set. Anything that remains is added to the train set.
Intuitively, this leads to a test set where all the datapoints are on the "perimeter" of the high-dimensional data cloud.
ScaffoldSplit
Bases: GroupShuffleSplit
The default scaffold split popular in molecular modeling literature