Skip to content

splito

splito

MOODSplitter

Bases: BaseShuffleSplit

The MOOD splitter takes in multiple candidate splitters and a set of deployment datapoints you plan to use a model on and prescribes one splitting method that creates the test set that is most representative of the deployment set.

prescribed_splitter_label property
prescribed_splitter_label

Textual identifier of the splitting method that was deemed most representative.

__init__
__init__(
    candidate_splitters: Dict[str, BaseShuffleSplit],
    metric: Union[str, Callable] = "minkowski",
    p: int = 2,
    k: int = 5,
    n_jobs: Optional[int] = None,
)

Creates the splitter object.

Parameters:

Name Type Description Default
candidate_splitters Dict[str, BaseShuffleSplit]

A list of splitter methods you are considering

required
metric Union[str, Callable]

The distance metric to use. Needs to be supported by sklearn.neighbors.NearestNeighbors

'minkowski'
p int

If the metric is the minkowski distance, this is the p in that distance.

2
k int

The number of nearest neighbors to use to compute the distance.

5
visualize
visualize(
    downstream_distances: np.ndarray,
    splits: Optional[List[_SplitCharacterization]] = None,
    ax: Optional[plt.Axes] = None,
)

Visualizes the results of the splitting protocol by visualizing the test-to-train distance distributions resulting from each of the candidate splitters and coloring them based on their representativeness.

score_representativeness staticmethod
score_representativeness(
    downstream_distances, distances, num_samples: int = 100
)

Scores a candidate split by comparing the test-to-train and deployment-to-dataset distributions. A higher score should be interpreted as more representative

get_prescribed_splitter
get_prescribed_splitter() -> BaseShuffleSplit

Returns the prescribed scikit-learn Splitter object that is most representative

get_protocol_visualization
get_protocol_visualization() -> plt.Axes

Visualizes the results of the splitting protocol

get_protocol_results
get_protocol_results() -> pd.DataFrame

Returns the results of the splitting protocol in tabular form

fit
fit(
    X: np.ndarray,
    y: Optional[np.ndarray] = None,
    groups: Optional[np.ndarray] = None,
    X_deployment: Optional[np.ndarray] = None,
    deployment_distances: Optional[np.ndarray] = None,
    progress: bool = False,
)

Follows the MOOD specification protocol to prescribe a train-test split that is most representative of the deployment setting and as such closes the testing-deployment gap.

The k-NN distance in the representation space is used as a proxy of difficulty. The further a datapoint is from the training set, the lower the expected model's performance. Using that observation, we select the train-test split that best replicates the distance distribution (i.e. "the difficulty") of the deployment set.

Parameters:

Name Type Description Default
X ndarray

An array of (n_samples, n_features)

required
y Optional[ndarray]

An array of (n_samples, 1) targets, passed to candidate splitter's split() method

None
groups Optional[ndarray]

An array of (n_samples,) groups, passed to candidate splitter's split() method

None
X_deployment Optional[ndarray]

An array of (n_deployment_samples, n_features)

None
deployment_distances Optional[ndarray]

An array of (n_deployment_samples, 1) precomputed distances.

None
progress bool

Whether to show a progress bar

False

KMeansSplit

Bases: GroupShuffleSplit

Group-based split that uses the k-Mean clustering in the input space for splitting.

PerimeterSplit

Bases: KMeansReducedDistanceSplitBase

Places the pairs of data points with maximal pairwise distance in the test set. This was originally called the extrapolation-oriented split, introduced in Szántai-Kis et. al., 2003

get_split_from_distance_matrix
get_split_from_distance_matrix(
    mat: np.ndarray, group_indices: np.ndarray, n_train: int, n_test: int
)

Iteratively places the pairs of data points with maximal pairwise distance in the test set. Anything that remains is added to the train set.

Intuitively, this leads to a test set where all the datapoints are on the "perimeter" of the high-dimensional data cloud.

MaxDissimilaritySplit

Bases: KMeansReducedDistanceSplitBase

Splits the data such that the train and test set are maximally dissimilar.

get_split_from_distance_matrix
get_split_from_distance_matrix(
    mat: np.ndarray, group_indices: np.ndarray, n_train: int, n_test: int
)

The Maximum Dissimilarity Split splits the data by trying to maximize the distance between train and test.

This is done as follows

(1) As initial test sample, take the data point that on average is furthest from all other samples. (2) As initial train sample, take the data point that is furthest from the initial test sample. (3) Iteratively add the train sample that is closest to the initial train sample.

ScaffoldSplit

Bases: GroupShuffleSplit

The default scaffold split popular in molecular modeling literature

MolecularMinMaxSplit

Bases: BaseShuffleSplit

Uses the Min-Max Diversity picker from RDKit and Datamol to have a diverse set of molecules in the train set.

MolecularWeightSplit

Bases: BaseShuffleSplit

Splits the dataset by sorting the molecules by their molecular weight and then finding an appropriate cutoff to split the molecules in two sets.

StratifiedDistributionSplit

Bases: GroupShuffleSplit

Split a dataset using the values of a readout, so both train, test and valid have the same distribution of values. Instead of bining using some kind of interval (rolling_windows), we will instead use a 1D clustering of the readout.