`splito`

splito

MOODSplitter

Bases: BaseShuffleSplit

The MOOD splitter takes in multiple candidate splitters and a set of deployment datapoints you plan to use a model on and prescribes one splitting method that creates the test set that is most representative of the deployment set.

prescribed_splitter_label `property`

prescribed_splitter_label

Textual identifier of the splitting method that was deemed most representative.

init

__init__(
    candidate_splitters: Dict[str, BaseShuffleSplit],
    metric: Union[str, Callable] = "minkowski",
    p: int = 2,
    k: int = 5,
    n_jobs: Optional[int] = None,
)

Creates the splitter object.

Parameters:

Name	Type	Description	Default
`candidate_splitters`	`Dict[str, BaseShuffleSplit]`	A list of splitter methods you are considering	required
`metric`	`Union[str, Callable]`	The distance metric to use. Needs to be supported by `sklearn.neighbors.NearestNeighbors`	`'minkowski'`
`p`	`int`	If the metric is the minkowski distance, this is the p in that distance.	`2`
`k`	`int`	The number of nearest neighbors to use to compute the distance.	`5`

visualize

visualize(
    downstream_distances: np.ndarray,
    splits: Optional[List[_SplitCharacterization]] = None,
    ax: Optional[plt.Axes] = None,
)

Visualizes the results of the splitting protocol by visualizing the test-to-train distance distributions resulting from each of the candidate splitters and coloring them based on their representativeness.

score_representativeness `staticmethod`

score_representativeness(
    downstream_distances, distances, num_samples: int = 100
)

Scores a candidate split by comparing the test-to-train and deployment-to-dataset distributions. A higher score should be interpreted as more representative

get_prescribed_splitter

get_prescribed_splitter() -> BaseShuffleSplit

Returns the prescribed scikit-learn Splitter object that is most representative

get_protocol_visualization

get_protocol_visualization() -> plt.Axes

Visualizes the results of the splitting protocol

get_protocol_results

get_protocol_results() -> pd.DataFrame

Returns the results of the splitting protocol in tabular form

fit

fit(
    X: np.ndarray,
    y: Optional[np.ndarray] = None,
    groups: Optional[np.ndarray] = None,
    X_deployment: Optional[np.ndarray] = None,
    deployment_distances: Optional[np.ndarray] = None,
    progress: bool = False,
)

Follows the MOOD specification protocol to prescribe a train-test split that is most representative of the deployment setting and as such closes the testing-deployment gap.

The k-NN distance in the representation space is used as a proxy of difficulty. The further a datapoint is from the training set, the lower the expected model's performance. Using that observation, we select the train-test split that best replicates the distance distribution (i.e. "the difficulty") of the deployment set.

Parameters:

Name	Type	Description	Default
`X`	`ndarray`	An array of (n_samples, n_features)	required
`y`	`Optional[ndarray]`	An array of (n_samples, 1) targets, passed to candidate splitter's split() method	`None`
`groups`	`Optional[ndarray]`	An array of (n_samples,) groups, passed to candidate splitter's split() method	`None`
`X_deployment`	`Optional[ndarray]`	An array of (n_deployment_samples, n_features)	`None`
`deployment_distances`	`Optional[ndarray]`	An array of (n_deployment_samples, 1) precomputed distances.	`None`
`progress`	`bool`	Whether to show a progress bar	`False`

KMeansSplit

Bases: GroupShuffleSplit

Group-based split that uses the k-Mean clustering in the input space for splitting.

PerimeterSplit

Bases: KMeansReducedDistanceSplitBase

Places the pairs of data points with maximal pairwise distance in the test set. This was originally called the extrapolation-oriented split, introduced in Szántai-Kis et. al., 2003

get_split_from_distance_matrix

get_split_from_distance_matrix(
    mat: np.ndarray, group_indices: np.ndarray, n_train: int, n_test: int
)

Iteratively places the pairs of data points with maximal pairwise distance in the test set. Anything that remains is added to the train set.

Intuitively, this leads to a test set where all the datapoints are on the "perimeter" of the high-dimensional data cloud.

MaxDissimilaritySplit

Bases: KMeansReducedDistanceSplitBase

Splits the data such that the train and test set are maximally dissimilar.

get_split_from_distance_matrix

get_split_from_distance_matrix(
    mat: np.ndarray, group_indices: np.ndarray, n_train: int, n_test: int
)

The Maximum Dissimilarity Split splits the data by trying to maximize the distance between train and test.

This is done as follows

(1) As initial test sample, take the data point that on average is furthest from all other samples. (2) As initial train sample, take the data point that is furthest from the initial test sample. (3) Iteratively add the train sample that is closest to the initial train sample.

ScaffoldSplit

Bases: GroupShuffleSplit

The default scaffold split popular in molecular modeling literature

MolecularMinMaxSplit

Bases: BaseShuffleSplit

Uses the Min-Max Diversity picker from RDKit and Datamol to have a diverse set of molecules in the train set.

MolecularWeightSplit

Bases: BaseShuffleSplit

Splits the dataset by sorting the molecules by their molecular weight and then finding an appropriate cutoff to split the molecules in two sets.

StratifiedDistributionSplit

Bases: GroupShuffleSplit

Split a dataset using the values of a readout, so both train, test and valid have the same distribution of values. Instead of bining using some kind of interval (rolling_windows), we will instead use a 1D clustering of the readout.

splito

splito

MOODSplitter

prescribed_splitter_label property

__init__

visualize

score_representativeness staticmethod

get_prescribed_splitter

get_protocol_visualization

get_protocol_results

fit

KMeansSplit

PerimeterSplit

get_split_from_distance_matrix

MaxDissimilaritySplit

get_split_from_distance_matrix

ScaffoldSplit

MolecularMinMaxSplit

MolecularWeightSplit

StratifiedDistributionSplit

`splito`

prescribed_splitter_label `property`

init

score_representativeness `staticmethod`