splito
splito
MOODSplitter
Bases: BaseShuffleSplit
The MOOD splitter takes in multiple candidate splitters and a set of deployment datapoints you plan to use a model on and prescribes one splitting method that creates the test set that is most representative of the deployment set.
prescribed_splitter_label
property
Textual identifier of the splitting method that was deemed most representative.
__init__
__init__(
candidate_splitters: Dict[str, BaseShuffleSplit],
metric: Union[str, Callable] = "minkowski",
p: int = 2,
k: int = 5,
n_jobs: Optional[int] = None,
)
Creates the splitter object.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
candidate_splitters |
Dict[str, BaseShuffleSplit]
|
A list of splitter methods you are considering |
required |
metric |
Union[str, Callable]
|
The distance metric to use. Needs to be supported by |
'minkowski'
|
p |
int
|
If the metric is the minkowski distance, this is the p in that distance. |
2
|
k |
int
|
The number of nearest neighbors to use to compute the distance. |
5
|
visualize
visualize(
downstream_distances: np.ndarray,
splits: Optional[List[_SplitCharacterization]] = None,
ax: Optional[plt.Axes] = None,
)
Visualizes the results of the splitting protocol by visualizing the test-to-train distance distributions resulting from each of the candidate splitters and coloring them based on their representativeness.
score_representativeness
staticmethod
Scores a candidate split by comparing the test-to-train and deployment-to-dataset distributions. A higher score should be interpreted as more representative
get_prescribed_splitter
Returns the prescribed scikit-learn Splitter object that is most representative
get_protocol_visualization
Visualizes the results of the splitting protocol
get_protocol_results
Returns the results of the splitting protocol in tabular form
fit
fit(
X: np.ndarray,
y: Optional[np.ndarray] = None,
groups: Optional[np.ndarray] = None,
X_deployment: Optional[np.ndarray] = None,
deployment_distances: Optional[np.ndarray] = None,
progress: bool = False,
)
Follows the MOOD specification protocol to prescribe a train-test split that is most representative of the deployment setting and as such closes the testing-deployment gap.
The k-NN distance in the representation space is used as a proxy of difficulty. The further a datapoint is from the training set, the lower the expected model's performance. Using that observation, we select the train-test split that best replicates the distance distribution (i.e. "the difficulty") of the deployment set.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
X |
ndarray
|
An array of (n_samples, n_features) |
required |
y |
Optional[ndarray]
|
An array of (n_samples, 1) targets, passed to candidate splitter's split() method |
None
|
groups |
Optional[ndarray]
|
An array of (n_samples,) groups, passed to candidate splitter's split() method |
None
|
X_deployment |
Optional[ndarray]
|
An array of (n_deployment_samples, n_features) |
None
|
deployment_distances |
Optional[ndarray]
|
An array of (n_deployment_samples, 1) precomputed distances. |
None
|
progress |
bool
|
Whether to show a progress bar |
False
|
KMeansSplit
Bases: GroupShuffleSplit
Group-based split that uses the k-Mean clustering in the input space for splitting.
PerimeterSplit
Bases: KMeansReducedDistanceSplitBase
Places the pairs of data points with maximal pairwise distance in the test set. This was originally called the extrapolation-oriented split, introduced in Szántai-Kis et. al., 2003
get_split_from_distance_matrix
get_split_from_distance_matrix(
mat: np.ndarray, group_indices: np.ndarray, n_train: int, n_test: int
)
Iteratively places the pairs of data points with maximal pairwise distance in the test set. Anything that remains is added to the train set.
Intuitively, this leads to a test set where all the datapoints are on the "perimeter" of the high-dimensional data cloud.
MaxDissimilaritySplit
Bases: KMeansReducedDistanceSplitBase
Splits the data such that the train and test set are maximally dissimilar.
get_split_from_distance_matrix
get_split_from_distance_matrix(
mat: np.ndarray, group_indices: np.ndarray, n_train: int, n_test: int
)
The Maximum Dissimilarity Split splits the data by trying to maximize the distance between train and test.
This is done as follows
(1) As initial test sample, take the data point that on average is furthest from all other samples. (2) As initial train sample, take the data point that is furthest from the initial test sample. (3) Iteratively add the train sample that is closest to the initial train sample.
ScaffoldSplit
Bases: GroupShuffleSplit
The default scaffold split popular in molecular modeling literature
MolecularMinMaxSplit
Bases: BaseShuffleSplit
Uses the Min-Max Diversity picker from RDKit and Datamol to have a diverse set of molecules in the train set.
MolecularWeightSplit
Bases: BaseShuffleSplit
Splits the dataset by sorting the molecules by their molecular weight and then finding an appropriate cutoff to split the molecules in two sets.
StratifiedDistributionSplit
Bases: GroupShuffleSplit
Split a dataset using the values of a readout, so both train, test and valid have the same distribution of values. Instead of bining using some kind of interval (rolling_windows), we will instead use a 1D clustering of the readout.