2.4.12 Validation Methods

Validation methods are used to check the robustness and accuracy of a model and diagnose if a model is overfitting or underfitting.

Stratified K-Fold Cross-Validation

A variation of k-fold which returns stratified folds: each set contains approximately the same percentage of samples of each target class as the complete set. In other words, for a data set consisting of total 100 samples with 40 samples from class 1 and 60 samples from class 2, for a stratified 2-fold scheme, each fold will consist of total 50 samples with 20 samples from class 1 and 30 samples from class 2.

Parameters
  • number_of_folds (int) – the number of stratified folds to produce

  • test_size (float) – the percentage of data to hold out as a final test set

  • shuffle (bool) – Specifies whether or not to shuffle the data before performing the cross-fold validation splits.

Leave-One-Subject-Out

A cross-validation scheme which holds out the samples for all but one subject for testing in each fold. In other words, for a data set consisting of 10 subjects, each fold will consist of a training set from 9 subjects and test set from 1 subject; thus, in all, there will be 10 folds, one for each left out test subject.

Parameters

group_columns (list [ str ]) – list of column names that define the groups (subjects)

Stratified Metadata k-fold

K-fold iterator variant with non-overlapping metadata/group and label combination which also attempts to evenly distribute the number of each class across each fold. This is similar to GroupKFold, where, you cannot have the same group in in multiple folds, but in this case you cannot have the same group and label combination across multiple folds.

The main use case is for time series data where you may have a Subject group, where the subject performs several activities. If you build a model using a sliding window to segment data, you will end up with “Subject A” performing “action 1” many times. If you use a validation method that splits up “Subject A” performing “action 1” into different folds it can often result in data leakage and overfitting. If however, you build your validation set such that “Subject A” performing “action 1” is only in a single fold you can be more confident that your model is generalizing. This validation will also attempt to ensure you have a similar amount of “action 1’s” across your folds.

Parameters
  • number_of_folds (int) – the number of stratified folds to produce

  • metadata_name (str) – the metadata to group on for splitting data into folds.

Metadata k-fold

K-fold iterator variant with non-overlapping metadata groups. The same group will not appear in two different folds (the number of distinct groups has to be at least equal to the number of folds). The folds are approximately balanced in the sense that the number of distinct groups is approximately the same in each fold.

Parameters
  • number_of_folds (int) – the number of stratified folds to produce

  • metadata_name (str) – the metadata to group on for splitting data into folds.

Recall

The simplest validation method, wherein the training set itself is used as the test set. In other words, for a data set consisting of 100 samples in total, both the training set and the test set consist of the same set of 100 samples.

Set Sample Validation

A validation scheme wherein the data set is divided into training and test sets based on two statistical parameters, mean and standard deviation. The user selects the number of events in each category and has the option to select the subset mean, standard deviation, number in the validation set and the acceptable limit in the number of retries of random selection from the original data set.

Example

samples = {‘Class 1’:2500, “Class 2”:2500} validation = {‘Class 1’:2000, “Class 2”:2000}

client.pipeline.set_validation_method({“name”: “Set Sample Validation”,
“inputs”: {“samples_per_class”: samples,

“validation_samples_per_class”: validation}})

Parameters
  • data_set_mean (numpy.array [ floats ]) – mean value of each feature in dataset

  • data_set_stdev (numpy.array [ floats ]) – standard deviation of each feature in dataset

  • samples_per_class (dict) – Number of members in subset for training, validation, and testing

  • validation_samples_per_class (dict) – Overrides the number of members in subset for validation if not empty

  • mean_limit (numpy.array [ floats ]) – minimum acceptable difference between mean of subset and data for any feature

  • stdev_limit (numpy.array [ floats ]) – minimum acceptable difference between standard deviation of subset and data for any feature

  • retries (int) – Number of attempts to find a subset with similar statistics

  • norm (list [ str ]) – [‘Lsup’,’L1’] Distance norm for determining whether subset is within user defined limits

  • optimize_mean_std (list [ str ]) – [‘both’,’mean’] Logic to use for optimizing subset. If ‘mean’, then only mean distance must be improved. If ‘both’, then both mean and stdev must improve.

  • binary_class1 (str) – Category name that will be the working class in set composition

Split by Metadata Value

A validation scheme wherein the data set is divided into training and test sets based on the metadata value. In other words, for a data set consisting of 100 samples with the metadata column set to ‘train’ for 60 samples, and ‘test’ for 40 samples, the training set will consist of 60 samples for which the metadata value is ‘train’ and the test set will consist of 40 samples for which the metadata value is ‘test’.

Parameters
  • metadata_name (str) – name of the metadata column to use for splitting

  • training_values (list [ str ]) – list of values of the named column to select samples for training

  • validation_values (list [ str )) – list of values of the named column to select samples for validation

Stratified Shuffle Split

A validation scheme which splits the data set into training, validation, and (optionally) test sets based on the parameters provided, with similar distribution of labels (hence stratified).

In other words, for a data set consisting of 100 samples in total with 40 samples from class 1 and 60 samples from class 2, for stratified shuffle split with validation_size = 0.4, the validation set will consist of 40 samples with 16 samples from class 1 and 24 samples from class 2, and the training set will consist of 60 samples with 24 samples from class 1 and 36 samples from class 2.

For each fold, training and validation data re-shuffle and split.

Parameters
  • test_size (float) – target percent of total size to use for testing

  • validation_size (float) – target percent of total size to use for validation

  • number_of_folds (int) – the number of stratified folds (iteration) to produce