2.4.14 Samplers

Used to remove outliers and noisy data before classification. Samplers are useful in improving the robustness of the model.

Isolation Forest Filtering

Isolation Forest Algorithm returns the anomaly score of each sample using the IsolationForest algorithm. The “Isolation Forest” isolates observations by randomly selecting a feature and then randomly selecting a split value between the maximum and minimum values of the selected feature.

Parameters
  • input_data – Dataframe, feature set that is results of generator_set or feature_selector

  • label_column (str) – Label column name.

  • filtering_label – List<String>, List of classes. if it is not defined, it use all classes.

  • feature_columns – List<String>, List of features. if it is not defined, it uses all features.

  • outliers_fraction (float) – Define the ratio of outliers.

  • assign_unknown (bool) – Assign unknown label to outliers.

Returns

DataFrame containing features without outliers and noise.

Examples

>>> client.pipeline.reset(delete_cache=False)
>>> df = client.datasets.load_activity_raw()
>>> client.pipeline.set_input_data('test_data', df, force=True,
                    data_columns = ['accelx', 'accely', 'accelz'],
                    group_columns = ['Subject','Class'],
                    label_column = 'Class')
>>> client.pipeline.add_feature_generator([{'name':'Downsample',
                             'params':{"columns": ['accelx','accely','accelz'],
                                       "new_length": 5 }}])
>>> results, stats = client.pipeline.execute()
# List of all data indices before the filtering algorithm
>>> results.index.tolist()
    Out:
    [0, 1, 2, 3, 4, 5, 6, 7, 8]
>>> client.pipeline.add_transform("Isolation Forest Filtering",
                   params={ "outliers_fraction": 0.01})
>>> results, stats = client.pipeline.execute()
# List of all data indices after the filtering algorithm
>>>results.index.tolist()
    Out:
    [0, 1, 2, 3, 4, 5]
Local Outlier Factor Filtering

The local outlier factor (LOF) to measure the local deviation of a given data point with respect to its neighbors by comparing their local density.

The LOF algorithm is an unsupervised outlier detection method which computes the local density deviation of a given data point with respect to its neighbors. It considers as outlier samples that have a substantially lower density than their neighbors.

Parameters
  • input_data – Dataframe, feature set that is results of generator_set or feature_selector

  • label_column (str) – Label column name.

  • filtering_label – List<String>, List of classes. if it is not defined, it use all classes.

  • feature_columns – List<String>, List of features. if it is not defined, it uses all features.

  • outliers_fraction (float) – Define the ratio of outliers.

  • number_of_neighbors (int) – Number of neighbors for a vector.

  • norm (string) – Metric that will be used for the distance computation.

  • assign_unknown (bool) – Assign unknown label to outliers.

Returns

DataFrame containing features without outliers and noise.

Examples

>>> client.pipeline.reset(delete_cache=False)
>>> df = client.datasets.load_activity_raw()
>>> client.pipeline.set_input_data('test_data', df, force=True,
                    data_columns = ['accelx', 'accely', 'accelz'],
                    group_columns = ['Subject','Class'],
                    label_column = 'Class')
>>> client.pipeline.add_feature_generator([{'name':'Downsample',
                             'params':{"columns": ['accelx','accely','accelz'],
                                       "new_length": 5 }}])
>>> results, stats = client.pipeline.execute()
# List of all data indices before the filtering algorithm
>>> results.index.tolist()
    Out:
    [0, 1, 2, 3, 4, 5, 6, 7, 8]
>>> client.pipeline.add_transform("Local Outlier Factor Filtering",
                   params={"outliers_fraction": 0.05,
                            "number_of_neighbors": 5})
>>> results, stats = client.pipeline.execute()
# List of all data indices after the filtering algorithm
>>>results.index.tolist()
    Out:
    [0, 1, 2, 3, 4, 5]
One Class SVM filtering

Unsupervised Outlier Detection. Estimate the support of a high-dimensional distribution. The implementation is based on libsvm.

Parameters
  • input_data – Dataframe, feature set that is results of generator_set or feature_selector

  • label_column (str) – Label column name.

  • filtering_label – List<String>, List of classes. if it is not defined, it use all classes.

  • feature_columns – List<String>, List of features. if it is not defined, it uses all features.

  • outliers_fraction (float) – Define the ratio of outliers.

  • kernel (str) – Specifies the kernel type to be used in the algorithm. It must be one of ‘linear’, ‘poly’, ‘rbf’, ‘sigmoid’.

  • assign_unknown (bool) – Assign unknown label to outliers.

Returns

DataFrame containing features without outliers and noise.

Examples

>>> client.pipeline.reset(delete_cache=False)
>>> df = client.datasets.load_activity_raw()
>>> client.pipeline.set_input_data('test_data', df, force=True,
                    data_columns = ['accelx', 'accely', 'accelz'],
                    group_columns = ['Subject','Class'],
                    label_column = 'Class')
>>> client.pipeline.add_feature_generator([{'name':'Downsample',
                             'params':{"columns": ['accelx','accely','accelz'],
                                       "new_length": 5 }}])
>>> results, stats = client.pipeline.execute()
# List of all data indices before the filtering algorithm
>>> results.index.tolist()
    Out:
    [0, 1, 2, 3, 4, 5, 6, 7, 8]
>>> client.pipeline.add_transform("One Class SVM filtering",
                   params={"outliers_fraction": 0.05})
>>> results, stats = client.pipeline.execute()
# List of all data indices after the filtering algorithm
>>>results.index.tolist()
    Out:
    [0, 1, 2, 3, 4, 5]
Robust Covariance Filtering

Unsupervised Outlier Detection. An object for detecting outliers in a Gaussian distributed dataset.

Parameters
  • input_data – Dataframe, feature set that is results of generator_set or feature_selector

  • label_column (str) – Label column name.

  • filtering_label – List<String>, List of classes. if it is not defined, it use all classes.

  • feature_columns – List<String>, List of features. if it is not defined, it uses all features.

  • outliers_fraction (float) – An upper bound on the fraction of training errors and a lower bound of the fraction of support vectors. Should be in the interval (0, 1]. By default 0.5 will be taken.

  • assign_unknown (bool) – Assign unknown label to outliers.

Returns

DataFrame containing features without outliers and noise.

Examples

>>> client.pipeline.reset(delete_cache=False)
>>> df = client.datasets.load_activity_raw()
>>> client.pipeline.set_input_data('test_data', df, force=True,
                    data_columns = ['accelx', 'accely', 'accelz'],
                    group_columns = ['Subject','Class'],
                    label_column = 'Class')
>>> client.pipeline.add_feature_generator([{'name':'Downsample',
                             'params':{"columns": ['accelx','accely','accelz'],
                                       "new_length": 5 }}])
>>> results, stats = client.pipeline.execute()
# List of all data indices before the filtering algorithm
>>> results.index.tolist()
    Out:
    [0, 1, 2, 3, 4, 5, 6, 7, 8]
>>> client.pipeline.add_transform("Robust Covariance Filtering",
                   params={"outliers_fraction": 0.05})
>>> results, stats = client.pipeline.execute()
# List of all data indices after the filtering algorithm
>>>results.index.tolist()
    Out:
    [0, 1, 2, 3, 4, 5]
Sample By Metadata

Select rows from the input DataFrame based on a metadata column. Rows that have a metadata value that is in the values list will be returned.

Parameters
  • input_data (DataFrame) – input DataFrame

  • metadata_name (str) – name of the metadata column to use for sampling

  • metadata_values (list [ str ]) – list of values of the named column for which to select rows of the input data.

Returns

DataFrame containing only the rows for which the metadata value is in the accepted list

Combine Labels

Select rows from the input DataFrame based on a metadata column. Rows that have a metadata value that is in the values list will be returned.

syntax combine_labels = {‘group1’:[‘label1’, ‘label2’], ‘group2’:[‘label3’,’label4’], ‘group3’:[‘group5’]}

Parameters
  • input_data (DataFrame) – input DataFrame

  • label_column (str) – label column name

  • combine_labels (dict) – map of label columns to combine

Returns

DataFrame containing only the rows for which the metadata value is in the accepted list

Zscore Filter

Filter out feature vectors that have features outside of a cutoff threshold

Parameters
  • input_data (DataFrame) – Input DataFrame

  • label_column (str) – Label column name

  • zscore_cutoff (int) – Cutoff for to filter features above z score

  • feature_threshold (int) – The number of features in a feature vector that can be outside of the zscore_cutoff without removing the feature vector

  • features (list) – List of features to filter by, if none filters all (default None)

  • assign_unknown (bool) – Assign unknown label to outliers.

Returns

DataFrame containing only the rows for which the metadata value is in the accepted list.

Examples

>>> client.pipeline.reset(delete_cache=False)
>>> df = client.datasets.load_activity_raw()
>>> client.pipeline.set_input_data('test_data', df, force=True,
                    data_columns = ['accelx', 'accely', 'accelz'],
                    group_columns = ['Subject','Class'],
                    label_column = 'Class')
>>> client.pipeline.add_feature_generator([{'name':'Downsample',
                             'params':{"columns": ['accelx','accely','accelz'],
                                       "new_length": 5 }}])
>>> results, stats = client.pipeline.execute()
# List of all data indices before the filtering algorithm
>>> results.index.tolist()
    Out:
    [0, 1, 2, 3, 4, 5, 6, 7, 8]
>>> client.pipeline.add_transform("Zscore Filter",
                   params={"zscore_cutoff": 3, "feature_threshold": 1})
>>> results, stats = client.pipeline.execute()
# List of all data indices after the filtering algorithm
>>>results.index.tolist()
    Out:
    [0, 1, 2, 3, 4, 5]
Sigma Outliers Filtering

Sigma outliers filtering is unsupervised outlier detection. An object for detecting outliers in a Gaussian distributed dataset based on given sigma value.

Parameters
  • input_data – Dataframe, feature set that is results of generator_set or feature_selector

  • label_column (str) – Label column name.

  • filtering_label – List<String>, List of classes that will be filered. If it is not defined, all class will be filtered.

  • feature_columns – List<String>, List of features. if it is not defined, it uses all features.

  • sigma_threshold (float) – Define the ratio of outliers.

  • assign_unknown (bool) – Assign unknown label to outliers.

Returns

DataFrame containing features without outliers and noise.

Examples

client.pipeline.reset(delete_cache=False)
df = client.datasets.load_activity_raw()
client.pipeline.set_input_data('test_data', df, force=True,
                data_columns = ['accelx', 'accely', 'accelz'],
                group_columns = ['Subject','Class'],
                label_column = 'Class')
client.pipeline.add_feature_generator([{'name':'Downsample',
                        'params':{"columns": ['accelx','accely','accelz'],
                                "new_length": 5 }}])
results, stats = client.pipeline.execute()
# List of all data indices before the filtering algorithm
results.index.tolist()
# Out:
# [0, 1, 2, 3, 4, 5, 6, 7, 8]

client.pipeline.add_transform("Sigma Outliers Filtering",
            params={ "sigma_threshold": 1.0 })

results, stats = client.pipeline.execute()
# List of all data indices after the filtering algorithm
results.index.tolist()
# Out:
# [0, 1, 2, 3, 4, 5]

Sampling Techniques for Handling Imbalanced Data sigma_outliers_filtering

Combine Labels

Select rows from the input DataFrame based on a metadata column. Rows that have a metadata value that is in the values list will be returned.

syntax combine_labels = {‘group1’:[‘label1’, ‘label2’], ‘group2’:[‘label3’,’label4’], ‘group3’:[‘group5’]}

Parameters
  • input_data (DataFrame) – input DataFrame

  • label_column (str) – label column name

  • combine_labels (dict) – map of label columns to combine

Returns

DataFrame containing only the rows for which the metadata value is in the accepted list

Undersample Majority Classes

Create a balanced data set by undersampling the majority classes using random sampling without replacement.

Parameters
  • input_data (DataFrame) – input DataFrame

  • label_column (str) – The column to split against

  • target_class_size (int) – Specifies the size of the minimum class to use. If None, the min class size is used; if size is greater than min class size, the min class size is used (default: None).

  • seed (int) – Specifies a random seed to use for sampling

  • maximum_samples_size_per_class (int) – Specifies the size of the maximum class to use per class,

Returns

DataFrame containing undersampled classes

Sampling Techniques for Augmenting Data Sets

Pad Segment

Pad a segment so that its length is equal to a specific sequence length

Parameters
  • input_data (DataFrame) – input DataFrame

  • group_columns (str) – The column to group by against (should 283 SegmentID)

  • sequence_length (int) – Specifies the size of the minimum class to use. If None, the min class size is used; if size is greater than min class size, the min class size is used (default: None).

  • noise_level (int) – max amount of noise to add to augmentation

Returns

DataFrame containing padded segments

Resampling by Majority Vote

For each group perform max pooling on the specified metadata_name column and set the value of that metadata column to the maximum occuring value.

Parameters
  • input_data (DataFrame) – input DataFrame

  • group_columns (list) – Columns to group over

  • metadata_name (str) – name of the metadata column to use for sampling

Returns

DataFrame with metadata_name column being modified by max pooling