2.4.14 Samplers

Isolation Forest Filtering

Isolation Forest Algorithm returns the anomaly score of each sample using the IsolationForest algorithm. The “Isolation Forest” isolates observations by randomly selecting a feature and then randomly selecting a split value between the maximum and minimum values of the selected feature.

Parameters

input_data – Dataframe, feature set that is results of generator_set or feature_selector
label_column (str) – Label column name.
filtering_label – List<String>, List of classes. if it is not defined, it use all classes.
feature_columns – List<String>, List of features. if it is not defined, it uses all features.
outliers_fraction (float) – Define the ratio of outliers.
assign_unknown (bool) – Assign unknown label to outliers.

Returns

DataFrame containing features without outliers and noise.

Examples

>>> client.pipeline.reset(delete_cache=False)
>>> df = client.datasets.load_activity_raw()
>>> client.pipeline.set_input_data('test_data', df, force=True,
                    data_columns = ['accelx', 'accely', 'accelz'],
                    group_columns = ['Subject','Class'],
                    label_column = 'Class')
>>> client.pipeline.add_feature_generator([{'name':'Downsample',
                             'params':{"columns": ['accelx','accely','accelz'],
                                       "new_length": 5 }}])
>>> results, stats = client.pipeline.execute()
# List of all data indices before the filtering algorithm
>>> results.index.tolist()
    Out:
    [0, 1, 2, 3, 4, 5, 6, 7, 8]

>>> client.pipeline.add_transform("Isolation Forest Filtering",
                   params={ "outliers_fraction": 0.01})

>>> results, stats = client.pipeline.execute()
# List of all data indices after the filtering algorithm
>>>results.index.tolist()
    Out:
    [0, 1, 2, 3, 4, 5]

Local Outlier Factor Filtering

The local outlier factor (LOF) to measure the local deviation of a given data point with respect to its neighbors by comparing their local density.

The LOF algorithm is an unsupervised outlier detection method which computes the local density deviation of a given data point with respect to its neighbors. It considers as outlier samples that have a substantially lower density than their neighbors.

Parameters

input_data – Dataframe, feature set that is results of generator_set or feature_selector
label_column (str) – Label column name.
filtering_label – List<String>, List of classes. if it is not defined, it use all classes.
feature_columns – List<String>, List of features. if it is not defined, it uses all features.
outliers_fraction (float) – Define the ratio of outliers.
number_of_neighbors (int) – Number of neighbors for a vector.
norm (string) – Metric that will be used for the distance computation.
assign_unknown (bool) – Assign unknown label to outliers.

Returns

DataFrame containing features without outliers and noise.

Examples

>>> client.pipeline.reset(delete_cache=False)
>>> df = client.datasets.load_activity_raw()
>>> client.pipeline.set_input_data('test_data', df, force=True,
                    data_columns = ['accelx', 'accely', 'accelz'],
                    group_columns = ['Subject','Class'],
                    label_column = 'Class')
>>> client.pipeline.add_feature_generator([{'name':'Downsample',
                             'params':{"columns": ['accelx','accely','accelz'],
                                       "new_length": 5 }}])
>>> results, stats = client.pipeline.execute()
# List of all data indices before the filtering algorithm
>>> results.index.tolist()
    Out:
    [0, 1, 2, 3, 4, 5, 6, 7, 8]

>>> client.pipeline.add_transform("Local Outlier Factor Filtering",
                   params={"outliers_fraction": 0.05,
                            "number_of_neighbors": 5})

>>> results, stats = client.pipeline.execute()
# List of all data indices after the filtering algorithm
>>>results.index.tolist()
    Out:
    [0, 1, 2, 3, 4, 5]

One Class SVM filtering

Unsupervised Outlier Detection. Estimate the support of a high-dimensional distribution. The implementation is based on libsvm.

Parameters

input_data – Dataframe, feature set that is results of generator_set or feature_selector
label_column (str) – Label column name.
filtering_label – List<String>, List of classes. if it is not defined, it use all classes.
feature_columns – List<String>, List of features. if it is not defined, it uses all features.
outliers_fraction (float) – Define the ratio of outliers.
kernel (str) – Specifies the kernel type to be used in the algorithm. It must be one of ‘linear’, ‘poly’, ‘rbf’, ‘sigmoid’.
assign_unknown (bool) – Assign unknown label to outliers.

Returns

DataFrame containing features without outliers and noise.

Examples

>>> client.pipeline.reset(delete_cache=False)
>>> df = client.datasets.load_activity_raw()
>>> client.pipeline.set_input_data('test_data', df, force=True,
                    data_columns = ['accelx', 'accely', 'accelz'],
                    group_columns = ['Subject','Class'],
                    label_column = 'Class')
>>> client.pipeline.add_feature_generator([{'name':'Downsample',
                             'params':{"columns": ['accelx','accely','accelz'],
                                       "new_length": 5 }}])
>>> results, stats = client.pipeline.execute()
# List of all data indices before the filtering algorithm
>>> results.index.tolist()
    Out:
    [0, 1, 2, 3, 4, 5, 6, 7, 8]

>>> client.pipeline.add_transform("One Class SVM filtering",
                   params={"outliers_fraction": 0.05})

>>> results, stats = client.pipeline.execute()
# List of all data indices after the filtering algorithm
>>>results.index.tolist()
    Out:
    [0, 1, 2, 3, 4, 5]

Robust Covariance Filtering

Unsupervised Outlier Detection. An object for detecting outliers in a Gaussian distributed dataset.

Parameters

input_data – Dataframe, feature set that is results of generator_set or feature_selector
label_column (str) – Label column name.
filtering_label – List<String>, List of classes. if it is not defined, it use all classes.
feature_columns – List<String>, List of features. if it is not defined, it uses all features.
outliers_fraction (float) – An upper bound on the fraction of training errors and a lower bound of the fraction of support vectors. Should be in the interval (0, 1]. By default 0.5 will be taken.
assign_unknown (bool) – Assign unknown label to outliers.

Returns

DataFrame containing features without outliers and noise.

Examples

>>> client.pipeline.reset(delete_cache=False)
>>> df = client.datasets.load_activity_raw()
>>> client.pipeline.set_input_data('test_data', df, force=True,
                    data_columns = ['accelx', 'accely', 'accelz'],
                    group_columns = ['Subject','Class'],
                    label_column = 'Class')
>>> client.pipeline.add_feature_generator([{'name':'Downsample',
                             'params':{"columns": ['accelx','accely','accelz'],
                                       "new_length": 5 }}])
>>> results, stats = client.pipeline.execute()
# List of all data indices before the filtering algorithm
>>> results.index.tolist()
    Out:
    [0, 1, 2, 3, 4, 5, 6, 7, 8]

>>> client.pipeline.add_transform("Robust Covariance Filtering",
                   params={"outliers_fraction": 0.05})

>>> results, stats = client.pipeline.execute()
# List of all data indices after the filtering algorithm
>>>results.index.tolist()
    Out:
    [0, 1, 2, 3, 4, 5]

Sample By Metadata

Select rows from the input DataFrame based on a metadata column. Rows that have a metadata value that is in the values list will be returned.

Parameters

input_data (DataFrame) – input DataFrame
metadata_name (str) – name of the metadata column to use for sampling
metadata_values (list [ str ]) – list of values of the named column for which to select rows of the input data.

Returns

DataFrame containing only the rows for which the metadata value is in the accepted list

Combine Labels

Select rows from the input DataFrame based on a metadata column. Rows that have a metadata value that is in the values list will be returned.

syntax combine_labels = {‘group1’:[‘label1’, ‘label2’], ‘group2’:[‘label3’,’label4’], ‘group3’:[‘group5’]}

Parameters

input_data (DataFrame) – input DataFrame
label_column (str) – label column name
combine_labels (dict) – map of label columns to combine

Returns

DataFrame containing only the rows for which the metadata value is in the accepted list

Zscore Filter

Filter out feature vectors that have features outside of a cutoff threshold

Parameters

input_data (DataFrame) – Input DataFrame
label_column (str) – Label column name
zscore_cutoff (int) – Cutoff for to filter features above z score
feature_threshold (int) – The number of features in a feature vector that can be outside of the zscore_cutoff without removing the feature vector
features (list) – List of features to filter by, if none filters all (default None)
assign_unknown (bool) – Assign unknown label to outliers.

Returns

DataFrame containing only the rows for which the metadata value is in the accepted list.

Examples

>>> client.pipeline.reset(delete_cache=False)
>>> df = client.datasets.load_activity_raw()
>>> client.pipeline.set_input_data('test_data', df, force=True,
                    data_columns = ['accelx', 'accely', 'accelz'],
                    group_columns = ['Subject','Class'],
                    label_column = 'Class')
>>> client.pipeline.add_feature_generator([{'name':'Downsample',
                             'params':{"columns": ['accelx','accely','accelz'],
                                       "new_length": 5 }}])
>>> results, stats = client.pipeline.execute()
# List of all data indices before the filtering algorithm
>>> results.index.tolist()
    Out:
    [0, 1, 2, 3, 4, 5, 6, 7, 8]

>>> client.pipeline.add_transform("Zscore Filter",
                   params={"zscore_cutoff": 3, "feature_threshold": 1})

>>> results, stats = client.pipeline.execute()
# List of all data indices after the filtering algorithm
>>>results.index.tolist()
    Out:
    [0, 1, 2, 3, 4, 5]

Sigma Outliers Filtering

Sigma outliers filtering is unsupervised outlier detection. An object for detecting outliers in a Gaussian distributed dataset based on given sigma value.

Parameters

input_data – Dataframe, feature set that is results of generator_set or feature_selector
label_column (str) – Label column name.
filtering_label – List<String>, List of classes that will be filered. If it is not defined, all class will be filtered.
feature_columns – List<String>, List of features. if it is not defined, it uses all features.
sigma_threshold (float) – Define the ratio of outliers.
assign_unknown (bool) – Assign unknown label to outliers.

Returns

DataFrame containing features without outliers and noise.

Examples

client.pipeline.reset(delete_cache=False)
df = client.datasets.load_activity_raw()
client.pipeline.set_input_data('test_data', df, force=True,
                data_columns = ['accelx', 'accely', 'accelz'],
                group_columns = ['Subject','Class'],
                label_column = 'Class')
client.pipeline.add_feature_generator([{'name':'Downsample',
                        'params':{"columns": ['accelx','accely','accelz'],
                                "new_length": 5 }}])
results, stats = client.pipeline.execute()
# List of all data indices before the filtering algorithm
results.index.tolist()
# Out:
# [0, 1, 2, 3, 4, 5, 6, 7, 8]

client.pipeline.add_transform("Sigma Outliers Filtering",
            params={ "sigma_threshold": 1.0 })

results, stats = client.pipeline.execute()
# List of all data indices after the filtering algorithm
results.index.tolist()
# Out:
# [0, 1, 2, 3, 4, 5]

Combine Labels

Select rows from the input DataFrame based on a metadata column. Rows that have a metadata value that is in the values list will be returned.

syntax combine_labels = {‘group1’:[‘label1’, ‘label2’], ‘group2’:[‘label3’,’label4’], ‘group3’:[‘group5’]}

Parameters

input_data (DataFrame) – input DataFrame
label_column (str) – label column name
combine_labels (dict) – map of label columns to combine

Returns

DataFrame containing only the rows for which the metadata value is in the accepted list

Undersample Majority Classes

Create a balanced data set by undersampling the majority classes using random sampling without replacement.

Parameters

input_data (DataFrame) – input DataFrame
label_column (str) – The column to split against
target_class_size (int) – Specifies the size of the minimum class to use. If None, the min class size is used; if size is greater than min class size, the min class size is used (default: None).
seed (int) – Specifies a random seed to use for sampling
maximum_samples_size_per_class (int) – Specifies the size of the maximum class to use per class,

Returns

DataFrame containing undersampled classes

Pad Segment

Pad a segment so that its length is equal to a specific sequence length

Parameters

input_data (DataFrame) – input DataFrame
group_columns (str) – The column to group by against (should 283 SegmentID)
sequence_length (int) – Specifies the size of the minimum class to use. If None, the min class size is used; if size is greater than min class size, the min class size is used (default: None).
noise_level (int) – max amount of noise to add to augmentation

Returns

DataFrame containing padded segments

Resampling by Majority Vote

For each group perform max pooling on the specified metadata_name column and set the value of that metadata column to the maximum occuring value.

Parameters

input_data (DataFrame) – input DataFrame
group_columns (list) – Columns to group over
metadata_name (str) – name of the metadata column to use for sampling

Returns

DataFrame with metadata_name column being modified by max pooling