2.4.14 Samplers
Used to remove outliers and noisy data before classification. Samplers are useful in improving the robustness of the model.
- Isolation Forest Filtering
-
Isolation Forest Algorithm returns the anomaly score of each sample using the IsolationForest algorithm. The “Isolation Forest” isolates observations by randomly selecting a feature and then randomly selecting a split value between the maximum and minimum values of the selected feature.
- Parameters
-
-
input_data – Dataframe, feature set that is results of generator_set or feature_selector
-
label_column (str) – Label column name.
-
filtering_label – List<String>, List of classes. if it is not defined, it use all classes.
-
feature_columns – List<String>, List of features. if it is not defined, it uses all features.
-
outliers_fraction (float) – Define the ratio of outliers.
-
assign_unknown (bool) – Assign unknown label to outliers.
-
- Returns
-
DataFrame containing features without outliers and noise.
Examples
>>> client.pipeline.reset(delete_cache=False) >>> df = client.datasets.load_activity_raw() >>> client.pipeline.set_input_data('test_data', df, force=True, data_columns = ['accelx', 'accely', 'accelz'], group_columns = ['Subject','Class'], label_column = 'Class') >>> client.pipeline.add_feature_generator([{'name':'Downsample', 'params':{"columns": ['accelx','accely','accelz'], "new_length": 5 }}]) >>> results, stats = client.pipeline.execute() # List of all data indices before the filtering algorithm >>> results.index.tolist() Out: [0, 1, 2, 3, 4, 5, 6, 7, 8]
>>> client.pipeline.add_transform("Isolation Forest Filtering", params={ "outliers_fraction": 0.01})
>>> results, stats = client.pipeline.execute() # List of all data indices after the filtering algorithm >>>results.index.tolist() Out: [0, 1, 2, 3, 4, 5]
- Local Outlier Factor Filtering
-
The local outlier factor (LOF) to measure the local deviation of a given data point with respect to its neighbors by comparing their local density.
The LOF algorithm is an unsupervised outlier detection method which computes the local density deviation of a given data point with respect to its neighbors. It considers as outlier samples that have a substantially lower density than their neighbors.
- Parameters
-
-
input_data – Dataframe, feature set that is results of generator_set or feature_selector
-
label_column (str) – Label column name.
-
filtering_label – List<String>, List of classes. if it is not defined, it use all classes.
-
feature_columns – List<String>, List of features. if it is not defined, it uses all features.
-
outliers_fraction (float) – Define the ratio of outliers.
-
number_of_neighbors (int) – Number of neighbors for a vector.
-
norm (string) – Metric that will be used for the distance computation.
-
assign_unknown (bool) – Assign unknown label to outliers.
-
- Returns
-
DataFrame containing features without outliers and noise.
Examples
>>> client.pipeline.reset(delete_cache=False) >>> df = client.datasets.load_activity_raw() >>> client.pipeline.set_input_data('test_data', df, force=True, data_columns = ['accelx', 'accely', 'accelz'], group_columns = ['Subject','Class'], label_column = 'Class') >>> client.pipeline.add_feature_generator([{'name':'Downsample', 'params':{"columns": ['accelx','accely','accelz'], "new_length": 5 }}]) >>> results, stats = client.pipeline.execute() # List of all data indices before the filtering algorithm >>> results.index.tolist() Out: [0, 1, 2, 3, 4, 5, 6, 7, 8]
>>> client.pipeline.add_transform("Local Outlier Factor Filtering", params={"outliers_fraction": 0.05, "number_of_neighbors": 5})
>>> results, stats = client.pipeline.execute() # List of all data indices after the filtering algorithm >>>results.index.tolist() Out: [0, 1, 2, 3, 4, 5]
- One Class SVM filtering
-
Unsupervised Outlier Detection. Estimate the support of a high-dimensional distribution. The implementation is based on libsvm.
- Parameters
-
-
input_data – Dataframe, feature set that is results of generator_set or feature_selector
-
label_column (str) – Label column name.
-
filtering_label – List<String>, List of classes. if it is not defined, it use all classes.
-
feature_columns – List<String>, List of features. if it is not defined, it uses all features.
-
outliers_fraction (float) – Define the ratio of outliers.
-
kernel (str) – Specifies the kernel type to be used in the algorithm. It must be one of ‘linear’, ‘poly’, ‘rbf’, ‘sigmoid’.
-
assign_unknown (bool) – Assign unknown label to outliers.
-
- Returns
-
DataFrame containing features without outliers and noise.
Examples
>>> client.pipeline.reset(delete_cache=False) >>> df = client.datasets.load_activity_raw() >>> client.pipeline.set_input_data('test_data', df, force=True, data_columns = ['accelx', 'accely', 'accelz'], group_columns = ['Subject','Class'], label_column = 'Class') >>> client.pipeline.add_feature_generator([{'name':'Downsample', 'params':{"columns": ['accelx','accely','accelz'], "new_length": 5 }}]) >>> results, stats = client.pipeline.execute() # List of all data indices before the filtering algorithm >>> results.index.tolist() Out: [0, 1, 2, 3, 4, 5, 6, 7, 8]
>>> client.pipeline.add_transform("One Class SVM filtering", params={"outliers_fraction": 0.05})
>>> results, stats = client.pipeline.execute() # List of all data indices after the filtering algorithm >>>results.index.tolist() Out: [0, 1, 2, 3, 4, 5]
- Robust Covariance Filtering
-
Unsupervised Outlier Detection. An object for detecting outliers in a Gaussian distributed dataset.
- Parameters
-
-
input_data – Dataframe, feature set that is results of generator_set or feature_selector
-
label_column (str) – Label column name.
-
filtering_label – List<String>, List of classes. if it is not defined, it use all classes.
-
feature_columns – List<String>, List of features. if it is not defined, it uses all features.
-
outliers_fraction (float) – An upper bound on the fraction of training errors and a lower bound of the fraction of support vectors. Should be in the interval (0, 1]. By default 0.5 will be taken.
-
assign_unknown (bool) – Assign unknown label to outliers.
-
- Returns
-
DataFrame containing features without outliers and noise.
Examples
>>> client.pipeline.reset(delete_cache=False) >>> df = client.datasets.load_activity_raw() >>> client.pipeline.set_input_data('test_data', df, force=True, data_columns = ['accelx', 'accely', 'accelz'], group_columns = ['Subject','Class'], label_column = 'Class') >>> client.pipeline.add_feature_generator([{'name':'Downsample', 'params':{"columns": ['accelx','accely','accelz'], "new_length": 5 }}]) >>> results, stats = client.pipeline.execute() # List of all data indices before the filtering algorithm >>> results.index.tolist() Out: [0, 1, 2, 3, 4, 5, 6, 7, 8]
>>> client.pipeline.add_transform("Robust Covariance Filtering", params={"outliers_fraction": 0.05})
>>> results, stats = client.pipeline.execute() # List of all data indices after the filtering algorithm >>>results.index.tolist() Out: [0, 1, 2, 3, 4, 5]
- Sample By Metadata
-
Select rows from the input DataFrame based on a metadata column. Rows that have a metadata value that is in the values list will be returned.
- Parameters
-
-
input_data (DataFrame) – input DataFrame
-
metadata_name (str) – name of the metadata column to use for sampling
-
metadata_values (list [ str ]) – list of values of the named column for which to select rows of the input data.
-
- Returns
-
DataFrame containing only the rows for which the metadata value is in the accepted list
- Combine Labels
-
Select rows from the input DataFrame based on a metadata column. Rows that have a metadata value that is in the values list will be returned.
syntax combine_labels = {‘group1’:[‘label1’, ‘label2’], ‘group2’:[‘label3’,’label4’], ‘group3’:[‘group5’]}
- Parameters
-
-
input_data (DataFrame) – input DataFrame
-
label_column (str) – label column name
-
combine_labels (dict) – map of label columns to combine
-
- Returns
-
DataFrame containing only the rows for which the metadata value is in the accepted list
- Zscore Filter
-
Filter out feature vectors that have features outside of a cutoff threshold
- Parameters
-
-
input_data (DataFrame) – Input DataFrame
-
label_column (str) – Label column name
-
zscore_cutoff (int) – Cutoff for to filter features above z score
-
feature_threshold (int) – The number of features in a feature vector that can be outside of the zscore_cutoff without removing the feature vector
-
features (list) – List of features to filter by, if none filters all (default None)
-
assign_unknown (bool) – Assign unknown label to outliers.
-
- Returns
-
DataFrame containing only the rows for which the metadata value is in the accepted list.
Examples
>>> client.pipeline.reset(delete_cache=False) >>> df = client.datasets.load_activity_raw() >>> client.pipeline.set_input_data('test_data', df, force=True, data_columns = ['accelx', 'accely', 'accelz'], group_columns = ['Subject','Class'], label_column = 'Class') >>> client.pipeline.add_feature_generator([{'name':'Downsample', 'params':{"columns": ['accelx','accely','accelz'], "new_length": 5 }}]) >>> results, stats = client.pipeline.execute() # List of all data indices before the filtering algorithm >>> results.index.tolist() Out: [0, 1, 2, 3, 4, 5, 6, 7, 8]
>>> client.pipeline.add_transform("Zscore Filter", params={"zscore_cutoff": 3, "feature_threshold": 1})
>>> results, stats = client.pipeline.execute() # List of all data indices after the filtering algorithm >>>results.index.tolist() Out: [0, 1, 2, 3, 4, 5]
- Sigma Outliers Filtering
-
Sigma outliers filtering is unsupervised outlier detection. An object for detecting outliers in a Gaussian distributed dataset based on given sigma value.
- Parameters
-
-
input_data – Dataframe, feature set that is results of generator_set or feature_selector
-
label_column (str) – Label column name.
-
filtering_label – List<String>, List of classes that will be filered. If it is not defined, all class will be filtered.
-
feature_columns – List<String>, List of features. if it is not defined, it uses all features.
-
sigma_threshold (float) – Define the ratio of outliers.
-
assign_unknown (bool) – Assign unknown label to outliers.
-
- Returns
-
DataFrame containing features without outliers and noise.
Examples
client.pipeline.reset(delete_cache=False) df = client.datasets.load_activity_raw() client.pipeline.set_input_data('test_data', df, force=True, data_columns = ['accelx', 'accely', 'accelz'], group_columns = ['Subject','Class'], label_column = 'Class') client.pipeline.add_feature_generator([{'name':'Downsample', 'params':{"columns": ['accelx','accely','accelz'], "new_length": 5 }}]) results, stats = client.pipeline.execute() # List of all data indices before the filtering algorithm results.index.tolist() # Out: # [0, 1, 2, 3, 4, 5, 6, 7, 8] client.pipeline.add_transform("Sigma Outliers Filtering", params={ "sigma_threshold": 1.0 }) results, stats = client.pipeline.execute() # List of all data indices after the filtering algorithm results.index.tolist() # Out: # [0, 1, 2, 3, 4, 5]
Sampling Techniques for Handling Imbalanced Data sigma_outliers_filtering
- Combine Labels
-
Select rows from the input DataFrame based on a metadata column. Rows that have a metadata value that is in the values list will be returned.
syntax combine_labels = {‘group1’:[‘label1’, ‘label2’], ‘group2’:[‘label3’,’label4’], ‘group3’:[‘group5’]}
- Parameters
-
-
input_data (DataFrame) – input DataFrame
-
label_column (str) – label column name
-
combine_labels (dict) – map of label columns to combine
-
- Returns
-
DataFrame containing only the rows for which the metadata value is in the accepted list
- Undersample Majority Classes
-
Create a balanced data set by undersampling the majority classes using random sampling without replacement.
- Parameters
-
-
input_data (DataFrame) – input DataFrame
-
label_column (str) – The column to split against
-
target_class_size (int) – Specifies the size of the minimum class to use. If None, the min class size is used; if size is greater than min class size, the min class size is used (default: None).
-
seed (int) – Specifies a random seed to use for sampling
-
maximum_samples_size_per_class (int) – Specifies the size of the maximum class to use per class,
-
- Returns
-
DataFrame containing undersampled classes
Sampling Techniques for Augmenting Data Sets
- Pad Segment
-
Pad a segment so that its length is equal to a specific sequence length
- Parameters
-
-
input_data (DataFrame) – input DataFrame
-
group_columns (str) – The column to group by against (should 283 SegmentID)
-
sequence_length (int) – Specifies the size of the minimum class to use. If None, the min class size is used; if size is greater than min class size, the min class size is used (default: None).
-
noise_level (int) – max amount of noise to add to augmentation
-
- Returns
-
DataFrame containing padded segments
- Resampling by Majority Vote
-
For each group perform max pooling on the specified metadata_name column and set the value of that metadata column to the maximum occuring value.
- Parameters
-
-
input_data (DataFrame) – input DataFrame
-
group_columns (list) – Columns to group over
-
metadata_name (str) – name of the metadata column to use for sampling
-
- Returns
-
DataFrame with metadata_name column being modified by max pooling