2.4.10 Feature Selectors
Used to optimally select a subset of features before training a Classifiers
- Correlation Threshold
This is an unsupervised feature selection algorithm that selects features based on absolute correlation (similar to backward feature selection). It first calculates a pair-wise correlation matrix consisting of all features. Then, a candidate feature is identified for removal. This candidate feature is the one that correlates to the maximum number of other features having correlation coefficient higher than the threshold. This step is repeated until there is no feature with correlation coefficient higher that the threshold or when there is no feature left.
- Parameters
threshold – float; default = 0.85. Minimum correlation threshold over which features should be eliminated (0 to 1)
passthrough_columns – list of column names; The set of columns the selector should ignore
Examples
>>> client.pipeline.reset(delete_cache=False) >>> df = client.datasets.load_activity_raw() >>> client.pipeline.set_input_data('test_data', df, force=True, data_columns = ['accelx', 'accely', 'accelz'], group_columns = ['Subject','Class'], label_column = 'Class') >>> client.pipeline.add_feature_generator([{'name':'Downsample', 'params':{"columns": ['accelx','accely','accelz'], "new_length": 5 }}]) >>> results, stats = client.pipeline.execute() # List of all features before the feature selection algorithm >>> results.columns.tolist() Out: [u'Class', u'Subject', u'gen_0001_accelx_0', u'gen_0001_accelx_1', u'gen_0001_accelx_2', u'gen_0001_accelx_3', u'gen_0001_accelx_4', u'gen_0002_accely_0', u'gen_0002_accely_1', u'gen_0002_accely_2', u'gen_0002_accely_3', u'gen_0002_accely_4', u'gen_0003_accelz_0', u'gen_0003_accelz_1', u'gen_0003_accelz_2', u'gen_0003_accelz_3', u'gen_0003_accelz_4']
>>> client.pipeline.reset(delete_cache=False) >>> client.pipeline.set_input_data('test_data', results, force=True, group_columns = ['Subject','Class'], label_column = 'Class') >>> client.pipeline.add_feature_selector([{'name':'Correlation Threshold', 'params':{ "threshold": 0.85 }}]) >>> results, stats = client.pipeline.execute()
>>> print results Out: [u'Class', u'Subject', u'gen_0001_accelx_2', u'gen_0001_accelx_4', u'gen_0002_accely_0']
- Custom Feature Selection
This is a feature selection method which, allows custom feature selection. This takes a list of strings where each value is the feature name to keep.
- Parameters
input_data (DataFrame) – Input data
custom_feature_selection (list) – feature generator names to keep
- Returns
- tuple containing:
selected_features (DataFrame): which includes selected features and the passthrough columns. unselected_features (list): unselected features
- Return type
tuple
- Custom Feature Selection
This is a feature selection method which, allows custom feature selection. This takes a dictionary where the key is the feature generator number, and the value is an array of the features for the feature generator to keep. All feature generators that are not added as keys in the dictionary will be dropped.
Example
client.pipeline.add_feature_selector([{'name': 'Custom Feature Selection By Index', 'params': {"custom_feature_selection": {1: [0], 2:[0], 3:[1,2,3,4]}, }}]) # would select the features 0 from feature generator 1 and 2, and # features 1,2,3,4 from the generator feature generator 3.
- Parameters
input_data (DataFrame) – Input data
custom_feature_selection (dict) – feature generator number and array of features to keep.
- Returns
- tuple containing:
selected_features (DataFrame): which includes selected features and the passthrough columns. unselected_features (list): unselected features
- Return type
tuple
- Information Gain
This is a supervised feature selection algorithm that selects features based on Information Gain (one class vs other classes approaches).
First, it calculates Information Gain (IG) for each class separately to all features then sort features based on IG scores, std and mean differences. Feature with higher IG is better feature to differentiate the class from others. At the end, each feature has their own feature list.
- Parameters
feature_number – Number of features will be selected for each class.
- Returns
DataFrame which includes selected features for each class.
Examples
>>> client.pipeline.reset(delete_cache=False) >>> df = client.datasets.load_activity_raw() >>> client.pipeline.set_input_data('test_data', df, force=True, data_columns = ['accelx', 'accely', 'accelz'], group_columns = ['Subject','Class'], label_column = 'Class') >>> client.pipeline.add_feature_generator([{'name':'Downsample', 'params':{"columns": ['accelx','accely','accelz'], "new_length": 5 }}]) >>> results, stats = client.pipeline.execute() # List of all features before the feature selection algorithm >>> results.columns.tolist() Out: [u'Class', u'Subject', u'gen_0001_accelx_0', u'gen_0001_accelx_1', u'gen_0001_accelx_2', u'gen_0001_accelx_3', u'gen_0001_accelx_4', u'gen_0002_accely_0', u'gen_0002_accely_1', u'gen_0002_accely_2', u'gen_0002_accely_3', u'gen_0002_accely_4', u'gen_0003_accelz_0', u'gen_0003_accelz_1', u'gen_0003_accelz_2', u'gen_0003_accelz_3', u'gen_0003_accelz_4']
>>> client.pipeline.reset(delete_cache=False) >>> client.pipeline.set_input_data('test_data', results, force=True, group_columns = ['Subject','Class'], label_column = 'Class') >>> client.pipeline.add_feature_selector([{'name':'Information Gain', 'params':{"feature_number": 3}}]) >>> results, stats = client.pipeline.execute()
>>> print results Out: Class Subject gen_0001_accelx_0 gen_0001_accelx_1 gen_0001_accelx_2 0 Crawling s01 347.881775 372.258789 208.341858 1 Crawling s02 347.713013 224.231735 91.971481 2 Crawling s03 545.664429 503.276642 200.263031 3 Running s01 -21.588972 -23.511278 -16.322056 4 Running s02 422.405182 453.950897 431.893585 5 Running s03 350.105774 366.373627 360.777466 6 Walking s01 -10.362945 -46.967007 0.492386 7 Walking s02 375.751343 413.259460 374.443237 8 Walking s03 353.421906 317.618164 283.627502
- Recursive Feature Elimination
This is a supervised method of feature selection. The goal of recursive feature elimination (RFE) is to select features by recursively considering smaller and smaller sets of features. First, the estimator (method: ‘Log R’ or ‘Linear SVC’) is trained on the initial set of features and weights are assigned to each one of them. Then, features whose absolute weights are the smallest are pruned from the current set features. That procedure is recursively repeated on the pruned set until the desired number of features number_of_features to select is eventually reached.
- Parameters
method – str; The type of selection method. Two options available: 1) Log R and 2) Linear SVC. For Log R, the value of Inverse of regularization strength C is default to 1.0 and penalty is defaulted to `l1. For Linear SVC, the default for C is 0.01, penalty is l1 and dual is set to False.
number_of_features – int; The number of features you would like the selector to reduce to.
- Returns
DataFrame which includes selected features and the passthrough columns.
- Return type
DataFrame
Examples
>>> client.pipeline.reset(delete_cache=False) >>> df = client.datasets.load_activity_raw() >>> client.pipeline.set_input_data('test_data', df, force=True, data_columns = ['accelx', 'accely', 'accelz'], group_columns = ['Subject','Class'], label_column = 'Class') >>> client.pipeline.add_feature_generator([{'name':'Downsample', 'params':{"columns": ['accelx','accely','accelz'], "new_length": 5 }}]) >>> results, stats = client.pipeline.execute() # # List of all features before the feature selection algorithm >>> results.columns.tolist() Out: [u'Class', u'Subject', u'gen_0001_accelx_0', u'gen_0001_accelx_1', u'gen_0001_accelx_2', u'gen_0001_accelx_3', u'gen_0001_accelx_4', u'gen_0002_accely_0', u'gen_0002_accely_1', u'gen_0002_accely_2', u'gen_0002_accely_3', u'gen_0002_accely_4', u'gen_0003_accelz_0', u'gen_0003_accelz_1', u'gen_0003_accelz_2', u'gen_0003_accelz_3', u'gen_0003_accelz_4']
>>> client.pipeline.reset(delete_cache=False) >>> client.pipeline.set_input_data('test_data', results, force=True, group_columns = ['Subject','Class'], label_column = 'Class') >>> client.pipeline.add_feature_selector([{'name':'Recursive Feature Elimination', 'params':{"method": "Log R", "number_of_features": 3}}], params={'number_of_features':3}) >>> results, stats = client.pipeline.execute()
>>> print results Out: Class Subject gen_0001_accelx_2 gen_0003_accelz_1 gen_0003_accelz_4 0 Crawling s01 208.341858 3881.038330 3900.734863 1 Crawling s02 91.971481 3821.513428 3896.376221 2 Crawling s03 200.263031 3896.349121 3889.297119 3 Running s01 -16.322056 641.164185 605.192993 4 Running s02 431.893585 870.608459 846.671204 5 Running s03 360.777466 263.184052 234.177200 6 Walking s01 0.492386 559.139587 558.538086 7 Walking s02 374.443237 658.902710 669.394592 8 Walking s03 283.627502 -87.612816 -98.735649
Note:For more information on defaults of Log R, or for Linear SVC, please see: unhandled xref case.
- Tree-Based Selection
Select features using a supervised tree-based algorithm. This class implements a meta estimator that fits a number of randomized decision-tree trees, also named Extra Trees, on various subsamples of the dataset and use averaging to improve the predictive accuracy and control overfitting. The default number of trees in the forest is set at 250, and the random_state to be 0. Check the notes for more information.
- Parameters
number_of_features – int; The number of features you would like the selector to reduce to.
- Returns
DataFrame which includes selected features and the passthrough columns for each class.
- Return type
DataFrame
Examples
>>> client.pipeline.reset(delete_cache=False) >>> df = client.datasets.load_activity_raw() >>> client.pipeline.set_input_data('test_data', df, force=True, data_columns = ['accelx', 'accely', 'accelz'], group_columns = ['Subject','Class'], label_column = 'Class') >>> client.pipeline.add_feature_generator([{'name':'Downsample', 'params':{"columns": ['accelx','accely','accelz'], "new_length": 5 }}]) >>> results, stats = client.pipeline.execute() # List of all features before the feature selection algorithm >>> results.columns.tolist() Out: [u'Class', u'Subject', u'gen_0001_accelx_0', u'gen_0001_accelx_1', u'gen_0001_accelx_2', u'gen_0001_accelx_3', u'gen_0001_accelx_4', u'gen_0002_accely_0', u'gen_0002_accely_1', u'gen_0002_accely_2', u'gen_0002_accely_3', u'gen_0002_accely_4', u'gen_0003_accelz_0', u'gen_0003_accelz_1', u'gen_0003_accelz_2', u'gen_0003_accelz_3', u'gen_0003_accelz_4']
>>> client.pipeline.reset(delete_cache=False) >>> client.pipeline.set_input_data('test_data', results, force=True, group_columns = ['Subject','Class'], label_column = 'Class') >>> client.pipeline.add_feature_selector([{'name':'Tree-based Selection', 'params':{ "number_of_features": 4 }}] ) >>> results, stats = client.pipeline.execute()
>>> print results Out: Class Subject gen_0002_accely_0 gen_0002_accely_1 gen_0002_accely_2 gen_0002_accely_3 gen_0002_accely_4 gen_0003_accelz_0 gen_0003_accelz_1 gen_0003_accelz_2 gen_0003_accelz_3 gen_0003_accelz_4 0 Crawling s01 1.669203 1.559860 1.526786 1.414068 1.413625 1.360500 1.368615 1.413445 1.426949 1.400083 1 Crawling s02 1.486925 1.418474 1.377726 1.414068 1.413625 1.360500 1.368615 1.388456 1.408576 1.397417 2 Crawling s03 1.035519 1.252789 1.332684 1.328587 1.324469 1.410274 1.414961 1.384032 1.345107 1.393088 3 Running s01 -0.700995 -0.678448 -0.706631 -0.674960 -0.713493 -0.572269 -0.600986 -0.582678 -0.560071 -0.615270 4 Running s02 -0.659030 -0.709012 -0.678594 -0.688869 -0.700753 -0.494247 -0.458891 -0.471897 -0.475010 -0.467597 5 Running s03 -0.712790 -0.713026 -0.740177 -0.728651 -0.733076 -0.836257 -0.835071 -0.868028 -0.855081 -0.842161 6 Walking s01 -0.701450 -0.714677 -0.692671 -0.716556 -0.696635 -0.652326 -0.651784 -0.640956 -0.655958 -0.643802 7 Walking s02 -0.698335 -0.689857 -0.696807 -0.702233 -0.682212 -0.551928 -0.590001 -0.570077 -0.558563 -0.576008 8 Walking s03 -0.719046 -0.726102 -0.722315 -0.727506 -0.712461 -1.077342 -1.052320 -1.052297 -1.075949 -1.045750
Notes
For more information, please see: unhandled xref case
- t-Test Feature Selector
This is a supervised feature selection algorithm that selects features based on 2 tailed t-test. It computes the p values then select the top performing number of features for each class as defined by feature_number. It returns a reduced combined list for all of the features.
- Parameters
input_data – DataFrame
label_column (str) – Class label
feature_number (int) – The number of features to select for each class
passthrough_columns – list of columns the selector should ignore
- Univariate Selection
Select features with the highest univariate (ANOVA) F-values; It is supervised feature selection method and requires both a input features and labels.
- Parameters
number_of_features – int; The number of features you would like the selector to reduce to.
- Returns
DataFrame which includes selected features and the passthrough columns.
- Return type
DataFrame
Examples
>>> client.pipeline.reset(delete_cache=False) >>> df = client.datasets.load_activity_raw() >>> client.pipeline.set_input_data('test_data', df, force=True, data_columns = ['accelx', 'accely', 'accelz'], group_columns = ['Subject','Class'], label_column = 'Class') >>> client.pipeline.add_feature_generator([{'name':'Downsample', 'params':{"columns": ['accelx','accely','accelz'], "new_length": 5 }}]) >>> results, stats = client.pipeline.execute() # List of all features before the feature selection algorithm >>> results.columns.tolist() Out: [u'Class', u'Subject', u'gen_0001_accelx_0', u'gen_0001_accelx_1', u'gen_0001_accelx_2', u'gen_0001_accelx_3', u'gen_0001_accelx_4', u'gen_0002_accely_0', u'gen_0002_accely_1', u'gen_0002_accely_2', u'gen_0002_accely_3', u'gen_0002_accely_4', u'gen_0003_accelz_0', u'gen_0003_accelz_1', u'gen_0003_accelz_2', u'gen_0003_accelz_3', u'gen_0003_accelz_4']
>>> client.pipeline.reset(delete_cache=False) >>> client.pipeline.set_input_data('test_data', results, force=True, group_columns = ['Subject','Class'], label_column = 'Class') >>> client.pipeline.add_feature_selector([{'name':'Univariate Selection', 'params': {"number_of_features": 3 } }]) >>> results, stats = client.pipeline.execute()
>>> print results Out: Class Subject gen_0002_accely_2 gen_0002_accely_3 gen_0002_accely_4 0 Crawling s01 1.526786 1.496120 1.500535 1 Crawling s02 1.377726 1.414068 1.413625 2 Crawling s03 1.332684 1.328587 1.324469 3 Running s01 -0.706631 -0.674960 -0.713493 4 Running s02 -0.678594 -0.688869 -0.700753 5 Running s03 -0.740177 -0.728651 -0.733076 6 Walking s01 -0.692671 -0.716556 -0.696635 7 Walking s02 -0.696807 -0.702233 -0.682212 8 Walking s03 -0.722315 -0.727506 -0.712461
Notes
Please see the following for more information: unhandled xref case unhandled xref case
- Variance Threshold
Feature selector that removes all low-variance features. This step is an unsupervised feature selection algorithm and looks only at the input features (X) and not the Labels or outputs (y). Select features whose variance exceeds the given threshold (default is set to 0.05). It must be applied prior to standardization.
- Parameters
threshold – float; default = 0.01. Minimum variance threshold under which features should be eliminated.
- Returns
DataFrame which includes selected features and the passthrough columns.
- Return type
DataFrame
Examples
>>> client.pipeline.reset() >>> df = client.datasets.load_activity_raw_toy() >>> client.pipeline.set_input_data('test_data', df, force=True, data_columns = ['accelx', 'accely', 'accelz'], group_columns = ['Subject','Class'], label_column = 'Class') >>> client.pipeline.add_feature_generator([{'name':'Downsample', 'params':{"columns": ['accelx','accely','accelz'], "new_length": 5 }}]) >>> results, stats = client.pipeline.execute() # List of all features before the feature selection algorithm >>> results.columns.tolist() Out: [u'Class', u'Subject', u'gen_0001_accelx_0', u'gen_0001_accelx_1', u'gen_0001_accelx_2', u'gen_0001_accelx_3', u'gen_0001_accelx_4', u'gen_0002_accely_0', u'gen_0002_accely_1', u'gen_0002_accely_2', u'gen_0002_accely_3', u'gen_0002_accely_4', u'gen_0003_accelz_0', u'gen_0003_accelz_1', u'gen_0003_accelz_2', u'gen_0003_accelz_3', u'gen_0003_accelz_4']
>>> client.pipeline.add_feature_selector([{'name':'Variance Threshold', 'params':{"threshold": 4513492.05}}])
>>> results, stats = client.pipeline.execute() >>> print results Out: [u'Class', u'Subject', u'gen_0002_accely_0', u'gen_0002_accely_1', u'gen_0002_accely_2', u'gen_0002_accely_3', u'gen_0002_accely_4']