2.3.6 Getting Started with the ML SDK

This tutorial is a continuation of the “Getting Started” tutorial at the beginning of this document. Before using the MPLAB ML SDK, it is recommended that you thoroughly review the tutorial. It is particularly valuable in two scenarios:

  • You are experienced with machine learning and want to create your own Knowledge Pack with customized algorithms.

  • You already generated a Knowledge Pack using the MPLAB® Machine Learning Development Suite and want to find out how you can tweak the underlying features of the Knowledge Pack even further.

Prerequisites: You must have already created a project using the Getting Started Guide called Slide Demo.

The goal of this tutorial is to give insight into the more advanced features in building a custom algorithm for a Knowledge Pack.

There are three main steps to building a Knowledge Pack:

- Query your data
- Transform the data into a feature vector
- Build the model to fit on the sensor device

Loading Your Project

First load your project. In this example, it is called Slide Demo.

[15]:
%matplotlib inline

from mplabml import Client
client = Client()
[16]:
client.project ='Slide Demo'

The next step is to initialize a pipeline space to work in. Pipelines store the preprocessing, feature extraction and model building steps. When training a model, these steps are executed on the server. When the model is trained, the pipeline is converted to a firmware code that runs on your target embedded device. Add a pipeline to the project using the following code snippet:

client.pipeline = "Name of your pipeline"
[17]:
client.pipeline = "Slide Demo Pipeline"

Query Your Data

To select the data you want to use in your pipeline, you need to add a query step. Queries provide a way to select and filter the data to use in our pipeline.

To create a query for all labeled sensor data in your project:

  1. Using the Prepare Data page in the Microchip ML Model Builder to create your query is recommended. Alternatively, you can also use the create_query API by running the cell below.

    [ ]:
    
    client.create_query(name="My Query",
                     segmenter="My Training Session",
                     label_column="Label",
                     metadata_columns=["Subject"],
                     columns=["AccelerometerX", "AccelerometerY","AccelerometerZ","GyroscopeX","GyroscopeY","GyroscopeZ"]
                    )
    

Building a Pipeline

Throughout this notebook, you will add multiple steps to transform the data in a pipeline.

Note: No work is done on the data until you execute the pipeline, i.e., client.pipeline.execute().

The main steps of a pipeline include:

-Query
-Feature Engineering
-Model Generation

It is important that you add the steps in the right order. If you accidentally add them in the wrong order or want to restart, enter the command:

client.pipeline.reset()

Adding Your Query Step

Add the query step that you created above. Use the command below:

[ ]:
client.pipeline.reset()
client.pipeline.set_input_query('My Query')

Pipeline Progress

To see the current steps in your pipeline, enter the command:

[ ]:
client.pipeline.describe()

ML Core Functions

The MPLAB ML SDK provides a way to define a pipeline for feature vector and model building. The feature vector generation part of the pipeline includes over 100 core functions that can be split into a few different types:

  • Sensor transforms – These are applied to the data directly as it comes off the sensor; they can be smoothing functions, magnitude of sensor columns, etc.

  • Segmentation – The segmenter selects regions of interest from the streaming data. This can be an event if you are using an event detection segmenter or simply a sliding window, which buffers a segment of data and sends it to the next step.

  • Segment transforms – Operates on a segment of data, typically normalizes the data in some way such as demeaning to prepare for feature vector generation.

  • Feature generators – Algorithms to extract relevant feature vectors from the data streams in preparation for model building.

  • Feature transforms – Feature transforms normalize all of the features in the feature vector to between 0-255.

  • Feature selectors – These functions remove features that do not help discriminate between different classes.

The MPLAB ML SDK allows you to string together a pipeline composed of these individual steps. The pipeline is sent to Microchip's servers to take advantage of cloud-based optimization in order to speed up the pipeline processing.

The segmentation and feature engineering part of the pipeline involves transforming data streams into a feature vector that are used to train a model (Knowledge Pack). This is where the more advanced machine learning part of the Microchip ML Model Builder comes in. It is okay if you do not understand everything right away, this guide walks through some examples of good features for the periodic event use case and gives you the tools to explore more features.

The features in the feature vector must be integers between 0 and 255. The feature vector can be any length, but, in practice, you will be limited by the space on the device.

Adding a Basic Core Function

Next, this guide explains how to add one core function and how to work with other core functions.

A core function that is often useful for normalizing data is the magnitude sensor transform. Add a Magnitude sensor transform using the command below:

[ ]:
client.pipeline.add_transform("Magnitude", params={"input_columns": ['GyroscopeX','GyroscopeY', 'GyroscopeZ']})
client.pipeline.describe()

If you want to see specific documentation about any of the MPLAB® Machine Learning Development Suite commands, add a ? to the end of the command

client.pipeline.add_transform?

Exploring Core Functions:

The magnitude sensor transform is just one of over 100 core functions that the Microchip ML Model Builder provides. To see a list of the available core functions, use the following command:

[ ]:
client.list_functions()

To get the documentation for any of the functions, use the command:

[6]:
client.function_description('Magnitude')


    Computes the magnitude (square sum) of a signal across the input_columns
    streams.

    Args:
        input_columns (list[str]): sensor streams to use in computing the magnitude

    Returns:
        The input DataFrame with an additional column containing the per-sample
        magnitude of the desired input_columns


Inputs
----------
  input_data: DataFrame
  input_columns: list

Usage
----------
For DataFrame inputs, provide a string reference to the
DataFrame output of a previous step in the pipeline.
For Dataframe output, provide a string name that subsequent
operations can refer to.

To get the function parameters, use the following command:

[7]:
client.function_help('Magnitude')

client.pipeline.add_transform("Magnitude", params={"input_columns": <list>,
                                })

Function Snippets

The ML SDK includes function snippets that will auto-generate the function parameters for you. To use a function snippet, execute the following command:

client.snippets.Sensor_Transform.Magnitude()

To see snippets in action, execute the cell below:

[ ]:
client.snippets.Sensor_Transform.Magnitude()

Pipeline Execution

When executing the pipeline, there will always be two results returned. Take a look at the next cell. The first variable magnitude_data will be the actual data. The second variable stats will contain information about the pipeline execution on the server.

[ ]:
magnitude_data, stats = client.pipeline.execute()

Explore the returned magnitude_data using the command below.

[ ]:
magnitude_data.head()

Notice that an additional column Magnitude_ST_0000 is added to the dataframe. The subscripts refer to this being a sensor transform (ST) and being the first one added (0000). If you were to add another sensor transform, for example, taking the magnitude of the accelerometer data as well, you would get another column, Magnitude_ST_0001.

Performing Segmentation

The next step is to segment our data into windows that we can perform recognition on. For periodic events, we want to use the Windowing transform. Look at the function description below; delta is the sliding window overlap. Setting delta to the same value as the window size means that there is no overlap in our segmented windows.

[ ]:
client.pipeline.add_transform("Windowing", params={"window_size": 300,
                                                "delta": 300,})
client.pipeline.describe(show_params=True)

Different window sizes can lead to better models. For this project, reduce the window_size and delta to 100. The actual time that the window size represents for this data set corresponds to 2 seconds as our data was recorded at 100 Hz. Change the values in the Windowing Segmenter and re-execute. You will see the parameters change for the windowing segmenter change, but a new step must not be added.

[ ]:
client.pipeline.add_transform("Windowing", params={"window_size": 100,
                                                "delta": 100,})
client.pipeline.describe(show_params=True)

Feature Vector Generation

At this point, you are ready to generate a feature vector from your segments. Feature generators are algorithms used to extract relevant feature vectors from the data streams in preparation for model building. They can be simple features, such as mean, up to more complex features, such as the Fourier transform.

Feature generators are all added into a single step and run in parallel against the same input data. Add two feature generators now:

[ ]:
client.pipeline.add_feature_generator(["Mean", 'Standard Deviation'],
                                   function_defaults = {"columns":[u'Magnitude_ST_0000']})

Two feature generators were added from the subtype Statistical. The more features, the better chance you have of building a successful model. Try adding a few more feature generators of the same subtype. Call client.list_functions(), and you can find more feature generators of the same type.

[ ]:
client.pipeline.add_feature_generator(["Mean", 'Standard Deviation', 'Sum', '25th Percentile'],
                                   function_defaults = {"columns":[u'Magnitude_ST_0000']})

Your classifiers are optimized for performance and memory usage to fit on resource-constrained devices. Because of this, the features are scaled in the feature vector to be a single byte each, so you need to add the Min Max Scale transform to the pipeline. This function scales the features in the feature vector to have values between 0 and 255.

[ ]:
client.pipeline.add_transform('Min Max Scale')
[ ]:
feature_vectors, stats = client.pipeline.execute()
feature_vectors.head()

Naming Convention

The column header represents the name of the feature generator and can be used to identify which feature generator and which inputs were used. The suffix gen identifies that this was a feature generator. The number that follows identifies the index of the feature generator. After that, there is the name of the input columns, Magnitude_ST_0000, combined with the name of the feature generator, Mean i.e., Magnitude_ST_0000Mean.

Visualizing Feature Vectors

Next, take a look at the feature vectors that you generated. Plot the average of all feature vectors grouped by Activity. Ideally, you are looking for feature vectors that are separable in space. How do the ones you generated look?

[ ]:
client.pipeline.visualize_features(feature_vectors)

Training a Model

  • Train Validate Optimize (TVO): This step defines the model validation, the classifier and the training algorithm to build the model with. The model is first trained using the selected training algorithm, then loaded into the hardware simulator and tested using the specified validation method.

This pipeline uses the validation method “Stratified K-Fold Cross-Validation”. This is a standard validation method used to test the performance of a model by splitting the data into k folds, training on k-1 folds and testing against the excluded fold. Then, it switches which fold is tested on and repeats until all of the folds are used as a test set. The average of the metrics for each model provides a good estimate of how a model trained on the full data set will perform.

The training algorithm attempts to optimize the number of neurons and their locations to create the best model. This is the “Hierarchical Clustering with Neuron Optimization” training algorithm, which uses a clustering algorithm to optimize neurons placement in feature space.

This is the Pattern Matching Engine (PME) classifier, which has two classification modes, RBF and KNN, and two distance modes of calculation, L1 and LSUP. You can see the documentation for further descriptions of the classifier.

[ ]:
client.pipeline.set_validation_method('Stratified K-Fold Cross-Validation', params={'number_of_folds':3,})

client.pipeline.set_classifier('PME', params={"classification_mode":'RBF','distance_mode':'L1'})

client.pipeline.set_training_algorithm('Hierarchical Clustering with Neuron Optimization',
                                    params = {'number_of_neurons':5})

client.pipeline.set_tvo({'validation_seed':2})

Execute the full pipeline now.

[ ]:
model_results, stats = client.pipeline.execute()

The model_results object returned after a TVO step contains a wealth of information about the models that were generated and their performance. A simple view is to use the summarize function to see the performance of your model.

[ ]:
model_results.summarize()

Grab the fold with the best performing model to compare with your features.

[ ]:
model = model_results.configurations[0].models[0]

The neurons are contained in model.neurons. Plot these over the feature_vector plot that you created earlier. This step is often useful for debugging.

[ ]:
import pandas as pd
client.pipeline.visualize_neuron_array(model, model_results.feature_vectors,
                                   pd.DataFrame(model.knowledgepack.feature_summary).Feature.values[-1],
                                   pd.DataFrame(model.knowledgepack.feature_summary).Feature.values[0])

Save the best model as a Knowledge Pack. Models that are not saved will be lost if you run the pipeline again.

[ ]:
model.knowledgepack.save('MyFirstModel_KP')

Downloading the Knowledge Pack

This completes the model training portion of the tutorial.

  1. Using the Download Model page within the MPLAB ML Model Builder is recommended to download the Knowledge Pack model firmware.

  2. Alternatively, see instructions for setting up the Knowledge Pack API in the API Methods section of the documentation.