(Source: knssr - stock.adobe.com)
Welcome back to our series on creating machine-learning (ML) models for embedded systems using Edge Impulse. In the first four blogs, we talked about the process of creating a training pipeline tailored to different data types, including images, sounds, and sensors, such as accelerometers. Thus far, we have demonstrated how to collect raw training data from sensors and how to create the training pipeline. Running the raw training data through the pipeline generates our first attempt at a custom neural network model that can now be used in real-world conditions with real-time data. This is often referred to as inferencing.
But for various reasons, our initial model might have issues that negatively affect its performance. As is good engineering practice, testing, validation, and tuning are crucial steps for developing a robust and efficient machine-learning model. It should be no surprise that Edge Impulse provides all the tools needed to do just that. In the following blogs, we will explore the process and tools to analyze ML model performance and tune the models to achieve peak performance.
First, let’s examine some of the negative issues when training an ML model. In a nutshell, the issues arise because finite data is used to train models, and it would be prohibitively expensive, if not outright impossible, to provide every possible input scenario to our training environment. Thus, a fielded model will almost certainly experience inputs during inferencing that the model—remember the outputs are probabilistic—could become confused by. The two broad categories of issues are:
Figure 1: Overfitting happens when a model performs exceedingly well on the training data but fails to generalize to new, unseen data. (Source: Green Shoe Garage)
Figure 2: Underfitting occurs when a model is too simplistic to capture the underlying patterns in the training data. (Source: Green Shoe Garage)
Collectively, these form the basis of a concept referred to as bias-variance trade-off. That is to say that overfitting and underfitting are related to the bias-variance trade-off in machine learning. A high-bias model—underfitting—oversimplifies the problem, while a high-variance model—overfitting—overly complexifies it. The goal is to strike a balance by finding a model that generalizes well without being too simplistic or complex.
Edge Impulse provides multiple tools to help examine the performance of your ML model throughout the development cycle. For example, the model's performance is evaluated in multiple ways when developing a classifier. First, there is a score of accuracy and loss. Accuracy is the percentage of the test data that is classified correctly. For example, if your training set consists of 50 images of cats and 50 images of dogs, and the model classifies 97 of the 100 images correctly, then the accuracy is 97 percent. Loss is a related though slightly more nebulous measure. Loss is related to the confidence of the predictions. If both models get 95 percent of their predictions correct, then the one with lower loss—meaning, it is more confident in its guesses—is considered a better model. The lower the loss, the more confident the model is for the inputs given. Both accuracy and loss can vary based on different test datasets presented to the model. Thus, ensuring the training data set is large can help provide a good analysis of both accuracy and loss.
Next is a confusion matrix, which expresses the probability of a new input signal being classified as any of the possible classes established during the training (Figure 3). Ideally, all signals would be classified correctly 100 percent of the time and incorrectly 0 percent of the time. But that can be difficult to achieve if several possible classifications are differentiated only in small, nuanced ways. The confusion matrix also gives a figure referred to as the F1 Score. The F1 score is a performance metric commonly used in machine learning and classification tasks. It is a measure of a model's accuracy, taking into account both precision and recall. Precision is the proportion of true positive predictions—correctly identified positive samples—out of all positive predictions made by the model. Recall is the proportion of true positive predictions out of all actual positive samples. An F1 score of 1.00 is considered ideal.
Figure 3: Edge Impulse can perform multiple analysis techniques, such as a confusion matrix of your model to determine the performance of the model. (Source: Green Shoe Garage)
We will wrap things up here for now. In the next blog, we will finish exploring the tools Edge Impulse provides to assess the quality of your self-generated ML model. We will also investigate ways to mitigate the effects of overfitting and underfitting.
Michael Parks, P.E. is the co-founder of Green Shoe Garage, a custom electronics design studio and embedded security research firm located in Western Maryland. He produces the Gears of Resistance Podcast to help raise public awareness of technical and scientific matters. Michael is also a licensed Professional Engineer in the state of Maryland and holds a Master’s degree in systems engineering from Johns Hopkins University.