Data Criteria for ML POC Success

On 七月 17, 2024 in All, Computing, General by Becks Simpson

Moving Your ML Proof of Concept to Production Part 2: Getting the Data in Order

(Source: FAMILY STOCK - stock.adobe.com)

No one can start to build a machine learning (ML) proof of concept (POC) without a discussion about the data needed to make the magic happen. While much of the debate centers on the amount of data needed (and, to a lesser extent, what types), data quality is also critical for success. So far, this blog series has covered establishing business objectives and then translating them to ML metrics. Now, I’ll detail what that means for building a relevant, POC-specific dataset. Information from the previous stage is crucial for making good data decisions since it helps determine the input needed for the model to learn to make decisions, as well as the output required for the model to be useful to the business. From there, the last piece of the puzzle is to establish the required data quantity and construct data acquisition pipelines if needed.

Data Input Selection and Outcome Label Criteria

The first step to preparing a dataset for an ML POC project is to define the required input data points, the information or features they include, and the outcomes or labels they tie to, which need to be predicted. Linking the desired outputs back to the inputs typically involves labeling or annotating the input data in some fashion and is also an important consideration when starting to build a dataset. Concretely defining and selecting data inputs and outputs based on the use case may seem like an obvious choice, but doing this with an eye toward what is actually available or acquirable is equally helpful. This is especially pertinent for internally sourced data with no open-source equivalents, which is a likely scenario for bespoke use cases or with heavy reliance on proprietary data. For example, if the desired prediction for the project is “increase or decrease in satisfaction” after a customer support interaction, the level of satisfaction needs to be collected both before and after the event, which often is not the case. Halting the project to establish data acquisition pipelines for dataset construction is feasible. However, if the chosen inputs or outputs cannot be collected effectively, even an acquisition phase will not suffice.

Ascertaining Data Quality

Whether data is readily available or an acquisition phase is needed, establishing good data quality through exploration and transformation is the next important step. This will ensure that the data produces robust results that can be trusted and reproduced. Typical dimensions of data quality include completeness, consistency, accuracy, and validity. The first considers whether any features of the input data are missing and to what extent. If the use case involves supervised learning, completeness also likely includes whether enough labels exist across all desired predictions.

Consistency addresses questions like whether information is encoded the same way across all data points or if it varies in some way. For example, inconsistencies could appear in the data between time periods if a change in the team producing them or in the procedure for their capture occurred, which may mean that some pieces of information are no longer collected or weren’t being acquired before. Inconsistencies could also be introduced by interobserver bias—in other words, people recording or annotating things differently. One should also consider outliers and other noise affecting predictive accuracy when assessing consistency.

Accuracy and validity go hand in hand since invalid data is often inaccurate too. For example, recording someone’s height as 6m instead of 6ft is both invalid—since 6m is not reasonable—and incorrect. Other inaccuracies are difficult to trace, however, like misclassifying a tumor type in a medical image. The type may be valid but incorrect for that data point, so care should be taken to verify the data as much as possible before using them. All of these data quality checks will inform the types of transformations needed to ensure that good data is collected and to ameliorate quality issues in existing data.

Establishing Data Quantity Needed

Routinely asked in ML projects is how much is enough? This question relates not only to how much input data is needed but also to how much data labeling is needed for sufficient distribution across the range of outputs to be predicted. The answer will depend on the intersection of what methods will be used during experimentation and how simple the input data is. Increasing the complexity of either means increasing data needs. On the data complexity axis, a model that needs to learn from whole images instead of extracted portions or heuristics will need more examples to predict effectively. Equally, on the experimentation side, off-the-shelf models or pre-trained foundation models require fewer data inputs because they have already been trained to perform tasks on an enormous dataset. This is also the case with modeling methods that involve feature extraction coupled with statistical model use; these approaches reduce the complexity of the data needed by removing some of the variability that makes learning difficult. The most data-intensive approach is training a deep learning model from scratch, though this is less common now, given all that’s available in the modern ML ecosystem.

While the data’s complexity and experimentation direction will give a rough idea of the amount of data needed, other methods give a more useful, concrete figure. For example, the rule of ten states that ten times as many data points as parameters are needed in the model, whereas statistical approaches like power analysis calculate whether a given sample size is big enough for the result to be considered genuine. Using the output from one of these coupled with the previous data quality stage to drive data acquisition if needed, the dataset portion of the project is more likely to be positioned for success.

Conclusion

Data can make or break an ML POC. Even with a vast quantity of data, if the quality is sufficiently lacking, the results will either fail to meet the necessary standards or, worse, produce inconsistent and untrustworthy predictions. Likewise, pristine data that are immaculately curated will also fail if not enough of them exist to produce meaningful patterns to learn. And before all of that, choosing inputs and outputs that are not easily available will halt the project before it begins, so having well-defined, standardized, and realistic data choices is critical.

So far, this series has covered two of the important first steps to success with an ML project—establishing goals and translating them to metrics, as well as preparing the dataset. Keep following along to learn the remaining equally critical stages involved in developing an ML POC in an agile but robust fashion and putting the resulting outputs into production. Future blogs will cover setting up experimentation tooling, developing resources and approaches for POC-building (including open-source models), creating guidelines and focus points when extending to a production-ready version, and considering what to anticipate and monitor post-deployment.

« Back

Becks Simpson is a Machine Learning Lead at AlleyCorp Nord where developers, product designers and ML specialists work alongside clients to bring their AI product dreams to life. In her spare time, she also works with Whale Seeker, another startup using AI to detect whales so that industry and these gentle giants can coexist profitably. She has worked across the spectrum in deep learning and machine learning from investigating novel deep learning methods and applying research directly for solving real world problems to architecting pipelines and platforms to train and deploy AI models in the wild and advising startups on their AI and data strategies.

Tagged With: data for ml, data quality, deep learning, machine learning, proof of concept

公司

資源

支援

關注我們

Bench Talk

Bench Talk for Design Engineers | The Official Blog of Mouser Electronics

Moving Your ML Proof of Concept to Production Part 2: Getting the Data in Order

Data Input Selection and Outcome Label Criteria

Ascertaining Data Quality

Establishing Data Quantity Needed

Conclusion

Search

Categories

Featured Authors

All Authors

Archives

Tags

客戶服務部

公司

資源

支援

關注我們

Bench Talk

Bench Talk for Design Engineers | The Official Blog of Mouser Electronics

Moving Your ML Proof of Concept to Production Part 2: Getting the Data in Order

Data Input Selection and Outcome Label Criteria

Ascertaining Data Quality

Establishing Data Quantity Needed

Conclusion

Related Posts

New Tech Tuesdays: Big Data, Bigger Insights: The Power of Datafication

Democratizing AI Development with Edge Impulse

Hyperscale Data Centers Power the AI Revolutions

Data Analysis Adds Insights Into IoT

New Tech Tuesdays: Edge Computing: The Future of Distributed Data Processing

7 Ways AI Revolutionizes Data Centers

Search

Categories

Featured Authors

All Authors

Archives

Tags

客戶服務部

公司

資源

支援

關注我們