All data-driven processes are intrinsically binary. No matter what the application or use case might be, depending on data involves risks as well as potential gains. Reducing the former tends to burgeon the latter, as anyone intimately familiar with statistical Artificial Intelligence deployments of machine learning well knows.
Many people focus on the advantages of this technology’s advanced pattern recognition, anomaly detection, and prescriptive capabilities. These predictive models are regularly sought for everything from security analytics to edge processing in the Internet of Things, from real-time recommendations to low latent decision-making for fraud detection.
Nonetheless, there’s a simple requisite for using supervised learning and unsupervised learning for these or any other purpose: organizations must have accurate, complete, recent, and de-duplicated data to train and build these models, or they simply won’t work.
“People really want to do this,” reflected Martin Boyd, Profisee VP of Marketing. “It’s hot and sexy and all that kind of good stuff. But without the right data foundation, all of it will not only not work, it might be misleading. It might cost you money.”
Securing the proper data foundation Boyd referenced requires ensuring the quality of data (which isn’t exactly the same as data quality) populating enterprise systems. Doing so efficiently is not the sole concern of data scientists and data engineers, but a much broader task involving longstanding, systemic solutions like Master Data Management.
Check out what books helped 20+ successful data scientists grow in their career.
Quality of Data
Quality data is essential for crafting machine learning models with the degree of accuracy required for enterprise investments. Data must meet the aforesaid quality standards to be trustworthy for any use, particularly one as risky and potentially lucrative as supplying inputs for statistical AI models. Boyd commented on the pivotal distinction between conventional notions of data quality and the precept of quality of data: “When you say data quality people immediately go to other type of solutions. But the quality of the data and the fact it’s well trusted requires more than just a data quality toolset; it requires Master Data Management.”
The reality is accomplished MDM platforms have a comprehensive array of data quality capabilities for cleaning, transforming, de-duplicating, and readying data for machine learning models. They also have a plethora of capabilities for matching, merging, and managing records (for any domain) to deliver pristine data with which to populate sources for scaffolding or deploying models. Moreover, well architected MDM provides the substratum all data management can depend on with a bottom-up approach that’s difficult to beat. “Machine learning and AI are garbage in, garbage out on steroids,” Boyd revealed. “Everything in IT’s garbage in, garbage out, but they’re garbage in, garbage out on steroids.”
The necessity of implementing quality data to reduce the risk of employing machine learning models pertains to both the training and production stages. Again, there’s a compounding effect because the latter obviously relies on the former. However, organizations can utilize any of the pre-built libraries of machine learning models and algorithms available in the cloud or from analytics vendors and still incur difficulty in production if their quality of data is insufficient.
Boyd explained the dichotomy between these two aspects of the predictive model lifecycle as “there’s two parts. One is the training data. If you don’t have accurate, clean, good quality training data then you’ll build a faulty model, in which case it’ll give you faulty answers. And, if you feed faulty data even into a good model, you’ll get faulty answers.”
Master Quality of Data
It’s important to realize that MDM, data quality tools, or any other such solutions are far from a panacea in and of themselves. They’re no substitution for data governance mainstays of metadata management, lifecycle management, data stewardship, and data modeling that form the core of this discipline, which is a substantial part of data management.
However, once organizations have devised enterprise-wide definitions and standards about how data should appear for relevant attributes, implementing quality data with credible platforms in this space is essential for reaping desirable machine learning outcomes, while considerably reducing the risk of this data-centric technology. Moreover, there’s no paucity of use cases delivering organization-wide value once the risk of deploying machine learning is minimized.
“If you want to do things like sentiment analysis, what are your customers thinking about, what have they bought in the past, what might they buy in the future, what custom offers might I put together for this customer or this group of customers based on their past behavior, that’s all stuff that people are trying to scale up to with AI and machine learning,” Boyd observed.
But, they’ll never get there without de-risking their data to furnish the foundation for these and other cognitive computing approaches.
Featured Image: NeedPix
Jelani Harper is an editorial consultant servicing the information technology market. He specializes in data-driven applications focused on semantic technologies, data governance, and analytics.