Practical Applications of Supervised Learning: Governing Sensitive Data for Regulatory Compliance

With all the media attention focused on contemporary Artificial Intelligence applications of computer vision, conversational AI, and knowledge graphs, it’s easy to lose sight of the basic data management mainstays cognitive computing has practically perfected.

However, as mounting regulations for Personally Identifiable Information (PII), data privacy, and consumer rights indicate, the greater challenge is likely the ability to deploy machine learning and AI’s knowledge base for pragmatic data governance use cases—such as ensuring regulatory compliance.

“In the last few years, with regulations from privacy and the changing data landscape, there are costs associated with getting data,” acknowledged Privacera CEO Balaji Ganesan. “Aspects of how you want to manage data is completely different, and most of it’s driven by the newer regulations: GDPR, and the California [Consumer] Privacy Act. There’s forms of those regulations coming up in every country across the world.”

Consequently, organizations are turning to data governance solutions for cataloging, profiling, discovering, and classifying sensitive data pertaining to regulations to address these newfound demands. Supervised learning and AI’s knowledge base (in the form of human rules) play an integral role in supporting these tools so organizations can adhere to regulations.

These machine learning models are crucial for classifying and tagging sensitive data so organizations can restrict their access with downstream activities to fulfill governance policies. Specifically, they underpin a form of tagging “to label our data,” Ganesan revealed. “Say this file has a social security number or this table has some personal information. We can use those tags to build controls and operationalize them.”

Supervised Learning

Once organizations discern sensitive data via data profiling and data discovery (which is largely automatable), they can catalog that data to implement the controls Ganesan mentioned. Mapping the findings of the profiling and data discovery phases to a catalog for sensitive data requires “rules and machine learning models: supervised learning,” Ganesan indicated. These techniques specifically apply to the classification or tagging process. For example, if data profiling results suggest there’s a social security number, these forms of AI “attach a confidence score to that,” Ganesan mentioned.

That information influences the classification step where data can be automatically classified as sensitive and cataloged as such. Competitive options in this space not only enable users to implement their own rules, but also are equipped with pre-built models from example data “seen from other customers and our own training set of data so we can enrich those models,” Ganesan commented.

Human in the Loop

The chief merit of employing supervised learning models in this pivotal data governance use case isn’t just the automation of the classification (which rules can provide in some instances) relating to the confidence score. As is true for most machine learning deployments, the cardinal value proposition is the ability of these models to evolve over time with learning and improve their results—and their confidence. Some of that learning stems from the longstanding concept of human in the loop which is critical when “there’s not enough context information available to accurately predict [a classification],” Ganesan remarked.

In that case, data stewards or subject matter experts are consulted to provide detailed answers that inform models’ ability to learn. Subsequently, that feedback is “built into the learning model so the more iteration happens, the more the model learns from it,” Ganesan noted. “The next time we see that data, we have inputs already from the previous time so we can more accurately predict it.” But if models generate high enough confidence scores, they can automate the classification of this information to prepare the requisite controls. The tenet of human in the loop enables models to generate such confidence over time, which is immensely useful for complying with data privacy regulations at enterprise scale.

Access Control Implementation

Data profiling, data discovery, and supervised learning assisted classification and tagging is imperative for cataloging sensitive data. Once those data are cataloged according to enterprise definitions for data governance and regulatory compliance, organizations can act on defined policies for restricting access to these data. There are numerous forms of access control, including obfuscation methods in which data are either tokenized or encrypted so the privacy of sensitive data is preserved. These approaches are effectual because they enable enterprise users with the requisite privileges to access them for analytics, for example. Business Intelligence reports to determine the most valuable customers don’t necessarily require users to know their PII.

Best of all, when there are cases in which users with the requisite credentials need access to such data, the aforementioned obfuscation can be undone to provide these crucial details. “Say you have a legal team that needs access to the actual information, or they need that for fraud detection or Anti-Money Laundering or other purposes,” Ganesan posited. In that case, organizations can “dynamically reverse the data for certain users,” Ganesan said. “Usually a small subset of users will need access to the actual data. Most of the users will be working with the anonymized data.”                                                                                                                           


Jelani Harper is an editorial consultant servicing the information technology market. He specializes in data-driven applications focused on semantic technologies, data governance, and analytics.

Opinions expressed by contributors are their own.

About Jelani Harper

Jelani Harper is an editorial consultant servicing the information technology market. He specializes in data-driven applications focused on semantic technologies, data governance, and analytics.

View all posts by Jelani Harper →