Shortening the Data Science Cycle: Training Machine Learning Models in Production

For all of machine learning’s massive pattern recognition capabilities and the vaunted accuracy of its predictions, this cognitive computing technology has been consistently impeded by one immutable factor: waiting.

Traditionally, it has taken a substantial amount of time to build models. Data scientists have lengthy data engineering processes that precede model training, then undertake a number of steps in a continual cycle to ensure models have the right training data, the proper amount of it, the requisite labels, and more.

However, were it possible to invert this process so that advanced machine learning models could be trained in production settings—dynamically, on-the-fly, while delivering enterprise value—organizations wouldn’t have to wait to reap the utility of this technology.

It’s precisely this concern that has burgeoned growing interest in the confluence of both batch processing and real-time capabilities, so data scientists can actually train models while simultaneously deploying them.

According to Cloudian CTO Gary Ogasawara, such an approach is “clearly superior to this batch model where we have to wait.”  Coupling real-time and batch processing with object storage delivers these advantages and more for data scientists and the business end users impacted by timely deployments of machine learning.

AI Time Journal Resources
Are you learning data science?

Check out what books helped 20+ successful data scientists grow in their career.

The Traditional Data Science Cycle

Ogasawara referred to the relatively newfound capabilities of training sophisticated AI models in production as “real-time machine learning”. This paradigm is distinguished from conventional methods that are predicated on time-consuming measures almost solely involving historic data.

“Machine learning of the past and current time is really like a batch process,” Ogasawara observed. “You take all this data; you build these huge training sets; you take that off, and go train your model for a day or 10 hours. You do all this data cleansing and model optimization. Then, you finally come up with something and use that for your decision-making.”

This process may take several months, during which time data sources and business requirements could realistically change—negating these efforts and forcing data scientists to start again from scratch. By combining batch operations with real-time data operations, data scientists can expedite these measures and train models during deployment. “What the data analysts and the data scientists really want to do is have a single model and have it adjust in real-time to inputs,” Ogasawara mentioned.

The Abbreviated Cycle: Real-Time and Batch Processing

Data scientists can implement the ideal Ogasawara articulated with an emergent architecture in which all data is directed through a real-time streaming engine. Kafka has become one of the most popular. Such an engine could process real-time sensor information or event streams while also writing batch data to a data lake augmented by object storage (which typically involves the S3 protocol).

Thus, models can be trained in production settings with both real-time and historic inputs. “You don’t want to have to do these offline model training steps,” Ogasawara reflected. “You just want to have a model and it gets better and better over time as it sees more data in real-time.”

The tandem of the aforesaid streaming capabilities and S3 supplies these boons. “You need to have a Kafka streaming engine on the front end to be able to ingest all this data, and it needs an object storage system to keep this vast amount of scalable data to do its training against,” Ogasawara specified.

Undeniable Business Value

The worth of these capabilities to the business are manifold. Models get created and deployed much quicker, impacting everything from financial services use cases like fraud detection, to retail deployments of recommendations based on customer activity. Conceivably, the latter could involve both e-commerce and retail in traditional physical locations. Another compelling use case for this development includes anomaly detection for applications of robotics in the manufacturing industry.

According to Ogasawara, robots are “everywhere. In manufacturing robots are just required. Now, really it’s just who can build the monitoring and control systems on top of that, that can help them operate better.” With the influx of low latent data readily available for training or fine-tuning models for such data science deployments, it may be easy to overlook the worth of historic data populating various batch processes.

Nonetheless, the latter is of extreme importance when building and tweaking models, since “you want all that historical data because there might be trends that are only visible, say, on a monthly basis,” Ogasawara indicated. “If you’re only responsive to your immediate inputs, that’s not a very smart dog. You want something that has a memory, and your memory is your object storage system.”

Featured Image: NeedPix


Jelani Harper is an editorial consultant servicing the information technology market. He specializes in data-driven applications focused on semantic technologies, data governance, and analytics.

Opinions expressed by contributors are their own.

About Jelani Harper

Jelani Harper is an editorial consultant servicing the information technology market. He specializes in data-driven applications focused on semantic technologies, data governance, and analytics.

View all posts by Jelani Harper →