Three less-technical tips for a more successful journey into machine learning
As people and companies venture into machine learning (ML), it is common for some to expect to dive right into building models and generating useful output. And while some parts of ML feel like this technical wizardry with magical predictions, there are other aspects that are less technical and arguably far more important. Taking sufficient time to define the right question, properly preprocess data, and consider the impact of using your model can greatly improve the success of your ML project.
My hope is that as a company, manager, or engineer looking to leverage ML, that these tips will save you time up front and help you prioritize your future efforts.
It starts with a question
The first step, asking the right question, can often be the most difficult part of your machine learning adventure. The purpose of any ML project is to answer a question. “Who wrote this”, “What is this”, “What will the price be”, “What patterns are there”? Having a concrete question you’re focused on answering allows you to clearly define your objective function and helps you identify the data you need to actually do the work. Without asking the right question, your team could sink countless hours into collecting, refining, and modeling data that produces a useless product. Even worse, these sorts of hiccups could diminish the perceived value of ML in your organization, leading to less support in the future.
Without asking the right question, your team could sink countless hours into collecting, refining, and modeling data that produces a useless product.
To avoid this, before moving forward on any data related project, pause and clearly state the question you seek to answer. Then define the objective function, ex maximize accuracy, you plan to use to measure your progress. This may seem like it is a simple step, and it is, but it can be easily forgotten. And while your first question may not be the right one, at least you will have made progress towards a clear goal which you can iterate from.
Prepare to prepare your data
Collecting and preprocessing data is likely ~90% of the effort. And the other 10% is likely testing, tuning, and operationalizing your model. For shallow-learning methods, this includes exploring your data, engineering features, and normalizing data to a useful format. Even with deep-learning methods, data should be explored and vectorized to ensure reasonable performance. While many of these operations are easily done with a few lines of code from libraries such as Scikit-learn or Keras, exploring data and validating that it is representative of the actual data is time consuming.
Check out what books helped 20+ successful data scientists grow in their career.
Feature engineering requires a rather in-depth understanding of the business context and having the right features can really improve your outcomes, so simply throwing a solo data scientist at the problem may not be fruitful. Those getting started with ML in their organization should build in plenty of time to allow for this data exploration and preparation before expecting results. Be upfront about how much effort is likely to go into this step and do not rush into just dumping out accuracy scores.
In a production setting, it’s also likely that you are not collecting all of the data points that you might have hoped for, so knowing that you’ll likely need to iterate on the collection aspect is also good to know going into it. Overall, plan to have a longer phase of iterating on data preprocessing and manage expectations accordingly.
What impact will your model have?
In a general sense, machine learning, and in particular deep learning, model’s output values are based on some mathematical transformation it has made to itself based on the training data.
These algorithms may identify some pattern in the training data that we, humans, have not recognized, but the model is not consciously thinking or making a decision. It is simply adjusting weights or values in order to max/minimize its objective function; a very singular focus. Just try to leverage your model on a completely different set of data and you’ll realize how specific your model’s “knowledge” really is.
We should take a moment to consider why our output is what it is, if it will generalize well against live data, and if there could be any unintended consequences from using it.
I believe this is important to point out because many times it is easy to take the output (ie prediction) of a model and run with it, without considering the bias that may have leaked into it. We should take a moment to consider why our output is what it is, if it will generalize well against live data, and if there could be any unintended consequences from using it. As the model is not “thinking”, it will not adapt to your moral or ethical considerations unless its objective function and training data are aligned to do so.
In summary, the tools and techniques for machine learning are rapidly advancing, but there are a number of ancillary considerations that must be made in tandem. Focusing on the right goal, properly processing data, and questioning your output are all actions you should consider when conducting any machine learning project. Just as we’re seeing exponential advancement in the technical capacity of machine learning, we should push to progress machine learning’s supporting activities at an even great rate.
MBA Candidate @ Cornell Tech | Johnson Graduate School of Management