MBA Candidate @ Cornell Tech | Johnson Graduate School of Management
Some practical examples, tips, and thoughts on supervised ML
Earlier this year, through my MBA program at Cornell Tech, I took a great intro course on Machine Learning with a fantastic professor, Lutz Finger. Lutz’s course inspired me to dig even deeper into ML and AI, so I recently started a hands-on Machine Learning course on Udacity, which I felt would add to my background well. So far, I’ve completed sections on Supervised learning and wanted to share my beginner’s take on what I’ve learned and reflect on the potential implications of this technology.
You can find my python files from the lessons/projects here. Relevant folders are
choose_your_own, decision_tree, naive_bayes, and svm
Supervised Learning overview
“Supervised learning is the machine learning task of learning a function that maps an input to an output based on example input-output pairs.” (Stuart J. Russell, Peter Norvig (2010) Artificial Intelligence: A Modern Approach)
I find this definition rather intuitive; the algorithm learns from “supervising” historical actions/data points to produce an output. Within supervised machine learning there are dozens of different approaches, but my recent studies focused on Naive Bayes, Support Vector Machines (SVM), Decision Trees, and Random Forests, then further how to work with these algorithms in Python/scikit-learn.
I won’t regurgitate the details of each approach here, but if you’d like an overview, I suggest taking the course or browsing Google for details. I do, however, want to point out a few key learnings and how I see those impacting my perspective moving forward.
As with any data-driven method, the data quality and structure has a huge impact on the usefulness of the supervised model output. This wasn’t much of a surprise, but what was surprising was the number of ways the incoming data could be modified in order to improve the model or performance. While I didn’t go too deep into feature selection/engineering, simply playing with the amount of data and the ratio of testing to training data had drastic impacts on accuracy, performance, and potential overfitting.
Scikit-learn (sklearn) is scarily easy to use, at least from my beginner’s standpoint. For each of the algorithms, excluding the imports, it only took 4 lines of code to train, predict, and calculate the accuracy of a model. This obviously also ignores all of the data prep that happens before and the tuning which happens after your initial run, but with this ease, running many different models on the same data set is quick. This gives you the opportunity to try many different models on the same data, helping you find the right approach for your problem. Running many different models on the same data became even easier when I moved my work from PyCharm to a Jupyter Notebook.
Going through this exercise of creating and running many models on the same data set, made the ability to easily tune models with a wide variety of parameters important. Luckily, the combination of Sklearn and Jupyter made this super easy to play with. I was also not working with massive data sets, so the iterations were rather quick. It was during this step that I began to see the artistic side of running these sorts of models. Getting just the right balance of accuracy, speed, and generalizability requires tweaking all sorts of model and data aspects. Each of the models that I looked at had different weaknesses in terms of how prone they were to overfitting to a data set, but each also has unique attributes to guard against that. For example, with SVM’s you can adjust the C and Gamma levels to balance your classification boundary directly, improving accuracy and generalization.
Analyzing text recognition data
I wanted to test out what I had learned on a data set I found in the wild, so I downloaded the letter recognition dataset from UC Irvine to use. The dataset comes in a text file with 16 normalized numerical features that represent the image of a scanned letter. It then provides the corresponding labels. The data set is rather straightforward and trying to classify these data points into different letters seemed like a great use case for the data.
Code from this analysis can be found here
I started by reading the data into a Pandas dataframe, investigating a few of the data points, and then separating them into features and labels as NumPy arrays. Aside from being super easy to separate into arrays from the dataframe, I found another really cool Sklearn module called
preprocessingwhich allowed me to scale the value of the features in one line. I found later, that since my data set was rather small (~20k rows) the preprocessing only ended up saving me a few seconds during training and predicting, but it seems like a useful step when working with much larger data sets. Finally, I used Sklearn’s
train_test_split to create my training and testing data sets, leveraging the default 75%-25% split respectively.
My next step was to consider the types of models I wanted to try out given my problem/data. I knew I needed something good at classification, could separate data on a fair amount of dimensions, didnt need to be super fast, and had parameters to control over fitting. To me, SMV and Random Forest fit the bill here, so I went ahead with testing those.
Support Vector Machine
My first pass at a SVM resulted in 94.6% accuracy and only took a total of 4.5 seconds to train and predict. I thought this was pretty good given the minimal effort, but moved on to tuning a few of the parameters to see if I could bump the accuracy up. Experimenting with the
C parameters, I was able to get the accuracy to over 97%. Bumping the C value up to 1000 seemed to give me the biggest boost in accuracy, while the kernel didn’t seem to have a much positive impact (something i need to investigate more). Another peculiar thing with the kernels was that some of them, such as the sigmoid kernel, reduced accuracy to as little as 43.44%. I’m not well versed in how to apply the “kernel trick”, so thats probably a bit more interesting to me than to someone who knows the ins and outs of kernels.
Training time: 2.0 s Prediction time: 1.484 s Accuracy 0.9718
With other data sets, I found the SVM to be very slow, but with this smaller data set and without time as a constraint, the SVM turned out to be a great choice. With more time, I’d like to do some more validation to ensure that my model isn’t overfit and see if i can improve accuracy by adjusting gamma.
The other model I wanted to try was a Random Forest. I had used decision trees in the past to classify data into groups, so a robust Random Forest seemed like a great step up for this challenge. The default RF classifier returned an accuracy of 93.36% but did it in only .197 seconds, significantly faster than the SVM. I first tried modifying the
min_samples_split to 25, however, this only reduced my accuracy. Unsure of why, I moved on to the
n_estimators value, which I first updated to 200. This boosted my accuracy to 96.4% but increased the total train/predict time to 3.5 seconds, still faster than the SVM. Last, I tried bumping the
n_estimators to 1000, which improved accuracy by .2% but added over 11 seconds of time. To improve the RF, I would have liked to spend time on adjusting or weighting features as well investigating how sample splits could improve the model.
Training time: 2.998 s Prediction time: 0.275 s Accuracy 0.9692
Both of these models seemed to perform really well by my standards, but provoked different questions with each run. These unexpected outcomes are what make ML such an interesting topic and fuels many people’s curiosity in the subject!
If we can, should we?
Being on a tech focused university campus I hear “we use machine learning to do x” many times per day, which generally I think is a good thing, but we need to think a little more critically before applying these tools. I mentioned earlier that Sklearn was “scary easy” to use, and to a degree, I do think that its a little bit scary that “predicting” an outcome is as easy as it is. People latch on to the idea of using ML and its great power in some cases, but they forget to consider the implications of the decisions being made or the validity of the answer. Just because we can do it, we need to pause and think about if we should do it.
Thanks for reading!
MBA Candidate @ Cornell Tech | Johnson Graduate School of Management