Mr. Abhinav Bhatnagar is a Big data scientist with 6+ years of experience, interested in mining insights, building Machine Learning pipelines on big data, big data science. I have got hands-on professional experience in Machine Learning, Deep Learning, Big Data Technologies(Hive, Hadoop, Spark, MongoDB, Kafka) and Languages. He is working at Truecaller.
Q: How to become a data scientist after completing Bachelor of Engineering?
The role of the data scientist is still evolving, and Indian organizations are still trying to understand what else they can achieve through data science. Data scientists are not just machine learning personnel. I think this role is an overlapping of 5 different roles – Data Analyst, Data Engineer, Machine Learning Expert, Statistician and Hacker (Out-of-the-box thinker basically). Finding a fresher with all these qualities had been a challenge for us. Now, Universities have introduced courses for achieving basic data science expertise. The right courses and projects can take you to the path of data science. Courses like Data Mining, Statistical Machine Learning, Applied Statistics, Operational Research, Big Data, and the list goes on.
Q: How did you choose the Data Science field?
Honestly, it was more sort of role chose me. Even before pursuing a master’s in Computer Science with specialization in Social Networks Analysis at Arizona State University, I was working on Big Data and Regression problems. Data Science was just a buzzword for the people who were already doing it. From time to time, Industries reinvent these buzz words (like Big Data, Data Science, Deep Learning, Cyber Security) to grow themselves.
Q: Please share your knowledge and experience of Data Science.
I have seen a variety of applications of data science – In fintech, I have done industry level budget predictions. In social networks, I have dealt with virality prediction. In cybersecurity, I have experienced vulnerability prioritization, risk assessment. In the communication industry, I am working on monetization problems, audience’s creation, spam detection. Data Science holds a different trend that I have seen in USA and India. In India, organizations want you to deliver the data science solutions end to end. However, In the USA, different teams help out data scientists to get their solution to production. I know a handful of companies in India are also following the same techniques now. This enables Data Scientists to focus on the problem statement completely, instead of worrying about how the solution will be deployed to production. Having said that, this journey so far has been really amazing for me as I have seen a variety of industry problems.
Q: Which programming language you are using for Data Science? Why you are using that particular language?
Currently, I am using Python, Scala. My affinity towards python has been pretty old and I have been using it for the last 7 years now. However, you would be amazed to hear that I started with Data Science with Matlab. Python is easy to kick start with your experiment with all the necessary libraries. I like it because of Python’s readability and less complexity. You name the problem and libraries are out there for you.
Want to work with images — numpy, opencv, scikit
Want to work in text — nltk, numpy, scikit
Want to work in audio — librosa
Want to solve machine learning problem — pandas, scikit
Want to see the data clearly — matplotlib, Seaborn, scikit
Want to use deep learning — Tensorflow, pytorch
Want to do scientific computing — scipy
Want to integrate web applications — DjangoSource:
Q: Which software do you prefer for data science?
I simply use Python, Spark, and Jupyter Notebook.
Q: If I want to become a data scientist what I have to do?
- Andrew Ng ML Course
- Big Data Course
- Statistics Course
- Analytics Vidhya
- Read case studies
I believe all of this is going to take 1 year completely. A year’s dedication to your career and you can also be a data scientist.
Q: What is Deep Learning? And how does it contrast with other Machine Learning algorithms?
Deep Learning is a part of AI/ML that deals with given problem statement through neural networks. Neural networks have been around for quite a long time now. As the hardware industry has grown exponentially and all neural network’s computations have become hell lot faster. But that’s not just it. Data has also grown on a much faster pace. Training a neural network was a problem in the past because of lack of data and compute was expensive. We have achieved the solution of both the problems now and you might have been hearing deep learning term more than often. Traditional Machine Learning is still data scientists first choice until problem statement fits the neural networks.i
Q: In text mining, what are the steps are needed?
Text mining is a pretty broad term. Text data is generally messy and noisy. Most of the time of a Data Scientist goes into cleaning the data, getting it into the right format you need. Post cleaning it’s a step by step process that includes Stemming, Lemmatization, Standardization.
Then based on your problem one can either go statistically (TF, TF-IDF, Hashing TF) or Parsing approaches like N-grams or one can take word embedding (shallow Neural network) based approach. There are other parsing approaches as well, such as POS(part-of-speech) Tagging, Phrase mining (TopMine), Topic Modeling and NER(Named Entity Recognition) as well.
Q: What is the best way to learn Machine Learning?
I would recommend Andrew Ng’s course if you want to learn ML and kick start with projects. However, if you really want to understand what is happening under the hood then you should take courses on statistical machine learning. This is about theory. Rest tweaking and twisting the model respective to your business problem (Applied Machine Learning) comes with experience and practice.
Q: How do you handle the corrupted data in a dataset?
Corrupted data is basically data you failed to capture while data collection process. There is a set process for such outlier removal or treating the Nulls. Outliers can either be trimmed or taken out of the sample (if you have good enough sample). For nulls, we can impute them, use mean/mode values to fix up the data.
Q: Can I use a spark or any other big data tools for Machine Learning?
Yes, Spark has turned out to be amazing for ML. Though, a couple of algorithms like trees based algorithms lack scalability. There are different ways industries are trying to run Keras models on top of spark. Amazon’s Sagemaker, Microsoft’s Azure have definitely made their impact out there.
Q: How can we use our Machine Learning skills to generate revenue?
As a Data Scientist, you are marking your impact on companies’ revenue everywhere. Whatever domain you are in, every problem you solve is either a value proposition or bringing revenue to the product. Currently, I am leveraging my ML skills at Truecaller’s Ads and Monetization. Here, every model we build is impacting the product’s revenue directly.
Q: Which field you are delivered more products in Deep Learning? the fields like medical or any others.
My forte in DL has been with text. The key reason for choosing text was LSTM has turned around text-related problems completely whether it be language translation, NER, Text Classification, Text summarization, and topic modelling.
The major field I have contributed would be cybersecurity and Communications.