Creating a ChatBot using the basic ML Algorithms — Part 1



Whenever someone asks you “How a ChatBot works?” usually people think of some complicated sequence to sequence learning model, which actually understands the question and forms the answer. But what if I tell you that you don’t require knowledge about deep neural networks to create a ChatBot. You can simply create a ChatBot using basic Machine Learning algorithms such as Text Classification and Text Similarity.

Here are the steps which we follow to create any Machine Learning problem.


Understanding the business problem and translating it to a technical manual — So our business problem is to create a ChatBot, a ChatBot is a question answering bot which answers us back based on the question we ask it. Now let’s think how can we technically approach this problem. Let us have several questions and answers in our training set such that the ChatBot can simply match the new question with the training set. For that, we would categorize similar questions in our training set into a single class. So now when we get a new question we can simply classify it to one of the classes in our training data and give an answer as the text which is a usual answer for questions belonging to this type of class.

Our second approach would be to match our new question with all the questions in the training set and find the most similar question in the training set. Thus the answer should also be the same for them.

Collecting Data- This is the most tedious part of your model, collecting data from various sources and accumulating them. But this is what would help in improving the predictivity of your ChatBot. The better the data you collect the better your ChatBot would respond.

Defining the Objective Function — In our business problem, the objective function is to classify the question to a class.


Visualising your data- Once you have the data in your Jupyter Notebook the first thing you do is that you understand what type of data you have. In our case we are dealing with text thus we would intuitively want to know what are the most common words used in our questions. For that we would be using :

  • Firstly load the data into your Notebook and start exploring it

  • WordClouds

  • Plot word frequencies using Counter function


Pre processing your data- Data is always noisy. So it is very important to clean it and only retain the information required for creating the model. The data pre processing/cleaning steps which we are going to do today are:

  • Punctuations Removal — Let’s remove punctuations because in my data the punctuations do not convey any meaning

  • Stop Words removal– Let’s remove stopwords because these words usually support the main words and they themselves do not convey much information in the sentence.

  • Negation Handling– We will be changing negative words like “not”, ”haven’t”, “didn’t”etc by clubbing them with the next word and adding a not before them. For example — “I don’t like you” is changed to “I not_like you”. This is important because when we would be tokenising (splitting the words) as features, the words“don’t” and “like” would be treated differently but now tokenising would treat ”not_like” as a single word.

  • POS-based preprocessing – Usually we have seen that Nouns, Adjective, Adverbs, and Verbs present in the sentence usually depicts the important key terms in the sentence such as the subject, action or intensity of the action. Thus in our pre processing step, we would only keep Nouns, Verbs, Adjectives, and Adverbs and remove words belonging to other parts of the speech.


Prepare data for model- We have our data in sentence format, where every sentence contains different number of words. But the input to any model has to be constant, thus we would be changing our data of sentences into data of Bag of Words. You can read more about Bag of Word out here.

There is one more word embedding known as Tf-idf (Term Frequency Inverse Term Frequency) which can give a better predictivity than the Bag of Words embedding because it gives more weight to important words and less weight to common words. You can read more about Tf-Idf here.

Scikit Learn comes with a wonderful feature of Pipeline. So a pipeline becomes a network between your algorithms. You just need to tell it which algorithm is going to occur after which one in the series. It automatically creates the pipeline for you thus you don’t need to manually take the output from each model and input to another one.

Text Classification — Now once we have our input ready we can now start training our model. The Classification algorithms which we would be using here are:

  • Logistic Regression
  • Multinomial Naïve Bayes Classifier
  • Decision Tree

Ensemble Model- Ensembling is a technique where you take the output from several models and ensemble them together to create one model. So we have created 4 models let’s ensemble them together. Let the answer of my ChatBot be the answer which has been predicted by the maximum number of models.

Text Similarity model — So how do we find the most similar question in the training set to my input question. After applying Tf-idf on the question; my question has been transferred into a 1 D array and similarly all other questions in the training set. So how do we compare the similarity between 2 arrays? We take the dot product between two arrays it gives us a measure of similarity between the 2 arrays. Similarly, we will find the dot product between the input question array and all the other question arrays in the training set. The highest dot product would tell us the most similar questions.

Answer Generation — Once you have figured out to which class your question belongs to, the next step is to figure out a suitable answer for your question. Firstly list out possible answers for every class. Now we would randomly generate one of these answers when the input question is classified to the corresponding class.


You never get the best model in your first model. You need to keep tuning the parameters of your model or keep validating it using cross validation to get the best model


So once you are done with your model. How do you know how well is your model performing? For a Classifier, the model predictivity is checked via creating a Confusion matrix and then we finally calculate the f-score of the model. A confusion matrix is nothing but a cross table between your predicted classes and your actual classes. This looks like a simple table but there are several predictivity scores which can be calculated from it thus it’s a very powerful table. You can calculate several scores live Accuracy, Precision, Recall, Specificity, F-score etc. which can be used for checking the predictivity of your created model.

Confusion Matrix on sample Test data

Now let’s see the ChatBot in action

But in real world ChatBots cannot always give the same answer for similar questions. What you have just seen is just the first step what a ChatBot does; classify your question to understand what type of answer the user is expecting. The next step which a ChatBot does is basically understand the intent and entity in your question thus using it to generate an answer. Do look out for Part 2 of this article where I’ll discuss on how to improve the current version of the ChatBot.

You can get the code here GitHub.

Original. Reposted with permission.

Related articles:

  1. Implementing Naive Bayes for Sentiment Analysis in Python
Opinions expressed by AI Time Journal contributors are their own.

About Priya Sarkar

Contributor Data Scientist JPMorgan Chase & Co. | AI and ML Blogger

View all posts by Priya Sarkar →