Machine Learning with Python: NLP and Text Recognition

In this article, I apply a series of natural language processing techniques on a dataset containing reviews about businesses. After that, I train a model using Logistic Regression to forecast if a review is “positive” or “negative”.

The natural language processing field contains a series of tools that are very useful to extract, label, and forecast information starting from raw text data. This collection of techniques are mainly used in the field of emotions recognition, text tagging (for example to automatize the process of sorting complaints from a client), chatbots, and vocal assistants.

The dataset

A condensed version of the yelp dataset will be used. This version contains a collection of 1000 observations, originally in JSON format, then converted into .csv.

The review dataset being used:

A glimpse of the dataset

Made out of 9 features (‘business_id’, ‘cool’, ‘date’, ‘funny’, ‘review_id’, ‘stars’, ‘text’, ‘useful’, ‘user_id’) this dataset contains a collection of reviews made by users from yelp, for each review a user gave a score from 1 to 5 stars. In order to create an efficient model to forecast if a review is “positive” or “negative”, we start from a model that takes the text variable as a predictor and the stars variable as the target.

An observation from the ‘text’ variable.

Data preprocessing and explorative analysis

Once the dataset is reduced to 2 columns it is possible to conduct a small explorative analysis. It is important to know which distribution the target variable (stars received) follows, in this way it is possible to understand if there is a bias in the dataset — imbalance between positive or negative reviews. This influences the results of the model, giving the propensity to predict outcomes that are more present in the training set.

As we can see from the plot, there is a major component of positive reviews (5 stars), which creates an imbalance or bias.

In order to be able to obtain useful results, it is necessary to reduce the complexity of the problem, an efficient way to do so is to divide the reviews into positive and negative, using this division as the dependent variable.

Before proceeding with any other visualization, it is mandatory to apply some preprocessing procedures very common in NLP:

  • Remove any non-useful characters (slashes, punctuation, HTML tags, question marks, etc.)
  • Convert the whole text to lowercase characters

Those two def functions will be very useful while preprocessing the text as described before. From here it is possible to determine which single words and a combination of words (bigrams) are more common:

Combinations of words made out of 2 and 3 words.

After a small indexing adjustment we can create a bubble chart displaying the most common words in positive and negative reviews:

Never trust the ‘burgers’.

And for the positive reviews:

The indie atmosphere is always appreciated.

After this short but interesting insight, we can proceed into the next phase: model creation.

The model

A very simple, fast to train, and very efficient algorithm is Logistic Regression. The scikit-learn library provides a tool that helps to build this model, but before doing this and before doing the classical splitting between train and test set, it is mandatory to perform few steps like stemming, vectorization, and removal of stopwords:

  • Stemming allows us to reduce every word to its root. This procedure avoids ‘dispersion’ in the text. For example, conjugation of the verb ‘to be’ like: ‘am’, ‘are’, ‘is’ are converted into its root form ‘be’.
  • The removal of the stopwords consists of removing every word like ‘the’, ‘that’, ‘of’ that would cause a decrease in the model accuracy.
  • Vectorization transforms every observation (review) in the dataset into a numerical representation. This phase is mandatory, and for every machine learning algorithm we would like to train, it is necessary to input numerical data. Vectorization provides the ability to translate text into a numeric representation of itself.

Let’s take a look at a review before and after applying stemming and stopwords removal:

Before stemming
After applying stemming the results look much more ‘raw’, but at the same time, it is still understandable as its original version.

Now it is possible to proceed with the text vectorization. The sklearn.feature_extraction.textCountVectorizer class offers a tool that is very simple to use. This tool needs to be initialized with the max_featuresargument which establishes the max length of the dictionary that will be created in order to represent the text. For example, after choosing 1500 as number of features the algorithm will create a dictionary based on the 1500 (features) with the highest amount of frequency, so each review in the dataset will be represented by a list containing 1500 elements. Each element represents a feature of the dictionary created previously, with the number assigned matching the number of times a word occurred in the observation (review).

Let’s check this example:

For each observation (Doc 1, Doc 2, Doc n..) a number represents the occurrences of this feature (word) in the observation (review).

To implement this in technique in Python, only two lines of code are necessary:

It is now possible to split the dataset into a training set and test set:

Then to train the Logistic Regression model with 10 folds:

It is possible to understand from the report that the accuracy is 88.5%, and the bias toward positive reviews is quite evident as the accuracy in predicting positive reviews is much bigger than the accuracy in predicting negative reviews. That means that given a block of text, we can predict that if it is ‘positive’ or ‘negative’ with an accuracy of 88.5%.

This difference is more evident in the confusion matrix:

Wrap Up

This model is not perfect, but it does his job. As it was mentioned before the bias toward positive reviews is quite big. To improve this model there are some possible solutions, for example:

  • Increase the number of observations (the golden rule)
  • Use a different algorithm, like Naïve Bayes, decision trees or some RNN, CNN or HAN.
  • Use a different stemming technique
  • Use a different stopwords collection

After manually modifying some parameters like the class_weights, it is possible to slightly improve the score. This practice is certainly not the best, but knowing that the model is biased toward positive reviews, a change in the class weights by decreasing the weight for positive reviews and increasing the weights for negative can lead to a (slightly ) higher accuracy.


Original. Reposted with permission.

Opinions expressed by AI Time Journal contributors are their own.

About Roberto Sannazzaro

Contributor Student and freelance AI / Big Data Developer with a passion for full stack.

View all posts by Roberto Sannazzaro →