Large Scale Jobs Recommendation Engine using Implicit Data in pySpark

WHY Implicit data?
Because we as lazy users, barely submit ratings (explicit data) for anything we do on any platform, be it Netflix, Amazon, LinkedIn etc. 
WHAT we do, is just watch the movie(implicit data), view a product or click a job item on LinkedIn and simply move on.

Offering relevant recommendations in almost every consumer facing business is very much essential to convert prospects into customers. And personalization is the key to building a Recommendation Engine(RE), we see it on Netflix, Amazon, LinkedIn and it even exists in the online black market.
The complete code for this project is on my GitHub profile.

What inspired me to write this blog on RE is the use of implicit data. We have seen a lot of case studies on RE using explicit data but a very few examples with implicit features. Let’s just first understand the fundamental difference between these two types of data:

Explicit data: preferences given explicitly by the user, it could be any score in terms of ratings, likes or dislikes.
Implicit data: preferences given implicitly by the user which are not so obvious, like views, clicks, adding an item in the cart etc.

So, with implicit data, you don’t know in real, if the customer clicked an item, they really liked it or not?

Let’s solve the problem of dealing with implicit data when building a RE.
We will divide this blog primarily into 3 parts:

  1. Collaborative filtering algorithm for building Rec Engine
  2. Understanding how Collaborative Filtering works with implicit data
  3. Let’s CODE: Practice above two using pySpark (MLlib)

1. Collaborative filtering (CF)

If you already comprehend the concept of collaborative filtering, you can skip this section.

The basic idea behind CF is very simple. Let’s say you have a user ‘X’ and you want to recommend jobs to ‘X’, CF will help you in finding a group of users ‘N’ whose likes and dislikes are similar to user ‘X’. Once this set of N users is calculated, we can recommend jobs which are liked by this set ’N’ to user X.

Source: [2]

Now, the question is how CF helps you in finding a group of ’N’ similar users to ‘X’. There are primarily two ways to find the group of ’N’ similar users:

  1. Memory based approach: uses the notion of Cosine similarity and finds the k nearest neighbors to calculate the ratings for the items not rated by the user (let’s say user ‘X’)
  2. Model based approach: uses the notion of matrix factorization (SVD), clustering or deep learning models (Restricted Boltzmann machines) to predict the users ratings.

In this post, we will be talking more about the most used method for CF:

Let’s say you have this sort of matrix of movies ratings given by different users. In real-world examples this matrix is very sparse.

Original user-movies(items) ratings matrix [4]

Matrix Factorization (MF):

Source: [3]

The intuition behind MF explained in the picture above, is the use of dimensionality reduction of the user-movies(items) matrix which is approximated by the two small matrices with k latent factors(embedding), which are user embedding matrix and movie embedding matrix.

Singular Value decomposition(SVD) is one such technique of dimensionality reduction which internally performs matrix factorization.

Factorized matrix [4]

Now, once the user and movie embedding matrices are generated, user embedding vector(any single row) of any user(let’s say ‘X’) is multiplied with movie embedding vectors to get the predicted ratings values.

Model prediction of rating (3.25 in yellow cell) = Dot product of the 2 green vectors (embedding vectors)+ 0.14 (Movie bias) + 0.87 (User bias) [4]

We can perform this for all the user-movie combinations to get the predicted values of the ratings. And we will use the actual ratings and corresponding predicted ratings to calculate the RMSE loss.

NOTE: We will not consider the predicted ratings where actual ratings are missing to calculate RMSE.

Finally, one last thing which we will also implement later in pySpark is the Alternating least squares method using the Gradient Descent Optimization to calculate the user & item embedding matrices.

Optional: For understanding more about the CF algorithm, you can watch this video by Andrew Ng on Coursera: Link (Model based approach) and the other one explains Memory based approach from Stanford University data mining class: Link. For a hands on practice in excel: Link and different types of CF: Link

2. Collaborative filtering with implicit data

That was some heavy dose of algorithms above but it’s very much essential to understand CF when data is in the form of ratings(explicit). Once, you understand the fundamentals of CF, it’s easy to tweak the CF with implicit data functionality. Here we go:

Most of the methodologies used in this section are derived from this research paper: Link

The numerical values of implicit feedback describe the frequency of actions e.g. how many times you click Data Science job posts, how long you watch a TV series on Netflix, number of times you listen to a song etc. We leverage these implicit numbers given by the users to capture the confidence of user’s preferences or no preferences. The preference (pui) of user ‘u’ to item ‘i’ is captured by the following equation:

where ‘rui’ is the recorded value of the item (e.g. number of times you listen to a song). So, if the recorded value is greater than 0, the preference is 1 and if it is equal to 0, the preference is 0.

Now, let’s say if you just listen to a song only once, it might not mean you liked it, it could have been on auto-play mode in your YouTube. Similarly, if the recorded value is zero, does not mean you don’t like the song, it might mean you never knew that song even existed or you haven’t discovered it yet. So, in order to capture this relation of preferences, we need to specify the notion of confidence:

In general, as ‘rui’ grows, we have a stronger indication that the user indeed likes the item. ‘cui’, measures our confidence in observing ‘pui’. We can notice, even for a user item pair with 0 ‘rui’ meaning no preference with the item, we have an associated minimal confidence of 1. And this confidence for ‘pui’ equal to 1, increases as the evidence for positive preference increases.

The rate of increase is controlled by the constant tuning parameter α.

Preferences are assumed to be the inner products: pui = (Transpose)U. X

Now, we will calculate the two matrices: user-embedding matrix(‘U’) and item-embedding matrix(‘X’) like we did in Collaborative filtering technique as explained above with only two distinctions from what we did in case of explicit data:

  1. We need to account for the varying confidence levels
  2. Optimization should account for all possible user-item pairs, rather than only those corresponding to observed data

Embedding/factors are computed by minimizing the following cost function:


This can be solved by using the ALS optimization process to approximate our preference matrix and serve the recommendations to the user based on the preferences.

3. PySpark code to build large scale jobs Rec Engine

Finally, it’s time to code what we have learned so far about Collaborative filtering or Recommendation Engine.

Let’s cook up some data.

job posts clicks by different users (job_clicks.csv)
jobs category like IT, Finance, HR etc (jobs.csv)

So, the first step is to load the above shown data into rdd and process it. Next, we need to split it into train, validation & test set for modeling purposes.

Data load and splitting

Once, you have divided the data into three sets, use the train and validation set to train the model and tune the hyper-parameters (or use cross-val).


Observe the 4 hyper-parameters used for tuning in the Alternating Least Squares technique (ALS.trainImplicit function) in the code above:

  1. rank: denotes the width/columns of the embedding matrices
  2. iters (iterations): denotes how many times we oscillate in the ALS process to generate embedding matrices
  3. alpha: relates to the scale of recordings (number of times user performs any action: watching movies) in the confidence equation
  4. reguls (lambda): is the regularization parameter to prevent over-fitting

Finally, you can train the model with the best set of hyper-parameters and use it for predicting the recommendations or confidence level of the user’s preferences to show top N recommendations.

Here is the code for loading & processing the data, model building and predicting the recommendations for all the users.

For a complete demo of the project with the cooked up data-set used in this article, please check out the GitHub link.

Next steps could be to put this model into cloud and show the recommendations real time depending on your requirement. Moreover, there are some other cool libraries like ‘implicit’ for building Rec Engines but the underlying concept is same as we discussed in the first two sections.

Also, please share your valuable inputs or feedback on making it better and more scalable. I have another blog on model building and deployment using Docker containers, so please check that out if you love scaling your models.


[1] RE Image: link







Original. Reposted with permission.

Opinions expressed by AI Time Journal contributors are their own.

About Akshay Arora

Contributor Lead Data Scientist | BlockChain

View all posts by Akshay Arora →