Deploying a recommender system for the movie-lens dataset – Part 1

No Comments

Introduction

In this post I will discuss building a simple recommender system for a movie database which will be able to:
– suggest top N movies similar to a given movie title to users, and
– predict user votes for the movies they have not voted for.
In the next part of this article I will show how to deploy this model using a Rest API in Python Flask, in an attempt to make this recommendation system easily useable in production.

A recommender system for a movie database

Recommender systems are so prevalently used in the net these days that we all have come across them in one form or another. Have you ever received suggestions on Amazon on what to buy next? Or suggestions on what websites you may like on Facebook?
Aside from the natural disconcerting feeling of being chased and traced, they can sometimes be helpful in navigating us into the right direction. Let’s look at an appealing example of recommendation systems in the movie industry.
I will be using the data provided from Movie-lens 20M datasets to describe different methods and systems one could build. With a bit of fine tuning, the same algorithms should be applicable to other datasets as well.
recommender system
I find the above diagram the best way of categorising different methodologies for building a recommender system. I will briefly explain some of these entries in the context of movie-lens data with some code in python. Full scripts for this article are accessible on my GitHub page. Suppose someone has watched “Inception (2010)” and loved it! What can my recommender system suggest to them to watch next?
Well, I could suggest different movies on the basis of the content similarity to the selected movie such as genres, cast and crew names, keywords and any other metadata from the movie. In that case I would be using an item-content filtering. I could also compare the user metadata such as age and gender to the other users and suggest items to the user that similar users have liked. In that case I would be using a user-content filtering. The movie-lens dataset used here does not contain any user content data. So in a first step we will be building an item-content (here a movie-content) filter.

Memory-based content filtering

In memory-based methods we don’t have a model that learns from the data to predict, but rather we form a pre-computed matrix of similarities that can be predictive. Please read on and you’ll see what I mean!
The data sets I have used for an item content filtering are movies.csv and tags.csv.
I skip the data wrangling and filtering part which you can find in the well-commented in the scripts on my GitHub page. We collect all the tags given to each movie by various users, add the movie’s genre keywords and form a final data frame with a metadata column for each movie.

# create a mixed dataframe of movies title, genres 
# and all user tags given to each movie
mixed = pd.merge(movies, tags, on='movieId', how='left')
mixed.head(3)

movie data

# create metadata from tags and genres
mixed.fillna("", inplace=True)
mixed = pd.DataFrame(mixed.groupby('movieId')['tag'].apply(
                             lambda x: "%s" % ' '.join(x))
Final = pd.merge(movies, mixed, on='movieId', how='left')
Final ['metadata'] = Final[['tag', 'genres']].apply(
                             lambda x: ' '.join(x), axis = 1)
Final[['movieId','title','metadata']].head(3)

movie data

We then transform these metadata texts to vectors of features using Tf-idf transformer of scikit-learn package. Each movie will transform into a vector of the length ~ 23000! But we don’t really need such large feature vectors to describe movies. Truncated singular value decomposition (SVD) is a good tool to reduce dimensionality of our feature matrix especially when applied on Tf-idf vectors. As you can see from the explained variance graph below, with 200 latent components (reduction from ~23000) we can explain more than 50% of variance in the data which suffices for our purpose in this work. So we will keep a latent matrix of 200 components as opposed to 23704 which expedites our analysis greatly.
We name this latent matrix the content_latent and use this matrix a few steps later to find our top N similar movies to a given movie title. But let’s learn a bit about the ratings data.

from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(stop_words='english')
tfidf_matrix = tfidf.fit_transform(Final['metadata'])
tfidf_df = pd.DataFrame(tfidf_matrix.toarray(), index=Final.index.tolist())
print(tfidf_df.shape)
(26694, 23704)
# Compress with SVD
from sklearn.decomposition import TruncatedSVD
svd = TruncatedSVD(n_components=200)
latent_matrix = svd.fit_transform(tfidf_df)
 
# plot var expalined to see what latent dimensions to use
explained = svd.explained_variance_ratio_.cumsum()
plt.plot(explained, '.-', ms = 16, color='red')
plt.xlabel('Singular value components', fontsize= 12)
plt.ylabel('Cumulative percent of variance', fontsize=12)        
plt.show()

movies

Memory-based collaborative filtering

Aside from the movie metadata we have another valuable source of information at our exposure: the user rating data. Our recommender system can recommend a movie that is similar to “Inception (2010)” on the basis of user ratings. In other words, what other movies have received similar ratings by other users? This would be an example of item-item collaborative filtering. You might have heard of it as “The users who liked this item also liked these other ones.” The data set of interest would be ratings.csv and we manipulate it to form items as vectors of input rates by the users. As there are many missing votes by users, we have imputed Nan(s) by 0 which would suffice for the purpose of our collaborative filtering.
movie metadata
Here we have movies as vectors of length ~80000. Again as before we can apply a truncated SVD to this rating matrix and only keep the first 200 latent components which we will name the collab_latent matrix. The next step is to use a similarity measure and find the top N most similar movies to “Inception (2010)” on the basis of each of these filtering methods we introduced. Cosine similarity is one of the similarity measures we can use. To see a summary of other similarity criteria, read Ref [2]- page 93.
In the following, you will see how the similarity of an input movie title can be calculated with both content and collaborative latent matrices. I have also added a hybrid filter which is an average measure of similarity from both content and collaborative filtering standpoints. If I list the top 10 most similar movies to “Inception (2010)” on the basis of the hybrid measure, you will see the following list in the data frame. For me personally, the hybrid measure is predicting more reasonable titles than any of the other filters.

from sklearn.metrics.pairwise import cosine_similarity
# take the latent vectors for a selected movie from both content 
# and collaborative matrixes
a_1 = np.array(Content_df.loc['Inception (2010)']).reshape(1, -1)
a_2 = np.array(Collab_df.loc['Inception (2010)']).reshape(1, -1)
 
# calculate the similartity of this movie with the others in the list
score_1 = cosine_similarity(Content_df, a_1).reshape(-1)
score_2 = cosine_similarity(Collab_df, a_2).reshape(-1)
 
# an average measure of both content and collaborative 
hybrid = ((score_1 + score_2)/2.0)
 
# form a data frame of similar movies 
dictDf = {'content': score_1 , 'collaborative': score_2, 'hybrid': hybrid} 
similar = pd.DataFrame(dictDf, index = Content_df.index )
 
#sort it on the basis of either: content, collaborative or hybrid 
similar.sort_values('content', ascending=False, inplace=True)
similar[['content']][1:].head(11)

movie data filtering

We could use the similarity information we gained from item-item collaborative filtering to compute a rating prediction,
\(r_{ui}\), for an item \((i)\) by a user \((u)\) where the rating is missing. Namely by taking a weighted average on the rating values of the top K nearest neighbours of item \((i)\). The Ref [2] page 97 discusses the parameters that can refine this prediction.

As mentioned right at the beginning of this article, there are model-based methods that use statistical learning rather than ad hoc heuristics to predict the missing rates. In the next section, we show how one can use a matrix factorisation model for the predictions of a user’s unknown votes.

Model-based collaborative filtering

Previously we used truncated SVD as a means to reduce the dimensionality of our matrices. To that end, we imputed the missing rating data with zero to compute SVD of a sparse matrix. However, one could also compute an estimate to SVD in an iterative learning process. For this purpose we only use the known ratings and try to minimise the error of computing the known rates via gradient descent. This algorithm was popularised during the Netflix prize for the best recommender system. Here is a more mathematical description of what I mean for the more interested reader. Otherwise you can skip this part and jump to the implementation part.

Mathematical description

SVD factorizes our rating matrix \(M_{m \times n}\) with a rank of \(k\), according to equation (1a) to
3 matrices of \(U_{m \times k}\), \(\Sigma_{k \times k}\) and \(I^T_{n \times k}\):

\(M = U \Sigma_k I^T \tag{1a}\)
\(M \approx U \Sigma_{k\prime} I^T \tag{1b}\)

where \(U\) is the matrix of user preferences and \(I\) the item preferences and \(\Sigma\) the matrix of singular values. The beauty of SVD is in this simple notion that instead of a full \(k\) vector space, we can approximate \(M\) on a much smaller \(k\prime\) latent space as in (1b). This approximation will not only reduce the dimensions of the rating matrix, but it also takes into account only the most important singular values and leaves behind the smaller singular values which could otherwise result in noise. This concept was used for the dimensionality reduction above as well.
To approximate \(M\), we would like to find \(U\) and \(I\) matrices in \(k\prime\) space using all the known rates which would mean we will solve an optimisation problem. According to (2), every rate entry in \(M\), \(r_{ui}\) can be written as a dot product of \(p_u\) and \(q_i\):

\(r_{ui} = p_u \cdot q_i \tag{2}\)

where \(p_u\) makes up the rows of \(U\) and \(q_i\) the columns of \(I^T\). Here we disregard the diagonal \(\Sigma\) matrix for simplicity (as it provides only a scaling factor). Graphically it would look something like this:

Finding all \(p_u\) and \(q_i\)s for all users and items will be possible via the following minimisation:

\( \min_{p_u,q_i} = \sum_{r_{ui}\in M}(r_{ui} – p_u \cdot q_i)^2 \tag{3}\)

A gradient descent (GD) algorithm (or a variant of it such as stochastic gradient descent SGD) can be used to solve the minimisation problem and to compute all \(p_u\) and \(q_i\)s. I will not describe the minimisation procedure in more detail here. You can read more about it on this blog or in Ref [2]. After we have all the entries of \(U\) and \(I\), the unknown rating r_{ui} will be computed according to eq. (2).
The minimisation process in (3) can also be regularised and fine-tuned with biases.

Implementation

A SVD algorithm similar to the one described above has been implemented in Surprise library, which I will use here. Aside from SVD, deep neural networks have also been repeatedly used to calculate the rating predictions. This blog entry describes one such effort. SVD was chosen because it produces a comparable accuracy to neural nets with a simpler training procedure. In the following you can see the steps to train a SVD model in Surprise.
We gain a root-mean-squared error (RMSE) accuracy of 0.77 (the lower the better!) for our rating data, which does not sound bad at all. In fact, with a memory-based prediction from the item-item collaborative filtering described in the previous section, I could not get an RMSE lower that 1.0; that’s 23% improvement in prediction!
Next we use this trained model to predict ratings for the movies that a given user \(u\), here e.g. with the \(id\) = 7010, has not rated yet. The top 10 highly rated movies can be recommended to user 7010 as you can see below.

from surprise import Dataset, Reader, SVD, accuracy
from surprise.model_selection import train_test_split
 
# instantiate a reader and read in our rating data
reader = Reader(rating_scale=(1, 5))
data = Dataset.load_from_df(ratings_f[['userId','movieId','rating']], reader)
 
# train SVD on 75% of known rates
trainset, testset = train_test_split(data, test_size=.25)
algorithm = SVD()
algorithm.fit(trainset)
predictions = algorithm.test(testset)
 
# check the accuracy using Root Mean Square Error
accuracy.rmse(predictions)
RMSE: 0.7724
 
# check the preferences of a particular user
user_id = 7010
predicted_ratings = pred_user_rating(user_id)
pdf = pd.DataFrame(predicted_ratings, columns = ['movies','ratings'])
pdf.sort_values('ratings', ascending=False, inplace=True)  
pdf.set_index('movies', inplace=True)
pdf.head(10)

user preference

Conclusion

As you saw in this article, there are a handful of methods one could use to build a recommendation system. The data scientist is tasked with finding and fine-tuning the methods that match the data better.
In the next part of this article I will be showing how the methods and models introduced here can be rearranged and categorised differently to facilitate serving and deployment. We will serve our model as a REST-ful API in Flask-restful with multiple recommendation endpoints.

References

Ref [1] – IEEE Transactions on knowledge and data engineering, Vol. 17, No. 6, JUNE 2005, DOI: 10.1109/TKDE.2005.99.
Ref [2] – Foundations and Trends in Human–Computer Interaction Vol. 4, No. 2, DOI: 10.1561/1100000009.

Comment

Your email address will not be published. Required fields are marked *