Deploying a recommender system for the movie-lens dataset – Part 1

15.7.2019 | 10 minutes of reading time

Introduction

In this post I will discuss building a simple recommender system for a movie database which will be able to:
– suggest top N movies similar to a given movie title to users, and
– predict user votes for the movies they have not voted for.
In the next part of this article I will show how to deploy this model using a Rest API in Python Flask, in an attempt to make this recommendation system easily useable in production.

A recommender system for a movie database

Recommender systems are so prevalently used in the net these days that we all have come across them in one form or another. Have you ever received suggestions on Amazon on what to buy next? Or suggestions on what websites you may like on Facebook?
Aside from the natural disconcerting feeling of being chased and traced, they can sometimes be helpful in navigating us into the right direction. Let’s look at an appealing example of recommendation systems in the movie industry.
I will be using the data provided from Movie-lens 20M datasets to describe different methods and systems one could build. With a bit of fine tuning, the same algorithms should be applicable to other datasets as well.

I find the above diagram the best way of categorising different methodologies for building a recommender system. I will briefly explain some of these entries in the context of movie-lens data with some code in python. Full scripts for this article are accessible on my GitHub page . Suppose someone has watched “Inception (2010)” and loved it! What can my recommender system suggest to them to watch next?
Well, I could suggest different movies on the basis of the content similarity to the selected movie such as genres, cast and crew names, keywords and any other metadata from the movie. In that case I would be using an item-content filtering. I could also compare the user metadata such as age and gender to the other users and suggest items to the user that similar users have liked. In that case I would be using a user-content filtering. The movie-lens dataset used here does not contain any user content data. So in a first step we will be building an item-content (here a movie-content) filter.

Memory-based content filtering

In memory-based methods we don’t have a model that learns from the data to predict, but rather we form a pre-computed matrix of similarities that can be predictive. Please read on and you’ll see what I mean!
The data sets I have used for an item content filtering are movies.csv and tags.csv.
I skip the data wrangling and filtering part which you can find in the well-commented in the scripts on my GitHub page . We collect all the tags given to each movie by various users, add the movie’s genre keywords and form a final data frame with a metadata column for each movie.

1# create a mixed dataframe of movies title, genres 
2# and all user tags given to each movie
3mixed = pd.merge(movies, tags, on='movieId', how='left')
4mixed.head(3)

1# create metadata from tags and genres
2mixed.fillna("", inplace=True)
3mixed = pd.DataFrame(mixed.groupby('movieId')['tag'].apply(
4                             lambda x: "%s" % ' '.join(x))
5Final = pd.merge(movies, mixed, on='movieId', how='left')
6Final ['metadata'] = Final[['tag', 'genres']].apply(
7                             lambda x: ' '.join(x), axis = 1)
8Final[['movieId','title','metadata']].head(3)

We then transform these metadata texts to vectors of features using Tf-idf transformer of scikit-learn package. Each movie will transform into a vector of the length ~ 23000! But we don’t really need such large feature vectors to describe movies. Truncated singular value decomposition (SVD) is a good tool to reduce dimensionality of our feature matrix especially when applied on Tf-idf vectors . As you can see from the explained variance graph below, with 200 latent components (reduction from ~23000) we can explain more than 50% of variance in the data which suffices for our purpose in this work. So we will keep a latent matrix of 200 components as opposed to 23704 which expedites our analysis greatly.
We name this latent matrix the content_latent and use this matrix a few steps later to find our top N similar movies to a given movie title. But let’s learn a bit about the ratings data.

1from sklearn.feature_extraction.text import TfidfVectorizer
2tfidf = TfidfVectorizer(stop_words='english')
3tfidf_matrix = tfidf.fit_transform(Final['metadata'])
4tfidf_df = pd.DataFrame(tfidf_matrix.toarray(), index=Final.index.tolist())
5print(tfidf_df.shape)

1(26694, 23704)

1# Compress with SVD
2from sklearn.decomposition import TruncatedSVD
3svd = TruncatedSVD(n_components=200)
4latent_matrix = svd.fit_transform(tfidf_df)
5 
6# plot var expalined to see what latent dimensions to use
7explained = svd.explained_variance_ratio_.cumsum()
8plt.plot(explained, '.-', ms = 16, color='red')
9plt.xlabel('Singular value components', fontsize= 12)
10plt.ylabel('Cumulative percent of variance', fontsize=12)        
11plt.show()

Memory-based collaborative filtering

Aside from the movie metadata we have another valuable source of information at our exposure: the user rating data. Our recommender system can recommend a movie that is similar to “Inception (2010)” on the basis of user ratings. In other words, what other movies have received similar ratings by other users? This would be an example of item-item collaborative filtering. You might have heard of it as “The users who liked this item also liked these other ones.” The data set of interest would be ratings.csv and we manipulate it to form items as vectors of input rates by the users. As there are many missing votes by users, we have imputed Nan(s) by 0 which would suffice for the purpose of our collaborative filtering.

Here we have movies as vectors of length ~80000. Again as before we can apply a truncated SVD to this rating matrix and only keep the first 200 latent components which we will name the collab_latent matrix. The next step is to use a similarity measure and find the top N most similar movies to “Inception (2010)” on the basis of each of these filtering methods we introduced. Cosine similarity is one of the similarity measures we can use. To see a summary of other similarity criteria, read Ref [2]- page 93.
In the following, you will see how the similarity of an input movie title can be calculated with both content and collaborative latent matrices. I have also added a hybrid filter which is an average measure of similarity from both content and collaborative filtering standpoints. If I list the top 10 most similar movies to “Inception (2010)” on the basis of the hybrid measure, you will see the following list in the data frame. For me personally, the hybrid measure is predicting more reasonable titles than any of the other filters.

1from sklearn.metrics.pairwise import cosine_similarity
2# take the latent vectors for a selected movie from both content 
3# and collaborative matrixes
4a_1 = np.array(Content_df.loc['Inception (2010)']).reshape(1, -1)
5a_2 = np.array(Collab_df.loc['Inception (2010)']).reshape(1, -1)
6 
7# calculate the similartity of this movie with the others in the list
8score_1 = cosine_similarity(Content_df, a_1).reshape(-1)
9score_2 = cosine_similarity(Collab_df, a_2).reshape(-1)
10 
11# an average measure of both content and collaborative 
12hybrid = ((score_1 + score_2)/2.0)
13 
14# form a data frame of similar movies 
15dictDf = {'content': score_1 , 'collaborative': score_2, 'hybrid': hybrid} 
16similar = pd.DataFrame(dictDf, index = Content_df.index )
17 
18#sort it on the basis of either: content, collaborative or hybrid 
19similar.sort_values('content', ascending=False, inplace=True)
20similar[['content']][1:].head(11)

We could use the similarity information we gained from item-item collaborative filtering to compute a rating prediction,
\(r_{ui}\), for an item \((i)\) by a user \((u)\) where the rating is missing. Namely by taking a weighted average on the rating values of the top K nearest neighbours of item \((i)\). The Ref [2] page 97 discusses the parameters that can refine this prediction.

As mentioned right at the beginning of this article, there are model-based methods that use statistical learning rather than ad hoc heuristics to predict the missing rates. In the next section, we show how one can use a matrix factorisation model for the predictions of a user’s unknown votes.

Model-based collaborative filtering

Previously we used truncated SVD as a means to reduce the dimensionality of our matrices. To that end, we imputed the missing rating data with zero to compute SVD of a sparse matrix. However, one could also compute an estimate to SVD in an iterative learning process. For this purpose we only use the known ratings and try to minimise the error of computing the known rates via gradient descent. This algorithm was popularised during the Netflix prize for the best recommender system. Here is a more mathematical description of what I mean for the more interested reader. Otherwise you can skip this part and jump to the implementation part.

Mathematical description

SVD factorizes our rating matrix \(M_{m \times n}\) with a rank of \(k\), according to equation (1a) to
3 matrices of \(U_{m \times k}\), \(\Sigma_{k \times k}\) and \(I^T_{n \times k}\):

\(M = U \Sigma_k I^T \tag{1a}\)
\(M \approx U \Sigma_{k\prime} I^T \tag{1b}\)

where \(U\) is the matrix of user preferences and \(I\) the item preferences and \(\Sigma\) the matrix of singular values. The beauty of SVD is in this simple notion that instead of a full \(k\) vector space, we can approximate \(M\) on a much smaller \(k\prime\) latent space as in (1b). This approximation will not only reduce the dimensions of the rating matrix, but it also takes into account only the most important singular values and leaves behind the smaller singular values which could otherwise result in noise. This concept was used for the dimensionality reduction above as well.
To approximate \(M\), we would like to find \(U\) and \(I\) matrices in \(k\prime\) space using all the known rates which would mean we will solve an optimisation problem. According to (2), every rate entry in \(M\), \(r_{ui}\) can be written as a dot product of \(p_u\) and \(q_i\):

\(r_{ui} = p_u \cdot q_i \tag{2}\)

where \(p_u\) makes up the rows of \(U\) and \(q_i\) the columns of \(I^T\). Here we disregard the diagonal \(\Sigma\) matrix for simplicity (as it provides only a scaling factor). Graphically it would look something like this:

Finding all \(p_u\) and \(q_i\)s for all users and items will be possible via the following minimisation:

\( \min_{p_u,q_i} = \sum_{r_{ui}\in M}(r_{ui} – p_u \cdot q_i)^2 \tag{3}\)

A gradient descent (GD) algorithm (or a variant of it such as stochastic gradient descent SGD) can be used to solve the minimisation problem and to compute all \(p_u\) and \(q_i\)s. I will not describe the minimisation procedure in more detail here. You can read more about it on this blog or in Ref [2]. After we have all the entries of \(U\) and \(I\), the unknown rating r_{ui} will be computed according to eq. (2).
The minimisation process in (3) can also be regularised and fine-tuned with biases.

Implementation

A SVD algorithm similar to the one described above has been implemented in Surprise library, which I will use here. Aside from SVD, deep neural networks have also been repeatedly used to calculate the rating predictions. This blog entry describes one such effort. SVD was chosen because it produces a comparable accuracy to neural nets with a simpler training procedure. In the following you can see the steps to train a SVD model in Surprise .
We gain a root-mean-squared error (RMSE) accuracy of 0.77 (the lower the better!) for our rating data, which does not sound bad at all. In fact, with a memory-based prediction from the item-item collaborative filtering described in the previous section, I could not get an RMSE lower that 1.0; that’s 23% improvement in prediction!
Next we use this trained model to predict ratings for the movies that a given user \(u\), here e.g. with the \(id\) = 7010, has not rated yet. The top 10 highly rated movies can be recommended to user 7010 as you can see below.

1from surprise import Dataset, Reader, SVD, accuracy
2from surprise.model_selection import train_test_split
3 
4# instantiate a reader and read in our rating data
5reader = Reader(rating_scale=(1, 5))
6data = Dataset.load_from_df(ratings_f[['userId','movieId','rating']], reader)
7 
8# train SVD on 75% of known rates
9trainset, testset = train_test_split(data, test_size=.25)
10algorithm = SVD()
11algorithm.fit(trainset)
12predictions = algorithm.test(testset)
13 
14# check the accuracy using Root Mean Square Error
15accuracy.rmse(predictions)
16RMSE: 0.7724
17 
18# check the preferences of a particular user
19user_id = 7010
20predicted_ratings = pred_user_rating(user_id)
21pdf = pd.DataFrame(predicted_ratings, columns = ['movies','ratings'])
22pdf.sort_values('ratings', ascending=False, inplace=True)  
23pdf.set_index('movies', inplace=True)
24pdf.head(10)

Conclusion

As you saw in this article, there are a handful of methods one could use to build a recommendation system. The data scientist is tasked with finding and fine-tuning the methods that match the data better.
In the next part of this article I will be showing how the methods and models introduced here can be rearranged and categorised differently to facilitate serving and deployment. We will serve our model as a REST-ful API in Flask-restful with multiple recommendation endpoints.

References

Ref [1] – IEEE Transactions on knowledge and data engineering, Vol. 17, No. 6, JUNE 2005, DOI: 10.1109/TKDE.2005.99 .
Ref [2] – Foundations and Trends in Human–Computer Interaction Vol. 4, No. 2, DOI: 10.1561/1100000009 .

Was this post helpful?

Likes

Blog author

Sherri Hadian

Do you still have questions? Just send me a message.

Your job at codecentric?

Jobs

Agile Developer und Consultant (w/d/m)

Alle Standorte

Green Cloud: Daten und Emissionen sparen

Das Internet produziert jährlich 900 Millionen Tonnen CO₂ – das ist deutlich mehr als Deutschland insgesamt emittiert. Hauptverantwortlich ist der immer weiter steigende Stromverbrauch beim Transport und der Speicherung von Daten. Wenn ihr kurz darüber...

Cloud
Green IT
Softwarearchitektur
Data

11.3.2024 | 5 Minuten Lesezeit

Dennis

Charge your APIs Volume 23: REST vs. gRPC

APIs dienen als Verbindungsstück zwischen Daten und Verarbeitung und erlauben uns damit, Daten im richtigen Kontext als Informationen zu interpretieren. Passende fachliche Themen sind dabei präsenter denn je und erreichen bald auch den Endverbraucher...

Java
Softwareentwicklung
Spring
Softwarearchitektur
API
Data

11.2.2024 | 7 Minuten Lesezeit

Sebastian Tiemann

Eine Einführung in Federated Learning im industriellen Kontext: Fortgeschritten

Im Bereich des maschinellen Lernens wurde eine lange Zeit angenommen, dass die Eingabedaten von Modellen und Gewichten sicher sei und nicht extrahiert werden könnten. In den letzten Jahren veröffentlichte Forschung hat diese Annahme in Frage gestellt...

Machine Learning
Big Data
Data Science
Data

18.9.2023 | 8 Minuten Lesezeit

Ihsan Kisi

Eine Einführung in Federated Learning im industriellen Kontext: Grundlagen

Mithilfe von Daten können Unternehmen fundiertere Entscheidungen treffen, ihre Arbeitsabläufe optimieren und mit der Kraft des maschinellen Lernens (ML) einen Vorteil in der wettbewerbsintensiven Geschäftswelt erlangen. Allerdings ist der Umgang mit ...

Machine Learning
Data Science
Data
Big Data

25.8.2023 | 7 Minuten Lesezeit

Ihsan Kisi

Bessere SQL-Datenpipelines mit dbt

SQL ist weiterhin aus der Datenanalyse nicht wegzudenken – es ist vergleichsweise einfach zu lernen und Anwender können es ohne zusätzliche Werkzeuge auf einer Datenbank ausführen. Entsprechend ist es bei vielen Datenanalysten und Engineers beliebt. ...

Data

22.2.2023 | 2 Minuten Lesezeit

Matthias Niehoff

Data Governance: Wie können wir Daten demokratisieren?

“Data is the new oil” ist inzwischen ein alter Hut. Jedes Unternehmen versucht, Daten besser zu nutzen, sei es, um die eigenen Prozesse zu optimieren, die Kunden besser zu verstehen oder neue Produkte anzubieten. Dabei stellen fast alle fest: Wir haben...

Data Science

23.11.2022 | 2 Minuten Lesezeit

Matthias Niehoff

Streaming Wikipedia mit Apache Kafka

Apache Kafka ist in aller Munde und entwickelt sich im Kontext von verteilten Systemen zum De-facto-Standard als Plattform für Event Streaming. Im Rahmen unserer OffProject Time (Weiterbildungszeit) haben wir uns die Plattform auch näher angeschaut und...

Kotlin
Data
Java
Messaging
Spring

15.8.2022 | 10 Minuten Lesezeit

Christoph Metzger

Felix Rieß

Einführung in die Welt der Tourenoptimierung – Echte Routen und realistischere...

In diesem Artikel möchte ich euch mit einem Python Jupyter Notebook zeigen, wie ihr Anwendungsfälle der Tourenoptimierung inklusive Nebenbedingungen lösen und visualisieren könnt. Außerdem zeige ich euch, wie ihr mit OpenStreetMaps die Route zwischen...

Data

21.6.2022 | 7 Minuten Lesezeit

Lukas Heidemann

Einführung in die Welt der Tourenoptimierung – Visualisierung und Lösungsverfahren...

In diesem Artikel möchte ich euch zeigen, wie ihr Probleme der Tourenoptimierung in einem Python Jupyter Notebook lösen und visualisieren könnt. Am Beispiel eines Fahrradkurierdienst zeige ich außerdem, wie das Grundproblem um gängige Nebenbedingungen...

Data

16.6.2022 | 9 Minuten Lesezeit

Lukas Heidemann

Einführung in die Welt der Tourenoptimierung (1/3)

In vielen Unternehmen fallen täglich verschiedene Transportprozesse an. Klassische Beispiele sind die Optimierung von Warenein- und ausgängen, die Einsatzplanung von Servicetechnikern oder die optimale Reihenfolge der Auslieferung bei Lieferdiensten....

Data

12.6.2022 | 8 Minuten Lesezeit

Lukas Heidemann

Machine-Learning-Modelle bewerten – Quality Gates etablieren

Die Qualität bzw. Nützlichkeit von Machine-Learning-Modellen lässt sich mit Hilfe von Testdaten und Metriken bewerten. Allerdings in welchem Umfang? Manuell, automatisiert, einmalig, regelmäßig? Manuell lassen sich die ersten Modelle als Ergebnis eines...

Data
Machine Learning
Softwareentwicklung
CI/CD

7.12.2021 | 7 Minuten Lesezeit

Berthold Schulte

Schnelles Training eines Recommendation-Modells durch BigQuery ML

Machine Learning (ML) kann nur durch Modelle in der Produktion Business Value erzeugen. Allerdings kann die Zeitspanne zwischen der Entwicklung der nächsten Iteration eines Modells und dessen Einsatz in einer Produktionsumgebung massiv sein. Dies gilt...

Accelerate
Cloud
Data
Google Cloud
Machine Learning

26.7.2021 | 11 Minuten Lesezeit

Niklas Haas

Timo Böhm

KI, Daten und Infrastruktur – ML-Systeme schnell Ende-zu-Ende verproben...

Heutzutage steht fast alles, was mit den Labels „künstliche Intelligenz (KI)“ oder „Machine Learning (ML)“ versehen ist, für Fortschritt. Seltsamerweise schließt diese Assoziation jedoch häufig die Themen Daten und Dateninfrastruktur nicht ausreichend...

Kultur
Data
Machine Learning

21.6.2021 | 12 Minuten Lesezeit

Marcel Mikl

Schnelles KI-Prototyping mit Google Cloud AutoML Vision

Bei klassischen Machine-Learning-(ML-)Projekten beschäftigen sich Data Scientists häufig längere Zeit (mehrere Monate) mit der Entwicklung eines ML-Modells. Dabei werden hohe Kosten verursacht und die Zeit, bis ein erstes Modell zur Verfügung steht, ...

Cloud
Computer Vision
Data
Künstliche Intelligenz
Google Cloud
Machine Learning

17.5.2021 | 5 Minuten Lesezeit

Nils Bauroth

Sven Rediske

The Good, the Bad and the Ugly: Daten effektiv visualisieren und kommunizieren

Dieser Artikel begleitet meinen Vortrag The Good, the Bad and the Ugly: Daten effektiv visualisieren und kommunizieren, den ich am 20.10.2020 auf der data2day gehalten habe.Datenvisualisierung ist ausschlaggebend für Verständnis und KommunikationDatenvisualisierung...

Data
Data Science

19.10.2020 | 11 Minuten Lesezeit

Shirin Elsinghorst

KI in der Praxis: Fehlerhafte Bauteile mit Rekognition auf AWS identifizieren

Noch vor kurzer Zeit mussten für den Einsatz von künstlicher Intelligenz (KI) unter großem Aufwand eigene KI-Modelle erstellt werden. Heute ist für viele Anwendungsfälle die Einstiegshürde in die Welt der KI durch Cloud-Computing-Dienste stark gesunken...

Cloud
Computer Vision
Data
Künstliche Intelligenz
Machine Learning
Python

29.7.2020 | 11 Minuten Lesezeit

Marcel Mikl

Nico Axtmann

KI in der Praxis: Fehlerhafte Bauteile mit AutoML in der Google Cloud ...

Noch vor kurzer Zeit war der Einsatz von künstlicher Intelligenz (KI) nur mit großem Aufwand und Konstruktion eigener neuronaler Netze möglich. Heute ist die Einstiegshürde in die Welt der KI durch Cloud-Computing-Dienste stark gesunken. So kann man ...

Cloud
Computer Vision
Data
Python
Machine Learning
Google Cloud
Künstliche Intelligenz

8.7.2020 | 11 Minuten Lesezeit

Nico Axtmann

Marcel Mikl

KI für KMU: (Teil-)Automatisierung der Qualitätskontrolle von Bauteilen

Noch vor kurzer Zeit war der Einsatz von künstlicher Intelligenz (KI) nur mit großem Aufwand und ausreichend Spezialwissen möglich. Hauptsächlich große Internet-Konzerne wie Google, Apple und Facebook hatten das Geld, die Daten und die Expertise, um ...

Data
Machine Learning
Künstliche Intelligenz

6.7.2020 | 7 Minuten Lesezeit

Marcel Mikl

Nico Axtmann

Machine Learning in der Praxis. Eine Mate mit … Matthias Niehoff #EineMateMit

Machine Learning und künstliche Intelligenz sind aktuell in aller Munde und versprechen vielfältige Einsatzmöglichkeiten im Unternehmen. Trotzdem tun sich viele Unternehmen aktuell noch schwer, das Potential der Technologie zu nutzen. „Der Fokus liegt...

Künstliche Intelligenz
Data
Community
Machine Learning

27.5.2020 | 1 Minuten Lesezeit

Matthias Niehoff

Process Mining mit bupaR

Process Mining schafft Transparenz darüber, was wirklich in Unternehmen geschieht. Im Prozessmanagement werden die Idealvorstellungen eines Prozesses meist langwierig definiert. In der Praxis ist die Qualität dieser Beschreibungen jedoch oft nicht eindeutig...

Open Source
Data
Process Management

5.5.2020 | 9 Minuten Lesezeit

Anna Lukas

Gemeinsam bessere Projekte umsetzen.

Wir helfen deinem Unternehmen.

Du stehst vor einer großen IT-Herausforderung? Wir sorgen für eine maßgeschneiderte Unterstützung. Informiere dich jetzt.

Hilf uns, noch besser zu werden.

Wir sind immer auf der Suche nach neuen Talenten. Auch für dich ist die passende Stelle dabei.

Contact

Send

Deploying a recommender system for the movie-lens dataset – Part 1

Introduction

A recommender system for a movie database

Memory-based content filtering

Memory-based collaborative filtering

Model-based collaborative filtering

Mathematical description

Implementation

Conclusion

References

Was this post helpful?

Ja

Blog author

Get in contact

Get in contact

Your job at codecentric?

Agile Developer und Consultant (w/d/m)

View Job

More articles in this subject area

Green Cloud: Daten und Emissionen sparen

Charge your APIs Volume 23: REST vs. gRPC

Eine Einführung in Federated Learning im industriellen Kontext: Fortgeschritten

Eine Einführung in Federated Learning im industriellen Kontext: Grundlagen

Bessere SQL-Datenpipelines mit dbt

Data Governance: Wie können wir Daten demokratisieren?

Streaming Wikipedia mit Apache Kafka

Einführung in die Welt der Tourenoptimierung – Echte Routen und realistischere...

Einführung in die Welt der Tourenoptimierung – Visualisierung und Lösungsverfahren...

Einführung in die Welt der Tourenoptimierung (1/3)

Machine-Learning-Modelle bewerten – Quality Gates etablieren

Schnelles Training eines Recommendation-Modells durch BigQuery ML

KI, Daten und Infrastruktur – ML-Systeme schnell Ende-zu-Ende verproben...

Schnelles KI-Prototyping mit Google Cloud AutoML Vision

The Good, the Bad and the Ugly: Daten effektiv visualisieren und kommunizieren

KI in der Praxis: Fehlerhafte Bauteile mit Rekognition auf AWS identifizieren

KI in der Praxis: Fehlerhafte Bauteile mit AutoML in der Google Cloud ...

KI für KMU: (Teil-)Automatisierung der Qualitätskontrolle von Bauteilen

Machine Learning in der Praxis. Eine Mate mit … Matthias Niehoff #EineMateMit

Process Mining mit bupaR

Gemeinsam bessere Projekte umsetzen.

Wir helfen deinem Unternehmen.

Unsere Leistungen

Hilf uns, noch besser zu werden.

Zu den Jobangeboten