Building a Movie Recommender System in Python

Prashant Jha
8 min readJun 4, 2021

What is a Recommendation System?

As it is clear by the name, Recommender System is a piece of software that recommends the relevant products or services. Recommender systems are everywhere. There can’t be a day when you don’t come through a recommender system on the web.

A recommender system is one of the most basic applications of machine learning. Some real-life examples of recommender system:

  • Amazon recommends products based on the products we checked out in the past.
  • Netflix recommends movies based on our watch history.
  • Google gives us suggestions when we type something in the search box based on our search history.
Netflix similar movie recommendation based on your watch history

Types of Recommender System

There are basically two main components of any recommendation system, Users and Items. Items are the entities that are recommended by the recommender system to the users. Let’s understand by taking some examples:

Netflix recommends movies to the people, hence movies are items and people are users, while Facebook recommends the people you may know to the people, here people are users and people are items too.

There are three types of recommender systems that are mostly used:

  • Popularity Based Recommender System

Popularity based recommender system recommends the most popular items to the users. The most popular items are the items that are used by the most number of users. For example, YouTube trending list recommends the most popular videos of the day.

  • Content-Based Recommender System

Content-based recommender systems recommend similar items used by the user in the past.

For example, Netflix recommends similar movies to the movie we recently watched. In the same way, Youtube also recommends similar videos to the videos in our watch history.

  • Collaborative Filtering based Recommender System

Collaborative Filtering based recommender system creates profiles of users based on the items the user likes. Then it recommends the items liked by a user to the user with a similar profile.

For example, Google creates our profile based on our browsing history and then shows us the relevant ads.

Now we’ll be building a Content-Based Bollywood movie recommender system in the Python programming language. First, we’ll dive deeper into the working of the Content-Based Recommender System.

Content-Based Recommender System Working

Content-Based Recommender System recommends items similar to the items the user likes. How does it decide which item is most similar to the item the user likes? Here we use the similarity scores.

Similarity Score: It is a numerical value that helps determine how much similar two items are to each other on a zero-to-one scale. This similarity score is obtained by measuring the similarity between the features of both of the items. So, the similarity score is the measure of similarity between given features of two items.

Here we’ll use cosine similarity between text details of items. In the example below it is shown how to get cosine similarity:

Step 1: Count the number of unique words in both texts.

Step 2: Count the frequency of each word in each text.

Step 3: Plot it by taking each word as an axis and frequency as a measure.

Step 4: Find the points of both texts and get the value of cosine distance between them.

Example:

Text 1 = “Money All Money”

Text 2 = “All Money All”

Step 1: There are only two unique words between both texts.

‘Money’ and ‘All’

Step 2: Counting the frequency.

Step 3 and Step 4: There are only two unique words, hence we’re making a 2D graph.

Now we need to find the cosine of angle ‘θ’, which is the value of the angle between both vectors. Here is the formula used for doing this:

In our case, cos(θ) = 0.8, Hence the similarity score between Text 1 and Text 2 is 0.8, so we can say that both the texts are 80% similar.

This was all the Math behind cosine similarity, which we are gonna use for building the Content-Based Recommender System. Now let’s see how to make a Content-Based Recommender System in Python.

Building it in Python

We are going to make a Bollywood movie recommender system in Python. Here we are using the Bollywood Movies Dataset from Kaggle. This dataset has the Bollywood movie details from 1920. You can download the dataset from this link.

After downloading the data, unzip it and there will be three .csv files. Check them out. We’ll use ‘bollywood.csv’ in our code.

We’ll be making it in Jupyter Notebook.

Let’s get started :

  • Import pandas and numpy
  • Read the .csv file with pandas.
  • We’ll be using the movies that released after 1990.
  • Here you can see now we have only movies released after 1990 in our data frame ‘df’, but the indexes are starting from 6653. We need to reset the index.
  • Dealing with the null values.
  • In the ‘Genre’ column, some values have two genres separated by a comma in a cell-like “action, comedy”. While calculating similarities, Python will consider it a single string while they are two different genres of the same movie. So we have to replace the comma with whitespace so they will be considered different like “action comedy”.
  • The same action should be performed on the ‘Cast’ column.
  • In the ‘Director’ columns, the name and surname is separated by whitespace. Here name and surname will be considered as two different words and this can lead to false similarity. For example, two directors “Sajid Khan” and “Salim Khan” will be considered 50% similar by Python, while they are two completely different persons. So converting them to “SajidKhan” and “SalimKhan” will help us.
  • Now converting the ‘Title’ columns values to lowercase for searching simplicity.
  • Now we’ll sum up the text of all the features we’re going to use for measuring similarity in one column. But first, we have to convert the data type of the ‘Year’ column from int to string.
  • Now after getting all the features summed up in one column in text form. We can make a similarity matrix. For making a similarity matrix, first, we have to create a count matrix of features.

After importing the libraries, now let’s make a count matrix of features and then a similarity Score matrix. The similarity score matrix contains the cosine similarity of all the movies in the dataset. We can get the similarity score of two movies by their index in the dataset.

Here is a small example:

In the above cosine similarity score matrix :

The similarity score of A and A is 1.0, ie; they are 100% similar.

The similarity score of A and B is 0.7, ie; they are 70% similar.

  • Now let’s name a movie user like. Convert it to the small case, cause we converted all the titles to small case. Then check if this movie exists in our dataset or not.
  • Now, get the index of the movie in the dataset. Then fetch the row on the same index from the similarity score matrix, which has the similarity scores of all the movies to the movie user like.
  • In the above step, we also enumerated the row of similarity score, because for the most similar movies we have to sort this list in descending order, but we don’t want to lose the original index of similar movies, hence enumerating is important.
  • Now, let’s sort the similarity score list on the basis of the similarity score.
  • Now we have the indexes of the most similar movies to the movie user likes in the list. We just need to iterate through the list and store movie names on the indexes in a new list and print it.
  • Yeah! We’ve got some good suggestions here.

Now let’s define a function that will take the movie you like as input and will perform all the steps at once and will print the name of the ten most similar movies to the movies you like.

Thanks for going through the blog and I hope it helped. Here is the GitHub repo where you can find all the code. Also, check out this Hollywood movie recommendation system that I deployed.

--

--