**Singular Value Decomposition in a Movie Recommender System**

Hello everyone! Have you ever wondered how Netflix figures out which movie you want to watch next? It is not just Netflix. Many media companies, such as Disney and HBO, now have their own recommender systems that predict how you would rate a movie that you have not seen before, and those predictions are surprisingly accurate. While the exact details of those recommender systems are the top secrets of the media companies, there are still some fundamental ideas in mathematics that will help us understand how those recommender systems work. One such idea is called the *Singular Value Decomposition* (SVD), and this is how it works: we can represent the movie ratings with a matrix, and the SVD on that matrix reveals the mathematical structure in people’s movie preferences. Using this decomposition, we can build a recommender system that can predict the movie ratings. The words like “matrix” or “decomposition” can sound scary if you are not a big fan of linear algebra, but do not worry! It is surprisingly easy to understand without too much mathematical detail and, at the end of this article, you are going to be amazed by how this seemingly unrelated idea in mathematics can help you decide which movie to watch next.

(If you want more mathematical detail on this idea, there are many textbooks available such as this one: https://mml-book.github.io/. The application of SVD in building a recommender system is a well-known idea.)

**Movie rating data as a matrix**

The first place we need to start is to represent the previous data of movie ratings in the form of a matrix, which will allow us to apply the ideas in linear algebra. Here is an example:

In this matrix, each row corresponds to each user, and each column to each movie. For example, “User 3” has given 4 stars to the movie “SF 2”. In this simple dataset, there are only two genres of movies: science fiction and romantic comedy, and the movies are rather boringly named to indicate their genres. For example, “Rom 2” is a romantic comedy movie. From this data, we can easily see that there are two easily distinguishable groups of users: the sci-fi fans and rom-com fans. For example, our “User 3” looks like a sci-fi fan. He or she has given high ratings to other sci-fi movies, but not to the romantic comedy movies. As highlighted in the matrix, the first four users gave high ratings to the sci-fi movies and low ratings to the rom-com movies. Conversely, the last two users gave high ratings to the rom-com movies and low ratings to the sci-fi movies.

**Singular Value Decomposition**

Now let’s have a quick review on SVD before getting into the details of its application on the movie ratings. The word *decomposition* means that it is expressing a given matrix as a product of other matrices of useful properties. In the case of SVD, the original matrix, *A*, of size *m*-by-*n* is expressed as *A = UΣVᵀ* where *U* and *V* are orthonormal matrices of sizes *m* and *n*, respectively, and *Σ* is a *m*-by-*n* matrix, of which diagonal entries are all non-negative and off-diagonal entries are all zero. Those non-negative diagonal entries are called the *singular values* of the matrix *A*, hence the name *singular value decomposition*.

However, what do all those technical terms mean to our not-so-mathy friends? Let’s visualize SVD so that it makes more sense. Remember, orthonormal means that every column vector of *U* or *V*, is orthogonal to one another and their norms are all equal to 1, so that *uᵢ·uⱼ = 0* for all *i ≠ j*, and *uᵢ·uᵢ = 1* (and of course *vᵢ·vⱼ = 0* for all *i ≠ j*, and *vᵢ·vᵢ = 1*). We can therefore interpret *u₁, u₂, …, uₘ* as the unit-length vectors that are perpendicular to one another in the *m*-dimensional space. In other words, we can imagine them as a new set of axes in the *m*-dimensional space. (In mathematics, they are called the *orthonormal bases*.) Similarly, we can imagine *v₁, v₂, …, vₙ* as a new set of axes in the *n*-dimensional space. Finally, we can imagine that each singular value *σᵢ *is amplifying the vectors in the *vᵢ* axis in the *n*-dimensional space and mapping them into the *uᵢ* axis in the *m*-dimensional space.

In SVD, the singular values are sorted in the descending order, so that *σ₁ ≥ σ₂ ≥ … ≥ σₙ*. This means that a vector in the *v₁ *direction will be amplified the most and mapped into the *u₁* direction. Then a vector in the *v₂ *direction will be amplified the second most and mapped into the *u₂* direction, and so on. This makes the *v₁ *direction the most important direction in the *n*-dimensional space, and the *v₂ *direction the second most, and so on. Similarly, in the *m*-dimensional space, the *u₁* direction is the most important direction and the *u₂* direction the second most, and so on. **Here are the key takeaways**: by using SVD, we can identify **the most important directions** in the *n*-dimensional and *m*-dimensional spaces, and we can also know about **the relative importance of each direction** from the corresponding singular value *σᵢ*.

**SVD on a movie rating matrix**

Now, let’s go back to the movie rating matrix and apply SVD on the example movie data above.

There are a lot going on, but here is the big picture: the movie rating matrix, *A*, maps movies to users, and by looking at the decomposed matrices *U*,* Σ, *and* V*, we can take a close look at the mapping process step by step. Just keep in mind that in our dataset, there are two groups of users, the sci-fi fans and the rom-com fans, and also two types of movies, the sci-fi movies and the rom-com movies. Thus, from the key takeaways of SVD, we expect to see the **sci-fi direction** and the **rom-com direction** in both the movie space and the user space. We also expect to see how the singular values shows the relative importance of each direction. Now let’s take a close look at each matrix.

First, take look at the *V* matrix, and you can see the first and second rows of *Vᵀ, *or the first and second columns of *V* finds the sci-fi movies and the rom-com movies. The first three movies are sci-fi movies, and you can see that the vector *v₁* is in the direction of those movies. Similarly, the last two movies are rom-com movies, and the vector *v₂ *is in the rom-com direction. In other words, when this matrix sees a sci-fi movie, it maps the movie to the sci-fi direction, and a rom-com movie to the rom-com direction. (Don’t worry too much about the plus and minus signs in the matrix and just focus on their absolute values. The plus and minus signs happen kind of randomly from how the computer code calculates the eigenvectors for us.)

Then, the movies in the sci-fi direction are multiplied by the sci-fi importance factor, so the first singular value. Subsequently, the movies in the rom-com direction are multiplied by the rom-com importance factor. We can see that the sci-fi importance is roughly twice larger than the rom-com importance factor, and that makes sense because, in this simple dataset, there are twice more sci-fi fans than there are rom-com fans. We can also see that the first two entries of the singular value matrix *Σ* are so much larger than any other entries, and this makes sense too. In this dataset, the direction that matter are the sci-fi direction and the rom-com direction. Other directions that SVD detects are probably from random variations among the fans and the movies of each genres, and those random variations do not matter in the overall structure of people’s movie preferences. What matters is the two groups of fans of the two genres of movies, and the sizes of the first two singular values tell us just that.

Finally, the *U* matrix receives the movies in the sci-fi and rom-com directions after they get amplified by their respective importance factors and send those movies to the sci-fi and rom-com fans. The first column *u₁ *shows where the sci-fi fans are. We know that the first four users are sci-fi fans, and the column *u₁ *really have large values in the first four entries. (Again, don’t worry about the plus and minus signs as they happen randomly from how the computer code calculates the eigenvectors). Similarly, the second column* u₂ *delivers the rom-com movies to the rom-com fans.

Now here is the summary. The movie ratings matrix maps movies to the users, and the SVD on that matrix tells us how it does that. First, the *V* matrix figures out which ones are the sci-fi movies and which ones are the rom-com movies. We can call *v₁* the sci-fi movie vector and *v₂* the rom-com movie vector. Then the *Σ* matrix multiplies the relative importance to each genres of movies. We see the sci-fi movies are the most important in this dataset, so they have the largest singular value *σ₁*. The rom-com movies are the second most important, so they have the second singular value *σ₂*. Finally, the *U* matrix delivers each genres of movies, multiplied by their importance factors, to the fans of the genres. The users pointed by the *u₁* vector receives the sci-fi movies, and the *u₂* users the rom-com movies.

The nice part about SVD is that **we did not have to manually label the movie genres and group the users by their favorite genres**. SVD automatically picked it up. In reality, there are more variety of movies and they can be clustered not only by their genres but also by other features of the movie: the movies that star a particular actor, the movies that are directed by a particular director, the movies that have happy endings, and so on, but it’s fine. SVD is capable of automatically grouping the movies into the “Tom Hanks movies”, “Steven Spielberg movies”, “happy-ending movies”, and so on, and delivering those movies to the “Tom Hanks fans”, “Steven Spielberg fans”, “those who want happy-ending movies”, and so on.

**Movie Ratings Prediction with SVD**

Now that we saw how SVD decomposes the movies and the users by their distinctive features, such as their genres, main actors, directors, and plots, it’s time to look at how we can use this structure in movie preferences to make predictions. The key point is to reconstruct the movie ratings matrix *A* from the key features that SVD has detected from the matrix. Let’s go back to the toy dataset. The key features are the sci-fi and rom-com genres. This means that we can reconstruct the matrix A with only using the fans of each genre, *u₁* and* u₂*, and the importance factors, *σ₁* and *σ₂*, and the movies of each genre, *v₁* and *v₂*.

In mathematics, this technique is called the *low rank approximation*. It captures the most important dimensions in a given matrix and recreates the matrix with that dimensions only. The resulting matrix is of lower rank, but it still retains the most important information from the original matrix. By applying it to our movie ratings matrix, we can confirm that it really does keep the sci-fi and rom-com structure. Look at how similar our reconstructed matrix *A₂* is to the original matrix *A*.

The benefit of this is that now SVD knows how to give us the prediction in a systematic way: 1. forget about the unimportant features because they only add noises from random variances in the fans and the movies and 2. make a guess based on the most important features. We can think of this matrix *A₂* as the prediction matrix.

We can see the effect of this when we add a new user. Our “User 7”’s data looks like this:

This new user has only rated two out of the five movies, and the missing values are indicated with the question mark “?” in the matrix. However, only with the two ratings, it looks obvious that this user is a sci-fi fan and not a rom-com fan. We only need to check if the prediction matrix *A₂* obtained from SVD will give high ratings for the unrated the sci-fi movies and a low rating for the unrated rom-com movie.

However, before we can obtain the prediction matrix, we need to apply SVD on the matrix *A* in the first place, but we can’t do that just yet, because some values are missing from that matrix. In other words, what do we do with the missing values in the data? This is a big question in constructing recommender systems, and there is no single correct answer. We will circle back on this question later, but for this simple dataset, let’s just fill in the empty ratings with 3 = (5 + 1)/2 as a placeholder, which is the average of all available ratings given by the user.

We can see that the prediction matrix indeed gives high ratings to the sci-fi movies that “User 7” has not seen before and a low rating to the unseen rom-com movie. What about the opposite case where the “User 7” is a potential rom-com fan?

We can repeat the steps above but with a different “User 7” who gave a low rating to a sci-fi movie and a high rating to a rom-com movie, and we see that SVD predicts low ratings to the remaining sci-fi movies and high rating to the remaining rom-com movie for this user. This is consistent with our guess that this user is a rom-com fan.

To sum up, we can use SVD to identify the most important structures in the users’ movie preferences. We can then use the decomposed matrices to construct a prediction matrix by eliminating random noises in the original matrix. This prediction matrix will even make a reasonable guess about the ratings for the movies that the users have not seen yet.

**Practical Challenges and Solutions**

So far, we have discussed how SVD can be used quite effectively in making movie recommendations. However, just like any other fundamental ideas in mathematics, there are some practical issues that we need to address before we can scale it up to a full-blown recommender system. With only six users and five movies, the toy dataset that we have been playing with is simple enough for us to understand, but it is also different from the real data in some ways. First, there are so many more movies and users, so the **movie ratings matrix will be so much larger**. Netflix provides thousands of movies and TV shows for hundreds of millions of users. Applying SVD on a matrix this huge is going to be a challenge in the computational point of view. Therefore, clever choice of algorithms and investment in the computational power of the hardware are needed.

Second, while we could make up the ratings for every movie and every user in the toy dataset, it is no longer going to be the case. In reality, **the vast majority of the values in the movie ratings matrix will be missing**. Even after spending so many hours watching the movies and TV shows on Netflix, the ones that you have watched so far will be of only tiny fraction of Netflix’s entire collection. In mathematics, we call a matrix *sparse* if its entries are predominantly zero. Sparse matrices are annoying to deal with in general, but with the movie ratings matrix, there is another nuance: a zero meaning that the movie was not even worth a single star is different from a zero meaning that the movie is not rated yet. We can try to bypass the problem by forcing the users to give at least one star to the movies, but that still does not solve the fundamental problem that we can’t interpret the number zero as the ratings. We have to come up with a strategy to deal with the missing values.

Then how do we address the issue of huge size and sparsity of the data? Well, it turns out there are many different solutions we can choose from. There are just so many that it is impossible to provide a comprehensive review in a single blog post. There is even a textbook entirely on this topic. One approach is to find the SVD through random iterations that asymptotically converges to the correct value. Even Facebook is interested in answering this very question, and we can also modify SVD process a little bit. This search for an efficient algorithm to calculate SVD, or recommender system in general, is an active area of research. Just think about the fact that many big IT and media companies have enormous incentives to improve their recommender systems, as even the smallest improvement can bring them immense profits.

**Summary**

Singular value decomposition is one of the major mathematical methods deployed in recommender systems. It decomposes the movie ratings matrix to find out the most important factors in people’s movie preference. It can even be used in predicting how a user would rate a movie that he or she has not watched before. When it is implemented in recommender systems, a number of carefully designed algorithms are employed to overcome the challenges presented by the large size and the sparsity of the real-life data. Now when you see movie recommendations on your screen, you know something about how those movies were chosen by the algorithm!