movielens/movielens-recommendation-sy...

577 lines
23 KiB
Plaintext

---
title: "MovieLens Recommendation System"
author: "Kaylee Robert Tejeda"
date: "9/13/2021"
output: pdf_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
## Introduction and Overview
One common application of machine learning is the creation of recommendation systems. These are mathematical models that take as input parameters associated with a particular observation to give as output the value of an other parameter associated with the observation as a prediction. The more accurate this prediction is, the more useful and valuable it is in specific industries. These predictions are used to make recommendations to users based on past behavior and the behavior of similar users under similar circumstances. For example, an online video distribution network might use such a recommendation algorithm to create lists of suggested movies based on what is currently being watched by a particular user.
### Goal Summary
Our goal is to create a recommendation algorithm from a standard set of data. Our data set provided by GroupLens is known as the ["10M version of the MovieLens dataset"](https://grouplens.org/datasets/movielens/10m/), which contains 5 base variables per observation. We will measure the accuracy of our model by comparing the ratings predicted for a subset of our data to the actual ratings for those observations, and calculate the root mean square error (RMSE) for the final set of predictions. An RMSE less than 0.86490 will indicate a successful model for the purposes of this report.
### Data
The MovieLens data set contains 10000054 rows, 10677 movies, 797 genres and 69878 users. Each observation in the data set has the following variables associated with it:
* `userId`: Unique user identification number.
* `movieId`: Unique movielens movie identification number.
* `rating`: User-provided ratings on a 5-point scale with half-point increments starting at 0.5
* `timestamp`: Time of user-submitted review in epoch time,
* `title`: Movie titles including year of release as identified in [IMDB](http://www.imdb.com)
* `genres`: A pipe-separated list of film genres
### Preparation
The main data set is split into a working set called `edx` and a second set representing 10% of the original set, called `validation`, which will be only used for checking the model once it has been created, and will not be used at all during the model formation process.
```{r data preparation, message=FALSE, echo=FALSE, warning=FALSE}
##########################################################
# Create edx set, validation set (final hold-out test set)
##########################################################
# Note: this process could take a couple of minutes
if(!require(tidyverse)) install.packages("tidyverse", repos = "http://cran.us.r-project.org")
if(!require(caret)) install.packages("caret", repos = "http://cran.us.r-project.org")
if(!require(data.table)) install.packages("data.table", repos = "http://cran.us.r-project.org")
library(tidyverse)
library(caret)
library(data.table)
# MovieLens 10M dataset:
# https://grouplens.org/datasets/movielens/10m/
# http://files.grouplens.org/datasets/movielens/ml-10m.zip
dl <- tempfile()
download.file("http://files.grouplens.org/datasets/movielens/ml-10m.zip", dl)
ratings <- fread(text = gsub("::", "\t", readLines(unzip(dl, "ml-10M100K/ratings.dat"))),
col.names = c("userId", "movieId", "rating", "timestamp"))
movies <- str_split_fixed(readLines(unzip(dl, "ml-10M100K/movies.dat")), "\\::", 3)
colnames(movies) <- c("movieId", "title", "genres")
movies <- as.data.frame(movies) %>% mutate(movieId = as.numeric(movieId),
title = as.character(title),
genres = as.character(genres))
movielens <- left_join(ratings, movies, by = "movieId")
# Validation set will be 10% of MovieLens data
set.seed(1, sample.kind="Rounding") # if using R 3.5 or earlier, use `set.seed(1)`
test_index <- createDataPartition(y = movielens$rating, times = 1, p = 0.1, list = FALSE)
edx <- movielens[-test_index,]
temp <- movielens[test_index,]
# Make sure userId and movieId in validation set are also in edx set
validation <- temp %>%
semi_join(edx, by = "movieId") %>%
semi_join(edx, by = "userId")
# Add rows removed from validation set back into edx set
removed <- anti_join(temp, validation)
edx <- rbind(edx, removed)
rm(dl, ratings, movies, test_index, temp, movielens, removed)
```
The resulting working set `edx` is further split into a training set and a test set which will be used to evaluate various methods. The test set comprises of 20% of the `edx` working set. No attempt is made to ensure that all the users and movies in the test set are also in the training set. This leads to a fortunate insight later, and thus has been included as part of the discovery process.
## Analysis
### Outline of Steps Taken
1) Baseline Naive-RMSE
2) Movie Effect Model
3) Notice `NAs` in predictions, need to adjust somehow (leads to an insight that is worth mentioning)
4) Movie and User Effects Model with fixed penalization factor lambda.
5) Penalized User & Movie effect with lambda optimized to the hundredth's place (overkill, but interesting).
6) Simplified Penalized Effect method with integer lambda and manual outlier pruning.
7) Simplified Penalized Effect method with integer lambda and statistical outlier pruning.
8) RMSE now in an acceptable range, test against validation set.
### Exploration
#### Baseline Naive-RMSE
Just to get a baseline, a first attempt is made by calculating the arithmetical mean of all movie ratings in the test set, and that is used as a prediction for all movie ratings. This is not a very good approach, but it will give us a worst-case error, of sorts.
```{r exploratory-data-analysis, message=FALSE, warning=FALSE, echo=FALSE}
# Split the edx data set into a training set and a test set
test_index <- createDataPartition(y = edx$rating, times = 1, p = 0.2, list = FALSE)
train_set <- edx[-test_index,]
test_set <- edx[test_index,]
# Loss function that computes the RMSE for vectors of ratings and their corresponding predictors:
RMSE <- function(true_ratings, predicted_ratings){
sqrt(mean((true_ratings - predicted_ratings)^2))
}
# Find the average of all ratings in the training set
mu_hat <- mean(train_set$rating)
# First attempt: Naive RMSE
naive_rmse <- RMSE(test_set$rating, mu_hat)
# Keep track of RMSE results in a tibble
method_string <- paste("Naive method using average rating of", as.character(mu_hat), sep=" ")
rmse_results <- tibble(method = method_string, RMSE = naive_rmse)
#rmse_results
```
#### Movie Effect Model
The next refinement to our approach is to account for a movie bias effect, since some movies naturally get higher ratings than others. This is when `NA` values start showing up in the predicted ratings. To get past this error for now, the option `na.rm=TRUE` is added to the `mean()` part of the loss function. This effectively throws out any observation that does not have a matching user ID or movie title in the training set. This is only a temporary fix and merits a closer look.
```{r Notice NAs, echo=FALSE}
# Factor in movie bias the quick way
movie_avgs <- train_set %>%
group_by(movieId) %>%
summarize(b_i = mean(rating - mu_hat))
# Calculate RMSE after accounting for movie bias
predicted_ratings <- mu_hat + test_set %>%
left_join(movie_avgs, by='movieId') %>%
pull(b_i)
bi_rmse <- RMSE(predicted_ratings, test_set$rating)
rmse_results <- bind_rows(rmse_results,
tibble(method="Movie Effect Model",
RMSE = bi_rmse))
# How many NA values do we need to get rid of?
nas <- sum(is.na(predicted_ratings))
paste("There are", as.character(nas), "NA values in our predictions, which need to be removed.", sep= " ")
# Plot movie and user histograms to get an idea of what is going on
edx %>% group_by(userId) %>% summarize(reviews = n()) %>% ggplot(aes(reviews)) + geom_histogram(color = "black", bins = 50) + scale_x_log10() + ggtitle("Number of reviews per user") + xlab("Number of reviews (log scale)") +
ylab("Count of users")
edx %>% group_by(movieId) %>% summarize(reviews = n()) %>% ggplot(aes(reviews)) + geom_histogram(color = "black", bins = 50) + scale_x_log10() + ggtitle("Number of reviews per movie") + xlab("Number of reviews (log scale)") +
ylab("Count of movies")
```
Not very many `NA` values are being generated. This is most likely due to movies and/or users in the training set that are not present in the test set, or vice versa. This was avoided in the original data set by using the `inner_join` function. Instead, we attempt to remove any movies or users with extremely low rating counts. The assumption is that movies with only a few ratings affect the mean and therefore the error without contributing much to the overall effect. The same could be said for users who have only rated a few movies. Removing these low-frequency observations prevents `NA` values in the predicted ratings and allows the RMSEs to be calculated. A more rigorous approach needs to be formulated as this is only a temporary workaround.
````{r remove NAs, echo=FALSE, message=FALSE}
RMSE <- function(true_ratings, predicted_ratings){
sqrt(mean((true_ratings - predicted_ratings)^2, na.rm=TRUE))
}
# Calculate movie effect method after throwing out NAs from predicted ratings
predicted_ratings <- mu_hat + test_set %>%
left_join(movie_avgs, by='movieId') %>%
pull(b_i)
bi_rmse <- RMSE(predicted_ratings, test_set$rating)
rmse_results <- bind_rows(rmse_results,
tibble(method="Movie Effect Model, na.rm=TRUE",
RMSE = bi_rmse))
#rmse_results
```
#### Move and User Effect Model
Similar to the Movie Effect model, user bias can also be accounted for. The naive average approach is biased for each movie as well as each user, since certain users tend to rate higher or lower than the average user.
````{r more effects, echo=FALSE, message=FALSE}
# Now account for user bias as well, quick method:
user_avgs <- train_set %>%
left_join(movie_avgs, by='movieId') %>%
group_by(userId) %>%
summarize(b_u = mean(rating - mu_hat - b_i))
# Check the new RMSE
predicted_ratings <- test_set %>%
left_join(movie_avgs, by='movieId') %>%
left_join(user_avgs, by='userId') %>%
mutate(pred = mu_hat + b_i + b_u) %>%
pull(pred)
bi_bu_rmse <- RMSE(predicted_ratings, test_set$rating)
rmse_results <- bind_rows(rmse_results,
tibble(method="Movie and User Effects Model",
RMSE = bi_bu_rmse))
#rmse_results
```
#### Penalized Movie Effect model.
Now that we have developed a model that performs better than the Naive RMSE approach that we began with, we can start to regularize the effects by penalizing with a parameter we will call lambda. We start by penalizing both the movie and user effects with the penalization factor lambda arbitrarily set to 3.
````{r penalized approach, echo=FALSE, message=FALSE}
# Now try a Penalized Movie and User Effects estimate with lambda = 3
lambda <- 3
b_i <- train_set %>% group_by(movieId) %>% summarize(b_i = sum(rating - mu_hat)/(n()+lambda))
b_u <- train_set %>% left_join(b_i, by="movieId") %>% group_by(userId) %>% summarize(b_u = sum(rating - b_i - mu_hat)/(n()+lambda))
# Check RMSE
predicted_ratings <- test_set %>% left_join(b_i, by = "movieId") %>% left_join(b_u, by = "userId") %>% mutate(pred = mu_hat + b_i + b_u) %>% pull(pred)
lambda_rmse <- RMSE(predicted_ratings, test_set$rating)
rmse_results <- bind_rows(rmse_results,
tibble(method="Penalized Movie Effect Model, lambda=3",
RMSE = lambda_rmse))
#rmse_results
```
#### Step-Wise Cross-Validation with lambda optimized to two decimal places.
Iterated cross-validation can be used to quickly optimize the penalization factor lambda to two decimal places of precision. First, lambda is optimized to the nearest integer, then it is optimized again to the nearest tenth, and then again to the nearest hundredth. This takes 30 tests, instead of the 100 that would be needed to check all values from 1 to 10 at 0.01 increments.
````{r optimized penalized approach, echo=FALSE, message=FALSE}
# Now optimize lambda using step-wise cross-validation
#Optimize lambda to the nearest integer between 1 and 10
lambdas <- seq(1, 10, 1)
RMSEs <- sapply(lambdas, function(l){
b_i <- train_set %>% group_by(movieId) %>% summarize(b_i = sum(rating - mu_hat)/(n()+l))
b_u <- train_set %>% left_join(b_i, by="movieId") %>% group_by(userId) %>% summarize(b_u = sum(rating - b_i - mu_hat)/(n()+l))
predicted_ratings <- test_set %>% left_join(b_i, by = "movieId") %>% left_join(b_u, by = "userId") %>% mutate(pred = mu_hat + b_i + b_u) %>% pull(pred)
return(RMSE(predicted_ratings, test_set$rating))
})
lambda_best <- lambdas[which.min(RMSEs)]
# Optimize lambda to the tenths place
lambdas <- seq(lambda_best - 1 , lambda_best + 1 , 0.1)
RMSEs <- sapply(lambdas, function(l){
b_i <- train_set %>% group_by(movieId) %>% summarize(b_i = sum(rating - mu_hat)/(n()+l))
b_u <- train_set %>% left_join(b_i, by="movieId") %>% group_by(userId) %>% summarize(b_u = sum(rating - b_i - mu_hat)/(n()+l))
predicted_ratings <- test_set %>% left_join(b_i, by = "movieId") %>% left_join(b_u, by = "userId") %>% mutate(pred = mu_hat + b_i + b_u) %>% pull(pred)
return(RMSE(predicted_ratings, test_set$rating))
})
lambda_best <- lambdas[which.min(RMSEs)]
# Optimize lambda to the hundredths place
lambdas <- seq(lambda_best - 0.1 , lambda_best + 0.1 , 0.01)
RMSEs <- sapply(lambdas, function(l){
b_i <- train_set %>% group_by(movieId) %>% summarize(b_i = sum(rating - mu_hat)/(n()+l))
b_u <- train_set %>% left_join(b_i, by="movieId") %>% group_by(userId) %>% summarize(b_u = sum(rating - b_i - mu_hat)/(n()+l))
predicted_ratings <- test_set %>% left_join(b_i, by = "movieId") %>% left_join(b_u, by = "userId") %>% mutate(pred = mu_hat + b_i + b_u) %>% pull(pred)
return(RMSE(predicted_ratings, test_set$rating))
})
lambdas_rmse <- min(RMSEs)
lambda_best <- lambdas[which.min(RMSEs)]
final_method <- paste("Optimized Penalized Movie & User Effect Model, lambda =", as.character(lambda_best), sep=" ")
rmse_results <- bind_rows(rmse_results, tibble(method=final_method, RMSE = lambdas_rmse))
# Optimize lambda to the integer level only
lambdas <- seq(1, 10, 1)
RMSEs <- sapply(lambdas, function(l){
b_i <-
train_set %>%
group_by(movieId) %>%
summarize(b_i = sum(rating - mu_hat)/(n()+l))
b_u <-
train_set %>%
left_join(b_i, by="movieId") %>%
group_by(userId) %>%
summarize(b_u = sum(rating - b_i - mu_hat)/(n()+l))
predicted_ratings <-
test_set %>%
left_join(b_i, by = "movieId") %>%
left_join(b_u, by = "userId") %>%
mutate(pred = mu_hat + b_i + b_u) %>%
pull(pred)
return(RMSE(predicted_ratings, test_set$rating))
})
lambdas_rmse <- min(RMSEs)
lambda_best <- lambdas[which.min(RMSEs)]
final_method <- paste("OPM&UE Model, integer lambda =", as.character(lambda_best), sep=" ")
rmse_results <- bind_rows(rmse_results, tibble(method=final_method, RMSE = lambdas_rmse))
```
### Insights
* `NAs in predicted_ratings`: A failure to use the proper join function inadvertently lead to `NA` values in the predictions, which prevents the proper calculation of an RMSE. Attempts to rectify this leads to an investigation of outliers that helps reduce the overall RMSE.
* `lambda can be refined step-wise`: Lambda can be optimized to two decimal places using 30 tests instead of 100. This ultimately proved to be worthless, but is interesting. The final RMSE does not seem to depend greatly on lambda to a high degree of precision. Optimizing lambda to the integer level seems sufficient and takes 1/3 the computational time.
### Final Modeling Approach
Ultimately, what helps more than fine-tuning lambda is to throw away the outliers that are theoretically adding more noise to the mean than they contribute to the predictive power of the algorithm. Hard-coded values are tested, starting by removing any movie with three or fewer ratings submitted, and removing any user that has reviewed nineteen or fewer movies.
```{r hard-coded outlier filter, message=FALSE, echo=FALSE, warning=FALSE}
edx2 <- edx %>% group_by(movieId) %>% filter(n()>4) %>% ungroup()
edx2 <- edx2 %>% group_by(userId) %>% filter(n()>20) %>% ungroup()
test_index <- createDataPartition(y = edx2$rating, times = 1, p = 0.2, list = FALSE)
train_set <- edx2[-test_index,]
test_set <- edx2[test_index,]
lambdas <- seq(1, 10, 1)
RMSEs <- sapply(lambdas, function(l){
b_i <-
train_set %>%
group_by(movieId) %>%
summarize(b_i = sum(rating - mu_hat)/(n()+l))
b_u <-
train_set %>%
left_join(b_i, by="movieId") %>%
group_by(userId) %>%
summarize(b_u = sum(rating - b_i - mu_hat)/(n()+l))
predicted_ratings <-
test_set %>%
left_join(b_i, by = "movieId") %>%
left_join(b_u, by = "userId") %>%
mutate(pred = mu_hat + b_i + b_u) %>%
pull(pred)
return(RMSE(predicted_ratings, test_set$rating))
})
lambdas_rmse <- min(RMSEs)
lambda_best <- lambdas[which.min(RMSEs)]
final_method <- paste("Hard-Coded Pruned OPM&UE Model, lambda =", as.character(lambda_best), sep=" ")
rmse_results <- bind_rows(rmse_results, tibble(method=final_method, RMSE = lambdas_rmse))
```
```{r rmse results_2, echo=FALSE}
print.data.frame(rmse_results)
```
The slight reduction in RMSE is an unexpected surprise. The motivation for pruning out the lowest number of ratings per movie and per user was to eliminate `NA` values from the predictions, since that arose from either movies or users with a low count winding up in the training set but not the test set. It was assumed that if every movie has been rated more than three times, and if every user had rated at least nineteen movies, then the possibility of either not being represented in the test set was much lower. It was realized after the fact that this can be controlled during the partition creation by using `semi_join`, as was done with the original data set. Nonetheless, this attempt at removing the `NA` values results in an unexpected reduction of the RMSE value, and merits further investigation.
Rather than hard-code arbitrary values, a statistical approach based on ignoring the lowest 10th percentile of both movies and users (in terms of numbers of ratings associated with each unique value) can be used.
10% seems like a good number to start with, although this should be refined more in the future, and is itself an arbitrary cutoff point not based on any deeper investigation or theory.
## Results
The fine tuning of lambda beyond the integer level does not seem to provide enough reduction in error to justify the increased cost in execution time. Instead, the statistical approach to pruning lower outliers seems to provide the needed optimization. A simple cutoff of the 10th percentile and integer lambda yields the following:
```{r final model, message=FALSE, echo=FALSE, warning=FALSE}
# Calculate 10th percentile of movieId by rating
movieId_cutoff <-
edx %>%
group_by(movieId) %>%
summarize(n = n()) %>%
.$n %>%
quantile(.10)
paste("Movies below the 10th percentile with fewer than",
as.character(movieId_cutoff),
"ratings will be ignored.",
sep=" ")
#Calculate 10th percentile of userId by rating
userId_cutoff <-
edx %>%
group_by(userId) %>%
summarize(n = n()) %>%
.$n %>%
quantile(.10)
paste("Users below the 10th percentile with fewer than",
as.character(userId_cutoff),
"ratings will be ignored.",
sep=" ")
# Remove any movie below the 10th percentile in terms of ratings
edx2 <-
edx %>%
group_by(movieId) %>%
filter(n()>=movieId_cutoff) %>%
ungroup()
# Remove any user below the 10th percentile in terms of ratings
edx2 <-
edx2 %>%
group_by(userId) %>%
filter(n()>=userId_cutoff) %>%
ungroup()
# Create partition, reserving 20% of edx set for testing purposes
test_index <-
createDataPartition(y = edx2$rating,
times = 1, p = 0.2, list = FALSE)
train_set <- edx2[-test_index,]
test_set <- edx2[test_index,]
# Define loss function, and throw away any NA values that result from
RMSE <-
function(true_ratings, predicted_ratings){
sqrt(mean((true_ratings - predicted_ratings)^2, na.rm=TRUE))}
# Calcualte the average rating for the training set.
mu_hat <- mean(train_set$rating)
paste("Penalized User & Movie Effect method using a mean rating of",
as.character(mu_hat),
".",
sep=" ")
# Using Movie and User Effects with Penalization on both, optimize lambda to the nearest integer.
lambdas <- seq(1, 10, 1)
RMSEs <- sapply(lambdas, function(l){
b_i <-
train_set %>%
group_by(movieId) %>%
summarize(b_i = sum(rating - mu_hat)/(n()+l))
b_u <-
train_set %>%
left_join(b_i, by="movieId") %>%
group_by(userId) %>%
summarize(b_u = sum(rating - b_i - mu_hat)/(n()+l))
predicted_ratings <-
test_set %>%
left_join(b_i, by = "movieId") %>%
left_join(b_u, by = "userId") %>%
mutate(pred = mu_hat + b_i + b_u) %>%
pull(pred)
return(RMSE(predicted_ratings, test_set$rating))
})
lambda_best <- lambdas[which.min(RMSEs)]
paste("Optimized lambda =",
as.character(lambda_best),
"gives an RMSE of",
as.character(min(RMSEs)),
"on the test set.",
sep=" ")
```
### Performance
We can measure the performance of this final algorithm against the `validation` set (for the first time):
```{r validation set, echo=FALSE}
# Use this optimized lambda value to test against the validation set.
paste("Now testing this optimized lambda against the vaildation set.")
l <- lambda_best
b_i <- train_set %>%
group_by(movieId) %>%
summarize(b_i = sum(rating - mu_hat)/(n()+l))
b_u <- train_set %>%
left_join(b_i, by="movieId") %>%
group_by(userId) %>%
summarize(b_u = sum(rating - b_i - mu_hat)/(n()+l))
final_predicted_ratings <-
validation %>%
left_join(b_i, by = "movieId") %>%
left_join(b_u, by = "userId") %>%
mutate(pred = mu_hat + b_i + b_u) %>%
pull(pred)
final_rmse <-
RMSE(final_predicted_ratings, validation$rating)
paste("Optimized lambda =",
as.character(lambda_best),
"gives an RMSE of",
as.character(final_rmse),
"on the validation set.",
sep=" ")
```
## Conclusion
Our penalized movie and user effect model gives an error in the desired range when lower outlier pruning is applied at the 10th percentile of ratings for both users and movie titles.
### Summary
We began with a linear estimates accounting for both movie and user effect as well as a penalized term for both effects. This got us close to our goal, and further exploration was warranted. The original mistake of not using `semi_join` when generating the training and test sets lead to an investigation of low-value outliers.
Our model is based on the mean rating for each movie across user ID, which is NOT resistant to outliers. Hence, the users and movies with low frequency add noise to the mean without adding much weight to the predictions. an attempt to remove these (to eliminate `NAs` in the final predictions) inadvertently lead to a reduction in RMSE, which is what makes this method powerful.
### Limitations
This seems to depend on large data sets. Eliminating below a certain percentile might work against you with a smaller data set. This needs to be explored.
### Future Work
Optimizing the percentile removed from each predictor (`movieId` and `userId`) would be interesting, as it is unlikely that the 10th percentile is always the ideal cutoff. It would be nice to keep as much of the original data set as possible while still training a good solid model.
Many predictors were not considered, such as timestamps or genres. No factor importance or influence analysis was done of any kind. These could be included in the model to decrease the overall error further if needed.