Initial Commit

This commit is contained in:
shelldweller 2021-09-13 01:14:03 -06:00
parent 659de1e6d7
commit c12e8a644f
2 changed files with 233 additions and 318 deletions

View File

@ -1,7 +1,7 @@
---
title: "MovieLens Recommendation System (Rough Draft)"
title: "MovieLens Recommendation System"
author: "Kaylee Robert Tejeda"
date: "9/11/2021"
date: "9/13/2021"
output: pdf_document
---
@ -10,40 +10,29 @@ knitr::opts_chunk$set(echo = TRUE)
```
## Introduction and Overview
(an introduction/overview/executive summary section that describes the dataset and summarizes the goal of the project and key steps that were performed)
One common application of machine learning is the creation of recommendation systems. These are mathematical models that take as input the parameters associated with a particular observation to give as output the value of an unknown parameter associated with the observation as a prediction. The more accurate this prediction is, the more useful and valuable it is to specific industries. These industries use these predictions to make recommendations to users based on past behavior and the behavior of other similar users under similar circumstances. For example, video distribution systems might use such a recommendation algorithm to create lists of suggested movies based on what is currently being watched by a particular user.
One common application of machine learning techniques is the creation of what are known as "recommendation systems". These are mathematical models that take as input the parameters associated with a particular observation and predicts the value of a specific unobserved variable generated by that observation. The more accurate this prediction is, the more useful and valuable it is to specific industries.
### Goal Summary
### Goal Summary (objective, motivation, etc.)
Our goal is to create such a recommendation algorithm using a data set containing movie ratings provided by GroupLens known as the "10M version of the MovieLens dataset", and is available here: https://grouplens.org/datasets/movielens/10m/ . We seek to develop a model for movie rating given 5 base variables (user ID, movie title, movie year, movie genre, and review date).
### Outline of Steps Taken
The performance of the algorithm was evaluated by RMSE, or root mean square error in stars between predicted rating and actual rating. The MovieLens data set contains 10000054 rows, 10677 movies, 797 genres and 69878 users.The edx dataset (train set) contains 9,000,055 ratings of 10,677 movies by 69,878 users and consisted of 90% of the original benchmark MovieLens 10M dataset while validation set contains 999,999 ratings.
Our goal is to create such a recommendation algorithm. Our data set containing movie ratings is provided by GroupLens known as the ["10M version of the MovieLens dataset"](https://grouplens.org/datasets/movielens/10m/), which contains 5 base variables per observation. We will measure the accuracy of our model by comparing the ratings predicted for a subset of our data to the actual ratings for those observations, and calculate the root mean square error (RMSE) for the final set of predictions. An RMSE less than 0.86490 will indicate a successful model for the purposes of this report.
### Data
Each observation in the dataset has the following variables associated:
The MovieLens data set contains 10000054 rows, 10677 movies, 797 genres and 69878 users. Each observation in the data set has the following variables associated with it:
"userId", "movieId", "rating", "timestamp", "title", "genres"
* `userId`: Unique user ID number.
* `movieId`: Unique movielens movie ID number.
* `rating`: User-provided ratings on a 5-star scale with half-star increments starting from 0.5
* `userId`: Unique user identification number.
* `movieId`: Unique movielens movie identification number.
* `rating`: User-provided ratings on a 5-star scale with half-star increments starting at 0.5
* `timestamp`: Time of user-submitted review in epoch time,
* `title`: Movie titles including year of release as identified in [IMDB](http://www.imdb.com)
* `genres`: A pipe-separated list of film genres
## Analysis
(a methods/analysis section that explains the process and techniques used, including data cleaning, data exploration and visualization, insights gained, and your modeling approach)
### Preparation
```{r data preparation}
The main data set is split into a working set called "edx" and a second set representing 10% of the original set, called "validation", which will be only used for checking the model once it has been created, and will not be used at all during the model formation process.
```{r data preparation, message=FALSE, echo=FALSE, warning=FALSE}
##########################################################
# Create edx set, validation set (final hold-out test set)
@ -58,15 +47,12 @@ if(!require(data.table)) install.packages("data.table", repos = "http://cran.us.
library(tidyverse)
library(caret)
library(data.table)
library(tictoc)
# MovieLens 10M dataset:
# https://grouplens.org/datasets/movielens/10m/
# http://files.grouplens.org/datasets/movielens/ml-10m.zip
tic("setup")
dl <- tempfile()
download.file("http://files.grouplens.org/datasets/movielens/ml-10m.zip", dl)
@ -100,167 +86,36 @@ edx <- rbind(edx, removed)
rm(dl, ratings, movies, test_index, temp, movielens, removed)
toc()
```
The resulting working set "edx" is further split into a training set and a test set which will be used to evaluate various methods. The test set comprises of 20% of the edx working set. No attempt was made to ensure that all the users and movies in the test set are also in the training set. This leads to a fortunate discovery later.
### Exploration / Visualization
## Analysis
A first attempt:
```{r exploratory-data-analysis}
### Outline of Steps Taken
# Assuming the data has already been prepared as per the instructions...
1) Baseline Naive-RMSE
# Use the training set to build a model with several of the models available
# from the caret package. We will test out all of the following models in this
# exercise:
2) Movie Effect Model
models <- c("glm", "lda", "naive_bayes", "svmLinear",
"gamboost", "gamLoess", "qda",
"knn", "kknn", "loclda", "gam",
"rf", "ranger", "wsrf", "Rborist",
"avNNet", "mlp", "monmlp",
"adaboost", "gbm",
"svmRadial", "svmRadialCost", "svmRadialSigma")
3) Notice NA's in predictions, need to adjust somehow (leads to insight that is worth mentioning)
4) Movie and User Effects Model with manual outlier pruning.
5) Penalized User and Movie effect with Lambda to the hundredth's place (overkill, but interesting).
6) Simplified Penalized Effect method with integer Lambda and statistical outlier pruning.
7) RMSE is now in range, test against validation set.
# Loss function that computes the RMSE for vectors of ratings and their corresponding predictors:
### Exploration
RMSE <- function(true_ratings, predicted_ratings){
sqrt(mean((true_ratings - predicted_ratings)^2))
}
#### Baseline Naive-RMSE
# Find the average of all ratings in the test set
mu_hat <- mean(edx$rating)
mu_hat
# First attempt: Naive RMSE
naive_rmse <- RMSE(validation$rating, mu_hat)
naive_rmse
# [1] 1.061202 (not very good at all)
# Keep track of RMSE results in a tibble ??
#rmse_results <- tibble(method = "Just the average", RMSE = naive_rmse)
# Compute least squares estimate the quick way
mu <- mean(edx$rating)
movie_avgs <- edx %>%
group_by(movieId) %>%
summarize(b_i = mean(rating - mu))
# Calculate RMSE after accounting for movie bias
predicted_ratings <- mu + validation %>%
left_join(movie_avgs, by='movieId') %>%
pull(b_i)
RMSE(predicted_ratings, validation$rating)
# [1] 0.9439087 (at least this is under unity, better)
# Now train based on user bias as well, quick method:
user_avgs <- edx %>%
left_join(movie_avgs, by='movieId') %>%
group_by(userId) %>%
summarize(b_u = mean(rating - mu - b_i))
# Check the new RMSE
predicted_ratings <- validation %>%
left_join(movie_avgs, by='movieId') %>%
left_join(user_avgs, by='userId') %>%
mutate(pred = mu + b_i + b_u) %>%
pull(pred)
RMSE(predicted_ratings, validation$rating)
# [1] 0.8653488 (getting close, full points for RMSE < 0.86490)
# Now try a Peanlized Least Squares estimate with lambda = 3
lambda <- 3
mu <- mean(edx$rating)
movie_reg_avgs <- edx %>%
group_by(movieId) %>%
summarize(b_i = sum(rating - mu)/(n()+lambda), n_i = n())
# Check RMSE
predicted_ratings <- validation %>%
left_join(movie_reg_avgs, by = "movieId") %>%
mutate(pred = mu + b_i) %>%
pull(pred)
RMSE(predicted_ratings, validation$rating)
# [1] 0.9438538 (much worse than before...)
# Now optimize lambda
lambdas <- seq(0, 10, 0.25)
mu <- mean(edx$rating)
just_the_sum <- edx %>%
group_by(movieId) %>%
summarize(s = sum(rating - mu), n_i = n())
rmses <- sapply(lambdas, function(l){
predicted_ratings <- validation %>%
left_join(just_the_sum, by='movieId') %>%
mutate(b_i = s/(n_i+l)) %>%
mutate(pred = mu + b_i) %>%
pull(pred)
return(RMSE(predicted_ratings, validation$rating))
})
lambdas[which.min(rmses)]
which.min(rmses)
# [1] 2.5
# [1] 0.9438521 (still not very good)
# Now regularize on user and movie effects using lambda
lambdas <- seq(0, 10, 0.25)
rmses <- sapply(lambdas, function(l){
mu <- mean(edx$rating)
b_i <- edx %>%
group_by(movieId) %>%
summarize(b_i = sum(rating - mu)/(n()+l))
b_u <- edx %>%
left_join(b_i, by="movieId") %>%
group_by(userId) %>%
summarize(b_u = sum(rating - b_i - mu)/(n()+l))
predicted_ratings <-
validation %>%
left_join(b_i, by = "movieId") %>%
left_join(b_u, by = "userId") %>%
mutate(pred = mu + b_i + b_u) %>%
pull(pred)
return(RMSE(predicted_ratings, validation$rating))
})
lambdas[which.min(rmses)]
# [1] 5.25 (optimized lambda for both user and movie effects)
min(rmses)
# [1] 0.864817 (optimial RMSE, full points for RMSE < 0.86490, YAY!!!)
```
This used the entire edx set, which is not correct. This set needs to be split and used for both training AND testing of the algorithm, and then the validation set is only used to calculate RMSE at the very end.
Attempt #2:
```{r more-exploratory-data-analysis}
Just to get a baseline, a first attempt is made by calculating the arithmetical mean of all movie ratings in the test set, and that is used as a prediction for all movie ratings. This is not a very good approach, but it will give us a worst-case error, of sorts.
```{r exploratory-data-analysis, message=FALSE, warning=FALSE, echo=FALSE}
# Split the edx data set into a training set and a test set
@ -269,10 +124,8 @@ train_set <- edx[-test_index,]
test_set <- edx[test_index,]
# Loss function that computes the RMSE for vectors of ratings and their corresponding predictors:
# Ignores NA values, which wind up in the predicted_ratings for some reason
RMSE <- function(true_ratings, predicted_ratings){
#sqrt(mean((true_ratings - predicted_ratings)^2, na.rm=TRUE))
sqrt(mean((true_ratings - predicted_ratings)^2))
}
@ -286,8 +139,18 @@ naive_rmse <- RMSE(test_set$rating, mu_hat)
# Keep track of RMSE results in a tibble
rmse_results <- tibble(method = "Just the average", RMSE = naive_rmse)
method_string <- paste("Naive method using average rating of", as.character(mu_hat), sep=" ")
rmse_results <- tibble(method = method_string, RMSE = naive_rmse)
rmse_results
```
#### Movie Effect Model
The next approach was to account for a movie bias effect, since some movies naturally get higher ratings than others. This is when "NA" values started showing up in the predicted ratings. To get past this error for now, the option "na.rm=TRUE" was added to the mean part of the loss function. This effectively throws out any observation that does not have a matching user ID or movie title in the training set. This was only a temporary fix and merited a closer look.
```{r Notice NAs, echo=FALSE}
# Factor in movie bias the quick way
movie_avgs <- train_set %>%
@ -306,6 +169,45 @@ rmse_results <- bind_rows(rmse_results,
tibble(method="Movie Effect Model",
RMSE = bi_rmse))
# How many are we throwing out?
nas <- sum(is.na(predicted_ratings))
paste("There are", as.character(nas), "NA values in our predictions, which need to be removed.", sep= " ")
edx %>% group_by(userId) %>% summarize(reviews = n()) %>% ggplot(aes(reviews)) + geom_histogram(color = "black", bins = 50) + scale_x_log10() + ggtitle("Number of reviews per user") + xlab("Number of reviews (log scale)") +
ylab("Count of users")
edx %>% group_by(movieId) %>% summarize(reviews = n()) %>% ggplot(aes(reviews)) + geom_histogram(color = "black", bins = 50) + scale_x_log10() + ggtitle("Number of reviews per movie") + xlab("Number of reviews (log scale)") +
ylab("Count of movies")
```
Not very many NA values are being generated. The assumption is that movies with only a few ratings affect the mean and therefore the error without contribution much to the overall effect. The same could be said for users who have only rated a few movies. Removing these NA values allowed the RMSEs to be calculated, however a more rigorous approach needs to be formulated as this is only a temporary workaround.
````{r remove NAs, echo=FALSE, message=FALSE}
RMSE <- function(true_ratings, predicted_ratings){
sqrt(mean((true_ratings - predicted_ratings)^2, na.rm=TRUE))
}
# Calculate movie effect method after throwing out NAs from predicted ratings
predicted_ratings <- mu_hat + test_set %>%
left_join(movie_avgs, by='movieId') %>%
pull(b_i)
bi_rmse <- RMSE(predicted_ratings, test_set$rating)
rmse_results <- bind_rows(rmse_results,
tibble(method="Movie Effect Model, na.rm=TRUE",
RMSE = bi_rmse))
#rmse_results
```
#### Move and User Effect Model
Similar to the Movie Effect model, user bias can also be accounted for.
````{r more effects, echo=FALSE, message=FALSE}
# Now account for user bias as well, quick method:
user_avgs <- train_set %>%
@ -327,191 +229,205 @@ rmse_results <- bind_rows(rmse_results,
tibble(method="Movie and User Effects Model",
RMSE = bi_bu_rmse))
# Now try a Peanlized Least Squares estimate with lambda = 5
lambda <- 5
movie_reg_avgs <- train_set %>%
group_by(movieId) %>%
summarize(b_i = sum(rating - mu_hat)/(n()+lambda), n_i = n())
#rmse_results
```
#### Penalized Movie Effect model.
Now that we have developed a model that performs better than the Naive RMSE approach that we began with, we can start to regularize the effects by penalizing with a parameter we will call lambda. We start by penalizing both the Movie and User effects with the penalization factor lambda arbitrarily fixed at 3.
````{r penalized approach, echo=FALSE, message=FALSE}
# Now try a Penalized Movie and User Effects estimate with lambda = 3
lambda <- 3
#movie_reg_avgs <- train_set %>%
#group_by(movieId) %>%
#summarize(b_i = sum(rating - mu_hat)/(n()+lambda), n_i = n())
b_i <- train_set %>% group_by(movieId) %>% summarize(b_i = sum(rating - mu_hat)/(n()+lambda))
b_u <- train_set %>% left_join(b_i, by="movieId") %>% group_by(userId) %>% summarize(b_u = sum(rating - b_i - mu_hat)/(n()+lambda))
# Check RMSE
predicted_ratings <- test_set %>%
left_join(movie_reg_avgs, by = "movieId") %>%
mutate(pred = mu_hat + b_i) %>%
pull(pred)
#predicted_ratings <- test_set %>%
#left_join(movie_reg_avgs, by = "movieId") %>%
#mutate(pred = mu_hat + b_i) %>%
#pull(pred)
predicted_ratings <- test_set %>% left_join(b_i, by = "movieId") %>% left_join(b_u, by = "userId") %>% mutate(pred = mu_hat + b_i + b_u) %>% pull(pred)
lambda_rmse <- RMSE(predicted_ratings, test_set$rating)
rmse_results <- bind_rows(rmse_results,
tibble(method="Penalized LSE Model on movie bias, lambda=5",
tibble(method="Penalized Movie Effect Model, lambda=3",
RMSE = lambda_rmse))
#rmse_results
```
# Now optimize lambda using cross-validation
#### Step-Wise Cross-Validation with Lambda optimized to two decimal places.
# Set lambda sequence
# Increase third value for quicker run time
# Decrease third value for finer precision
Iterated cross-validation can be used to quickly optimize the penalization factor lambda to two decimal places of precision. First, lambda is optimized to the nearest integer, then it is optiized again to the nearest tenth, and then again to the nearest hundredth.
lambdas <- seq(0, 10, 0.25)
````{r optimized penalized approach, echo=FALSE, message=FALSE}
# Now optimize lambda using step-wise cross-validation
# Calculate RMSEs for each lambda
# Use peanlized least squares on both user and movie bias
# (good method to try from experience/book lecture)
lambdas <- seq(1, 10, 1)
RMSEs <- sapply(lambdas, function(l){
b_i <- train_set %>% group_by(movieId) %>% summarize(b_i = sum(rating - mu_hat)/(n()+l))
b_u <- train_set %>% left_join(b_i, by="movieId") %>% group_by(userId) %>% summarize(b_u = sum(rating - b_i - mu_hat)/(n()+l))
predicted_ratings <- test_set %>% left_join(b_i, by = "movieId") %>% left_join(b_u, by = "userId") %>% mutate(pred = mu_hat + b_i + b_u) %>% pull(pred)
return(RMSE(predicted_ratings, test_set$rating))
})
RMSEs <- sapply(lambdas, function(l){ # do the following for each lambda:
b_i <- train_set %>% # calculate bias for each movie from training set
lambda_best <- lambdas[which.min(RMSEs)]
lambdas <- seq(lambda_best - 1 , lambda_best + 1 , 0.1)
RMSEs <- sapply(lambdas, function(l){
b_i <- train_set %>% group_by(movieId) %>% summarize(b_i = sum(rating - mu_hat)/(n()+l))
b_u <- train_set %>% left_join(b_i, by="movieId") %>% group_by(userId) %>% summarize(b_u = sum(rating - b_i - mu_hat)/(n()+l))
predicted_ratings <- test_set %>% left_join(b_i, by = "movieId") %>% left_join(b_u, by = "userId") %>% mutate(pred = mu_hat + b_i + b_u) %>% pull(pred)
return(RMSE(predicted_ratings, test_set$rating))
})
lambda_best <- lambdas[which.min(RMSEs)]
lambdas <- seq(lambda_best - 0.1 , lambda_best + 0.1 , 0.01)
RMSEs <- sapply(lambdas, function(l){
b_i <- train_set %>% group_by(movieId) %>% summarize(b_i = sum(rating - mu_hat)/(n()+l))
b_u <- train_set %>% left_join(b_i, by="movieId") %>% group_by(userId) %>% summarize(b_u = sum(rating - b_i - mu_hat)/(n()+l))
predicted_ratings <- test_set %>% left_join(b_i, by = "movieId") %>% left_join(b_u, by = "userId") %>% mutate(pred = mu_hat + b_i + b_u) %>% pull(pred)
return(RMSE(predicted_ratings, test_set$rating))
})
lambdas_rmse <- min(RMSEs)
lambda_best <- lambdas[which.min(RMSEs)]
final_method <- paste("Optimized Penalized Movie & User Effect Model, lambda =", as.character(lambda_best), sep=" ")
rmse_results <- bind_rows(rmse_results, tibble(method=final_method, RMSE = lambdas_rmse))
# Optimize lambda to the integer level only
lambdas <- seq(1, 10, 1)
RMSEs <- sapply(lambdas, function(l){
b_i <-
train_set %>%
group_by(movieId) %>%
summarize(b_i = sum(rating - mu_hat)/(n()+l))
b_u <- train_set %>% # calculate bias for each user from training set
b_u <-
train_set %>%
left_join(b_i, by="movieId") %>%
group_by(userId) %>%
summarize(b_u = sum(rating - b_i - mu_hat)/(n()+l))
predicted_ratings <- # calculate predictions on test set
predicted_ratings <-
test_set %>%
left_join(b_i, by = "movieId") %>%
left_join(b_u, by = "userId") %>%
mutate(pred = mu_hat + b_i + b_u) %>%
pull(pred)
return(RMSE(predicted_ratings, test_set$rating)) # calculate RMSE for a particular lambda
return(RMSE(predicted_ratings, test_set$rating))
})
# Best RMSE is the smallest calculated, we want to minimize error
lambdas_rmse <- min(RMSEs)
# Best lambda is the one that generated the smallest error
lambda_best <- lambdas[which.min(RMSEs)]
# Best RMSE to more digits
# Make it look pretty in a data frame / tibble
#options(pillar.sigfig=8) #increase default digits in tibble
#tibble(mu = mu, lambda=lambda_best, RMSE=rmse_best)
final_method <- paste("Optimized Penalized LSE Model with lambda =", as.character(lambda_best), sep=" ")
rmse_results <- bind_rows(rmse_results,
tibble(method=final_method,
RMSE = lambdas_rmse))
###################################
## Done training!
## Run selected method against the validation set
############################################
final_method <- paste("OPM&UE Model, integer lambda =", as.character(lambda_best), sep=" ")
rmse_results <- bind_rows(rmse_results, tibble(method=final_method, RMSE = lambdas_rmse))
# Check the new RMSE
l <- lambda_best
```
b_i <- train_set %>%
The final RMSE does not seem to depend greatly on lambda to a high degree of precision. Optimizing lambda to the integer level seems sufficient.
```{r rmse results, echo=FALSE}
print.data.frame(rmse_results)
```
### Insights
* `NAs in predicted_ratings`: I was getting NAs in the predicted_ratings output. This lead to an investigation of outliers that helped reduce the overall RMSE.
* `Lambda can be refined step-wise`: Lambda can be optimized to two decimal places using 30 tests instead of 100. This ultimately proved to be worthless, but interesting.
### Final Modeling Approach
Ultimately, what helped more than fine-tuning lambda was to throw away the outliers that were adding more noise than was worth in the overall weighted model. I experimented with a few hard-coded values, starting by removing any movie with three or fewer ratings submitted, and removing any user that has reviewed nineteen or fewer movies.
```{r hard-coded outlier filter, message=FALSE, echo=FALSE, warning=FALSE}
edx2 <- edx %>% group_by(movieId) %>% filter(n()>4) %>% ungroup()
edx2 <- edx2 %>% group_by(userId) %>% filter(n()>20) %>% ungroup()
test_index <- createDataPartition(y = edx2$rating, times = 1, p = 0.2, list = FALSE)
train_set <- edx2[-test_index,]
test_set <- edx2[test_index,]
lambdas <- seq(1, 10, 1)
RMSEs <- sapply(lambdas, function(l){
b_i <-
train_set %>%
group_by(movieId) %>%
summarize(b_i = sum(rating - mu_hat)/(n()+l))
b_u <- train_set %>%
b_u <-
train_set %>%
left_join(b_i, by="movieId") %>%
group_by(userId) %>%
summarize(b_u = sum(rating - b_i - mu_hat)/(n()+l))
final_predicted_ratings <-
validation %>%
predicted_ratings <-
test_set %>%
left_join(b_i, by = "movieId") %>%
left_join(b_u, by = "userId") %>%
mutate(pred = mu_hat + b_i + b_u) %>%
pull(pred)
RMSE(final_predicted_ratings, validation$rating)
return(RMSE(predicted_ratings, test_set$rating))
})
lambdas_rmse <- min(RMSEs)
lambda_best <- lambdas[which.min(RMSEs)]
final_method <- paste("Hard-Coded Pruned OPM&UE Model, lambda =", as.character(lambda_best), sep=" ")
rmse_results <- bind_rows(rmse_results, tibble(method=final_method, RMSE = lambdas_rmse))
```
### Insights
* `NAs in predicted_ratings`: I was getting NAs in the predicted_ratings output. This lead to an investigation of outliers that helped reduce the overall RMSE.
* `Lambda can be refined step-wise`: Lambda can be optimized to two decimal places using 30 tests instead of 100. This ultimately proved to be worthless, but interesting.
```{r step-wise lambda optimization}
#####
#####
#lambdas <- seq(1, 10, 1)
#RMSEs <- sapply(lambdas, function(l){
#b_i <- train_set %>% group_by(movieId) %>% summarize(b_i = sum(rating - mu_hat)/(n()+l))
#b_u <- train_set %>% left_join(b_i, by="movieId") %>% group_by(userId) %>% summarize(b_u #= sum(rating - b_i - mu_hat)/(n()+l))
#predicted_ratings <- test_set %>% left_join(b_i, by = "movieId") %>% left_join(b_u, by = #"userId") %>% mutate(pred = mu_hat + b_i + b_u) %>% pull(pred)
#return(RMSE(predicted_ratings, test_set$rating))
#})
#lambda_best <- lambdas[which.min(RMSEs)]
#lambdas <- seq(lambda_best-1, lambda_best+1, 0.1)
#RMSEs <- sapply(lambdas, function(l){
#b_i <- train_set %>% group_by(movieId) %>% summarize(b_i = sum(rating - mu_hat)/(n()+l))
#b_u <- train_set %>% left_join(b_i, by="movieId") %>% group_by(userId) %>% summarize(b_u #= sum(rating - b_i - mu_hat)/(n()+l))
#predicted_ratings <- test_set %>% left_join(b_i, by = "movieId") %>% left_join(b_u, by = #"userId") %>% mutate(pred = mu_hat + b_i + b_u) %>% pull(pred)
#return(RMSE(predicted_ratings, test_set$rating))
#})
#lambda_best <- lambdas[which.min(RMSEs)]
#lambdas <- seq(lambda_best-0.1, lambda_best+0.1, 0.01)
#RMSEs <- sapply(lambdas, function(l){
#b_i <- train_set %>% group_by(movieId) %>% summarize(b_i = sum(rating - mu_hat)/(n()+l))
#b_u <- train_set %>% left_join(b_i, by="movieId") %>% group_by(userId) %>% summarize(b_u = sum(rating - b_i - mu_hat)/(n()+l))
#predicted_ratings <- test_set %>% left_join(b_i, by = "movieId") %>% left_join(b_u, by = "userId") %>% mutate(pred = mu_hat + b_i + b_u) %>% pull(pred)
#return(RMSE(predicted_ratings, test_set$rating))
#})
```{r rmse results_2, echo=FALSE}
print.data.frame(rmse_results)
#lambdas_rmse <- min(RMSEs)
#lambda_best <- lambdas[which.min(RMSEs)]
#final_method <- paste("Optimized Penalized LSE Model with lambda =", #as.character(lambda_best), sep=" ")
#rmse_results <- bind_rows(rmse_results, tibble(method=final_method, RMSE = lambdas_rmse))
#rmse_results
```
The slight reduction in RMSE was a surprise. The reason for pruning out the lowest number of ratings per movie and per user was to no longer need to remove NA values from the predictions, since that arose from either movies or users with a low count winding up in the training set but not the test set. It was assumed that if every movie has been rated more than three times, and if every user had rated at least nineteen movies, then the possibility of either not being represented in the test set was much lower. It was realized after the fact that this can be controlled during the partition creation by using semi_join, as was done with the original data set. Nonetheless, this attempt at removing the NA values resulted in a reduction of the RMSE value that was unexpected, and merited further investigation.
Rather than hard-code arbitrary values, a statistical approach based on ignoring the lowest 10th percentile of both movies and users (in terms of numbers of ratings associated with each unique value) was developed.
### Modeling Approach
Ultimately, what helped more than fine-tuning lambda was to throw away the outliers that were adding more noise than was worth in the overall weighted model. I experimented with a few hard-coded values:
```{r hard-coded outlier filter}
edx2 <- edx %>% group_by(movieId) %>% filter(n()>15) %>% ungroup()
edx2 <- edx2 %>% group_by(userId) %>% filter(n()>15) %>% ungroup()
test_index <- createDataPartition(y = edx2$rating, times = 1, p = 0.2, list = FALSE)
train_set <- edx2[-test_index,]
test_set <- edx2[test_index,]
```
Then I moved to a statistical approach based on ignoring the lowest 10th percentile of both movies and users (in terms of numbers of ratings associated with each unique value).
```{r percentile-based outlier filter}
movieId_cutoff <- edx %>% group_by(movieId) %>% summarize(n = n()) %>% .$n %>% quantile(.10)
userId_cutoff <- edx %>% group_by(userId) %>% summarize(n = n()) %>% .$n %>% quantile(.10)
edx2 <- edx %>% group_by(movieId) %>% filter(n()>movieId_cutoff) %>% ungroup()
edx2 <- edx2 %>% group_by(userId) %>% filter(n()>userId_cutoff) %>% ungroup()
test_index <- createDataPartition(y = edx2$rating, times = 1, p = 0.2, list = FALSE)
train_set <- edx2[-test_index,]
test_set <- edx2[test_index,]
```
10% seemed like a good number to work with, although this could be refined more in the future.
10% seemed like a good number to start with, although this could be refined more in the future, and is itself an arbitrary cutoff point not based on any deeper investigation.
## Results
(a results section that presents the modeling results and discusses the model performance)
We wind up with a model that looks like this:
The fine tuning of lambda beyond the integer level does not seem to provide enough reduction in error to justify the increased cost in execution time. Instead, the statistical approach to pruning lower outliers seems to provide the needed optimization. A simple cutoff of the 10th percentile and integer lambda yields the following:
```{r final model}
```{r final model, message=FALSE, echo=FALSE, warning=FALSE}
# Calculate 10th percentile of movieId by rating
movieId_cutoff <-
@ -568,9 +484,9 @@ function(true_ratings, predicted_ratings){
# Calcualte the average rating for the training set.
mu_hat <- mean(train_set$rating)
paste("Penalized User and Movie Effect Least Squares approach with a sample mean rating of",
paste("Penalized User & Movie Effect method using a mean rating of",
as.character(mu_hat),
", optimizing lambda to the nearest integer.",
".",
sep=" ")
# Using Movie and User Effects with Penalization on both, optimize lambda to the nearest integer.
@ -602,7 +518,7 @@ return(RMSE(predicted_ratings, test_set$rating))
lambda_best <- lambdas[which.min(RMSEs)]
paste("Penalized LSE Model with lambda =",
paste("Optimized lambda =",
as.character(lambda_best),
"gives an RMSE of",
as.character(min(RMSEs)),
@ -617,7 +533,7 @@ sep=" ")
We can measure the performance of this final algorithm against the validation set (for the first time):
```{r validation set}
```{r validation set, echo=FALSE}
# Use this optimized lambda value to test against the validation set.
paste("Now testing this optimized lambda against the vaildation set.")
@ -638,7 +554,7 @@ pull(pred)
final_rmse <-
RMSE(final_predicted_ratings, validation$rating)
paste("Penalized LSE Model with optimized lambda =",
paste("Optimized lambda =",
as.character(lambda_best),
"gives an RMSE of",
as.character(final_rmse),
@ -650,23 +566,22 @@ sep=" ")
## Conclusion
(a conclusion section that gives a brief summary of the report, its limitations and future work)
Our penalized movie and user effect model gives an error in the desired range when lower outlier pruning is applied at the 10th percentile of ratings for both users and movie titles.
### Summary
We basically used a linear estimate accounting for both movie and user effect as well as a penalized term for both effects. This got us close, but not as close as we wanted to be. The original mistake of not using semi_join lead to an exploration of low-value outliers.
We began a linear estimate accounting for both movie and user effect as well as a penalized term for both effects. This got us close to our goal, and further exploration was warranted. The original mistake of not using semi_join when generating the training and test sets lead to an investigation of low-value outliers.
*** insert histogram here of userId vs. frequency and movieId vs. frequency to highlight how many users have few ratings and how many movies have few ratings. ****
These add noise without contributing to predictability. Our model is based on the mean rating for each movie across userId, which is NOT resistant to outliers. Hence, the users and movies with low frequency add noise to the mean without adding much weight to the predictions. an attempt to remove these (to eliminate NAs in the final predicitons) inadvertently lead to a reduction in RMSE, which is what makes this method powerful.
These add noise without contributing to predictability. Our model is based on the mean rating for each movie across userId, which is NOT resistant to outliers. Hence, the users and movies with low frequency add noise to the mean without adding much weight to the predictions. an attempt to remove these (to eliminate NAs in the final predictions) inadvertently lead to a reduction in RMSE, which is what makes this method powerful.
### Limitations
This seems to depend on large data sets. Elminiating below a certain percentile might work against you with a smaller data set. This needs to be explored.
This seems to depend on large data sets. Eliminating below a certain percentile might work against you with a smaller data set. This needs to be explored.
### Future Work
Optimizing the percentile removed from each predictor (movieId and userId) would be interesting. It would be nice to keep as much of the original data set as possible while still training a good solid model.
Optimizing the percentile removed from each predictor (movieId and userId) would be interesting, as it is unlikely that the 10th percentile is always the ideal cutoff. It would be nice to keep as much of the original data set as possible while still training a good solid model.
Many predictors were not considered, such as timestamps or genres. No factor analysis was done of any kind. These could be included in the model to decrease the overall error if desired.
Many predictors were not considered, such as timestamps or genres. No matrix factorization was used to ??terminology goes here???.

Binary file not shown.