diff --git a/movielens-recommendation-system.Rmd b/movielens-recommendation-system.Rmd index e611d2f..8601429 100644 --- a/movielens-recommendation-system.Rmd +++ b/movielens-recommendation-system.Rmd @@ -90,6 +90,7 @@ rm(dl, ratings, movies, test_index, temp, movielens, removed) The resulting working set `edx` is further split into a training set and a test set which will be used to evaluate various methods. The test set comprises of 20% of the `edx` working set. No attempt is made to ensure that all the users and movies in the test set are also in the training set. This leads to a fortunate insight later, and thus has been included as part of the discovery process. +\newpage ## Analysis ### Outline of Steps Taken @@ -171,10 +172,6 @@ rmse_results <- bind_rows(rmse_results, tibble(method="Movie Effect Model", RMSE = bi_rmse)) -# How many NA values do we need to get rid of? -nas <- sum(is.na(predicted_ratings)) -paste("There are", as.character(nas), "NA values in our predictions, which need to be removed.", sep= " ") - # Plot movie and user histograms to get an idea of what is going on edx %>% group_by(userId) %>% summarize(reviews = n()) %>% ggplot(aes(reviews)) + geom_histogram(color = "black", bins = 50) + scale_x_log10() + ggtitle("Number of reviews per user") + xlab("Number of reviews (log scale)") + @@ -183,6 +180,10 @@ edx %>% group_by(userId) %>% summarize(reviews = n()) %>% ggplot(aes(reviews)) + edx %>% group_by(movieId) %>% summarize(reviews = n()) %>% ggplot(aes(reviews)) + geom_histogram(color = "black", bins = 50) + scale_x_log10() + ggtitle("Number of reviews per movie") + xlab("Number of reviews (log scale)") + ylab("Count of movies") +# How many NA values do we need to get rid of? +nas <- sum(is.na(predicted_ratings)) +paste("There are", as.character(nas), "NA values in our predictions, which need to be removed.", sep= " ") + ``` Not very many `NA` values are being generated. This is most likely due to movies and/or users in the training set that are not present in the test set, or vice versa. This was avoided in the original data set by using the `inner_join` function. Instead, we attempt to remove any movies or users with extremely low rating counts. The assumption is that movies with only a few ratings affect the mean and therefore the error without contributing much to the overall effect. The same could be said for users who have only rated a few movies. Removing these low-frequency observations prevents `NA` values in the predicted ratings and allows the RMSEs to be calculated. A more rigorous approach needs to be formulated as this is only a temporary workaround. diff --git a/movielens-recommendation-system.pdf b/movielens-recommendation-system.pdf index 45236e6..6c86e91 100644 Binary files a/movielens-recommendation-system.pdf and b/movielens-recommendation-system.pdf differ