added pagebreak and moved the graphs around to fit better.

2021-09-13 21:17:53 -06:00 · 2021-09-13 21:17:53 -06:00 · 96454bc3c6
parent 4abd1906a3
commit 96454bc3c6
2 changed files with 5 additions and 4 deletions
--- a/movielens-recommendation-system.Rmd
+++ b/movielens-recommendation-system.Rmd
@ -90,6 +90,7 @@ rm(dl, ratings, movies, test_index, temp, movielens, removed)

 The resulting working set `edx` is further split into a training set and a test set which will be used to evaluate various methods.  The test set comprises of 20% of the `edx` working set.  No attempt is made to ensure that all the users and movies in the test set are also in the training set.  This leads to a fortunate insight later, and thus has been included as part of the discovery process.

+\newpage
 ## Analysis

 ### Outline of Steps Taken
@ -171,10 +172,6 @@ rmse_results <- bind_rows(rmse_results,
                          tibble(method="Movie Effect Model",  
                                     RMSE = bi_rmse))

-# How many NA values do we need to get rid of? 
-nas <- sum(is.na(predicted_ratings))
-paste("There are", as.character(nas), "NA values in our predictions, which need to be removed.", sep= " ")
-
 # Plot movie and user histograms to get an idea of what is going on

 edx %>% group_by(userId) %>% summarize(reviews = n()) %>% ggplot(aes(reviews)) + geom_histogram(color = "black", bins = 50) + scale_x_log10() + ggtitle("Number of reviews per user") + xlab("Number of reviews (log scale)") +
@ -183,6 +180,10 @@ edx %>% group_by(userId) %>% summarize(reviews = n()) %>% ggplot(aes(reviews)) +
 edx %>% group_by(movieId) %>% summarize(reviews = n()) %>% ggplot(aes(reviews)) + geom_histogram(color = "black", bins = 50) + scale_x_log10() + ggtitle("Number of reviews per movie") + xlab("Number of reviews (log scale)") +
  ylab("Count of movies") 

+# How many NA values do we need to get rid of? 
+nas <- sum(is.na(predicted_ratings))
+paste("There are", as.character(nas), "NA values in our predictions, which need to be removed.", sep= " ")
+
 ```

 Not very many `NA` values are being generated.  This is most likely due to movies and/or users in the training set that are not present in the test set, or vice versa.  This was avoided in the original data set by using the `inner_join` function.  Instead, we attempt to remove any movies or users with extremely low rating counts.  The assumption is that movies with only a few ratings affect the mean and therefore the error without contributing much to the overall effect.  The same could be said for users who have only rated a few movies.  Removing these low-frequency observations prevents `NA` values in the predicted ratings and allows the RMSEs to be calculated.  A more rigorous approach needs to be formulated as this is only a temporary workaround.
--- a/movielens-recommendation-system.pdf
+++ b/movielens-recommendation-system.pdf