added pagebreak and moved the graphs around to fit better.

This commit is contained in:
shelldweller 2021-09-13 21:17:53 -06:00
parent 4abd1906a3
commit 96454bc3c6
2 changed files with 5 additions and 4 deletions

View File

@ -90,6 +90,7 @@ rm(dl, ratings, movies, test_index, temp, movielens, removed)
The resulting working set `edx` is further split into a training set and a test set which will be used to evaluate various methods. The test set comprises of 20% of the `edx` working set. No attempt is made to ensure that all the users and movies in the test set are also in the training set. This leads to a fortunate insight later, and thus has been included as part of the discovery process.
\newpage
## Analysis
### Outline of Steps Taken
@ -171,10 +172,6 @@ rmse_results <- bind_rows(rmse_results,
tibble(method="Movie Effect Model",
RMSE = bi_rmse))
# How many NA values do we need to get rid of?
nas <- sum(is.na(predicted_ratings))
paste("There are", as.character(nas), "NA values in our predictions, which need to be removed.", sep= " ")
# Plot movie and user histograms to get an idea of what is going on
edx %>% group_by(userId) %>% summarize(reviews = n()) %>% ggplot(aes(reviews)) + geom_histogram(color = "black", bins = 50) + scale_x_log10() + ggtitle("Number of reviews per user") + xlab("Number of reviews (log scale)") +
@ -183,6 +180,10 @@ edx %>% group_by(userId) %>% summarize(reviews = n()) %>% ggplot(aes(reviews)) +
edx %>% group_by(movieId) %>% summarize(reviews = n()) %>% ggplot(aes(reviews)) + geom_histogram(color = "black", bins = 50) + scale_x_log10() + ggtitle("Number of reviews per movie") + xlab("Number of reviews (log scale)") +
ylab("Count of movies")
# How many NA values do we need to get rid of?
nas <- sum(is.na(predicted_ratings))
paste("There are", as.character(nas), "NA values in our predictions, which need to be removed.", sep= " ")
```
Not very many `NA` values are being generated. This is most likely due to movies and/or users in the training set that are not present in the test set, or vice versa. This was avoided in the original data set by using the `inner_join` function. Instead, we attempt to remove any movies or users with extremely low rating counts. The assumption is that movies with only a few ratings affect the mean and therefore the error without contributing much to the overall effect. The same could be said for users who have only rated a few movies. Removing these low-frequency observations prevents `NA` values in the predicted ratings and allows the RMSEs to be calculated. A more rigorous approach needs to be formulated as this is only a temporary workaround.

Binary file not shown.