intro_r/script/vectorization.R

101 lines
2.8 KiB
R
Executable File
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

#' ---
#' title: "What is vectorization in R?"
#' date: "2021-11-03"
#' author: "Jose https://ajuda.multifarm.top"
#' output:
#' html_document:
#' code_folding: show
#' toc: yes
#' toc_float:
#' smooth_scroll: true
#' df_print: paged
#' highlight: zenburn
#' ---
#' One operation that is slow in R, and somewhat slow in all languages, is memory allocation. So one of the slower ways to write a for loop is to resize a vector repeatedly, so that R has to re-allocate memory repeatedly, like this:
j <- 1
system.time(for (i in 1:10) {
j[i] = 10
})
n <- 1:10
j <- 1
system.time(for (i in seq_along(n)) {
j[i] = 10
})
fxn <- function(j){
for (i in 1:10) {
j[i] = 10
}
return(j)
}
system.time(fxn(j))
#' Here, in each repetition of the for loop, R has to re-size the vector and re-allocate memory. It has to find the vector in memory, create a new vector that will fit more data, copy the old data over, insert the new data, and erase the old vector. This can get very slow as vectors get big.
#' If one pre-allocates a vector that fits all the values, R doesnt have to re-allocate memory each iteration, and the results can be much faster. Heres how youd do that for the above case:
j <- rep(NA, 10)
system.time(for (i in 1:10) {
j[i] = 10
})
j <- rep(NA, 10)
system.time(for (i in seq_along(1:10)) {
j[i] = 10
})
## There are still situations that it may make sense to use for loops instead of vectorized functions, though. These include:
## Using functions that dont take vector arguments
## Loops where each iteration is dependent on the results of previous iterations
## Note that the second case is tricky. In some cases where the obvious implementation of an algorithm uses a for loop, theres a vectorized way around it. For instance, here is a good example of implementing a random walk using vectorized code. In these cases, you often want to call functions that are essentially C/FORTRAN implementations of loop operations to avoid the loop in R. Examples of such functions include cumsum (cumulative sums), rle (counting number of repeated value), and ifelse (vectorized if…else statements).
#' ## Using rle
#' Compute the lengths and values of runs of equal values in a vector
# - or the reverse operation.
#' Building data
x <- c("952345172", "alju12", "amou79", "amou91", "baab81", NA)
code <- rep(x, c(5, 10, 10, 20, 2, 7))
df <- data.frame(id = 1:length(code), code)
rle_code <- rle(df$code)
class(rle_code)
attributes(rle_code)
rle_code$values
rle_code$lengths
rle_code$values > 6
rle_code[rle_code$lengths > 6]
rle_code[[1]] > 6
inverse.rle(rle_code)
#' ## Using 'cumsum'
#'
#' ## Using 'ifelse'
?do.call