100 lines
2.8 KiB
R
Executable file
100 lines
2.8 KiB
R
Executable file
#' ---
|
||
#' title: "What is vectorization in R?"
|
||
#' date: "2021-11-03"
|
||
#' author: "Jose https://ajuda.multifarm.top"
|
||
#' output:
|
||
#' html_document:
|
||
#' code_folding: show
|
||
#' toc: yes
|
||
#' toc_float:
|
||
#' smooth_scroll: true
|
||
#' df_print: paged
|
||
#' highlight: zenburn
|
||
#' ---
|
||
|
||
|
||
#' One operation that is slow in R, and somewhat slow in all languages, is memory allocation. So one of the slower ways to write a for loop is to resize a vector repeatedly, so that R has to re-allocate memory repeatedly, like this:
|
||
|
||
j <- 1
|
||
|
||
system.time(for (i in 1:10) {
|
||
j[i] = 10
|
||
})
|
||
|
||
n <- 1:10
|
||
|
||
j <- 1
|
||
|
||
system.time(for (i in seq_along(n)) {
|
||
j[i] = 10
|
||
})
|
||
|
||
fxn <- function(j){
|
||
for (i in 1:10) {
|
||
j[i] = 10
|
||
}
|
||
return(j)
|
||
}
|
||
|
||
system.time(fxn(j))
|
||
|
||
#' Here, in each repetition of the for loop, R has to re-size the vector and re-allocate memory. It has to find the vector in memory, create a new vector that will fit more data, copy the old data over, insert the new data, and erase the old vector. This can get very slow as vectors get big.
|
||
|
||
#' If one pre-allocates a vector that fits all the values, R doesn’t have to re-allocate memory each iteration, and the results can be much faster. Here’s how you’d do that for the above case:
|
||
|
||
j <- rep(NA, 10)
|
||
|
||
system.time(for (i in 1:10) {
|
||
j[i] = 10
|
||
})
|
||
|
||
j <- rep(NA, 10)
|
||
|
||
system.time(for (i in seq_along(1:10)) {
|
||
j[i] = 10
|
||
})
|
||
|
||
## There are still situations that it may make sense to use for loops instead of vectorized functions, though. These include:
|
||
|
||
## Using functions that don’t take vector arguments
|
||
## Loops where each iteration is dependent on the results of previous iterations
|
||
|
||
## Note that the second case is tricky. In some cases where the obvious implementation of an algorithm uses a for loop, there’s a vectorized way around it. For instance, here is a good example of implementing a random walk using vectorized code. In these cases, you often want to call functions that are essentially C/FORTRAN implementations of loop operations to avoid the loop in R. Examples of such functions include cumsum (cumulative sums), rle (counting number of repeated value), and ifelse (vectorized if…else statements).
|
||
|
||
#' ## Using rle
|
||
#' Compute the lengths and values of runs of equal values in a vector
|
||
# - or the reverse operation.
|
||
|
||
#' Building data
|
||
|
||
x <- c("952345172", "alju12", "amou79", "amou91", "baab81", NA)
|
||
|
||
code <- rep(x, c(5, 10, 10, 20, 2, 7))
|
||
|
||
df <- data.frame(id = 1:length(code), code)
|
||
|
||
rle_code <- rle(df$code)
|
||
|
||
class(rle_code)
|
||
|
||
attributes(rle_code)
|
||
|
||
rle_code$values
|
||
|
||
rle_code$lengths
|
||
|
||
rle_code$values > 6
|
||
|
||
rle_code[rle_code$lengths > 6]
|
||
|
||
rle_code[[1]] > 6
|
||
|
||
inverse.rle(rle_code)
|
||
|
||
#' ## Using 'cumsum'
|
||
#'
|
||
|
||
|
||
#' ## Using 'ifelse'
|
||
|
||
?do.call
|