diff --git a/README.org b/README.org index c680466..599a8a6 100755 --- a/README.org +++ b/README.org @@ -17,6 +17,10 @@ data using the R programming environment * [[./script/extractpdf.R][Reading pdf files]] +** Vectorization + + * [[./script/vectorization.R][Vectorization]] + ** Iteration * [[./script/iteration.R][Lapply, apply and for loop: brief introduction]] diff --git a/script/vectorization.R b/script/vectorization.R new file mode 100755 index 0000000..56d7208 --- /dev/null +++ b/script/vectorization.R @@ -0,0 +1,100 @@ +#' --- +#' title: "What is vectorization in R?" +#' date: "2021-11-03" +#' author: "Jose https://ajuda.multifarm.top" +#' output: +#' html_document: +#' code_folding: show +#' toc: yes +#' toc_float: +#' smooth_scroll: true +#' df_print: paged +#' highlight: zenburn +#' --- + + +#' One operation that is slow in R, and somewhat slow in all languages, is memory allocation. So one of the slower ways to write a for loop is to resize a vector repeatedly, so that R has to re-allocate memory repeatedly, like this: + +j <- 1 + +system.time(for (i in 1:10) { + j[i] = 10 + }) + +n <- 1:10 + +j <- 1 + +system.time(for (i in seq_along(n)) { + j[i] = 10 + }) + +fxn <- function(j){ + for (i in 1:10) { + j[i] = 10 + } + return(j) +} + +system.time(fxn(j)) + +#' Here, in each repetition of the for loop, R has to re-size the vector and re-allocate memory. It has to find the vector in memory, create a new vector that will fit more data, copy the old data over, insert the new data, and erase the old vector. This can get very slow as vectors get big. + +#' If one pre-allocates a vector that fits all the values, R doesn’t have to re-allocate memory each iteration, and the results can be much faster. Here’s how you’d do that for the above case: + +j <- rep(NA, 10) + +system.time(for (i in 1:10) { + j[i] = 10 + }) + +j <- rep(NA, 10) + +system.time(for (i in seq_along(1:10)) { + j[i] = 10 + }) + +## There are still situations that it may make sense to use for loops instead of vectorized functions, though. These include: + +## Using functions that don’t take vector arguments +## Loops where each iteration is dependent on the results of previous iterations + +## Note that the second case is tricky. In some cases where the obvious implementation of an algorithm uses a for loop, there’s a vectorized way around it. For instance, here is a good example of implementing a random walk using vectorized code. In these cases, you often want to call functions that are essentially C/FORTRAN implementations of loop operations to avoid the loop in R. Examples of such functions include cumsum (cumulative sums), rle (counting number of repeated value), and ifelse (vectorized if…else statements). + +#' ## Using rle +#' Compute the lengths and values of runs of equal values in a vector +# - or the reverse operation. + +#' Building data + +x <- c("952345172", "alju12", "amou79", "amou91", "baab81", NA) + +code <- rep(x, c(5, 10, 10, 20, 2, 7)) + +df <- data.frame(id = 1:length(code), code) + +rle_code <- rle(df$code) + +class(rle_code) + +attributes(rle_code) + +rle_code$values + +rle_code$lengths + +rle_code$values > 6 + +rle_code[rle_code$lengths > 6] + +rle_code[[1]] > 6 + +inverse.rle(rle_code) + +#' ## Using 'cumsum' +#' + + +#' ## Using 'ifelse' + +?do.call