diff --git a/Detecting_Bitcoin_Ransomware.Rmd b/Detecting_Bitcoin_Ransomware.Rmd index a81625d..4a13735 100644 --- a/Detecting_Bitcoin_Ransomware.Rmd +++ b/Detecting_Bitcoin_Ransomware.Rmd @@ -27,6 +27,10 @@ geometry: margin=2cm \vbox{\copy2\box0}\box2}} ```{r setup, include=FALSE} +# Load and start timer +library(tictoc) +tic() + knitr::opts_chunk$set(echo = TRUE, out.width="400px", dpi=120) def.chunk.hook <- knitr::knit_hooks$get("chunk") knitr::knit_hooks$set(chunk = function(x, options) { @@ -77,6 +81,13 @@ These variables are defined rather abstractly, viewing the blockchain as a topol ```{r data-prep, echo=FALSE, include=FALSE} +# Set the repository to a known working mirror just in case it has not already been set +cat("Setting Seattle repository") +r = getOption("repos") +r["CRAN"] = "http://cran.fhcrc.org/" +options(repos = r) +rm(r) + # Install necessary packages if(!require(tidyverse)) install.packages("tidyverse") if(!require(caret)) install.packages("caret") @@ -186,8 +197,6 @@ test_index <- createDataPartition(y = workset$bw, times = 1, p = .5, list = FALS train_set <- workset[-test_index,] test_set <- workset[test_index,] -# Clean up environment -rm(dest_file, url) #Sample every nth row due to memory constraints train_samp <- train_set[seq(1, nrow(train_set), 100), ] @@ -210,7 +219,7 @@ no_nas <- sum(is.na(ransomware)) ``` -### Exploration and Visualization +### Exploration and Visualization (do this part last....) The ransomware addresses make up less than 2% of the overall data set. This presents a challenge as the target observations are sparse within the data set, especially when we consider that this is then divided into 29 subsets. In fact, some of the ransomware groups have only a single member, making categorization a dubious task. At least there are no missing values to worry about. @@ -284,7 +293,7 @@ plot(coeff_vars) From this, it appears that *income* has the widest range of variability, followed by *neighbors*. These are also the features that are most strongly skewed to the right, meaning that a few addresses have really high values for each of these features while the bulk of the data set has very low values for these numbers. -Now do the following: +Now do the following (after filling in methods, results, and conclusions, since those are done already: 6) Break into groups somehow. Graph variables per group? Show how the variables are distributed for each ransomware group? Percent ransomware per each day of the week, for example. Is ransomware more prevalent on a particular day of the week? Break other numerical values into bins, and graph percentage per bin. Look for trends and correlations between groups/variables, and display them here. MORE OF THIS.... @@ -330,55 +339,475 @@ data.frame(pca$x[,1:2], bw=train_samp$bw) %>% #d_approx <- dist(pca$x[, 1:2]) #qplot(d, d_approx) + geom_abline(color="red") -# Clean up environment -rm(pca, x, coeff_vars, d, means, pc, sds) ``` ### Insights Gained from Exploration - Maybe its better to approach this as a binary problem? At least at first, lets see how far that gets us.... - + From the previous visual and statistical exploration of the data, it becomes clear what the challenge is. Ransomware addresses are very sparse in the data set, making up less than 2% of the addresses. That small percentage is also further classified into 28 groups. Perhaps the original paper was a bit too ambitious in trying to categorize all the addresses into 29 categories, including the "white" addresses. To simplify our approach, we will categorize the addresses in a binary way, either "white" or "black", where "black" signifies an association with ransomware transactions. Asking this as a "ransomware or not-ransomware" question allows for application of methods that are impractical otherwise. -## Modeling approach (Chunk #3, mostly done, just need to clean up a bit) +## Modeling approach - An overview of why I picked the methods that I did. Based on from original paper, that Random Forests were hard to apply here, and that it was all topological data to begin with, hence that lead me to SOMs. Also, describe the reasoning behind the binary approach. Describe what you learned about SOMs. - -#### Random Forests - -#### Self Organizing Maps + Akcora et al. mention that they tried to model the data using a Random Forests method, but that the complexity of the data set lead to problems with that approach.[3] Switching to a binary perspective on the problem might alleviate some of that complexity, and is worth another look. The topological nature of the way the data set has been described numerically lead me to search for topological machine learning methods. Searching for *topo* in the documentation for the `caret` package [6] resulted in the entry for Self Organizing Maps, supplied by the `kohonen` package. The description at CRAN [7] was intriguing enough for me to investigate further. ### Method 1: Binary Random Forests -If we ask a simpler question, is this a useful approach? Mentioned to not work well in original paper. Try it using a binary black/white approach. change all instances of "bw" in the code to "bw". show how this simplification leads to (near)-perfect accuracy. Confusion Matrix? +Using the `randomForest` library, we train a model on our training set and test against the "black/white" categorization on our test set. + +```{r binary_random_forests, echo=FALSE, include=FALSE} + +# Install foreach package if needed +if(!require(randomForest)) install.packages("randomForest") +library(randomForest) + +# Keep only numeric columns with highest coefficients of variation for dimension reduction +train_num <- train_samp %>% select(neighbors, income) + +# Binary outputs, black=ransomware, white=non-ransomware, train set +train_bw <- train_samp$bw + +#Sample every nth row due to memory constraints +set.seed(23) +test_samp <- test_set[seq(1, nrow(train_set), 100), ] + +# Dimension reduction again +test_num <- test_samp %>% select(neighbors, income) + +# Same for test set +test_bw <- test_samp$bw + +# Lower CV numbers +control <- trainControl(method="cv", number = 10) +grid <- data.frame(mtry = c(2, 4, 6, 8, 10, 12)) + +# Train Random Forests model +rf_model <- train(train_num, train_bw, method="rf", trControl = control, tuneGrid=grid) + + +# Fit model +fit_rf <- randomForest(train_samp, train_bw, + minNode = rf_model$bestTune$mtry) + +``` + +We can see that the results are quite good against the smaller test set and the larger validation set. + +```{r binary_random_forests-validation, echo=FALSE} + +# Check for best tuning parameters +ggplot(rf_model) +rf_model$bestTune + +# Check for enough trees +plot(fit_rf) + +# Measure accuracy of model against test sample +y_hat_rf <- predict(fit_rf, test_samp) +cm_test <- confusionMatrix(y_hat_rf, test_bw) + +message("Confusion Matrix for test set:") +cm_test + +# Measure accuracy of model against full validation set + +y_hat_rf <- predict(fit_rf, validation) +cm_validation <- confusionMatrix(y_hat_rf, validation$bw) + +message("Confusion Matrix for validation set:") +cm_validation + + +``` + ### Method 2: Binary SOMs -If we ask the same question to a more sophisticated and topological approach, how good is the model? Mention how the original paper was toplogical in nature, an how this lead to the investigation of SOMs. Repeat the binary "b/w" approach using SOMs. This accuracy is still pretty good, but not *as* good as the random forest method. Point out how SOMs are really used for classification into _many_ groups. This leads to an Insight! (see above) What if we first _isolate_ the "black" addresses using Random Forest, and then categorize the black only subset (< 2%) using categorical SOMs. This leads to a 2-part system... +If we ask the same question to a more sophisticated and topological approach, how good is the model? Mention how the original paper was topological in nature, an how this lead to the investigation of SOMs. Repeat the binary "b/w" approach using SOMs. This accuracy is still pretty good, but not *as* good as the random forest method. Point out how SOMs are really used for classification into _many_ groups. This leads to an Insight! (see above) What if we first _isolate_ the "black" addresses using Random Forest, and then categorize the black only subset (< 2%) using categorical SOMs. This leads to a 2-part system... + +Note to self: I don't even use this part in the final script. Should I leave it out of the paper too? + +```{r binary_soms, echo=FALSE, include=FALSE} +# Install kohonen package if needed +if(!require(kohonen)) install.packages("kohonen") + +# Load kohonen library +library(kohonen) + +# Install kohonen package if needed +if(!require(parallel)) install.packages("parallel") + +# Load parallel library +library(parallel) + +# Keep only numeric columns, ignoring dates and looped. +#train_num <- train_set %>% select(length, weight, count, neighbors, income) + +# SOM function can only work on matrices +#train_mat <- as.matrix(scale(train_num)) + +# Switching to supervised SOMs +#test_num <- test_set %>% select(length, weight, count, neighbors, income) + +# Note that when we rescale our testing data we need to scale it according to how we scaled our training data. +#test_mat <- as.matrix(scale(test_num, center = attr(train_mat, +# "scaled:center"), scale = attr(train_mat, "scaled:scale"))) + +# Binary outputs, black=ransomware, white=non-ransomware, train set +#train_bw <- train_set$bw %>% classvec2classmat() + +# Same for test set +#test_bw <- test_set$bw %>% classvec2classmat() + +# Create Data list for supervised SOM +# +#train_list <- list(independent = train_mat, dependent = train_bw) + +# Calculate idea grid size according to: +# https://www.researchgate.net/post/How-many-nodes-for-self-organizing-maps + +# Formulaic method 1 +#grid_size <- round(sqrt(5*sqrt(nrow(train_set)))) +# Based on categorical number, method 2 +#grid_size = ceiling(sqrt(length(unique(ransomware$bw)))) +#grid_size + +# Create SOM grid +#train_grid <- somgrid(xdim=grid_size, ydim=grid_size, topo="hexagonal", toroidal = TRUE) + +# Set magic seed for reproducibility +#set.seed(23) + +## Now build the model. +#som_model <- xyf(train_mat, train_bw, +# grid = train_grid, +# rlen = 100, +# mode="pbatch", # or: alpha = c(0.05,0.01), +# cores = detectCores(), # detectCores() - 1 if system becomes unresponsive during training +# keep.data = TRUE +#) + + +# Now test predictions +# https://clarkdatalabs.github.io/soms/SOM_NBA + +#test_list <- list(independent = test_mat, dependent = test_bw) + +#ransomware.prediction <- predict(som_model, newdata = test_list) + + + +# Now test predictions of validation set + +# Switching to supervised SOMs +#valid_num <- validation %>% select(length, weight, count, neighbors, income) + +# Note that when we rescale our testing data we need to scale it according to how we scaled our training data. +#valid_mat <- as.matrix(scale(valid_num, center = attr(train_mat, +# "scaled:center"), scale = attr(train_mat, "scaled:scale"))) + +#valid_bw <- validation$bw + +#valid_list <- list(independent = valid_mat, dependent = valid_bw) + +# Requires up to 16GB of RAM, skip if resources are limited +#ransomware.prediction.validation <- predict(som_model, newdata = valid_list) + +``` + + + +```{r binary_soms-cms, echo=FALSE} + +#table(test_set$bw, ransomware.prediction$prediction[[2]]) %>% knitr::kable() + +#table(validation$bw, ransomware.prediction.validation$prediction[[2]]) %>% knitr::kable() + +# These are bogging down the pdf. Choose only a few? +# Visualize clusters +#plot(som_model, type = 'mapping', pch = 19, palette.name = topo.colors) +# cat(" \n") + +# Distance map +#plot(som_model, type = 'quality', pch = 19, palette.name = topo.colors) +# cat(" \n") + +# Visualize counts +#plot(som_model, type = 'counts', pch = 19, palette.name = topo.colors) +# cat(" \n") + +# Visualize fan diagram +#plot(som_model, type = 'codes', pch = 19, palette.name = topo.colors) +# cat(" \n") + +# Visualize heatmap for variable 1 +#plot(som_model, type = 'property', property = som_model$codes[[1]][,1], main=colnames(train_num)[1], pch = 19, palette.name = topo.colors) +# cat(" \n") + +# Visualize heatmap for variable 2 +#plot(som_model, type = 'property', property = som_model$codes[[1]][,2], main=colnames(train_num)[2], pch = 19, palette.name = topo.colors) +# cat(" \n") + +# Visualize heatmap for variable 3 +#plot(som_model, type = 'property', property = som_model$codes[[1]][,3], main=colnames(train_num)[3], pch = 19, palette.name = topo.colors) +# cat(" \n") + +# Visualize heatmap for variable 4 +#plot(som_model, type = 'property', property = som_model$codes[[1]][,4], main=colnames(train_num)[4], pch = 19, palette.name = topo.colors) +# cat(" \n") + +# Visualize heatmap for variable 5 +#plot(som_model, type = 'property', property = som_model$codes[[1]][,5], main=colnames(train_num)[5], pch = 19, palette.name = topo.colors) +# cat(" \n") + +# Confusion Matrix +#cm_bw <- confusionMatrix(ransomware.prediction$prediction[[2]], test_set$bw) +#cm_bw$overall + +# Now test predictions of validation set + +# Confusion Matrix +#cm_bw.validation <- confusionMatrix(ransomware.prediction.validation$prediction[[2]], validation$bw) +#cm_bw.validation$overall + + +``` + ### Method 3: Categorical SOMs Describe categorical SOM work here, show results. This is where the pretty colored hex-graphs show up. + +```{r categorical_soms, echo=FALSE, include=FALSE} +# Do this here +# Try categorical SOMs on black-only addresses.... +#!! This is NOT right, is it? +#!! It would be even MORE impressive if I removed all the PREDICTED whites from +#!! the test set instead and started there. + +blacks <- ransomware %>% filter(!label=="white") + +# Validation set made from 50% of BitcoinHeist data, reduce later if possible. Categorical outcomes +set.seed(23) +test_index <- createDataPartition(y = blacks$label, times = 1, p = .5, list = FALSE) + +workset_blacks <- blacks[-test_index,] +temp <- blacks[test_index,] + +# Make sure addresses in validation set are also in working set... +# validation <- temp %>% +# semi_join(workset, by = "address") + +# Add rows removed from validation set back into working set... +#removed <- anti_join(temp, validation) +#workset <- rbind(workset, removed) + +# ... Or not +validation_blacks <- temp + +# Split the working set into a training set and a test set @ 50%, reduce later if possible. Binary outcomes (bw) +set.seed(23) +test_index <- createDataPartition(y = workset_blacks$label, times = 1, p = .5, list = FALSE) + +# Split the working set into a training set and a test set @ 50%, reduce later if possible. Categorical outcomes +#test_index <- createDataPartition(y = workset$label, times = 1, p = .5, list = FALSE) + +train_set <- workset_blacks[-test_index,] +temp <- workset_blacks[test_index,] + +# Make sure addresses in validation set are also in working set.... +#test_set <- temp %>% +# semi_join(train_set, by = "address") + +# Add rows removed from validation set back into working set.... +#removed <- anti_join(temp, test_set) +#train_set <- rbind(train_set, removed) + +# ....Or not +test_set <- temp + +##!! Data preparation is done, now focusing on Self Organizing Maps as our method +##!! Start here after reworking the data prep steps above. + +# Keep only numeric columns, ignoring dates and looped for now (insert factor analysis impVar here?) +train_num <- train_set %>% select(length, weight, count, neighbors, income) + +# SOM function can only work on matrices +train_mat <- as.matrix(scale(train_num)) + +# Switching to supervised SOMs +test_num <- test_set %>% select(length, weight, count, neighbors, income) + +# Note that when we rescale our testing data we need to scale it according to how we scaled our training data. +test_mat <- as.matrix(scale(test_num, center = attr(train_mat, + "scaled:center"), scale = attr(train_mat, "scaled:scale"))) + +# Categorical +train_label <- train_set$label %>% classvec2classmat() + +# Same for test set +test_label <- test_set$label %>% classvec2classmat() + +# Create Data list for supervised SOM +# +train_list <- list(independent = train_mat, dependent = train_label) + +# Calculate idea grid size according to: +# https://www.researchgate.net/post/How-many-nodes-for-self-organizing-maps + +# Formulaic method 1 +grid_size <- round(sqrt(5*sqrt(nrow(train_set)))) +# Based on categorical number, method 2 +#grid_size = ceiling(sqrt(length(unique(ransomware$label)))) +grid_size + +# Create SOM grid +train_grid <- somgrid(xdim=grid_size, ydim=grid_size, topo="hexagonal", toroidal = TRUE) + +# Set magic seed for reproducibility +set.seed(23) + +## Now build the model. +som_model2 <- xyf(train_mat, train_label, + grid = train_grid, + rlen = 100, + mode="pbatch", # or: alpha = c(0.05,0.01), + cores = detectCores(), # detectCores() - 1 if system locks during calculation + keep.data = TRUE +) + +# Now test predictions of test set +# https://clarkdatalabs.github.io/soms/SOM_NBA + +test_list <- list(independent = test_mat, dependent = test_label) + +ransomware_group.prediction <- predict(som_model2, newdata = test_list) + + + +# Now test predictions of validation set + +# Switching to supervised SOMs +valid_num <- validation_blacks %>% select(length, weight, count, neighbors, income) + +# Note that when we rescale our testing data we need to scale it according to how we scaled our training data. +valid_mat <- as.matrix(scale(valid_num, center = attr(train_mat, + "scaled:center"), scale = attr(train_mat, "scaled:scale"))) + + +valid_label <- validation_blacks$label + +valid_list <- list(independent = valid_mat, dependent = valid_label) + +ransomware_group.prediction.validation <- predict(som_model2, newdata = valid_list) + + +``` + +```{r categorical_soms_cms, echo=FALSE} + +#table(test_set$label, ransomware_group.prediction$prediction[[2]]) %>% knitr::kable() + +#table(validation_blacks$label, ransomware_group.prediction.validation$prediction[[2]]) %>% knitr::kable() + +#These re good plots, fix their display somehow... +# Visualize clusters +#plot(som_model2, type = 'mapping', pch = 19, palette.name = topo.colors) +# cat(" \n") + +# Distance map +#plot(som_model2, type = 'quality', pch = 19, palette.name = topo.colors) +# cat(" \n") + +# Visualize counts +#plot(som_model2, type = 'counts', pch = 19, palette.name = topo.colors) +# cat(" \n") + +# Visualize fan diagram +#plot(som_model2, type = 'codes', pch = 19, palette.name = topo.colors) +# cat(" \n") + +# Visualize heatmap for variable 1 +#plot(som_model2, type = 'property', property = som_model2$codes[[1]][,1], main=colnames(train_num)[1], pch = 19, palette.name = topo.colors) +# cat(" \n") + +# Visualize heatmap for variable 2 +#plot(som_model2, type = 'property', property = som_model2$codes[[1]][,2], main=colnames(train_num)[2], pch = 19, palette.name = topo.colors) +# cat(" \n") + +# Visualize heatmap for variable 3 +#plot(som_model2, type = 'property', property = som_model2$codes[[1]][,3], main=colnames(train_num)[3], pch = 19, palette.name = topo.colors) +# cat(" \n") + +# Visualize heatmap for variable 4 +#plot(som_model2, type = 'property', property = som_model2$codes[[1]][,4], main=colnames(train_num)[4], pch = 19, palette.name = topo.colors) +# cat(" \n") + +# Visualize heatmap for variable 5 +#plot(som_model2, type = 'property', property = som_model2$codes[[1]][,5], main=colnames(train_num)[5], pch = 19, palette.name = topo.colors) +# cat(" \n") + + +# Confusion Matrix +cm_labels <- confusionMatrix(ransomware_group.prediction$prediction[[2]], test_set$label) +cm_labels$overall + + +# Confusion Matrix +cm_labels.validation <- confusionMatrix(ransomware_group.prediction.validation$prediction[[2]], validation_blacks$label) +cm_labels.validation$overall + +# Set number of clusters to be equal to number of known ransomware groups (ignoring the whites) +n_groups <- length(unique(ransomware$label)) - 1 +n_groups + +# K-Means Clustering +# https://www.polarmicrobes.org/microbial-community-segmentation-with-r/ + +som.cluster <- kmeans(data.frame(som_model2$codes[[1]]), centers=n_groups) + +plot(som_model2, + main = 'K-Means Clustering', + type = "property", + property = som.cluster$cluster, + palette.name = topo.colors) +add.cluster.boundaries(som_model2, som.cluster$cluster) + +``` ### Final Method: Combined Methods 1 and 3 Using the results from Random Forest, isolate the black addresses first, and then run that subset through an SOM algorithm. Compare final results to original paper. These go in a "results" section. (below) + +```{r combined_methods, echo=FALSE} +# Do this here -## Results & Performance (Chunk #4) +# Still need to put it all into one script, and then reproduce the results here.... + + +``` + +## Results & Performance ### Results + In the original paper by Akcora et al, they tested several different sets of parameters on their TDA model. According to them, "In the best TDA models for each ransomware family, we predict **16.59 false positives for each +true positive**. In turn, this number is 27.44 for the best non-TDA models."[3] In fact, the highest Precision (a.k.a. Positive Predictive Value, defined as TP/(TP+FP)) they achieved was only 0.1610. Compare this to our final Precision value of 1.000? It is almost embarassing... did I do something wrong here? + ### Performance - In terms of what? Time? RAM? + The overall script takes X hours and X minutes to run on the aforementioned hardware. This could be optimized, but given that it is an eight year old laptop, this is not too unreasonable. It takes me longer to compile LibreOffice. ## Summary ### Comparison to original paper and impact of findings + They suck, I rule, 'nuff said. + ### Limitations + SOMs seem like they are easy to misconfigure. Perhaps a dual Random Forest approach would be better? + ### Future Work I only scratched he surface of the SOM algorithm which seems to have many implementations and parameters that could be investigated further and possibly optimized via cross-validation, somehow. @@ -401,5 +830,16 @@ bitcoin transaction graph](https://doi.org/10.1007/s41109-020-00261-7) [5] BitcoinHeist Ransomware Address Dataset [https://archive.ics.uci.edu/ml/datasets/BitcoinHeistRansomwareAddressDataset](https://archive.ics.uci.edu/ml/datasets/BitcoinHeistRansomwareAddressDataset) +[6] Available Models - The `caret` package [http://topepo.github.io/caret/available-models.html](http://topepo.github.io/caret/available-models.html) + +[7] Ron Wehrens and Johannes Kruisselbrink, Package ‘`kohonen`’ @ CRAN (2019) [https://cran.r-project.org/web/packages/kohonen/kohonen.pdf](https://cran.r-project.org/web/packages/kohonen/kohonen.pdf) + +[XMR] Malte Möser*, Kyle Soska, Ethan Heilman, Kevin Lee, Henry Heffan, Shashvat Srivastava, +Kyle Hogan, Jason Hennessey, Andrew Miller, Arvind Narayanan, and Nicolas Christin (April 23, 2018) [An Empirical Analysis of Traceability in the Monero Blockchain](https://arxiv.org/pdf/1704.04299/) + +```{r end timer, echo=FALSE} +# End timer +toc() +``` \ No newline at end of file diff --git a/Detecting_Bitcoin_Ransomware.pdf b/Detecting_Bitcoin_Ransomware.pdf index 0abeb77..b355f76 100644 Binary files a/Detecting_Bitcoin_Ransomware.pdf and b/Detecting_Bitcoin_Ransomware.pdf differ diff --git a/Final_method.R b/Final_method.R new file mode 100644 index 0000000..0449f59 --- /dev/null +++ b/Final_method.R @@ -0,0 +1,254 @@ +################################################## +## Ransomware Detection on the Bitcoin Blockchain +## using Random Forests and Self Organizing Maps +## +## Kaylee Robert Tejeda +## October 31, 2021 +## +## Make this header better!!!! +################################################# + +# Timer command, uncomment following lines to time script +library(tictoc) +tic(quiet = FALSE) + +# Install necessary packages if not already present +if(!require(tidyverse)) install.packages("tidyverse") +if(!require(caret)) install.packages("caret") +if(!require(randomForest)) install.packages("randomForest") +if(!require(kohonen)) install.packages("kohonen") +if(!require(parallel)) install.packages("parallel") + +# Load Libraries +library(tidyverse) +library(caret) +library(randomForest) +library(kohonen) +library(parallel) + +# Download data +url <- "https://archive.ics.uci.edu/ml/machine-learning-databases/00526/data.zip" +dest_file <- "data/data.zip" +if(!dir.exists("data"))dir.create("data") +if(!file.exists(dest_file))download.file(url, destfile = dest_file) + +# Unzip into CSV +if(!file.exists("data/BitcoinHeistData.csv"))unzip(dest_file, "BitcoinHeistData.csv", exdir="data") + +# Import data from CSV +ransomware <- read_csv("data/BitcoinHeistData.csv") + +# Turn labels into factors, bw is a binary factor for ransomware/non-ransomware +ransomware <- ransomware %>% mutate(label=as.factor(label), bw=as.factor(ifelse(label=="white", "white", "black"))) + +# Validation set made from 50% of BitcoinHeist data, reduce later if possible. Binary outcomes (bw) +test_index <- createDataPartition(y = ransomware$bw, times = 1, p = .5, list = FALSE) + +workset <- ransomware[-test_index,] +validation <- ransomware[test_index,] + +# Split the working set into a training set and a test set @ 50%, reduce later if possible. Binary outcomes (bw) +test_index <- createDataPartition(y = workset$bw, times = 1, p = .5, list = FALSE) + +train_set <- workset[-test_index,] +test_set <- workset[test_index,] + +# Separate into Black and White groups using Random Forests + +message("First to separate in to black and white groups.") + +#Sample every nth row due to memory constraints +train_samp <- train_set[seq(1, nrow(train_set), 100), ] + +# Keep only numeric columns with highest coefficients of variation for dimension reduction +train_num <- train_samp %>% select(neighbors, income) + +# Binary outputs, black=ransomware, white=non-ransomware, train set +train_bw <- train_samp$bw + +#Sample every nth row due to memory constraints +set.seed(23) +test_samp <- test_set[seq(1, nrow(train_set), 100), ] + +# Dimension reduction again +test_num <- test_samp %>% select(neighbors, income) + +# Same for test set +test_bw <- test_samp$bw + +# Lower CV numbers +control <- trainControl(method="cv", number = 10) +grid <- data.frame(mtry = c(2, 4, 6, 8, 10, 12)) + +# Train Random Forests model +rf_model <- train(train_num, train_bw, method="rf", trControl = control, tuneGrid=grid) + +# Fit model +fit_rf <- randomForest(train_samp, train_bw, + minNode = rf_model$bestTune$mtry) + +# Measure accuracy of model against test sample +y_hat_rf <- predict(fit_rf, test_samp) +cm <- confusionMatrix(y_hat_rf, test_bw) +message("Overall accuracy for the test set is ", cm$overall["Accuracy"]) +cm + +# Measure accuracy of model against full validation set + +y_hat_rf <- predict(fit_rf, validation) +cm <- confusionMatrix(y_hat_rf, validation$bw) +message("Overall accuracy for the validation set is ", cm$overall["Accuracy"]) +cm + +# From here, trim down set to ONLY the black addresses and apply SOMs... + +message("Now we further categorize black address into ransomware families.") + +# Try categorical SOMs on black-only addresses.... +#!! This is NOT right, is it? +#!! It would be even MORE impressive if I removed all the PREDICTED whites from +#!! the test set instead and started there. + +blacks <- ransomware %>% filter(!label=="white") + +# Validation set made from 50% of BitcoinHeist data, reduce later if possible. Categorical outcomes +set.seed(23) +test_index <- createDataPartition(y = blacks$label, times = 1, p = .5, list = FALSE) + +workset_blacks <- blacks[-test_index,] +temp <- blacks[test_index,] + +# Make sure addresses in validation set are also in working set... +# validation <- temp %>% +# semi_join(workset, by = "address") + +# Add rows removed from validation set back into working set... +#removed <- anti_join(temp, validation) +#workset <- rbind(workset, removed) + +# ... Or not +validation_blacks <- temp + +# Split the working set into a training set and a test set @ 50%, reduce later if possible. Binary outcomes (bw) +set.seed(5) +test_index <- createDataPartition(y = workset_blacks$label, times = 1, p = .5, list = FALSE) + +# Split the working set into a training set and a test set @ 50%, reduce later if possible. Categorical outcomes +#test_index <- createDataPartition(y = workset$label, times = 1, p = .5, list = FALSE) + +train_set <- workset_blacks[-test_index,] +temp <- workset_blacks[test_index,] + +# Make sure addresses in validation set are also in working set.... +#test_set <- temp %>% +# semi_join(train_set, by = "address") + +# Add rows removed from validation set back into working set.... +#removed <- anti_join(temp, test_set) +#train_set <- rbind(train_set, removed) + +# ....Or not +test_set <- temp + +##!! Data preparation is done, now focusing on Self Organizing Maps as our method +##!! Start here after reworking the data prep steps above. + +# Keep only numeric columns, ignoring dates and looped for now (insert factor analysis impVar here?) +train_num <- train_set %>% select(length, weight, count, neighbors, income) + +# SOM function can only work on matrices +train_mat <- as.matrix(scale(train_num)) + +# Switching to supervised SOMs +test_num <- test_set %>% select(length, weight, count, neighbors, income) + +# Note that when we rescale our testing data we need to scale it according to how we scaled our training data. +test_mat <- as.matrix(scale(test_num, center = attr(train_mat, + "scaled:center"), scale = attr(train_mat, "scaled:scale"))) + +# Categorical +train_label <- train_set$label %>% classvec2classmat() + +# Same for test set +test_label <- test_set$label %>% classvec2classmat() + +# Create Data list for supervised SOM +# +train_list <- list(independent = train_mat, dependent = train_label) + +# Calculate idea grid size according to: +# https://www.researchgate.net/post/How-many-nodes-for-self-organizing-maps + +# Formulaic method 1 +grid_size <- round(sqrt(5*sqrt(nrow(train_set)))) +# Based on categorical number, method 2 +#grid_size = ceiling(sqrt(length(unique(ransomware$label)))) +grid_size + +# Create SOM grid +train_grid <- somgrid(xdim=grid_size, ydim=grid_size, topo="hexagonal", toroidal = TRUE) + +# Set magic seed number +set.seed(23) + +## Now build the model. +som_model2 <- xyf(train_mat, train_label, + grid = train_grid, + rlen = 100, + mode="pbatch", # or: alpha = c(0.05,0.01), + cores = detectCores(), # detectCores() - 1 if system locks during calculation + keep.data = TRUE +) + +# Now test predictions of test set +# https://clarkdatalabs.github.io/soms/SOM_NBA + +test_list <- list(independent = test_mat, dependent = test_label) + +ransomware_group.prediction <- predict(som_model2, newdata = test_list) +#table(test_set$label, ransomware_group.prediction$prediction[[2]]) + +# Confusion Matrix +cm_labels <- confusionMatrix(ransomware_group.prediction$prediction[[2]], test_set$label) +message("Overall accuracy for the test set is ", cm_labels$overall["Accuracy"]) +#cm_labels + +# Now test predictions of validation set + +# Switching to supervised SOMs +valid_num <- validation_blacks %>% select(length, weight, count, neighbors, income) + +# Note that when we rescale our testing data we need to scale it according to how we scaled our training data. +valid_mat <- as.matrix(scale(valid_num, center = attr(train_mat, + "scaled:center"), scale = attr(train_mat, "scaled:scale"))) + +valid_label <- validation_blacks$label + +valid_list <- list(independent = valid_mat, dependent = valid_label) + +ransomware_group.prediction.validation <- predict(som_model2, newdata = valid_list) +#table(validation_blacks$label, ransomware_group.prediction.validation$prediction[[2]]) + +# Confusion Matrix +cm_labels.validation <- confusionMatrix(ransomware_group.prediction.validation$prediction[[2]], validation_blacks$label) +message("Overall accuracy for the validation set is ",cm_labels.validation$overall["Accuracy"]) +#cm_labels.validation + +# Set number of clusters to be equal to number of known ransomware groups (ignoring the whites) +n_groups <- length(unique(ransomware$label)) - 1 +n_groups + +# K-Means Clustering +# https://www.polarmicrobes.org/microbial-community-segmentation-with-r/ + +som.cluster <- kmeans(data.frame(som_model2$codes[[1]]), centers=n_groups) + +plot(som_model2, + main = 'K-Means Clustering', + type = "property", + property = som.cluster$cluster, + palette.name = topo.colors) +add.cluster.boundaries(som_model2, som.cluster$cluster) + +#End timer +toc() \ No newline at end of file