cut out binary soms from paper (saves hours of compile time, might stick with it.). Also made up first draft of Final Method script. Need to use first half to inform the set for the second half. Might work on that tonight....

This commit is contained in:
shelldweller 2021-10-18 20:39:23 -06:00
parent 76f6ffb677
commit be6bd62bbc
3 changed files with 712 additions and 18 deletions

View File

@ -27,6 +27,10 @@ geometry: margin=2cm
\vbox{\copy2\box0}\box2}}
```{r setup, include=FALSE}
# Load and start timer
library(tictoc)
tic()
knitr::opts_chunk$set(echo = TRUE, out.width="400px", dpi=120)
def.chunk.hook <- knitr::knit_hooks$get("chunk")
knitr::knit_hooks$set(chunk = function(x, options) {
@ -77,6 +81,13 @@ These variables are defined rather abstractly, viewing the blockchain as a topol
```{r data-prep, echo=FALSE, include=FALSE}
# Set the repository to a known working mirror just in case it has not already been set
cat("Setting Seattle repository")
r = getOption("repos")
r["CRAN"] = "http://cran.fhcrc.org/"
options(repos = r)
rm(r)
# Install necessary packages
if(!require(tidyverse)) install.packages("tidyverse")
if(!require(caret)) install.packages("caret")
@ -186,8 +197,6 @@ test_index <- createDataPartition(y = workset$bw, times = 1, p = .5, list = FALS
train_set <- workset[-test_index,]
test_set <- workset[test_index,]
# Clean up environment
rm(dest_file, url)
#Sample every nth row due to memory constraints
train_samp <- train_set[seq(1, nrow(train_set), 100), ]
@ -210,7 +219,7 @@ no_nas <- sum(is.na(ransomware))
```
### Exploration and Visualization
### Exploration and Visualization (do this part last....)
The ransomware addresses make up less than 2% of the overall data set. This presents a challenge as the target observations are sparse within the data set, especially when we consider that this is then divided into 29 subsets. In fact, some of the ransomware groups have only a single member, making categorization a dubious task. At least there are no missing values to worry about.
@ -284,7 +293,7 @@ plot(coeff_vars)
From this, it appears that *income* has the widest range of variability, followed by *neighbors*. These are also the features that are most strongly skewed to the right, meaning that a few addresses have really high values for each of these features while the bulk of the data set has very low values for these numbers.
Now do the following:
Now do the following (after filling in methods, results, and conclusions, since those are done already:
6) Break into groups somehow. Graph variables per group? Show how the variables are distributed for each ransomware group? Percent ransomware per each day of the week, for example. Is ransomware more prevalent on a particular day of the week? Break other numerical values into bins, and graph percentage per bin. Look for trends and correlations between groups/variables, and display them here. MORE OF THIS....
@ -330,55 +339,475 @@ data.frame(pca$x[,1:2], bw=train_samp$bw) %>%
#d_approx <- dist(pca$x[, 1:2])
#qplot(d, d_approx) + geom_abline(color="red")
# Clean up environment
rm(pca, x, coeff_vars, d, means, pc, sds)
```
### Insights Gained from Exploration
Maybe its better to approach this as a binary problem? At least at first, lets see how far that gets us....
From the previous visual and statistical exploration of the data, it becomes clear what the challenge is. Ransomware addresses are very sparse in the data set, making up less than 2% of the addresses. That small percentage is also further classified into 28 groups. Perhaps the original paper was a bit too ambitious in trying to categorize all the addresses into 29 categories, including the "white" addresses. To simplify our approach, we will categorize the addresses in a binary way, either "white" or "black", where "black" signifies an association with ransomware transactions. Asking this as a "ransomware or not-ransomware" question allows for application of methods that are impractical otherwise.
## Modeling approach (Chunk #3, mostly done, just need to clean up a bit)
## Modeling approach
An overview of why I picked the methods that I did. Based on from original paper, that Random Forests were hard to apply here, and that it was all topological data to begin with, hence that lead me to SOMs. Also, describe the reasoning behind the binary approach. Describe what you learned about SOMs.
#### Random Forests
#### Self Organizing Maps
Akcora et al. mention that they tried to model the data using a Random Forests method, but that the complexity of the data set lead to problems with that approach.[3] Switching to a binary perspective on the problem might alleviate some of that complexity, and is worth another look. The topological nature of the way the data set has been described numerically lead me to search for topological machine learning methods. Searching for *topo* in the documentation for the `caret` package [6] resulted in the entry for Self Organizing Maps, supplied by the `kohonen` package. The description at CRAN [7] was intriguing enough for me to investigate further.
### Method 1: Binary Random Forests
If we ask a simpler question, is this a useful approach? Mentioned to not work well in original paper. Try it using a binary black/white approach. change all instances of "bw" in the code to "bw". show how this simplification leads to (near)-perfect accuracy. Confusion Matrix?
Using the `randomForest` library, we train a model on our training set and test against the "black/white" categorization on our test set.
```{r binary_random_forests, echo=FALSE, include=FALSE}
# Install foreach package if needed
if(!require(randomForest)) install.packages("randomForest")
library(randomForest)
# Keep only numeric columns with highest coefficients of variation for dimension reduction
train_num <- train_samp %>% select(neighbors, income)
# Binary outputs, black=ransomware, white=non-ransomware, train set
train_bw <- train_samp$bw
#Sample every nth row due to memory constraints
set.seed(23)
test_samp <- test_set[seq(1, nrow(train_set), 100), ]
# Dimension reduction again
test_num <- test_samp %>% select(neighbors, income)
# Same for test set
test_bw <- test_samp$bw
# Lower CV numbers
control <- trainControl(method="cv", number = 10)
grid <- data.frame(mtry = c(2, 4, 6, 8, 10, 12))
# Train Random Forests model
rf_model <- train(train_num, train_bw, method="rf", trControl = control, tuneGrid=grid)
# Fit model
fit_rf <- randomForest(train_samp, train_bw,
minNode = rf_model$bestTune$mtry)
```
We can see that the results are quite good against the smaller test set and the larger validation set.
```{r binary_random_forests-validation, echo=FALSE}
# Check for best tuning parameters
ggplot(rf_model)
rf_model$bestTune
# Check for enough trees
plot(fit_rf)
# Measure accuracy of model against test sample
y_hat_rf <- predict(fit_rf, test_samp)
cm_test <- confusionMatrix(y_hat_rf, test_bw)
message("Confusion Matrix for test set:")
cm_test
# Measure accuracy of model against full validation set
y_hat_rf <- predict(fit_rf, validation)
cm_validation <- confusionMatrix(y_hat_rf, validation$bw)
message("Confusion Matrix for validation set:")
cm_validation
```
### Method 2: Binary SOMs
If we ask the same question to a more sophisticated and topological approach, how good is the model? Mention how the original paper was toplogical in nature, an how this lead to the investigation of SOMs. Repeat the binary "b/w" approach using SOMs. This accuracy is still pretty good, but not *as* good as the random forest method. Point out how SOMs are really used for classification into _many_ groups. This leads to an Insight! (see above) What if we first _isolate_ the "black" addresses using Random Forest, and then categorize the black only subset (< 2%) using categorical SOMs. This leads to a 2-part system...
If we ask the same question to a more sophisticated and topological approach, how good is the model? Mention how the original paper was topological in nature, an how this lead to the investigation of SOMs. Repeat the binary "b/w" approach using SOMs. This accuracy is still pretty good, but not *as* good as the random forest method. Point out how SOMs are really used for classification into _many_ groups. This leads to an Insight! (see above) What if we first _isolate_ the "black" addresses using Random Forest, and then categorize the black only subset (< 2%) using categorical SOMs. This leads to a 2-part system...
Note to self: I don't even use this part in the final script. Should I leave it out of the paper too?
```{r binary_soms, echo=FALSE, include=FALSE}
# Install kohonen package if needed
if(!require(kohonen)) install.packages("kohonen")
# Load kohonen library
library(kohonen)
# Install kohonen package if needed
if(!require(parallel)) install.packages("parallel")
# Load parallel library
library(parallel)
# Keep only numeric columns, ignoring dates and looped.
#train_num <- train_set %>% select(length, weight, count, neighbors, income)
# SOM function can only work on matrices
#train_mat <- as.matrix(scale(train_num))
# Switching to supervised SOMs
#test_num <- test_set %>% select(length, weight, count, neighbors, income)
# Note that when we rescale our testing data we need to scale it according to how we scaled our training data.
#test_mat <- as.matrix(scale(test_num, center = attr(train_mat,
# "scaled:center"), scale = attr(train_mat, "scaled:scale")))
# Binary outputs, black=ransomware, white=non-ransomware, train set
#train_bw <- train_set$bw %>% classvec2classmat()
# Same for test set
#test_bw <- test_set$bw %>% classvec2classmat()
# Create Data list for supervised SOM
#
#train_list <- list(independent = train_mat, dependent = train_bw)
# Calculate idea grid size according to:
# https://www.researchgate.net/post/How-many-nodes-for-self-organizing-maps
# Formulaic method 1
#grid_size <- round(sqrt(5*sqrt(nrow(train_set))))
# Based on categorical number, method 2
#grid_size = ceiling(sqrt(length(unique(ransomware$bw))))
#grid_size
# Create SOM grid
#train_grid <- somgrid(xdim=grid_size, ydim=grid_size, topo="hexagonal", toroidal = TRUE)
# Set magic seed for reproducibility
#set.seed(23)
## Now build the model.
#som_model <- xyf(train_mat, train_bw,
# grid = train_grid,
# rlen = 100,
# mode="pbatch", # or: alpha = c(0.05,0.01),
# cores = detectCores(), # detectCores() - 1 if system becomes unresponsive during training
# keep.data = TRUE
#)
# Now test predictions
# https://clarkdatalabs.github.io/soms/SOM_NBA
#test_list <- list(independent = test_mat, dependent = test_bw)
#ransomware.prediction <- predict(som_model, newdata = test_list)
# Now test predictions of validation set
# Switching to supervised SOMs
#valid_num <- validation %>% select(length, weight, count, neighbors, income)
# Note that when we rescale our testing data we need to scale it according to how we scaled our training data.
#valid_mat <- as.matrix(scale(valid_num, center = attr(train_mat,
# "scaled:center"), scale = attr(train_mat, "scaled:scale")))
#valid_bw <- validation$bw
#valid_list <- list(independent = valid_mat, dependent = valid_bw)
# Requires up to 16GB of RAM, skip if resources are limited
#ransomware.prediction.validation <- predict(som_model, newdata = valid_list)
```
```{r binary_soms-cms, echo=FALSE}
#table(test_set$bw, ransomware.prediction$prediction[[2]]) %>% knitr::kable()
#table(validation$bw, ransomware.prediction.validation$prediction[[2]]) %>% knitr::kable()
# These are bogging down the pdf. Choose only a few?
# Visualize clusters
#plot(som_model, type = 'mapping', pch = 19, palette.name = topo.colors)
# cat(" \n")
# Distance map
#plot(som_model, type = 'quality', pch = 19, palette.name = topo.colors)
# cat(" \n")
# Visualize counts
#plot(som_model, type = 'counts', pch = 19, palette.name = topo.colors)
# cat(" \n")
# Visualize fan diagram
#plot(som_model, type = 'codes', pch = 19, palette.name = topo.colors)
# cat(" \n")
# Visualize heatmap for variable 1
#plot(som_model, type = 'property', property = som_model$codes[[1]][,1], main=colnames(train_num)[1], pch = 19, palette.name = topo.colors)
# cat(" \n")
# Visualize heatmap for variable 2
#plot(som_model, type = 'property', property = som_model$codes[[1]][,2], main=colnames(train_num)[2], pch = 19, palette.name = topo.colors)
# cat(" \n")
# Visualize heatmap for variable 3
#plot(som_model, type = 'property', property = som_model$codes[[1]][,3], main=colnames(train_num)[3], pch = 19, palette.name = topo.colors)
# cat(" \n")
# Visualize heatmap for variable 4
#plot(som_model, type = 'property', property = som_model$codes[[1]][,4], main=colnames(train_num)[4], pch = 19, palette.name = topo.colors)
# cat(" \n")
# Visualize heatmap for variable 5
#plot(som_model, type = 'property', property = som_model$codes[[1]][,5], main=colnames(train_num)[5], pch = 19, palette.name = topo.colors)
# cat(" \n")
# Confusion Matrix
#cm_bw <- confusionMatrix(ransomware.prediction$prediction[[2]], test_set$bw)
#cm_bw$overall
# Now test predictions of validation set
# Confusion Matrix
#cm_bw.validation <- confusionMatrix(ransomware.prediction.validation$prediction[[2]], validation$bw)
#cm_bw.validation$overall
```
### Method 3: Categorical SOMs
Describe categorical SOM work here, show results. This is where the pretty colored hex-graphs show up.
```{r categorical_soms, echo=FALSE, include=FALSE}
# Do this here
# Try categorical SOMs on black-only addresses....
#!! This is NOT right, is it?
#!! It would be even MORE impressive if I removed all the PREDICTED whites from
#!! the test set instead and started there.
blacks <- ransomware %>% filter(!label=="white")
# Validation set made from 50% of BitcoinHeist data, reduce later if possible. Categorical outcomes
set.seed(23)
test_index <- createDataPartition(y = blacks$label, times = 1, p = .5, list = FALSE)
workset_blacks <- blacks[-test_index,]
temp <- blacks[test_index,]
# Make sure addresses in validation set are also in working set...
# validation <- temp %>%
# semi_join(workset, by = "address")
# Add rows removed from validation set back into working set...
#removed <- anti_join(temp, validation)
#workset <- rbind(workset, removed)
# ... Or not
validation_blacks <- temp
# Split the working set into a training set and a test set @ 50%, reduce later if possible. Binary outcomes (bw)
set.seed(23)
test_index <- createDataPartition(y = workset_blacks$label, times = 1, p = .5, list = FALSE)
# Split the working set into a training set and a test set @ 50%, reduce later if possible. Categorical outcomes
#test_index <- createDataPartition(y = workset$label, times = 1, p = .5, list = FALSE)
train_set <- workset_blacks[-test_index,]
temp <- workset_blacks[test_index,]
# Make sure addresses in validation set are also in working set....
#test_set <- temp %>%
# semi_join(train_set, by = "address")
# Add rows removed from validation set back into working set....
#removed <- anti_join(temp, test_set)
#train_set <- rbind(train_set, removed)
# ....Or not
test_set <- temp
##!! Data preparation is done, now focusing on Self Organizing Maps as our method
##!! Start here after reworking the data prep steps above.
# Keep only numeric columns, ignoring dates and looped for now (insert factor analysis impVar here?)
train_num <- train_set %>% select(length, weight, count, neighbors, income)
# SOM function can only work on matrices
train_mat <- as.matrix(scale(train_num))
# Switching to supervised SOMs
test_num <- test_set %>% select(length, weight, count, neighbors, income)
# Note that when we rescale our testing data we need to scale it according to how we scaled our training data.
test_mat <- as.matrix(scale(test_num, center = attr(train_mat,
"scaled:center"), scale = attr(train_mat, "scaled:scale")))
# Categorical
train_label <- train_set$label %>% classvec2classmat()
# Same for test set
test_label <- test_set$label %>% classvec2classmat()
# Create Data list for supervised SOM
#
train_list <- list(independent = train_mat, dependent = train_label)
# Calculate idea grid size according to:
# https://www.researchgate.net/post/How-many-nodes-for-self-organizing-maps
# Formulaic method 1
grid_size <- round(sqrt(5*sqrt(nrow(train_set))))
# Based on categorical number, method 2
#grid_size = ceiling(sqrt(length(unique(ransomware$label))))
grid_size
# Create SOM grid
train_grid <- somgrid(xdim=grid_size, ydim=grid_size, topo="hexagonal", toroidal = TRUE)
# Set magic seed for reproducibility
set.seed(23)
## Now build the model.
som_model2 <- xyf(train_mat, train_label,
grid = train_grid,
rlen = 100,
mode="pbatch", # or: alpha = c(0.05,0.01),
cores = detectCores(), # detectCores() - 1 if system locks during calculation
keep.data = TRUE
)
# Now test predictions of test set
# https://clarkdatalabs.github.io/soms/SOM_NBA
test_list <- list(independent = test_mat, dependent = test_label)
ransomware_group.prediction <- predict(som_model2, newdata = test_list)
# Now test predictions of validation set
# Switching to supervised SOMs
valid_num <- validation_blacks %>% select(length, weight, count, neighbors, income)
# Note that when we rescale our testing data we need to scale it according to how we scaled our training data.
valid_mat <- as.matrix(scale(valid_num, center = attr(train_mat,
"scaled:center"), scale = attr(train_mat, "scaled:scale")))
valid_label <- validation_blacks$label
valid_list <- list(independent = valid_mat, dependent = valid_label)
ransomware_group.prediction.validation <- predict(som_model2, newdata = valid_list)
```
```{r categorical_soms_cms, echo=FALSE}
#table(test_set$label, ransomware_group.prediction$prediction[[2]]) %>% knitr::kable()
#table(validation_blacks$label, ransomware_group.prediction.validation$prediction[[2]]) %>% knitr::kable()
#These re good plots, fix their display somehow...
# Visualize clusters
#plot(som_model2, type = 'mapping', pch = 19, palette.name = topo.colors)
# cat(" \n")
# Distance map
#plot(som_model2, type = 'quality', pch = 19, palette.name = topo.colors)
# cat(" \n")
# Visualize counts
#plot(som_model2, type = 'counts', pch = 19, palette.name = topo.colors)
# cat(" \n")
# Visualize fan diagram
#plot(som_model2, type = 'codes', pch = 19, palette.name = topo.colors)
# cat(" \n")
# Visualize heatmap for variable 1
#plot(som_model2, type = 'property', property = som_model2$codes[[1]][,1], main=colnames(train_num)[1], pch = 19, palette.name = topo.colors)
# cat(" \n")
# Visualize heatmap for variable 2
#plot(som_model2, type = 'property', property = som_model2$codes[[1]][,2], main=colnames(train_num)[2], pch = 19, palette.name = topo.colors)
# cat(" \n")
# Visualize heatmap for variable 3
#plot(som_model2, type = 'property', property = som_model2$codes[[1]][,3], main=colnames(train_num)[3], pch = 19, palette.name = topo.colors)
# cat(" \n")
# Visualize heatmap for variable 4
#plot(som_model2, type = 'property', property = som_model2$codes[[1]][,4], main=colnames(train_num)[4], pch = 19, palette.name = topo.colors)
# cat(" \n")
# Visualize heatmap for variable 5
#plot(som_model2, type = 'property', property = som_model2$codes[[1]][,5], main=colnames(train_num)[5], pch = 19, palette.name = topo.colors)
# cat(" \n")
# Confusion Matrix
cm_labels <- confusionMatrix(ransomware_group.prediction$prediction[[2]], test_set$label)
cm_labels$overall
# Confusion Matrix
cm_labels.validation <- confusionMatrix(ransomware_group.prediction.validation$prediction[[2]], validation_blacks$label)
cm_labels.validation$overall
# Set number of clusters to be equal to number of known ransomware groups (ignoring the whites)
n_groups <- length(unique(ransomware$label)) - 1
n_groups
# K-Means Clustering
# https://www.polarmicrobes.org/microbial-community-segmentation-with-r/
som.cluster <- kmeans(data.frame(som_model2$codes[[1]]), centers=n_groups)
plot(som_model2,
main = 'K-Means Clustering',
type = "property",
property = som.cluster$cluster,
palette.name = topo.colors)
add.cluster.boundaries(som_model2, som.cluster$cluster)
```
### Final Method: Combined Methods 1 and 3
Using the results from Random Forest, isolate the black addresses first, and then run that subset through an SOM algorithm. Compare final results to original paper. These go in a "results" section. (below)
```{r combined_methods, echo=FALSE}
# Do this here
## Results & Performance (Chunk #4)
# Still need to put it all into one script, and then reproduce the results here....
```
## Results & Performance
### Results
In the original paper by Akcora et al, they tested several different sets of parameters on their TDA model. According to them, "In the best TDA models for each ransomware family, we predict **16.59 false positives for each
true positive**. In turn, this number is 27.44 for the best non-TDA models."[3] In fact, the highest Precision (a.k.a. Positive Predictive Value, defined as TP/(TP+FP)) they achieved was only 0.1610. Compare this to our final Precision value of 1.000? It is almost embarassing... did I do something wrong here?
### Performance
In terms of what? Time? RAM?
The overall script takes X hours and X minutes to run on the aforementioned hardware. This could be optimized, but given that it is an eight year old laptop, this is not too unreasonable. It takes me longer to compile LibreOffice.
## Summary
### Comparison to original paper and impact of findings
They suck, I rule, 'nuff said.
### Limitations
SOMs seem like they are easy to misconfigure. Perhaps a dual Random Forest approach would be better?
### Future Work
I only scratched he surface of the SOM algorithm which seems to have many implementations and parameters that could be investigated further and possibly optimized via cross-validation, somehow.
@ -401,5 +830,16 @@ bitcoin transaction graph](https://doi.org/10.1007/s41109-020-00261-7)
[5] BitcoinHeist Ransomware Address Dataset [https://archive.ics.uci.edu/ml/datasets/BitcoinHeistRansomwareAddressDataset](https://archive.ics.uci.edu/ml/datasets/BitcoinHeistRansomwareAddressDataset)
[6] Available Models - The `caret` package [http://topepo.github.io/caret/available-models.html](http://topepo.github.io/caret/available-models.html)
[7] Ron Wehrens and Johannes Kruisselbrink, Package `kohonen` @ CRAN (2019) [https://cran.r-project.org/web/packages/kohonen/kohonen.pdf](https://cran.r-project.org/web/packages/kohonen/kohonen.pdf)
[XMR] Malte Möser*, Kyle Soska, Ethan Heilman, Kevin Lee, Henry Heffan, Shashvat Srivastava,
Kyle Hogan, Jason Hennessey, Andrew Miller, Arvind Narayanan, and Nicolas Christin (April 23, 2018) [An Empirical Analysis of Traceability in the Monero Blockchain](https://arxiv.org/pdf/1704.04299/)
```{r end timer, echo=FALSE}
# End timer
toc()
```

Binary file not shown.

254
Final_method.R Normal file
View File

@ -0,0 +1,254 @@
##################################################
## Ransomware Detection on the Bitcoin Blockchain
## using Random Forests and Self Organizing Maps
##
## Kaylee Robert Tejeda
## October 31, 2021
##
## Make this header better!!!!
#################################################
# Timer command, uncomment following lines to time script
library(tictoc)
tic(quiet = FALSE)
# Install necessary packages if not already present
if(!require(tidyverse)) install.packages("tidyverse")
if(!require(caret)) install.packages("caret")
if(!require(randomForest)) install.packages("randomForest")
if(!require(kohonen)) install.packages("kohonen")
if(!require(parallel)) install.packages("parallel")
# Load Libraries
library(tidyverse)
library(caret)
library(randomForest)
library(kohonen)
library(parallel)
# Download data
url <- "https://archive.ics.uci.edu/ml/machine-learning-databases/00526/data.zip"
dest_file <- "data/data.zip"
if(!dir.exists("data"))dir.create("data")
if(!file.exists(dest_file))download.file(url, destfile = dest_file)
# Unzip into CSV
if(!file.exists("data/BitcoinHeistData.csv"))unzip(dest_file, "BitcoinHeistData.csv", exdir="data")
# Import data from CSV
ransomware <- read_csv("data/BitcoinHeistData.csv")
# Turn labels into factors, bw is a binary factor for ransomware/non-ransomware
ransomware <- ransomware %>% mutate(label=as.factor(label), bw=as.factor(ifelse(label=="white", "white", "black")))
# Validation set made from 50% of BitcoinHeist data, reduce later if possible. Binary outcomes (bw)
test_index <- createDataPartition(y = ransomware$bw, times = 1, p = .5, list = FALSE)
workset <- ransomware[-test_index,]
validation <- ransomware[test_index,]
# Split the working set into a training set and a test set @ 50%, reduce later if possible. Binary outcomes (bw)
test_index <- createDataPartition(y = workset$bw, times = 1, p = .5, list = FALSE)
train_set <- workset[-test_index,]
test_set <- workset[test_index,]
# Separate into Black and White groups using Random Forests
message("First to separate in to black and white groups.")
#Sample every nth row due to memory constraints
train_samp <- train_set[seq(1, nrow(train_set), 100), ]
# Keep only numeric columns with highest coefficients of variation for dimension reduction
train_num <- train_samp %>% select(neighbors, income)
# Binary outputs, black=ransomware, white=non-ransomware, train set
train_bw <- train_samp$bw
#Sample every nth row due to memory constraints
set.seed(23)
test_samp <- test_set[seq(1, nrow(train_set), 100), ]
# Dimension reduction again
test_num <- test_samp %>% select(neighbors, income)
# Same for test set
test_bw <- test_samp$bw
# Lower CV numbers
control <- trainControl(method="cv", number = 10)
grid <- data.frame(mtry = c(2, 4, 6, 8, 10, 12))
# Train Random Forests model
rf_model <- train(train_num, train_bw, method="rf", trControl = control, tuneGrid=grid)
# Fit model
fit_rf <- randomForest(train_samp, train_bw,
minNode = rf_model$bestTune$mtry)
# Measure accuracy of model against test sample
y_hat_rf <- predict(fit_rf, test_samp)
cm <- confusionMatrix(y_hat_rf, test_bw)
message("Overall accuracy for the test set is ", cm$overall["Accuracy"])
cm
# Measure accuracy of model against full validation set
y_hat_rf <- predict(fit_rf, validation)
cm <- confusionMatrix(y_hat_rf, validation$bw)
message("Overall accuracy for the validation set is ", cm$overall["Accuracy"])
cm
# From here, trim down set to ONLY the black addresses and apply SOMs...
message("Now we further categorize black address into ransomware families.")
# Try categorical SOMs on black-only addresses....
#!! This is NOT right, is it?
#!! It would be even MORE impressive if I removed all the PREDICTED whites from
#!! the test set instead and started there.
blacks <- ransomware %>% filter(!label=="white")
# Validation set made from 50% of BitcoinHeist data, reduce later if possible. Categorical outcomes
set.seed(23)
test_index <- createDataPartition(y = blacks$label, times = 1, p = .5, list = FALSE)
workset_blacks <- blacks[-test_index,]
temp <- blacks[test_index,]
# Make sure addresses in validation set are also in working set...
# validation <- temp %>%
# semi_join(workset, by = "address")
# Add rows removed from validation set back into working set...
#removed <- anti_join(temp, validation)
#workset <- rbind(workset, removed)
# ... Or not
validation_blacks <- temp
# Split the working set into a training set and a test set @ 50%, reduce later if possible. Binary outcomes (bw)
set.seed(5)
test_index <- createDataPartition(y = workset_blacks$label, times = 1, p = .5, list = FALSE)
# Split the working set into a training set and a test set @ 50%, reduce later if possible. Categorical outcomes
#test_index <- createDataPartition(y = workset$label, times = 1, p = .5, list = FALSE)
train_set <- workset_blacks[-test_index,]
temp <- workset_blacks[test_index,]
# Make sure addresses in validation set are also in working set....
#test_set <- temp %>%
# semi_join(train_set, by = "address")
# Add rows removed from validation set back into working set....
#removed <- anti_join(temp, test_set)
#train_set <- rbind(train_set, removed)
# ....Or not
test_set <- temp
##!! Data preparation is done, now focusing on Self Organizing Maps as our method
##!! Start here after reworking the data prep steps above.
# Keep only numeric columns, ignoring dates and looped for now (insert factor analysis impVar here?)
train_num <- train_set %>% select(length, weight, count, neighbors, income)
# SOM function can only work on matrices
train_mat <- as.matrix(scale(train_num))
# Switching to supervised SOMs
test_num <- test_set %>% select(length, weight, count, neighbors, income)
# Note that when we rescale our testing data we need to scale it according to how we scaled our training data.
test_mat <- as.matrix(scale(test_num, center = attr(train_mat,
"scaled:center"), scale = attr(train_mat, "scaled:scale")))
# Categorical
train_label <- train_set$label %>% classvec2classmat()
# Same for test set
test_label <- test_set$label %>% classvec2classmat()
# Create Data list for supervised SOM
#
train_list <- list(independent = train_mat, dependent = train_label)
# Calculate idea grid size according to:
# https://www.researchgate.net/post/How-many-nodes-for-self-organizing-maps
# Formulaic method 1
grid_size <- round(sqrt(5*sqrt(nrow(train_set))))
# Based on categorical number, method 2
#grid_size = ceiling(sqrt(length(unique(ransomware$label))))
grid_size
# Create SOM grid
train_grid <- somgrid(xdim=grid_size, ydim=grid_size, topo="hexagonal", toroidal = TRUE)
# Set magic seed number
set.seed(23)
## Now build the model.
som_model2 <- xyf(train_mat, train_label,
grid = train_grid,
rlen = 100,
mode="pbatch", # or: alpha = c(0.05,0.01),
cores = detectCores(), # detectCores() - 1 if system locks during calculation
keep.data = TRUE
)
# Now test predictions of test set
# https://clarkdatalabs.github.io/soms/SOM_NBA
test_list <- list(independent = test_mat, dependent = test_label)
ransomware_group.prediction <- predict(som_model2, newdata = test_list)
#table(test_set$label, ransomware_group.prediction$prediction[[2]])
# Confusion Matrix
cm_labels <- confusionMatrix(ransomware_group.prediction$prediction[[2]], test_set$label)
message("Overall accuracy for the test set is ", cm_labels$overall["Accuracy"])
#cm_labels
# Now test predictions of validation set
# Switching to supervised SOMs
valid_num <- validation_blacks %>% select(length, weight, count, neighbors, income)
# Note that when we rescale our testing data we need to scale it according to how we scaled our training data.
valid_mat <- as.matrix(scale(valid_num, center = attr(train_mat,
"scaled:center"), scale = attr(train_mat, "scaled:scale")))
valid_label <- validation_blacks$label
valid_list <- list(independent = valid_mat, dependent = valid_label)
ransomware_group.prediction.validation <- predict(som_model2, newdata = valid_list)
#table(validation_blacks$label, ransomware_group.prediction.validation$prediction[[2]])
# Confusion Matrix
cm_labels.validation <- confusionMatrix(ransomware_group.prediction.validation$prediction[[2]], validation_blacks$label)
message("Overall accuracy for the validation set is ",cm_labels.validation$overall["Accuracy"])
#cm_labels.validation
# Set number of clusters to be equal to number of known ransomware groups (ignoring the whites)
n_groups <- length(unique(ransomware$label)) - 1
n_groups
# K-Means Clustering
# https://www.polarmicrobes.org/microbial-community-segmentation-with-r/
som.cluster <- kmeans(data.frame(som_model2$codes[[1]]), centers=n_groups)
plot(som_model2,
main = 'K-Means Clustering',
type = "property",
property = som.cluster$cluster,
palette.name = topo.colors)
add.cluster.boundaries(som_model2, som.cluster$cluster)
#End timer
toc()