graphs are starting to look good. Now pick the best ones and leave the rest.....

This commit is contained in:
shelldweller 2021-10-28 07:04:34 -06:00
parent 50f05fcd5c
commit 6801d0f326
3 changed files with 91 additions and 67 deletions

View File

@ -1,5 +1,5 @@
---
title: \vspace{1in}Detecting Ransomware Addresses on the Bitcoin Blockchain using Random Forests and Self Organizing Maps
title: \vspace{1in}Detecting Ransomware Addresses on the Bitcoin Blockchain using Random Forest and Self Organizing Maps
subtitle: \vspace{.5in}HarvardX PH125.9x Final Capstone CYO Project
\vspace{.5in}
author: "Kaylee Robert Tejeda"
@ -10,7 +10,7 @@ keywords:
- blockchain
- ransomware
- machine learning
- Random Forests
- Random Forest
- Self Organizing Maps
- SOMs
- cryptocurrency
@ -31,7 +31,8 @@ knitr::opts_chunk$set(echo = TRUE, out.width="400px", dpi=120)
def.chunk.hook <- knitr::knit_hooks$get("chunk")
knitr::knit_hooks$set(chunk = function(x, options) {
x <- def.chunk.hook(x, options)
ifelse(options$size != "normalsize", paste0("\n \\", options$size,"\n\n", x, "\n\n \\normalsize"), x)
ifelse(options$size != "normalsize", paste0("\n \\", options$size,"\n\n", x,
"\n\n \\normalsize"), x)
})
```
@ -144,17 +145,12 @@ The original research team downloaded and parsed the entire Bitcoin transaction
### Outline of Steps Taken
1) Analyze data set numerically and visually. Notice any pattern, look for insights.
2) Binary separation using Self Organizing Maps.
3) Improved Binary separation using Random Forests.
4) Categorical classification using Self Organizing Maps.
5) Visualize clustering to analyze results further.
6) Generate Confusion Matrix to quantify results.
1. Analyze data set numerically and visually. Notice any pattern, look for insights.
2. Binary separation using Self Organizing Maps.
3. Improved Binary separation using Random Forest.
4. Categorical classification using Self Organizing Maps.
5. Visualize clustering to analyze results further.
6. Generate Confusion Matrix to quantify results.
---
@ -203,13 +199,9 @@ no_nas <- sum(is.na(ransomware))
```
---
###############################################################################
## looks good up to here, continue down to Chunk #3
###############################################################################
---
~ ~ ~ ! ! ! ~ ~ ~
### Exploration and Visualization
### Exploration and Visualization (rework from here to Modeling Approach)
The ransomware addresses make up less than 2% of the overall data set. This presents a challenge as the target observations are sparse within the data set, especially when we consider that this is then divided into 28 subsets. In fact, some of the ransomware groups have only a single member, making categorization a dubious task. At least there are no missing values to worry about.
@ -253,12 +245,10 @@ test_bw <- test_samp$bw
```
The proportion of ransomware addresses in the original data set is `r ransomprop`. The total number of NA or missing values in the original data set is `r no_nas`.
```{r data-sparsness, echo=FALSE}
message("The proportion of ransomware addresses in the original data set is ", ransomprop, ".")
message("The total number of NA or missing values in the original data set is ", no_nas, ".")
labels <- ransomware$label %>% summary()
knitr::kable(
@ -272,14 +262,16 @@ knitr::kable(
Let's take a look at the distribution of the different features. Note how skewed the non-temporal features are, some of them being bimodal:
```{r histograms, echo=FALSE}
```{r histograms, echo=FALSE, warning=FALSE}
########################################################
## Histograms of each of the columns to show skewness
## Plot histograms for each column using facet wrap
########################################################
train_long <- train_num %>% # Apply pivot_longer function
pivot_longer(colnames(train_num)) %>%
train_hist <- train_samp %>% select(-address, -label, -bw)
train_long <- train_hist %>% # Apply pivot_longer function
pivot_longer(colnames(train_hist)) %>%
as.data.frame()
# Histograms per column
@ -289,6 +281,12 @@ ggp1 <- ggplot(train_long, aes(x = value)) + # Draw each column as histogram
facet_wrap(~ name, scales = "free")
ggp1
```
Some of these look better on a log scale x-axis.
```{r log scale histograms, echo=FALSE, warning=FALSE}
# Log scale on value axis, does not make sense for temporal variables
ggp2 <- ggplot(train_long, aes(x = value)) + # Draw each column as histogram
geom_histogram(aes(y = ..density..), bins=20) +
@ -297,17 +295,12 @@ ggp2 <- ggplot(train_long, aes(x = value)) + # Draw each column as histogram
facet_wrap(~ name, scales = "free")
ggp2
```
Now we can compare the relative spread of each feature by calculating the coefficient of variation for each column. Larger coefficients of variation indicate larger relative spread compared to other columns.
```{r cv-results, echo=FALSE}
message("The features with the highest coefficients of variation are ",
selected_features[1], selected_features[2],
", which will be used to train the binary model.")
# Summarize results in a table and a plot
knitr::kable(coeff_vars)
@ -315,11 +308,11 @@ plot(coeff_vars)
```
From this, it appears that *income* has the widest range of variability, followed by *neighbors*. These are also the features that are most strongly skewed to the right, meaning that a few addresses have really high values for each of these features while the bulk of the data set has very low values for these numbers.
From this, it appears that `r selected_features[1]` has the widest range of variability, followed by `r selected_features[2]`. These are also the features that are most strongly skewed to the right, meaning that a few addresses have really high values for each of these features while the bulk of the data set has very low values for these numbers.
Now do the following (after filling in methods, results, and conclusions, since those are done already):
6) Break into groups somehow. Graph variables per group? Show how the variables are distributed for each ransomware group? Percent ransomware per each day of the week, for example. Is ransomware more prevalent on a particular day of the week? Break other numerical values into bins, and graph percentage per bin. Look for trends and correlations between groups/variables, and display them here. MORE OF THIS....
!) Break into groups somehow. Graph variables per group? Show how the variables are distributed for each ransomware group? Percent ransomware per each day of the week, for example. Is ransomware more prevalent on a particular day of the week? Break other numerical values into bins, and graph percentage per bin. Look for trends and correlations between groups/variables, and display them here. MORE OF THIS....
```{r shrimp-percentage, echo=FALSE, include=FALSE}
@ -328,18 +321,10 @@ shrimp <- train_samp %>% filter(income < 10^8 )
```
```{r shrimp-output, echo=FALSE}
The percentage of wallets with less than one full bitcoin as their balance is `r mean(shrimp$bw == "black")` .
# Print the percentage of wallets with less than one full bitcoin
mean(shrimp$bw == "black")
~ ~ ~ ! ! ! ~ ~ ~
```
---
###############################################################################
## End graphic cleanup here.
###############################################################################
---
### Insights Gained from Exploration
@ -375,7 +360,7 @@ Lets see how well the SOM approach can model the data in a black/white fashion.
## used to compare with the better method. If, for some reason, you want to
## compile the report without this section, you can just comment it all out
## or remove it because nothing is needed from Method Part 0 for any of the
## other methods. In otherwords, it can be safely skipped if you are short on
## other methods. In other words, it can be safely skipped if you are short on
## tine or RAM.
##############################################################################
@ -388,9 +373,11 @@ som1_train_mat <- as.matrix(scale(som1_train_num))
# Switching to supervised SOMs
som1_test_num <- test_set %>% select(length, weight, count, neighbors, income)
# Note that when we rescale our testing data we need to scale it according to how we scaled our training data.
som1_test_mat <- as.matrix(scale(som1_test_num, center = attr(som1_train_mat,
"scaled:center"), scale = attr(som1_train_mat, "scaled:scale")))
# Note that when we rescale our testing data we need to scale it
# according to how we scaled our training data.
som1_test_mat <-
as.matrix(scale(som1_test_num, center = attr(som1_train_mat, "scaled:center"),
scale = attr(som1_train_mat, "scaled:scale")))
# Binary outputs, black=ransomware, white=non-ransomware, train set
som1_train_bw <- train_set$bw %>% classvec2classmat()
@ -400,7 +387,8 @@ som1_test_bw <- test_set$bw %>% classvec2classmat()
# Create Data list for supervised SOM
#
som1_train_list <- list(independent = som1_train_mat, dependent = som1_train_bw)
som1_train_list <-
list(independent = som1_train_mat, dependent = som1_train_bw)
# Calculate idea grid size according to:
# https://www.researchgate.net/post/How-many-nodes-for-self-organizing-maps
@ -412,7 +400,8 @@ grid_size <- round(sqrt(5*sqrt(nrow(train_set))))
grid_size
# Create SOM grid
som1_train_grid <- somgrid(xdim=grid_size, ydim=grid_size, topo="hexagonal", toroidal = TRUE)
som1_train_grid <-
somgrid(xdim=grid_size, ydim=grid_size, topo="hexagonal", toroidal = TRUE)
## Now build the model.
som_model1 <- xyf(som1_train_mat, som1_train_bw,
@ -433,16 +422,19 @@ ransomware.prediction1 <- predict(som_model1, newdata = som1_test_list)
# Confusion Matrix
som1_cm_bw <- confusionMatrix(ransomware.prediction1$prediction[[2]], test_set$bw)
som1_cm_bw <-
confusionMatrix(ransomware.prediction1$prediction[[2]], test_set$bw)
# Now test predictions of validation set
# Switching to supervised SOMs
valid_num <- validation %>% select(length, weight, count, neighbors, income)
# Note that when we rescale our testing data we need to scale it according to how we scaled our training data.
valid_mat <- as.matrix(scale(valid_num, center = attr(som1_train_mat,
"scaled:center"), scale = attr(som1_train_mat, "scaled:scale")))
# Note that when we rescale our testing data we need to scale it
# according to how we scaled our training data.
valid_mat <-
as.matrix(scale(valid_num, center = attr(som1_train_mat, "scaled:center"),
scale = attr(som1_train_mat, "scaled:scale")))
valid_bw <- validation$bw
@ -452,7 +444,9 @@ valid_list <- list(independent = valid_mat, dependent = valid_bw)
ransomware.prediction1.validation <- predict(som_model1, newdata = valid_list)
# Confusion Matrix
cm_bw.validation <- confusionMatrix(ransomware.prediction1.validation$prediction[[2]], validation$bw)
cm_bw.validation <-
confusionMatrix(ransomware.prediction1.validation$prediction[[2]],
validation$bw)
```
@ -460,7 +454,7 @@ Here are some graphs, but be careful with these....
```{r binary som graphs, echo=FALSE}
# Be careful with these, some are really large and take a long time to produce.....
# Be careful with these, some are really large and take a long time to produce.
# Visualize clusters
#plot(som_model1, type = 'mapping', pch = 19, palette.name = topo.colors)
@ -479,23 +473,28 @@ Here are some graphs, but be careful with these....
#cat(" \n")
# Visualize heatmap for variable 1
#plot(som_model1, type = 'property', property = som_model$codes[[1]][,1], main=colnames(train_num)[1], pch = 19, palette.name = topo.colors)
#plot(som_model1, type = 'property', property = som_model$codes[[1]][,1],
#main=colnames(train_num)[1], pch = 19, palette.name = topo.colors)
#cat(" \n")
# Visualize heatmap for variable 2
#plot(som_model1, type = 'property', property = som_model$codes[[1]][,2], main=colnames(train_num)[2], pch = 19, palette.name = topo.colors)
#plot(som_model1, type = 'property', property = som_model$codes[[1]][,2],
#main=colnames(train_num)[2], pch = 19, palette.name = topo.colors)
#cat(" \n")
# Visualize heatmap for variable 3
#plot(som_model1, type = 'property', property = som_model$codes[[1]][,3], main=colnames(train_num)[3], pch = 19, palette.name = topo.colors)
#plot(som_model1, type = 'property', property = som_model$codes[[1]][,3],
#main=colnames(train_num)[3], pch = 19, palette.name = topo.colors)
#cat(" \n")
# Visualize heatmap for variable 4
#plot(som_model1, type = 'property', property = som_model$codes[[1]][,4], main=colnames(train_num)[4], pch = 19, palette.name = topo.colors)
#plot(som_model1, type = 'property', property = som_model$codes[[1]][,4],
#main=colnames(train_num)[4], pch = 19, palette.name = topo.colors)
#cat(" \n")
# Visualize heatmap for variable 5
#plot(som_model1, type = 'property', property = som_model$codes[[1]][,5], main=colnames(train_num)[5], pch = 19, palette.name = topo.colors)
#plot(som_model1, type = 'property', property = som_model$codes[[1]][,5],
#main=colnames(train_num)[5], pch = 19, palette.name = topo.colors)
#cat(" \n")
```
@ -513,7 +512,7 @@ cm_bw.validation %>% as.matrix() %>% knitr::kable()
This winds up being the most resource intensive and least accurate method out of those tried. It was left out of the final version of the script and is not really worth running except to compare to the next method.
### Method Part 1: Binary Random Forests to isolate ransomware addresses before categorization.
### Method Part 1: Binary Random Forest to isolate ransomware addresses before categorization.
A Random Forest model is trained using ten-fold cross validation and a tuning grid with the number of variables randomly sampled as candidates at each split `mtry` set to the values $={2, 4, 6, 8, 10, 12}$, each one being checked for optimization.
@ -769,16 +768,16 @@ bitcoin transaction graph](https://doi.org/10.1007/s41109-020-00261-7)
[3] Cuneyt Gurcan Akcora, Yitao Li, Yulia R. Gel, Murat Kantarcioglu (June 19, 2019) [BitcoinHeist: Topological Data Analysis for Ransomware Detection on the Bitcoin Blockchain](https://arxiv.org/abs/1906.07852)
[4] UCI Machine Learning Repository [https://archive.ics.uci.edu/ml/index.php](https://archive.ics.uci.edu/ml/index.php)
[4] UCI Machine Learning Repository https://archive.ics.uci.edu/ml/index.php
[5] BitcoinHeist Ransomware Address Dataset
[https://archive.ics.uci.edu/ml/datasets/BitcoinHeistRansomwareAddressDataset](https://archive.ics.uci.edu/ml/datasets/BitcoinHeistRansomwareAddressDataset)
https://archive.ics.uci.edu/ml/datasets/BitcoinHeistRansomwareAddressDataset
[6] Available Models - The `caret` package [http://topepo.github.io/caret/available-models.html](http://topepo.github.io/caret/available-models.html)
[6] Available Models - The `caret` package http://topepo.github.io/caret/available-models.html
[7] Ron Wehrens and Johannes Kruisselbrink, Package `kohonen` @ CRAN (2019) [https://cran.r-project.org/web/packages/kohonen/kohonen.pdf](https://cran.r-project.org/web/packages/kohonen/kohonen.pdf)
[7] Ron Wehrens and Johannes Kruisselbrink, Package `kohonen` @ CRAN (2019) https://cran.r-project.org/web/packages/kohonen/kohonen.pdf
[8] How many nodes for self-organizing maps? (Oct 22, 2021) [https://www.researchgate.net/post/How-many-nodes-for-self-organizing-maps](https://www.researchgate.net/post/How-many-nodes-for-self-organizing-maps)
[8] How many nodes for self-organizing maps? (Oct 22, 2021) https://www.researchgate.net/post/How-many-nodes-for-self-organizing-maps
[9] Malte Möser, Kyle Soska, Ethan Heilman, Kevin Lee, Henry Heffan, Shashvat Srivastava,
Kyle Hogan, Jason Hennessey, Andrew Miller, Arvind Narayanan, and Nicolas Christin (April 23, 2018) [An Empirical Analysis of Traceability in the Monero Blockchain](https://arxiv.org/pdf/1704.04299/)

Binary file not shown.

View File

@ -69,6 +69,31 @@ data.frame(pca$x[,1:2], bw=train_samp$bw) %>%
#d_approx <- dist(pca$x[, 1:2])
#qplot(d, d_approx) + geom_abline(color="red")
########################################################
## Histograms of each of the columns to show skewness
## Plot histograms for each column using facet wrap
########################################################
train_long <- train_num %>% # Apply pivot_longer function
pivot_longer(colnames(train_num)) %>%
as.data.frame()
# Histograms per column
ggp1 <- ggplot(train_long, aes(x = value)) + # Draw each column as histogram
geom_histogram(aes(y = ..density..), bins=20) +
geom_density(col = "green", size = .5) +
facet_wrap(~ name, scales = "free")
ggp1
# Log scale on value axis, does not make sense for temporal variables
ggp2 <- ggplot(train_long, aes(x = value)) + # Draw each column as histogram
geom_histogram(aes(y = ..density..), bins=20) +
geom_density(col = "green", size = .5) +
scale_x_continuous(trans='log2') +
facet_wrap(~ name, scales = "free")
ggp2
# Clean up environment
rm(pca, x, coeff_vars, d, means, pc, sds)