fixed all tables and graphs, again. added some code blocks to the method section. all that is left now is text.

This commit is contained in:
shelldweller 2021-11-07 22:36:39 -07:00
parent f6d8ab1cb4
commit 7d4fa10a02
3 changed files with 134 additions and 86 deletions

View File

@ -10,10 +10,8 @@
##
###############################################################################
# Timer command, uncomment following lines to time script
# tictoc libary needs to be inistalled ahead of time for this to work.
library(tictoc)
tic(quiet = FALSE)
# Uncomment next line to time script
tic()
# Set the repository mirror to “1: 0-Cloud” for maximum availability
r = getOption("repos")
@ -75,6 +73,7 @@ test_set <- workset[test_index,]
## Data preparation is now done
## Separate into "black" and "white" groups using Random Forests predictions
###############################################################################
# Keep only numeric columns, ignoring temporal features
ransomware_num <- ransomware %>%
select(length, weight, count, looped, neighbors, income)
@ -97,7 +96,7 @@ message("The features with the highest coefficients of variation are ",
selected_features[1], " and ", selected_features[2],
", which will be used to train the binary model.")
#Sample every 100th row due to memory constraints
# Sample every 100th row due to memory constraints
train_samp <- train_set[seq(1, nrow(train_set), 100), ]
# Keep only numeric columns with highest coefficients of variation
@ -106,7 +105,7 @@ train_num <- train_samp %>% select(selected_features[1], selected_features[2])
# Binary labels, black = ransomware, white = non-ransomware, train set
train_bw <- train_samp$bw
#Sample every 100th row due to memory constraints to make test sample same size.
#Sample every 100th row due to memory constraints to make test sample same size
test_samp <- test_set[seq(1, nrow(train_set), 100), ]
# Dimension reduction again, selecting features with highest CVs
@ -247,5 +246,5 @@ add.cluster.boundaries(som_model2, som.cluster$cluster)
message("Overall accuracy is ", cm_labels$overall["Accuracy"])
#End timer
toc()
# End timer
toc(quiet=FALSE)

View File

@ -72,11 +72,11 @@ acyclic directed path originating from any starter transaction and ending at the
6) *Looped* - the number of starter addresses connected to this address by more than one path
These variables are defined somewhat conceptually, viewing the blockchain as a topological graph with nodes and edges. The rationale behind this approach is to facilitate quantification of specific transaction patterns. Akcora, et al.$^{[3]}$ give a thorough explanation in the original paper of how and why these features were chosen. We shall treat the features as general numerical variables and will not seek to justify their definitions beyond that. Machine learning methods will be applied to the original data set from the same paper$^{[3]}$, and the new results will be compared to the original ones.
These variables are defined somewhat conceptually, by viewing the blockchain as a topological graph with nodes and edges. The rationale behind this approach is to facilitate quantification of specific transaction patterns. Akcora, et al.$^{[3]}$ give a thorough explanation of how and why these features were chosen. We shall treat the features as general numerical variables and will not seek to justify their definitions beyond that. Machine learning methods will be applied to the original data set from the same paper, and the new results will be compared to the original ones.
### Data
This data set was found while exploring the [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/index.php)$^{[4]}$ as suggested in the project instructions. The author of this report, interested in Bitcoin and other cryptocurrencies since (unsuccessfully) mining for them on an ASUS netbook in rural Peru in late 2010, used *cryptocurrency* as a preliminary search term. This brought up a single data set entitled ["BitcoinHeist: Ransomware Address Data Set"](https://archive.ics.uci.edu/ml/datasets/BitcoinHeistRansomwareAddressDataset#). The data set was downloaded and the exploration began.
The data set was found while exploring the [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/index.php)$^{[4]}$ as suggested in the project instructions. The author of this report, interested in Bitcoin and other cryptocurrencies since (unsuccessfully) mining for them on an ASUS netbook in rural Peru in late 2010, used *cryptocurrency* as a preliminary search term. This brought up a single data set entitled ["BitcoinHeist: Ransomware Address Data Set"](https://archive.ics.uci.edu/ml/datasets/BitcoinHeistRansomwareAddressDataset#). The data set was downloaded and the exploration began.
```{r install_load_libraries-and-download_data, echo=FALSE, include=FALSE}
@ -94,6 +94,8 @@ if(!require(kohonen)) install.packages("kohonen")
if(!require(parallel)) install.packages("parallel")
if(!require(matrixStats)) install.packages("matrixStats")
if(!require(xtable)) install.packages("xtable")
if(!require(tictoc)) install.packages("tictoc")
#if(!require(kableExtra)) install.packages("kableExtra")
# Load Libraries
library(tidyverse)
@ -103,6 +105,8 @@ library(kohonen)
library(parallel)
library(matrixStats)
library(xtable)
library(tictoc)
#library(kableExtra)
# Set # of cores, use detectCores() - 1 to leave one for the system
n_cores <- detectCores()
@ -122,39 +126,66 @@ if(!file.exists("data/BitcoinHeistData.csv"))unzip(dest_file,
# Import data from CSV
ransomware <- read_csv("data/BitcoinHeistData.csv")
# Define custom tictoc messages
# tic() message
my.msg.tic <- function(tic, msg)
{
if (is.null(msg) || is.na(msg) || length(msg) == 0)
{
outmsg <- paste(round(toc - tic, 3), " seconds elapsed", sep="")
}
else
{
outmsg <- paste("Starting ", msg, "...", sep="")
}
}
# toc() message
my.msg.toc <- function(tic, toc, msg, info)
{
if (is.null(msg) || is.na(msg) || length(msg) == 0)
{
outmsg <- paste(round(toc - tic, 3), " seconds elapsed", sep="")
}
else
{
outmsg <- paste(info, ": ", msg, ": ",
round(toc - tic, 3), " seconds elapsed", sep="")
}
}
```
A summary of the data set shows the range of values and size of the sample.
A summary of the data set shows the range of values and size of the sample. Some of the features, such as *weight* for example, already appear to be very skewed just from the quartiles. In the case of *weight*, the third quartile is only `r quantile(ransomware$weight, 0.75)`, meaning that 75% of the data is at or below this value for *weight* (with a minimum of `r min(ransomware$weight)`). The maximum *weight* value, however, is `r max(ransomware$weight)`. This that nearly the entire range of values occurs in the upper 25%. In fact, many of the numerical features are similarly skewed, as you can see in the following summary.
```{r data_summary, echo=FALSE, size="tiny"}
# Summary
ransomware %>% summary() %>% knitr::kable(caption="Summary of data set")
ransomware %>% summary() %>% knitr::kable(caption="Summary of data set")
```
A listing of the first ten rows provides a sample of the features associated with each observation.
This data set has 2,916,697 observations of ten features associated with a sample of transactions from the Bitcoin blockchain. The ten features include *address* as a unique identifier, the six features defined previously (*income, neighbors, weight, length, count, loop*), two temporal features in the form of *year* and *day* (day of the year as an integer from 1 to 365), and a categorical feature called *label* that categorizes each address as either *white* (i.e. not connected to any ransomware activity), or one of 28 known ransomware groups as identified by three independent ransomware analysis teams (Montreal, Princeton, and Padua)$^{[3]}$. A listing of the first ten rows provides a sample of the features associated with each observation.
```{r data_head, echo=FALSE, size="tiny"}
# Inspect data
ransomware %>% head() %>% knitr::kable(caption="First ten entries of data set")
ransomware %>% head() %>%
knitr::kable(caption="First ten entries of data set")
```
This data set has 2,916,697 observations of ten features associated with a sample of transactions from the Bitcoin blockchain. The ten features include *address* as a unique identifier, the six features defined previously (*income, neighbors, weight, length, count, loop*), two temporal features in the form of *year* and *day* (day of the year as an integer from 1 to 365), and a categorical feature called *label* that categorizes each address as either *white* (i.e. not connected to any ransomware activity), or one of 28 known ransomware groups as identified by three independent ransomware analysis teams (Montreal, Princeton, and Padua)$^{[3]}$.
The original research team downloaded and parsed the entire Bitcoin transaction graph from January 2009 to December 2018. Based on a 24 hour time interval, daily transactions on the network were extracted and the Bitcoin graph was formed. Network edges that transferred less than \bitcoinA 0.3 were filtered out since ransom amounts are rarely below this threshold. Ransomware addresses are taken from three widely adopted studies: Montreal, Princeton and Padua. *White* Bitcoin addresses were capped at one thousand per day, whereas the entire network sees up to 800,000 addresses daily.$^{[5]}$
### Goal
The goal of this project is to apply different machine learning algorithms to the same data set used in the original paper, producing an acceptable predictive model for categorizing ransomware addresses with an acceptable degree of accuracy. Improving on the results of the original paper in some way, while not strictly necessary for the purposes of the project, would be a notable sign of success.
The goal of this project is to apply different machine learning algorithms to the same data set used in the original paper, producing an acceptable predictive model for categorizing ransomware addresses with an acceptable degree of accuracy. Increasing the precision, while not strictly necessary for the purposes of the project, would be a notable sign of success.
### Outline of Steps Taken
1. Analyze data set numerically and visually, look for insights in any patterns.
2. Binary separation using Self Organizing Maps.
3. Fast binary separation using Random Forest.
3. Faster binary separation using Random Forest.
4. Categorical classification using Self Organizing Maps.
5. Visualize clustering to analyze results further.
6. Generate confusion matrix to quantify results.
@ -167,7 +198,7 @@ The original research team downloaded and parsed the entire Bitcoin transaction
All of the analysis in this report was conducted on a single laptop computer, a **Lenovo Yoga S1** from late 2013 with the following specifications.
- CPU: Intel i7-4600U @ 3.300GHz (4th Gen quad-core i7 x86_64)
- CPU: Intel i7-4600U @ 3.300GHz (4th Gen quad-core i7 `x86_64`)
- RAM: 8217MB DDR3L @ 1600 MHz (8 GB)
- OS: Slackware64-current (15.0 RC1) `x86_64-slackware-linux-gnu` (64-bit GNU/Linux)
- R version 4.0.0 (2020-04-24) -- "Arbor Day" (built from source using scripts from [slackbuilds.org](https://slackbuilds.org/))
@ -175,7 +206,7 @@ The original research team downloaded and parsed the entire Bitcoin transaction
### Data Preparation
It is immediately apparent that this is a rather large data set. The usual practice of partitioning out 80% to 90% of the data for training results in a training set that is too large to process given the hardware limitations. For reasons that no longer apply, the original data set was first split in half with 50% reserved as *validation set* and the other 50% used as the *working set*. This working set was again split in half, to give a *training set* that was of a reasonable size to deal with. This produced partitions that were small enough to work with, so the partition size ratio was not further refined. This is a potential area for later optimization. Careful sampling was carried out to ensure that the ransomware groups were represented in each sample.
It is immediately apparent that this is a rather large data set. The usual practice of partitioning out 80% to 90% of the data for training results in a training set that is too large to process given the hardware limitations. For reasons that are no longer relevant, the original data set was first split in half with 50% reserved as *validation set* and the other 50% used as the *working set*. This working set was again split in half, to give a *training set* that was of a reasonable size to deal with. This produced partitions that were small enough to work with, so the partition size ratio was not further refined. This is a potential area for later optimization. A better partitioning scheme can surely be optimized further. Careful sampling was carried out to ensure that the ransomware groups were represented in each sample.
```{r data_prep, echo=FALSE, include=FALSE}
@ -208,8 +239,6 @@ no_nas <- sum(is.na(ransomware))
### Exploration and Visualization
By graphing a values, we can get an idea of how the data is distributed across the various features.
```{r cv_calcs, echo=FALSE}
# Keep only numeric columns, ignoring temporal features
@ -248,24 +277,31 @@ test_num <- test_samp %>% select(selected_features[1], selected_features[2])
# Binary labels for test set
test_bw <- test_samp$bw
# Summarize ransomware family membership
labels <- ransomware$label %>% summary()
```
The proportion of ransomware addresses in the original data set is `r ransomprop`. The total number of NA or missing values in the original data set is `r no_nas`.
The proportion of ransomware addresses in the original data set is `r ransomprop`. Thus, they make up less than 2% of all observations. This presents a challenge as the target observations are sparse within the data set, especially when we consider that this is then further divided into 28 subsets. In fact, some of the ransomware groups have only a single member, making categorization a dubious task.
The ransomware addresses make up less than 2% of the overall data set. This presents a challenge as the target observations are sparse within the data set, especially when we consider that this is then divided into 28 subsets. In fact, some of the ransomware groups have only a single member, making categorization a dubious task. At least there are no missing values to worry about.
The total number of `NA` or missing values in the original data set is `r no_nas`. At least there are no missing values to worry about. The original data set is clean in that sense.
A listing of all ransomware families in the full original data set, plus a member count for each family follows. As can be seen, `r length(unname(labels)[unname(labels)<10])` of the 28 families have less than 10 addresses associated with them. We shall keep this in mind for later.
```{r data_sparsness, echo=FALSE}
labels <- ransomware$label %>% summary()
# Print ransomware family summary table
knitr::kable(
list(labels[1:10], labels[11:20], labels[21:29]),
caption = 'Ransomware group labels and frequency counts for full data set',
booktabs = TRUE)
booktabs = TRUE)
#%>%kable_styling(latex_options = "HOLD_position")
```
Let's take a look at the distribution of the different features. Note how skewed the non-temporal features are, some of them being bimodal. Looks better on a log scale x-axis.
We can take a look at the overall distribution of the different features. The temporal features have been left out, since those plots are basically flat. The skewed nature of the non-temporal features causes the plots to look better on a log$_2$ scale $x$-axis.
```{r histograms, echo=FALSE, warning=FALSE, fig.align="center"}
########################################################
@ -294,7 +330,6 @@ histograms + theme(axis.text.x = element_text(size = 8, angle=30, hjust=1))
```
Now let us compare the relative spread of each feature by calculating the coefficient of variation for each column. Larger coefficients of variation indicate larger relative spread compared to other columns.
```{r cv_results, echo=FALSE, fig.align="center"}
@ -303,13 +338,14 @@ Now let us compare the relative spread of each feature by calculating the coeffi
knitr::kable(
list(coeff_vars[1:2], coeff_vars[3:4], coeff_vars[5:6]),
caption = 'Coefficients of Variation for each feature',
booktabs = TRUE)
booktabs = TRUE)
#%>%kable_styling(latex_options = "HOLD_position")
```
From this, it appears that `r selected_features[1]` has the widest range of variability, followed by `r selected_features[2]`. These are also the features that are most strongly skewed to the right, meaning that a few addresses have really high values for each of these features while the bulk of the data set has very low values for these numbers.
Taking the feature with the highest variation `r selected_features[1]`, let us take a look at the distribution for individual ransomware families. Perhaps there is a similarity across families.
Taking the feature with the highest variation `r selected_features[1]`, let us take a look at the distribution for individual ransomware families. Perhaps there is a similarity across families. This can be done for all the features, but we will focus on `r selected_features[1]` in the interest of saving space and to avoid repetition and redundancy. The distribution plots for `r selected_features[1]` show the most variation anyway, since it is the feature with the highest coefficient of variation.
```{r variation_histograms, echo=FALSE, fig.height=2, fig.width=2.5, fig.show="hold", out.width='35%', warning=FALSE}
@ -323,6 +359,7 @@ ransomware_big_families <- ransomware %>%
# Note: Putting these graphs into a for loop breaks some of the formatting.
# Low membership makes some of the graphs not very informative
# Relatively meaningless graphs have been left out to save time and space.
# Label numbers correspond to the ransomware families listed previously.
# Label 1
ransomware_big_families %>%
@ -336,7 +373,6 @@ ransomware_big_families %>%
plot.title = element_text(size = 9, face = "bold"),
axis.title.x=element_blank(),
axis.title.y=element_blank())
# Label 4
ransomware_big_families %>%
filter(label==levels(ransomware_big_families$label)[4]) %>%
@ -361,7 +397,6 @@ ransomware_big_families %>%
plot.title = element_text(size = 9, face = "bold"),
axis.title.x=element_blank(),
axis.title.y=element_blank())
# Label 6
ransomware_big_families %>%
filter(label==levels(ransomware_big_families$label)[6]) %>%
@ -398,8 +433,6 @@ ransomware_big_families %>%
plot.title = element_text(size = 9, face = "bold"),
axis.title.x=element_blank(),
axis.title.y=element_blank())
# Label 10
ransomware_big_families %>%
filter(label==levels(ransomware_big_families$label)[10]) %>%
@ -436,7 +469,6 @@ ransomware_big_families %>%
plot.title = element_text(size = 9, face = "bold"),
axis.title.x=element_blank(),
axis.title.y=element_blank())
# Label 13
ransomware_big_families %>%
filter(label==levels(ransomware_big_families$label)[13]) %>%
@ -473,7 +505,6 @@ ransomware_big_families %>%
plot.title = element_text(size = 9, face = "bold"),
axis.title.x=element_blank(),
axis.title.y=element_blank())
# Label 16
ransomware_big_families %>%
filter(label==levels(ransomware_big_families$label)[16]) %>%
@ -486,7 +517,6 @@ ransomware_big_families %>%
plot.title = element_text(size = 9, face = "bold"),
axis.title.x=element_blank(),
axis.title.y=element_blank())
# Label 18
ransomware_big_families %>%
filter(label==levels(ransomware_big_families$label)[18]) %>%
@ -499,7 +529,6 @@ ransomware_big_families %>%
plot.title = element_text(size = 9, face = "bold"),
axis.title.x=element_blank(),
axis.title.y=element_blank())
# Label 20
ransomware_big_families %>%
filter(label==levels(ransomware_big_families$label)[20]) %>%
@ -512,7 +541,6 @@ ransomware_big_families %>%
plot.title = element_text(size = 9, face = "bold"),
axis.title.x=element_blank(),
axis.title.y=element_blank())
# Label 22
ransomware_big_families %>%
filter(label==levels(ransomware_big_families$label)[22]) %>%
@ -549,7 +577,6 @@ ransomware_big_families %>%
plot.title = element_text(size = 9, face = "bold"),
axis.title.x=element_blank(),
axis.title.y=element_blank())
# Label 27
ransomware_big_families %>%
filter(label==levels(ransomware_big_families$label)[27]) %>%
@ -587,10 +614,9 @@ ransomware_big_families %>%
axis.title.x=element_blank(),
axis.title.y=element_blank())
```
It appears that, although the income distribution (as an example feature to consider) for ransomware groups does differ from the distribution pattern for *white* addresses, it also varies from group to group. For this reason, this makes a good feature to use in the training of the models.
It appears that, although the `r selected_features[1]` distribution for ransomware groups does differ from the distribution pattern for *white* addresses, it also varies from group to group. For this reason, this makes a good feature to use in the training of the models.
```{r shrimp-percentage, echo=FALSE, include=FALSE}
@ -599,15 +625,15 @@ shrimp <- ransomware %>% filter(income < 10^10 )
```
The percentage of wallets with less than one hundred bitcoins as their balance is `r mean(shrimp$bw == "black")`. I have no idea why this is meaningful, but I can calculate it at least.
The percentage of wallets with less than one hundred bitcoins as their balance is `r mean(shrimp$bw == "black")`. I have no idea why this is meaningful, but I can calculate it at least. **What else can I do here?** [A few more of these calculations might be good enough to wrap this section up, actually.]
### Insights gained from exploration
After visually and statistically exploring the data, it becomes clear what the challenge is. Ransomware-related addresses are very sparse in the data set, making up less than 2% of all addresses. This small percentage is also further classified into 28 groups. Perhaps the original paper was a overly ambitious in trying to categorize all the addresses into 29 categories, including the vastly prevalent *white* addresses. To simplify our approach, we will categorize the addresses in a binary way: as either *white* or *black*, where *black* signifies an association with ransomware transactions. Asking this as a "ransomware or not-ransomware" question allows for application of methods that have been shown to be impractical otherwise.
After visually and statistically exploring the data, it becomes clear what the challenge is. Ransomware-related addresses are very sparse, comprising `r ransomprop*100`% of all addresses. This small percentage is also further classified into 28 groups. Perhaps the original paper was a overly ambitious in trying to categorize all the addresses into 29 categories, including the vastly prevalent *white* addresses. To simplify our approach, we will categorize the addresses in a binary way: as either *white* or *black*, where *black* signifies an association with ransomware transactions. Asking this as a "ransomware or not-ransomware" question allows for application of methods that have been shown to be impractical otherwise.
---
## Modelling approach
## Modeling approach
Akcora, et al. applied a Random Forest approach to the data, however "Despite improving data scarcity, [...] tree based methods (i.e., Random Forest and XGBoost) fail to predict any ransomware family".[3] Considering all ransomware addresses as belonging to a single group may improve the predictive power of such methods, making Random Forest worth another try.
@ -625,7 +651,7 @@ The percentage of wallets with less than one hundred bitcoins as their balance i
The first working model that ran to completion without exhausting computer resources ignored the ransomware family labels and instead used the two categories of *black* and *white*. The `kohonen` package provides algorithms for both supervised and unsupervised model building, using both Self Organizing Maps and Super Organizing Maps respectively.[11] A supervised approach was used since the data set includes information about the membership of ransomware families that can be used to train the model.
```{r binary_SOMs, echo=FALSE, include=FALSE}
```{r binary_SOMs}
##############################################################################
## This is a first attempt using SOMs to model the data set as "black" and
## "white" addresses only.
@ -636,9 +662,12 @@ The first working model that ran to completion without exhausting computer resou
## compile the report without this section, you can just comment it all out
## or remove it because nothing is needed from Method Part 0 for any of the
## other methods. In other words, it can be safely skipped if you are short on
## tine or RAM.
## time or RAM.
##############################################################################
# Start timer
tic("binary SOMs", quiet = FALSE, func.tic = my.msg.tic)
# Keep only numeric columns, ignoring dates and looped.
som1_train_num <- train_set %>% select(length, weight, count, neighbors, income)
@ -723,12 +752,17 @@ cm_bw.validation <-
confusionMatrix(ransomware.prediction1.validation$prediction[[2]],
validation$bw)
# End timer
toc(quiet = FALSE, func.toc = my.msg.toc, info = "INFO")
```
After training the model, we obtain the confusion matrices for the test set and the validation set, separately. As you can see, the results are very good in both cases.
```{r binary_SOM_results, echo=FALSE, results='asis' }
cm1_test_set <- som1_cm_bw %>% as.matrix() %>%
knitr::kable(format = "latex", booktabs = TRUE)
@ -753,18 +787,23 @@ cat(c("\\begin{table}[!htb]
```
This is a very intensive method compared to what follows. It was left out of the final version of the script and has been included here only for model comparison and to track developmental evolution.
This is a very intensive method compared to what follows.
It was left out of the final version of the script and has been included here only for model comparison and to track developmental evolution.
### Method Part 1: Binary Random Forest
A Random Forest model is trained using ten-fold cross validation and a tuning grid with the number of variables randomly sampled as candidates at each split (`mtry`) set to the values $={2, 4, 6, 8, 10, 12}$, each one being checked for optimization.
```{r random_forest_prep, echo=FALSE, inculde=FALSE, warning=FALSE}
```{r random_forest, warning=FALSE}
##############################################################################
## This is a better attempt using Random Forest to model the data set as
## "black" and "white" addresses only.
##############################################################################
# Start timer
tic("Random Forest", quiet = FALSE, func.tic = my.msg.tic)
# Cross Validation, ten fold
control <- trainControl(method="cv", number = 10)
@ -788,9 +827,12 @@ cm_test <- confusionMatrix(y_hat_rf, test_bw)
ransomware_y_hat_rf <- predict(fit_rf, ransomware)
cm_ransomware <- confusionMatrix(ransomware_y_hat_rf, ransomware$bw)
# End timer
toc(quiet = FALSE, func.toc = my.msg.toc, info = "INFO")
```
The confusion matrix for the test set shows very good results, specifically in the areas of accuracy and precision. Although not as good as the SOM model used previously, the results are good enough to justify the time saved.
```{r random-forest-output_test, echo=FALSE}
@ -889,18 +931,18 @@ cat(c("\\begin{table}[!htb]
```
This is a much quicker way of removing most of the *white* addresses, and will be used in the final composite model to save time.
This is a much quicker way of removing most of the *white* addresses.
This method will be used in the final composite model to save time.
### Method Part 2: Categorical SOMs
Now we train a new model after throwing away all *white* addresses. The predictions from the Random Forest model are used to isolate all *black* addresses for further classification into ransomware addresses using SOMs. The reduced set is then categorized using a supervised SOM method with the 28 ransomware families as the target classification groups.
```{r soms-prep, echo=FALSE, include=FALSE}
```{r soms-families, warning=FALSE}
##############################################################################
## Now we use the Random Forest model to classify the data set into "black"
## and "white" categories with better precision.
##############################################################################
# Start timer
tic("categorical SOMs", quiet = FALSE, func.tic = my.msg.tic)
# Now use this prediction to reduce the original set to only "black" addresses
# First append the full set of predictions to the original set.
@ -976,19 +1018,25 @@ ransomware_group.prediction <- predict(som_model2, newdata = test_list)
cm_labels <- confusionMatrix(ransomware_group.prediction$prediction[[2]],
test_set$label)
# End timer
toc(quiet = FALSE, func.toc = my.msg.toc, info = "INFO")
```
When selecting the grid size for a Self Organizing Map, there are at least two different schools of thought. The two that were tried here are explained (with supporting documentation) on a Researchgate forum.[8] The first method is based on the size of the training set, and in this case results in a larger, more accurate map. The second method is based on the number of known categories to classify the data into, and in this case results in a smaller, less accurate map. For this script, a grid size of `r grid_size` has been selected.
A summary of the results for the categorization of black addresses into ransomware families follows. For the full table of predictions and statistics, see the Appendix.
A summary of the results for the categorization of black addresses into ransomware families follows. For the full table of predictions and statistics, see the Appendix.
Here are the overall results of the final categorization.
```{r cm_overall, echo=FALSE}
# Overall section of the confusion matrix formatted through kable()
cm_labels$overall %>% knitr::kable(caption="overall categorization results")
cm_labels$overall %>%
knitr::kable(caption="overall categorization results")
#%>% kable_styling(latex_options = "HOLD_position")
```
@ -997,7 +1045,9 @@ Here are the final results by class.
```{r soms-output-by-class, echo=FALSE, size="tiny"}
# By Class section of the confusion matrix formatted through kable()
cm_labels$byClass %>% knitr::kable(caption="categorization results by class")
cm_labels$byClass %>%
knitr::kable(caption="categorization results by class")
#%>% kable_styling(latex_options = "HOLD_position")
```
@ -1011,58 +1061,57 @@ Toroidal neural node maps are used to generate the models, and can be visualized
Describe these separately?
```{r binary som graphs, echo=FALSE, fig.show="hold", out.width='35%'}
```{r categorical som graphs, echo=FALSE, fig.show="hold", out.width='35%'}
# Be careful with these, some are really large and take a long time to produce.
# SOM visualization plots
# Visualize neural network mapping
plot(som_model2, type = 'mapping', pch = 19, palette.name = topo.colors)
#cat(" \n")
# Distance map
plot(som_model2, type = 'quality', pch = 19, palette.name = topo.colors)
#cat(" \n")
# Visualize counts
plot(som_model2, type = 'counts', pch = 19, palette.name = topo.colors)
#cat(" \n")
# Visualize fan diagram
plot(som_model2, type = 'codes', pch = 19, palette.name = topo.colors)
#cat(" \n")
# Visualize heatmap for variable 1
plot(som_model2, type = 'property', property = som_model2$codes[[1]][,1],
main=colnames(train_num)[1], pch = 19, palette.name = topo.colors)
#cat(" \n")
# Visualize heatmap for variable 2
plot(som_model2, type = 'property', property = som_model2$codes[[1]][,2],
main=colnames(train_num)[2], pch = 19, palette.name = topo.colors)
#cat(" \n")
# Visualize heatmap for variable 3
plot(som_model2, type = 'property', property = som_model2$codes[[1]][,3],
main=colnames(train_num)[3], pch = 19, palette.name = topo.colors)
#cat(" \n")
# Visualize heatmap for variable 4
plot(som_model2, type = 'property', property = som_model2$codes[[1]][,4],
main=colnames(train_num)[4], pch = 19, palette.name = topo.colors)
#cat(" \n")
# Visualize heatmap for variable 5
plot(som_model2, type = 'property', property = som_model2$codes[[1]][,5],
main=colnames(train_num)[5], pch = 19, palette.name = topo.colors)
#cat(" \n")
# Visualize heatmap for variable 6
plot(som_model2, type = 'property', property = som_model2$codes[[1]][,6],
main=colnames(train_num)[6], pch = 19, palette.name = topo.colors)
#cat(" \n")
```
The code plots show how much of each feature is represented by each cell in the map. For large numbers of categories (like what we have with the ransomware families), the default behavior is to make a line plot instead of a segment plot, which leads to the density-like patterns to the right. In the left plot, the codebook vectors of the features used in the model are shown. These can be directly interpreted as an indication of how
likely a given class is at a certain unit.
```{r fan diagrams graphs, echo=FALSE}
# Visualize fan diagram
plot(som_model2, type = 'codes', pch = 19, palette.name = topo.colors,
main = c("Codes for training features", "Codes for ransomware families"))
```
K-means clustering offers a nice way of visualizing the final SOM grid and the categorical boundaries that were formed by the model.
Say a bit more about it here....
@ -1102,30 +1151,30 @@ add.cluster.boundaries(som_model2, som.cluster$cluster)
### Results
The first attempt to isolate ransomware using SOMs resulted in a model with an accuracy of `r toString(cm_bw.validation$overall["Accuracy"])` and precision `r toString(cm_bw.validation$byClass[3])`.
The first attempt to isolate ransomware from *white* addresses using SOMs resulted in a model with an accuracy of `r toString(cm_bw.validation$overall["Accuracy"])` and precision `r toString(cm_bw.validation$byClass[3])`.
The the second attempt to isolate ransomware using Random forest resulted in a model with an accuracy of `r toString(cm_ransomware$overall["Accuracy"])` and precision `r toString(cm_ransomware$byClass[3])`.
The second attempt to isolate ransomware from *white* addresses using Random Forest resulted in a model with an accuracy of `r toString(cm_ransomware$overall["Accuracy"])` and precision `r toString(cm_ransomware$byClass[3])`.
Classifying the ransomware predicted by the second attempt into 28 ransomware families resulted in a model with an overall accuracy of `r toString(cm_labels$overall["Accuracy"])` and minimum nonzero precision of `r toString(min(cm_labels$byClass[,5][which(cm_labels$byClass[,5] > 0)]))`.
Classifying the ransomware predicted by the second attempt into 28 ransomware families using SOMs resulted in a model with an overall accuracy of `r toString(cm_labels$overall["Accuracy"])` and minimum nonzero precision of `r toString(min(cm_labels$byClass[,5][which(cm_labels$byClass[,5] > 0)]))`.
### Performance
The script runs on the aforementioned hardware in less than five minutes and uses less than 4GB of RAM. Given that the Bitcoin network produces one new block every ten minutes on average, then real-time analysis could theoretically be conducted on each block as they are announced using even moderate computing resources. Just for comparison, the final script was also run on lower powered machines with the following specifications:
The script runs on the aforementioned hardware in 235 seconds and uses less than 4GB of RAM. Given that the Bitcoin network produces one new block every ten minutes on average, then real-time analysis could theoretically be conducted on each block as they are announced using even moderate computing resources. Just for comparison, the final script was also run on lower powered machines with the following specifications:
#### ASUS Eee PC 1025C
- CPU: Intel Atom N2600 @ 1.6GHz (64-bit Intel Atom quad-core x86)
- RAM: 3911MB DDR3 @ 800 MT/s (4 GB)
This is a computer known for being slow and clunky. Even on this device, which runs the same operating system and software as the hardware listed previously, the total run time for the script is around 1665 seconds. At nearly 28 minutes, this is not fast enough to analyze the Bitcoin blockchain in real time, but it does show that the script can be run on very modest hardware to completion.
This is a computer known for being slow and clunky. Even on this device, which runs the same operating system and software as the hardware listed previously, the total run time for the script is around 1665 seconds. At nearly 28 minutes, this is not fast enough to analyze the Bitcoin blockchain in real time, although it does show that the script can be run on very modest hardware to completion.
#### Pine64 Quartz64 Model A
- CPU: Rockchip RK3566 SoC `aarch64` @1.8GHz (64-bit quad-core ARM)
- RAM: DDR4 8080MB (8 GB)
This is a single board computer / development board, which runs the same software as the others (ported to `aarch64`), except for Rstudio. It is nice to be able to benchmark a modern 64-bit ARM processor. The script runs in about 860 minutes on this platform, nearly half of that for the Atom processor above. Still not fast enough to analyze each block in real time, but a significant improvement given the low power usage of such processors.
This is a single board computer / development board, which runs the same software as the others (ported to `aarch64`), except for Rstudio. It is of personal interest to benchmark a modern 64-bit ARM processor in addition to the two Intel CPUs. The script runs in about 860 seconds on this platform, nearly half of that for the Atom processor above. Still not fast enough to analyze each block in real time, but a significant improvement given the low power usage of such processors.
---
@ -1134,26 +1183,26 @@ add.cluster.boundaries(som_model2, som.cluster$cluster)
### Comparison to results from original paper
In the original paper by Akcora et al., they tested several different sets of parameters on their TDA model. According to them, "In the best TDA models for each ransomware family, we predict **16.59 false positives for each
true positive**. In turn, this number is 27.44 for the best non-TDA models."[3] In fact, the highest Precision (a.k.a. Positive Predictive Value, defined as TP/(TP+FP)) they achieved was only 0.1610. By comparison, although several of our predicted classes had zero or NA precision values, the lowest non-zero precision value is `r toString(min(cm_labels$byClass[,5][which(cm_labels$byClass[,5] > 0)]))`, with many well above that, approaching one in a few cases.
true positive**. In turn, this number is 27.44 for the best non-TDA models."[3] In fact, the **highest** precision [a.k.a. Positive Predictive Value, defined as TP/(TP+FP), where TP = the number of true positives, and FP = the number of false positives] they achieved was only 0.1610. By comparison, although several of our predicted classes had zero or NA precision values due to low family membership in some cases, the **lowest** non-zero precision value is `r toString(min(cm_labels$byClass[,5][which(cm_labels$byClass[,5] > 0)]))`, with many well above that, approaching one in a few cases.
One might say that we are comparing apples to oranges in a sense, because their method was one single method model, while these results are from a two-method stack. Still, given the run time of the final script, the two-model approach is superior in this case, especially when measured in terms of precision and avoiding false positives.
One might say that we are comparing apples to oranges by benchmarking single method model with a two-method stack. The two-model approach is justified and seems superior in this case, especially when measured in terms of total run time and having the benefit of avoiding false positives to a great degree.
### Limitations
SOMs seem like they are easy to misconfigure, and require significantly more computing resources than less sophisticated algorithms. Perhaps a dual Random Forest approach would be better. This has not been attempted yet, as the two method approach presented here was satisfactory enough to present in a report.
SOMs have many different parameters that seem easy to misconfigure, and usually require significantly more computing resources than less sophisticated algorithms. Perhaps a dual Random Forest approach would be better, if the loss of accuracy or precision was worth the time gain.
### Future Work
I only scratched he surface of the SOM algorithm which seems to have many implementations and parameters that could be investigated further and possibly optimized via cross-validation. For example, the grid size used to train the SOM was calculated using an algorithm based on the size of the training set, and while this performed better than a grid size based on the number of categories, this may not be ideal. Optimization around grid size could still be carried out.
A dual Random Forest approach could be used to first isolate the ransomware addresses as well as classify them might be quick enough to run in under ten minutes on all the hardware listed. Conversely, a dual-SOM method could be created for maximum precision if the necessary computing resources were available.
A dual Random Forest approach could be used to first isolate the ransomware addresses as well as classify them might be quick enough to run in under ten minutes on all the hardware listed. Conversely, a dual SOM method could be created for maximum precision if the necessary computing resources were available.
The script itself has a few areas that could be further optimization. The sampling method does what it needs to do, but the ratios taken for each set could possibly be optimized.
The script itself has a few areas that could be further optimization. The sampling method does what it needs to do, but the ratios taken for each set could possibly be optimized further.
### Conclusion
This report presents a reliable method for classifying Bitcoin addresses into known ransomware families, while at the same time avoiding false positives by filtering them out using a binary method before classifying them further. It leaves the author of the paper wondering how much harder it would be to perform the same task for ransomware that uses privacy-centric coins. Certain cryptocurrency networks utilize privacy coins, such as Monero, that obfuscate transactions from being analyzed in the same way that the Bitcoin network has been analyzed here. Some progress has been made towards analyzing such networks[9], but the developers of such networks continually evolve the code to complicate transaction tracking. This could be another good area for future research.
This report presents a reliable method for classifying Bitcoin addresses into known ransomware families, while at the same time avoiding false positives by filtering them out using a binary method before classifying them further. It leaves the author of the paper wondering how much harder it would be to perform the same task for ransomware that uses privacy-oriented coins. Certain cryptocurrency networks utilize privacy features, such as Monero, that obfuscate transactions from being analyzed in the same way that the Bitcoin network has been analyzed here. Some progress has been made towards analyzing these networks[9], while the developers of such networks continually evolve the code to complicate transaction tracking. This could be another good area for future research.
## References

Binary file not shown.