Graphs finally all look good. Now focus on the words! Words words words, write up the words and then put out a "first reading draft".

This commit is contained in:
shelldweller 2021-11-04 05:12:12 -06:00
parent 1c4ef5bb15
commit 9e1a851e1f
3 changed files with 303 additions and 247 deletions

View File

@ -3,7 +3,7 @@
## using Random Forests and Self Organizing Maps
##
## Kaylee Robert Tejeda
## October 31, 2021
## November 11, 2021
##
## Submitted as part of final CYO project for
## HarvardX PH125.9x Capstone Course
@ -15,7 +15,7 @@
library(tictoc)
tic(quiet = FALSE)
# Set the repository mirror to “0-Cloud” for maximum availability
# Set the repository mirror to “1: 0-Cloud” for maximum availability
r = getOption("repos")
r["CRAN"] = "http://cran.rstudio.com"
options(repos = r)
@ -71,7 +71,6 @@ test_index <- createDataPartition(y = workset$bw,
train_set <- workset[-test_index,]
test_set <- workset[test_index,]
###############################################################################
## Data preparation is now done
## Separate into "black" and "white" groups using Random Forests predictions
@ -164,7 +163,6 @@ test_index <- createDataPartition(y = black_addresses$prediction,
train_set <- black_addresses[-test_index,]
test_set <- black_addresses[test_index,]
# Keep only numeric columns, ignoring temporal variables.
train_num <- train_set %>%
select(length, weight, count, looped, neighbors, income)
@ -249,7 +247,5 @@ add.cluster.boundaries(som_model2, som.cluster$cluster)
message("Overall accuracy is ", cm_labels$overall["Accuracy"])
#End timer
toc()
toc()

View File

@ -4,7 +4,7 @@ subtitle: \vspace{.5in}HarvardX PH125.9x Final Capstone CYO Project
\vspace{.5in}
author: "Kaylee Robert Tejeda"
date: "11/11/2021"
abstract: "Ransomware is a persisent and growing threat in the world of cybersecurity. A specific area of focus involves detecting and tracking payments made to ransomware operators. While many attempts towards this goal have not made use of sophisticated machine learning methods, even those that have often result in models with poor precision or other performance issues. A two-step method is developed to address the issue of false positives and improve on previous results."
abstract: "Ransomware is a persisent and growing threat in the world of cybersecurity. A specific area of focus involves detecting and tracking payments made to ransomware operators. While many attempts towards this goal have not made use of sophisticated machine learning methods, even those that have often result in models with poor precision or other performance issues. A two-step method is developed to address the issue of false positives."
keywords:
- Bitcoin
- blockchain
@ -15,19 +15,10 @@ keywords:
- SOMs
- cryptocurrency
output: pdf_document
header-includes:
- \usepackage{booktabs}
geometry: margin=2cm
---
```{r tic, echo=FALSE, include=FALSE}
##############################################################################
## Uncomment these commands to time the compilation of the script.
## the tictoc library needs to be installed for this to work.
##############################################################################
library(tictoc)
tic(quiet = FALSE)
```
\def\bitcoinA{%
\leavevmode
\vtop{\offinterlineskip %\bfseries
@ -46,6 +37,7 @@ knitr::knit_hooks$set(chunk = function(x, options) {
"\n\n \\normalsize"), x)
})
```
\newpage
@ -57,15 +49,15 @@ knitr::knit_hooks$set(chunk = function(x, options) {
## Introduction
Ransomware attacks have gained the attention of security professionals, law enforcement, and financial regulatory officials.$^{[1]}$ The pseudo-anonymous Bitcoin network provides a convenient method for ransomware attackers to accept payments without revealing their identity or location. The victims (usually hospitals or other large organizations) come to find that much if not all of their important organizational data have been encrypted with a secret key by an unknown attacker. They are instructed to make a payment to a specific Bitcoin address before a certain deadline to have their data decrypted before being deleted permanently.
Ransomware attacks are of interest to security professionals, law enforcement, and financial regulatory officials.$^{[1]}$ The pseudo-anonymous Bitcoin network provides a convenient method for ransomware attackers to accept payments without revealing their identity or location. The victims (usually hospitals or other large organizations) come to learn that much if not all of their important organizational data have been encrypted with a secret key by an unknown attacker. They are instructed to make a payment to a specific Bitcoin address by a certain deadline to have the data decrypted or else it will be deleted automatically.
The legal and financial implications of ransomware attacks are not of concern for the purpose of this analysis. Many parties are interested in tracking illicit activity (such as ransomware payments) around the Bitcoin blockchain as soon as possible to minimize financial losses. Daniel Goldsmith explains some of the reasons and methods of blockchain analysis at [Chainalysis.com](https://www.chainalysis.com/).$^{[2]}$ A ransomware attack could be perpetrated on an illegal darknet market site, for example. The news of such an attack might not be published at all, let alone in popular media. By analyzing the transaction record with a blockchain explorer such as [BTC.com](https://btc.com/), suspicious activity could be flagged in real time given a sufficiently robust model. It may, in fact, be the first public notice of such an event. Any suspicious addresses could then be blacklisted or banned from using other services.
The deeper legal and financial implications of ransomware attacks are inconsequential to the work in this report, as we are merely interested in being able to classify bitcoin addresses by their connection to ransomware transactions. Many researchers are already tracking illicit activity (such as ransomware payments) around the Bitcoin blockchain as soon as possible to minimize financial losses. Daniel Goldsmith explains some of the reasons and methods of blockchain analysis at [Chainalysis.com](https://www.chainalysis.com/).$^{[2]}$ For example, consider a ransomware attack conducted towards an illegal darknet market site. The news of such an attack might not be published at all, let alone in popular media. By analyzing the transaction record with a blockchain explorer such as [BTC.com](https://btc.com/), suspicious activity could be flagged in real time given a sufficiently robust model. It may, in fact, be the first public notice of such an event. Any suspicious addresses could then be blacklisted or banned from using other services, if that is so desired.
Lists of known ransomware payment addresses have been compiled and analyzed using various methods. One well known paper entitled "BitcoinHeist: Topological Data Analysis for Ransomware Detection on the Bitcoin Blockchain"$^{[3]}$ will be the source of our data set and the baseline to which we will compare our results. In that paper, Akcora, et al. use Topological Data Analysis (TDA) to classify addresses on the Bitcoin blockchain into one of 28 known ransomware address groups. Addresses with no known ransomware associations are classified as "white". The blockchain is then considered as a heterogeneous Directed Acyclic Graph (DAG) with two types of nodes describing *addresses* and *transactions*. Edges are formed between the nodes when a transaction can be associated with a particular address.
Lists of known ransomware payment addresses have been compiled and analyzed using various methods. One well known paper entitled "BitcoinHeist: Topological Data Analysis for Ransomware Detection on the Bitcoin Blockchain"$^{[3]}$ will be the source of our data set and the baseline to which we will compare our results. In that paper, Akcora, et al. use Topological Data Analysis (TDA) to classify addresses on the Bitcoin blockchain into one of 28 known ransomware address groups. Addresses with no known ransomware associations are classified as *white*. The blockchain is then considered as a heterogeneous Directed Acyclic Graph (DAG) with two types of nodes describing *addresses* and *transactions*. Edges are formed between the nodes when a transaction can be associated with a particular address.
Addresses on the Bitcoin network may appear many times, with different inputs and outputs each time. The Bitcoin network data has been divided into 24-hour time intervals with the UTC-6 timezone as a reference. Speed is defined as the number of blocks the coin appears in during a 24-hour period and provides information on how quickly a coin moves through the network. Speed can be an indicator of money laundering or coin mixing, as normal payments only involve a limited number of addresses in a 24 hour period, and thus have lower speeds when compared to "mixed" coins. The temporal data can also help distinguish transactions by geolocation, as criminal transactions tend to cluster in time.
Any given address on the Bitcoin network may appear many times, with different inputs and outputs each time. The Bitcoin network data has been divided into 24-hour time intervals with the UTC-6 timezone as a reference. This way, variables can be defined in a specific and meaningful way. For example, *speed* can be defined as the number of blocks the coin appears in during a 24-hour period, and provides information on how quickly a coin moves through the network. *Speed* may be an indicator of money laundering or coin mixing, as normal payments only involve a limited number of addresses in a given 24 hour period, and thus have lower speeds when compared to "mixed" coins. The temporal data can also help distinguish transactions by geolocation, as criminal transactions tend to cluster in time.
With the graph defined as such, the following six numerical features$^{[2]}$ are associated with a given address:
With the graph specified as such, the following six numerical features$^{[2]}$ are associated with a given address:
1) *Income* - the total amount of coins sent to an address
@ -80,14 +72,13 @@ acyclic directed path originating from any starter transaction and ending at the
6) *Looped* - The number of starter addresses connected to this address by more than one path
These variables are defined rather abstractly, viewing the blockchain as a topological graph with nodes and edges. The rationale behind this approach is to quantify specific transaction patterns. Akcora$^{[3]}$ gives a thorough explanation in the original paper of how and why these features were chosen. We shall treat the features as general numerical variables and will not seek to justify their definitions. Several machine learning methods will be applied to the original data set from the paper by Akcora$^{[3]}$, and the results will be compared.
These variables are defined rather conceptually, viewing the blockchain as a topological graph with nodes and edges. The rationale behind this approach is to quantify specific transaction patterns. Akcora$^{[3]}$ gives a thorough explanation in the original paper of how and why these features were chosen. We shall treat the features as general numerical variables and will not seek to justify their definitions beyond that. Machine learning methods will be applied to the original data set from the paper by Akcora$^{[3]}$, and the new results will be compared to the original ones.
### Data
This data set was discovered while exploring the [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/index.php)$^{[4]}$ as suggested in the project instructions. The author of this report, interested in Bitcoin and other cryptocurrencies since (unsuccessfully) mining them on an ASUS netbook in rural Peru in late 2010, used *cryptocurrency* as a preliminary search term. This brought up a single data set entitled ["BitcoinHeist: Ransomware Address Data Set"](https://archive.ics.uci.edu/ml/datasets/BitcoinHeistRansomwareAddressDataset#). The data set was downloaded and the exploration began.
This data set was found while exploring the [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/index.php)$^{[4]}$ as suggested in the project instructions. The author of this report, interested in Bitcoin and other cryptocurrencies since (unsuccessfully) mining them on an ASUS netbook in rural Peru in late 2010, used *cryptocurrency* as a preliminary search term. This brought up a single data set entitled ["BitcoinHeist: Ransomware Address Data Set"](https://archive.ics.uci.edu/ml/datasets/BitcoinHeistRansomwareAddressDataset#). The data set was downloaded and the exploration began.
```{r install-load-libraries download-data, echo=FALSE, include=FALSE}
```{r install_load_libraries-and-download_data, echo=FALSE, include=FALSE}
# Set the repository mirror to “0-Cloud” for maximum availability
r = getOption("repos")
@ -102,6 +93,7 @@ if(!require(randomForest)) install.packages("randomForest")
if(!require(kohonen)) install.packages("kohonen")
if(!require(parallel)) install.packages("parallel")
if(!require(matrixStats)) install.packages("matrixStats")
if(!require(xtable)) install.packages("xtable")
# Load Libraries
library(tidyverse)
@ -110,6 +102,10 @@ library(randomForest)
library(kohonen)
library(parallel)
library(matrixStats)
library(xtable)
# Set # of cores, use detectCores() - 1 to leave one for the system
n_cores <- detectCores()
# Download data
url <-
@ -128,40 +124,40 @@ ransomware <- read_csv("data/BitcoinHeistData.csv")
```
A summary of the data set tells the range of values and size of the sample.
A summary of the data set shows the range of values and size of the sample.
```{r data-summary, echo=FALSE, size="tiny"}
```{r data_summary, echo=FALSE, size="tiny"}
# Summary
ransomware %>% summary() %>% knitr::kable()
ransomware %>% summary() %>% knitr::kable(caption="Summary of data set")
```
A listing of the first ten rows provides a sample of the features associated with each observation.
```{r data-head, echo=FALSE, size="tiny"}
```{r data_head, echo=FALSE, size="tiny"}
# Inspect data
ransomware %>% head() %>% knitr::kable()
ransomware %>% head() %>% knitr::kable(caption="First ten entries of data set")
```
This data set has 2,916,697 observations of ten features associated with a sample of transactions from the Bitcoin blockchain. The ten features include *address* as a unique identifier, the six features defined previously (*income, neighbors, weight, length, count, loop*), two temporal features in the form of *year* and *day* (of the year as 1 to 365), and a categorical feature called *label* that categorizes each address as either "white" (meaning not connected to any ransomware activity), or one of 28 known ransomware groups as identified by three independent ransomware analysis teams (Montreal, Princeton, and Padua)$^{[3]}$ .
This data set has 2,916,697 observations of ten features associated with a sample of transactions from the Bitcoin blockchain. The ten features include *address* as a unique identifier, the six features defined previously (*income, neighbors, weight, length, count, loop*), two temporal features in the form of *year* and *day* (of the year as 1 to 365), and a categorical feature called *label* that categorizes each address as either *white* (meaning not connected to any ransomware activity), or one of 28 known ransomware groups as identified by three independent ransomware analysis teams (Montreal, Princeton, and Padua)$^{[3]}$.
The original research team downloaded and parsed the entire Bitcoin transaction graph from January 2009 to December 2018. Based on a 24 hour time interval, daily transactions on the network were extracted and the Bitcoin graph was formed. Network edges that transferred less than \bitcoinA 0.3 were filtered out since ransom amounts are rarely below this threshold. Ransomware addresses are taken from three widely adopted studies: Montreal, Princeton and Padua. "White" Bitcoin addresses were capped at one thousand per day while the entire network has up to 800,000 addresses daily.$^{[5]}$
The original research team downloaded and parsed the entire Bitcoin transaction graph from January 2009 to December 2018. Based on a 24 hour time interval, daily transactions on the network were extracted and the Bitcoin graph was formed. Network edges that transferred less than \bitcoinA 0.3 were filtered out since ransom amounts are rarely below this threshold. Ransomware addresses are taken from three widely adopted studies: Montreal, Princeton and Padua. *White* Bitcoin addresses were capped at one thousand per day, whereas the entire network sees up to 800,000 addresses daily.$^{[5]}$
### Goal
The goal of this project is to apply different machine learning algorithms to the same data set used in the original paper to produce an acceptable predictive model for categorizing ransomware addresses correctly. Improving on the results of the original paper in some way, while not strictly necessary for the purposes of the project, would be a notable sign of success.
The goal of this project is to apply different machine learning algorithms to the same data set used in the original paper, producing an acceptable predictive model for categorizing ransomware addresses correctly. Improving on the results of the original paper in some way, while not strictly necessary for the purposes of the project, would be a notable sign of success.
### Outline of Steps Taken
1. Analyze data set numerically and visually. Notice any pattern, look for insights.
1. Analyze data set numerically and visually, look for insights in any patterns.
2. Binary separation using Self Organizing Maps.
3. Improved Binary separation using Random Forest.
3. Fast binary separation using Random Forest.
4. Categorical classification using Self Organizing Maps.
5. Visualize clustering to analyze results further.
6. Generate Confusion Matrix to quantify results.
6. Generate confusion matrix to quantify results.
---
@ -169,7 +165,7 @@ The original research team downloaded and parsed the entire Bitcoin transaction
### Hardware Specification
All of the analysis in this report was conducted on a single laptop computer, a Lenovo Yoga S1 from late 2013 with the following specs:
All of the analysis in this report was conducted on a single laptop computer, a Lenovo Yoga S1 from late 2013 with the following specifications.
- CPU: Intel i7-4600U @ 3.300GHz (4th Gen quad-core i7 x86_64)
- RAM: 8217MB DDR3L @ 1600 MHz (8 GB)
@ -179,9 +175,9 @@ The original research team downloaded and parsed the entire Bitcoin transaction
### Data Preparation
It is immediately apparent that this is a rather large data set. The usual practice of partitioning out eighty to ninety percent of the data for a training set results in a data set that is too large to process given the hardware available. For reasons that no longer apply, the original data set was first split in half with 50% reserved as "validation set" and the other 50% used as the "working set". This working set was again split in half, to give a "training set" that was of a reasonable size to deal with. At this point the partitions were small enough to work with, so the sample partitions were not further refined. This is a potential area for later optimization. Careful sampling was carried out to ensure that the ransomware groups were represented in each sample.
It is immediately apparent that this is a rather large data set. The usual practice of partitioning out 80% to 90% of the data for training results in a training set that is too large to process given the hardware limitations. For reasons that no longer apply, the original data set was first split in half with 50% reserved as *validation set* and the other 50% used as the *working set*. This working set was again split in half, to give a *training set* that was of a reasonable size to deal with. This produced partitions that were small enough to work with, so the partition size ratio was not further refined. This is a potential area for later optimization. Careful sampling was carried out to ensure that the ransomware groups were represented in each sample.
```{r data-prep, echo=FALSE, include=FALSE}
```{r data_prep, echo=FALSE, include=FALSE}
# Turn labels into factors, "bw" is binary factor for ransomware/non-ransomware
ransomware <- ransomware %>%
@ -210,13 +206,11 @@ no_nas <- sum(is.na(ransomware))
```
#########################################################################
### Exploration and Visualization
The ransomware addresses make up less than 2% of the overall data set. This presents a challenge as the target observations are sparse within the data set, especially when we consider that this is then divided into 28 subsets. In fact, some of the ransomware groups have only a single member, making categorization a dubious task. At least there are no missing values to worry about.
```{r cv-calcs, echo=FALSE}
By graphing a values, we can get an idea of how the data is distributed across the various features.
```{r cv_calcs, echo=FALSE}
# Keep only numeric columns, ignoring temporal features
ransomware_num <- ransomware %>%
@ -258,12 +252,14 @@ test_bw <- test_samp$bw
The proportion of ransomware addresses in the original data set is `r ransomprop`. The total number of NA or missing values in the original data set is `r no_nas`.
```{r data-sparsness, echo=FALSE}
The ransomware addresses make up less than 2% of the overall data set. This presents a challenge as the target observations are sparse within the data set, especially when we consider that this is then divided into 28 subsets. In fact, some of the ransomware groups have only a single member, making categorization a dubious task. At least there are no missing values to worry about.
```{r data_sparsness, echo=FALSE}
labels <- ransomware$label %>% summary()
knitr::kable(
list(labels[1:15], labels[16:29]),
list(labels[1:10], labels[11:20], labels[21:29]),
caption = 'Ransomware group labels and frequency counts for full data set',
booktabs = TRUE)
@ -299,15 +295,15 @@ histograms + theme(axis.text.x = element_text(size = 8, angle=30, hjust=1))
```
Now we can compare the relative spread of each feature by calculating the coefficient of variation for each column. Larger coefficients of variation indicate larger relative spread compared to other columns.
Now let us compare the relative spread of each feature by calculating the coefficient of variation for each column. Larger coefficients of variation indicate larger relative spread compared to other columns.
```{r cv-results, echo=FALSE, fig.align="center"}
```{r cv_results, echo=FALSE, fig.align="center"}
# Summarize results in a table
knitr::kable(coeff_vars)
# Scatterplot, not very interesting
# plot(coeff_vars)
knitr::kable(
list(coeff_vars[1:2], coeff_vars[3:4], coeff_vars[5:6]),
caption = 'Coefficients of Variation for each feature',
booktabs = TRUE)
```
@ -316,7 +312,7 @@ From this, it appears that `r selected_features[1]` has the widest range of vari
Taking the feature with the highest variation `r selected_features[1]`, let us take a look at the distribution for individual ransomware families. Perhaps there is a similarity across families.
```{r variation histograms, echo=FALSE, fig.show="hold", out.width='35%', warning=FALSE}
```{r variation_histograms, echo=FALSE, fig.height=2, fig.width=2.5, fig.show="hold", out.width='35%', warning=FALSE}
# Density plots of the feature with highest variation
selected_feature1 <- selected_features[1]
@ -326,8 +322,8 @@ ransomware_big_families <- ransomware %>%
# Note: Putting these graphs into a for loop breaks some of the formatting.
# Low membership makes some of the graphs not very informative
# Relatively boring graphs have been commented out to save time and space.
# These can be uncommented if one wishes.
# Relatively meaningless graphs have been left out to save time and space.
# Label 1
ransomware_big_families %>%
filter(label==levels(ransomware_big_families$label)[1]) %>%
@ -335,23 +331,12 @@ ransomware_big_families %>%
ggplot(aes(x=income, y = ..density..)) +
geom_density(col = "green")+
ggtitle(levels(ransomware_big_families$label)[1]) +
scale_x_continuous(trans='log2')
# Label 2
#ransomware_big_families %>%
# filter(label==levels(ransomware_big_families$label)[2]) %>%
# select(income) %>%
# ggplot(aes(x=income, y = ..density..)) +
# geom_density(col = "green", size = .5)+
# ggtitle(levels(ransomware_big_families$label)[2]) +
# scale_x_continuous(trans='log2')
# Label 3
#ransomware_big_families %>%
# filter(label==levels(ransomware_big_families$label)[3]) %>%
# select(income) %>%
# ggplot(aes(x=income, y = ..density..)) +
# geom_density(col = "green")+
# ggtitle(levels(ransomware_big_families$label)[3]) +
# scale_x_continuous(trans='log2')
scale_x_continuous(trans='log2') +
theme(axis.text.x = element_text(size = 8, angle=30, hjust=1),
plot.title = element_text(size = 9, face = "bold"),
axis.title.x=element_blank(),
axis.title.y=element_blank())
# Label 4
ransomware_big_families %>%
filter(label==levels(ransomware_big_families$label)[4]) %>%
@ -359,7 +344,11 @@ ransomware_big_families %>%
ggplot(aes(x=income, y = ..density..)) +
geom_density(col = "green")+
ggtitle(levels(ransomware_big_families$label)[4]) +
scale_x_continuous(trans='log2')
scale_x_continuous(trans='log2') +
theme(axis.text.x = element_text(size = 8, angle=30, hjust=1),
plot.title = element_text(size = 9, face = "bold"),
axis.title.x=element_blank(),
axis.title.y=element_blank())
# Label 5
ransomware_big_families %>%
filter(label==levels(ransomware_big_families$label)[5]) %>%
@ -367,7 +356,12 @@ ransomware_big_families %>%
ggplot(aes(x=income, y = ..density..)) +
geom_density(col = "green")+
ggtitle(levels(ransomware_big_families$label)[5]) +
scale_x_continuous(trans='log2')
scale_x_continuous(trans='log2') +
theme(axis.text.x = element_text(size = 8, angle=30, hjust=1),
plot.title = element_text(size = 9, face = "bold"),
axis.title.x=element_blank(),
axis.title.y=element_blank())
# Label 6
ransomware_big_families %>%
filter(label==levels(ransomware_big_families$label)[6]) %>%
@ -375,7 +369,11 @@ ransomware_big_families %>%
ggplot(aes(x=income, y = ..density..)) +
geom_density(col = "green")+
ggtitle(levels(ransomware_big_families$label)[6]) +
scale_x_continuous(trans='log2')
scale_x_continuous(trans='log2') +
theme(axis.text.x = element_text(size = 8, angle=30, hjust=1),
plot.title = element_text(size = 9, face = "bold"),
axis.title.x=element_blank(),
axis.title.y=element_blank())
# Label 7
ransomware_big_families %>%
filter(label==levels(ransomware_big_families$label)[7]) %>%
@ -383,7 +381,11 @@ ransomware_big_families %>%
ggplot(aes(x=income, y = ..density..)) +
geom_density(col = "green")+
ggtitle(levels(ransomware_big_families$label)[7]) +
scale_x_continuous(trans='log2')
scale_x_continuous(trans='log2') +
theme(axis.text.x = element_text(size = 8, angle=30, hjust=1),
plot.title = element_text(size = 9, face = "bold"),
axis.title.x=element_blank(),
axis.title.y=element_blank())
# Label 8
ransomware_big_families %>%
filter(label==levels(ransomware_big_families$label)[8]) %>%
@ -391,15 +393,13 @@ ransomware_big_families %>%
ggplot(aes(x=income, y = ..density..)) +
geom_density(col = "green")+
ggtitle(levels(ransomware_big_families$label)[8]) +
scale_x_continuous(trans='log2')
# Label 9
#ransomware_big_families %>%
# filter(label==levels(ransomware_big_families$label)[9]) %>%
# select(income) %>%
# ggplot(aes(x=income, y = ..density..)) +
# geom_density(col = "green")+
# ggtitle(levels(ransomware_big_families$label)[9]) +
# scale_x_continuous(trans='log2')
scale_x_continuous(trans='log2') +
theme(axis.text.x = element_text(size = 8, angle=30, hjust=1),
plot.title = element_text(size = 9, face = "bold"),
axis.title.x=element_blank(),
axis.title.y=element_blank())
# Label 10
ransomware_big_families %>%
filter(label==levels(ransomware_big_families$label)[10]) %>%
@ -407,7 +407,11 @@ ransomware_big_families %>%
ggplot(aes(x=income, y = ..density..)) +
geom_density(col = "green")+
ggtitle(levels(ransomware_big_families$label)[10]) +
scale_x_continuous(trans='log2')
scale_x_continuous(trans='log2') +
theme(axis.text.x = element_text(size = 8, angle=30, hjust=1),
plot.title = element_text(size = 9, face = "bold"),
axis.title.x=element_blank(),
axis.title.y=element_blank())
# Label 11
ransomware_big_families %>%
filter(label==levels(ransomware_big_families$label)[11]) %>%
@ -415,7 +419,11 @@ ransomware_big_families %>%
ggplot(aes(x=income, y = ..density..)) +
geom_density(col = "green")+
ggtitle(levels(ransomware_big_families$label)[11]) +
scale_x_continuous(trans='log2')
scale_x_continuous(trans='log2') +
theme(axis.text.x = element_text(size = 8, angle=30, hjust=1),
plot.title = element_text(size = 9, face = "bold"),
axis.title.x=element_blank(),
axis.title.y=element_blank())
# Label 12
ransomware_big_families %>%
filter(label==levels(ransomware_big_families$label)[12]) %>%
@ -423,7 +431,12 @@ ransomware_big_families %>%
ggplot(aes(x=income, y = ..density..)) +
geom_density(col = "green")+
ggtitle(levels(ransomware_big_families$label)[12]) +
scale_x_continuous(trans='log2')
scale_x_continuous(trans='log2') +
theme(axis.text.x = element_text(size = 8, angle=30, hjust=1),
plot.title = element_text(size = 9, face = "bold"),
axis.title.x=element_blank(),
axis.title.y=element_blank())
# Label 13
ransomware_big_families %>%
filter(label==levels(ransomware_big_families$label)[13]) %>%
@ -431,7 +444,11 @@ ransomware_big_families %>%
ggplot(aes(x=income, y = ..density..)) +
geom_density(col = "green")+
ggtitle(levels(ransomware_big_families$label)[13]) +
scale_x_continuous(trans='log2')
scale_x_continuous(trans='log2') +
theme(axis.text.x = element_text(size = 8, angle=30, hjust=1),
plot.title = element_text(size = 9, face = "bold"),
axis.title.x=element_blank(),
axis.title.y=element_blank())
# Label 14
ransomware_big_families %>%
filter(label==levels(ransomware_big_families$label)[14]) %>%
@ -439,7 +456,11 @@ ransomware_big_families %>%
ggplot(aes(x=income, y = ..density..)) +
geom_density(col = "green")+
ggtitle(levels(ransomware_big_families$label)[14]) +
scale_x_continuous(trans='log2')
scale_x_continuous(trans='log2') +
theme(axis.text.x = element_text(size = 8, angle=30, hjust=1),
plot.title = element_text(size = 9, face = "bold"),
axis.title.x=element_blank(),
axis.title.y=element_blank())
# Label 15
ransomware_big_families %>%
filter(label==levels(ransomware_big_families$label)[15]) %>%
@ -447,7 +468,12 @@ ransomware_big_families %>%
ggplot(aes(x=income, y = ..density..)) +
geom_density(col = "green")+
ggtitle(levels(ransomware_big_families$label)[15]) +
scale_x_continuous(trans='log2')
scale_x_continuous(trans='log2') +
theme(axis.text.x = element_text(size = 8, angle=30, hjust=1),
plot.title = element_text(size = 9, face = "bold"),
axis.title.x=element_blank(),
axis.title.y=element_blank())
# Label 16
ransomware_big_families %>%
filter(label==levels(ransomware_big_families$label)[16]) %>%
@ -455,15 +481,12 @@ ransomware_big_families %>%
ggplot(aes(x=income, y = ..density..)) +
geom_density(col = "green")+
ggtitle(levels(ransomware_big_families$label)[16]) +
scale_x_continuous(trans='log2')
# Label 17
#ransomware_big_families %>%
# filter(label==levels(ransomware_big_families$label)[17]) %>%
# select(income) %>%
# ggplot(aes(x=income, y = ..density..)) +
# geom_density(col = "green", size = .5)+
# ggtitle(levels(ransomware_big_families$label)[17]) +
# scale_x_continuous(trans='log2')
scale_x_continuous(trans='log2') +
theme(axis.text.x = element_text(size = 8, angle=30, hjust=1),
plot.title = element_text(size = 9, face = "bold"),
axis.title.x=element_blank(),
axis.title.y=element_blank())
# Label 18
ransomware_big_families %>%
filter(label==levels(ransomware_big_families$label)[18]) %>%
@ -471,15 +494,12 @@ ransomware_big_families %>%
ggplot(aes(x=income, y = ..density..)) +
geom_density(col = "green")+
ggtitle(levels(ransomware_big_families$label)[18]) +
scale_x_continuous(trans='log2')
# Label 19
#ransomware_big_families %>%
# filter(label==levels(ransomware_big_families$label)[19]) %>%
# select(income) %>%
# ggplot(aes(x=income, y = ..density..)) +
# geom_density(col = "green")+
# ggtitle(levels(ransomware_big_families$label)[19]) +
# scale_x_continuous(trans='log2')
scale_x_continuous(trans='log2') +
theme(axis.text.x = element_text(size = 8, angle=30, hjust=1),
plot.title = element_text(size = 9, face = "bold"),
axis.title.x=element_blank(),
axis.title.y=element_blank())
# Label 20
ransomware_big_families %>%
filter(label==levels(ransomware_big_families$label)[20]) %>%
@ -487,15 +507,12 @@ ransomware_big_families %>%
ggplot(aes(x=income, y = ..density..)) +
geom_density(col = "green")+
ggtitle(levels(ransomware_big_families$label)[20]) +
scale_x_continuous(trans='log2')
# Label 21
#ransomware_big_families %>%
# filter(label==levels(ransomware_big_families$label)[21]) %>%
# select(income) %>%
# ggplot(aes(x=income, y = ..density..)) +
# geom_density(col = "green", size = .5)+
# ggtitle(levels(ransomware_big_families$label)[21]) +
# scale_x_continuous(trans='log2')
scale_x_continuous(trans='log2') +
theme(axis.text.x = element_text(size = 8, angle=30, hjust=1),
plot.title = element_text(size = 9, face = "bold"),
axis.title.x=element_blank(),
axis.title.y=element_blank())
# Label 22
ransomware_big_families %>%
filter(label==levels(ransomware_big_families$label)[22]) %>%
@ -503,7 +520,11 @@ ransomware_big_families %>%
ggplot(aes(x=income, y = ..density..)) +
geom_density(col = "green")+
ggtitle(levels(ransomware_big_families$label)[22]) +
scale_x_continuous(trans='log2')
scale_x_continuous(trans='log2') +
theme(axis.text.x = element_text(size = 8, angle=30, hjust=1),
plot.title = element_text(size = 9, face = "bold"),
axis.title.x=element_blank(),
axis.title.y=element_blank())
# Label 23
ransomware_big_families %>%
filter(label==levels(ransomware_big_families$label)[23]) %>%
@ -511,7 +532,11 @@ ransomware_big_families %>%
ggplot(aes(x=income, y = ..density..)) +
geom_density(col = "green")+
ggtitle(levels(ransomware_big_families$label)[23]) +
scale_x_continuous(trans='log2')
scale_x_continuous(trans='log2') +
theme(axis.text.x = element_text(size = 8, angle=30, hjust=1),
plot.title = element_text(size = 9, face = "bold"),
axis.title.x=element_blank(),
axis.title.y=element_blank())
# Label 24
ransomware_big_families %>%
filter(label==levels(ransomware_big_families$label)[24]) %>%
@ -519,23 +544,12 @@ ransomware_big_families %>%
ggplot(aes(x=income, y = ..density..)) +
geom_density(col = "green")+
ggtitle(levels(ransomware_big_families$label)[24]) +
scale_x_continuous(trans='log2')
# Label 25
#ransomware_big_families %>%
# filter(label==levels(ransomware_big_families$label)[25]) %>%
# select(income) %>%
# ggplot(aes(x=income, y = ..density..)) +
# geom_density(col = "green", size = .5)+
# ggtitle(levels(ransomware_big_families$label)[25]) +
# scale_x_continuous(trans='log2')
# Label 26
#ransomware_big_families %>%
# filter(label==levels(ransomware_big_families$label)[26]) %>%
# select(income) %>%
# ggplot(aes(x=income, y = ..density..)) +
# geom_density(col = "green")+
# ggtitle(levels(ransomware_big_families$label)[26]) +
# scale_x_continuous(trans='log2')
scale_x_continuous(trans='log2') +
theme(axis.text.x = element_text(size = 8, angle=30, hjust=1),
plot.title = element_text(size = 9, face = "bold"),
axis.title.x=element_blank(),
axis.title.y=element_blank())
# Label 27
ransomware_big_families %>%
filter(label==levels(ransomware_big_families$label)[27]) %>%
@ -543,7 +557,11 @@ ransomware_big_families %>%
ggplot(aes(x=income, y = ..density..)) +
geom_density(col = "green")+
ggtitle(levels(ransomware_big_families$label)[27]) +
scale_x_continuous(trans='log2')
scale_x_continuous(trans='log2') +
theme(axis.text.x = element_text(size = 8, angle=30, hjust=1),
plot.title = element_text(size = 9, face = "bold"),
axis.title.x=element_blank(),
axis.title.y=element_blank())
# Label 28
ransomware_big_families %>%
filter(label==levels(ransomware_big_families$label)[28]) %>%
@ -551,7 +569,11 @@ ransomware_big_families %>%
ggplot(aes(x=income, y = ..density..)) +
geom_density(col = "green")+
ggtitle(levels(ransomware_big_families$label)[28]) +
scale_x_continuous(trans='log2')
scale_x_continuous(trans='log2') +
theme(axis.text.x = element_text(size = 8, angle=30, hjust=1),
plot.title = element_text(size = 9, face = "bold"),
axis.title.x=element_blank(),
axis.title.y=element_blank())
# Label 29
ransomware_big_families %>%
filter(label==levels(ransomware_big_families$label)[29]) %>%
@ -559,49 +581,51 @@ ransomware_big_families %>%
ggplot(aes(x=income, y = ..density..)) +
geom_density(col = "green")+
ggtitle(levels(ransomware_big_families$label)[29]) +
scale_x_continuous(trans='log2')
scale_x_continuous(trans='log2') +
theme(axis.text.x = element_text(size = 8, angle=30, hjust=1),
plot.title = element_text(size = 9, face = "bold"),
axis.title.x=element_blank(),
axis.title.y=element_blank())
```
It appears that, although the income distribution (as an example feature to consider) for ransomware groups does differ from the distribution pattern for *white* addresses, it also varies from group to group. For this reason, this makes a good feature to use in the training of the models.
```{r shrimp-percentage, echo=FALSE, include=FALSE}
# Count how many wallets have less than one full bitcoin
shrimp <- ransomware %>% filter(income < 10^8 )
# Count how many wallets have less than one hundred bitcoins
shrimp <- ransomware %>% filter(income < 10^10 )
```
The percentage of wallets with less than one full bitcoin as their balance is `r mean(shrimp$bw == "black")` .
The percentage of wallets with less than one hundred bitcoins as their balance is `r mean(shrimp$bw == "black")`. I have no idea why this is meaningful, but I can calculate it at least.
###############################################################################
### Insights gained from exploration
### Insights Gained from Exploration
From the previous visual and statistical exploration of the data, it becomes clear what the challenge is. Ransomware related addresses are very sparse in the data set, making up less than 2% of all addresses. That small percentage is also further classified into 28 groups. Perhaps the original paper was a bit too ambitious in trying to categorize all the addresses into 29 categories, including the "white" addresses. To simplify our approach, we will categorize the addresses in a binary way, either "white" or "black", where "black" signifies an association with ransomware transactions. Asking this as a "ransomware or not-ransomware" question allows for application of methods that are known to be impractical otherwise.
After visually and statistically exploring of the data, it becomes clear what the challenge is. Ransomware related addresses are very sparse in the data set, making up less than 2% of all addresses. This small percentage is also further classified into 28 groups. Perhaps the original paper was a overly ambitious in trying to categorize all the addresses into 29 categories, including the vastly prevalent *white* addresses. To simplify our approach, we will categorize the addresses in a binary way as either *white* or *black*, where *black* signifies an association with ransomware transactions. Asking this as a "ransomware or not-ransomware" question allows for application of methods that have been shown to be impractical otherwise.
---
## Modeling approach
## Modelling approach
Akcora et al. applied a Random Forest approach to the data, however "Despite improving data scarcity, [...] tree based methods (i.e., Random Forest and XGBoost) fail to predict any ransomware family".[3, 11] Considering all ransomware addresses as belonging to a single group might improve the predictive power of such methods, making Random Forest worth another try.
Akcora, et al. applied a Random Forest approach to the data, however "Despite improving data scarcity, [...] tree based methods (i.e., Random Forest and XGBoost) fail to predict any ransomware family".[3, 11] Considering all ransomware addresses as belonging to a single group may improve the predictive power of such methods, making Random Forest worth another try.
The topological description of the data set inspired a search for topological machine learning methods, although one does not necessitate the other. Searching for *topo* in the documentation for the `caret` package [6] resulted in the entry for Self Organizing Maps, supplied by the `kohonen` package. The description at CRAN [7] was intriguing enough to merit further investigation.
The topological description of the data set inspired a search for topological machine learning methods, although one does not necessitate the other. Searching for *topo* in the documentation for the `caret` package [6] resulted in the entry for Self Organizing Maps (SOMs), supplied by the `kohonen` package. The description at CRAN [7] was intriguing enough to merit further investigation.
Initially, the categorization of ransomware into the 28 different families was attempted using SOMs. This proved to be very resource intensive, requiring more time and RAM than was available at the time. Although it did help to illuminate how SOMs are configured, the resource requirements of the script became a deterrent. It was at this point that the SOMS were applied in a binary way, classifying all ransomware addresses as merely "black". This seemed to reduce RAM usage to the point of being feasible on the available hardware.
Initially, the categorization of ransomware into the 28 different families was attempted using SOMs. This proved to be very resource intensive, requiring more time and RAM than was available. Although it did help to illuminate how SOMs are configured, the resource requirements of the algorithm became a deterrent. It was at this point that the SOMs were applied in a binary way, classifying all ransomware addresses as merely *black*, initially in an attempt to simply get the algorithm to run to completion without error. This seemed to reduce RAM usage to the point of being feasible on the available hardware.
Being unsure of the SOM method, since it was not covered in the coursework at any point, a familiar method was sought out to compare the results to. This is when Random Forest was applied to the data set in a binary way. Much to the surprise of the author, not only did the Random Forest approach result in an acceptable model, it surpassed the model produced by the SOM approach.
Self Organizing Maps were not covered in the coursework at any point, therefore a familiar method was sought out to compare the results to. Random Forest was chosen and applied to the data set in a binary way, classifying every address as either *white* or *black*, ignoring the ransomware families. Surprisingly, not only did the Random Forest approach result in an acceptable model, it did so much quicker than expected, taking only a few minutes to produce results.
The author was tempted to leave it there and write up a comparison of the two approaches to the binary problem. However, a nagging feeling that more could be done eventually inspired a second look at the categorical problem of categorizing the ransomware addresses into the 28 families. Given the high accuracy and precision of the Random Forest approach to the binary problem, it became apparent that the sparseness of the ransomware in the larger set had been eliminated completely, as had any chances of false positives. There are a few cases of false negatives, depending on how the randomization is done during the sampling process. However, the Random Forest method has never produced a false positive yet, meaning it never seems to predict a truly white address as being black. Hence, by applying the Random Forest method first, we have essentially filtered out any possibility of false positives, which is exactly what plagued the original paper by Akcora et al.[3]
At this point, it was very tempting to leave it there and write up a comparison of the two approaches to the binary problem, by classifying all ransomware related addresses as *black*. However, a nagging feeling that more could be done eventually inspired a second look at the categorical problem of grouping the ransomware addresses into the 28 known families. Given the high accuracy and precision of the binary Random Forest approach, the sparseness of the ransomware in the larger set has been eliminated completely, along with any chances of false positives. There are a few cases of false negatives, depending on how the randomization is done during the sampling process. However, the Random Forest method does not seem to produce many false positive (if any), meaning it never seems to predict a truly white address as being black. Hence, by applying the Random Forest method first, we have effectively filtered out any possibility of false positives by correctly identifying a very large set of purely *white* addresses, which are then removed from the set. The best model used in the original paper by Akcora, et al. resulted in more false positives than true positives. This low precision rate is what made it impractical for real-world usage.[3]
Finally, a two-part method was devised to first separate the addresses into black and white groups, and then further classify the black addresses into ransomware families. Let's explore each of these separately.
This all inspired a two-part method to first separate the addresses into *black* and *white* groups, and then further classify the *black* addresses into ransomware families. We shall explore each of these steps separately.
### Method Part 0: Binary SOMS as a first attempt to isolate ransomware addresses.
### Method Part 0: Binary SOMs
Lets see how well the SOM approach can model the data in a black/white fashion.
The first working model that ran to completion without exhausting computer resources did not make use of the ransomware family labels and instead the two categories of *black* and *white*. The `kohonen` package provides algorithms for both supervised and unsupervised model building. A supervised approach was used since the data set includes information about the membership of ransomware families that can be used to train the model.
```{r binary SOMs, echo=FALSE, include=FALSE}
```{r binary_SOMs, echo=FALSE, include=FALSE}
##############################################################################
## This is a first attempt using SOMs to model the data set as "black" and
## "white" addresses only.
@ -660,7 +684,7 @@ som_model1 <- xyf(som1_train_mat, som1_train_bw,
grid = som1_train_grid,
rlen = 100,
mode="pbatch",
cores = detectCores(), # detectCores() - 1
cores = n_cores,
keep.data = TRUE
)
@ -672,7 +696,7 @@ som1_test_list <- list(independent = som1_test_mat, dependent = som1_test_bw)
ransomware.prediction1 <- predict(som_model1, newdata = som1_test_list)
# Confusion Matrix
# Confusion matrix
som1_cm_bw <-
confusionMatrix(ransomware.prediction1$prediction[[2]], test_set$bw)
@ -694,40 +718,48 @@ valid_list <- list(independent = valid_mat, dependent = valid_bw)
# Requires up to 16GB of RAM, skip if resources are limited
ransomware.prediction1.validation <- predict(som_model1, newdata = valid_list)
# Confusion Matrix
# Confusion matrix
cm_bw.validation <-
confusionMatrix(ransomware.prediction1.validation$prediction[[2]],
validation$bw)
```
Here are the results of the binary SOM model.
After training the model, weobtain the confusion matricies for the test set and the validation set, separately.
Test set:
```{r binary_SOM_results, echo=FALSE, results='asis' }
```{r binary som results1, echo=FALSE}
cm1_test_set <- som1_cm_bw %>% as.matrix() %>%
knitr::kable(format = "latex", booktabs = TRUE)
som1_cm_bw %>% as.matrix() %>% knitr::kable()
cm1_validation_set <- cm_bw.validation %>% as.matrix() %>%
knitr::kable(format = "latex", booktabs = TRUE)
cat(c("\\begin{table}[!htb]
\\begin{minipage}{.5\\linewidth}
\\caption{test set}
\\centering",
cm1_test_set,
"\\end{minipage}%
\\begin{minipage}{.5\\linewidth}
\\centering
\\caption{validation set}",
cm1_validation_set,
"\\end{minipage}
\\end{table}"
))
```
Validation set:
```{r binary som results2, echo=FALSE}
This is a very intensive and somewhat inaccurate method compared to what follows. It was left out of the final version of the script and has been included here only for model comparison and to track developmental evolution.
cm_bw.validation %>% as.matrix() %>% knitr::kable()
### Method Part 1: Binary Random Forest
```
A Random Forest model is trained using ten-fold cross validation and a tuning grid with the number of variables randomly sampled as candidates at each split (`mtry`) set to the values $={2, 4, 6, 8, 10, 12}$, each one being checked for optimization.
This winds up being the most resource intensive and least accurate method out of those tried. It was left out of the final version of the script and is not really worth running except to compare to the next method.
### Method Part 1: Binary Random Forest to isolate ransomware addresses before categorization.
A Random Forest model is trained using ten-fold cross validation and a tuning grid with the number of variables randomly sampled as candidates at each split `mtry` set to the values $={2, 4, 6, 8, 10, 12}$, each one being checked for optimization.
```{r random-forest-prep, echo=FALSE, inculde=FALSE, warning=FALSE}
```{r random_forest_prep, echo=FALSE, inculde=FALSE, warning=FALSE}
##############################################################################
## This is a better attempt using Random Forest to model the data set as
## "black" and "white" addresses only.
@ -761,64 +793,108 @@ cm_ransomware <- confusionMatrix(ransomware_y_hat_rf, ransomware$bw)
The confusion matrix for the test set shows excellent results, specifically in the areas of accuracy and precision.
```{r random-forest-output_test1, echo=FALSE}
```{r random-forest-output_test, echo=FALSE}
# Confusion matrix for test set
cm_test %>% as.matrix() %>% knitr::kable()
```
Here are the overall results...
```{r random-forest-output_test2, echo=FALSE}
cm2_test_set <- cm_test %>% as.matrix() %>%
knitr::kable(format = "latex", booktabs = TRUE)
# overall results
cm_test$overall %>% knitr::kable()
cm2_overall <- cm_test$overall %>%
knitr::kable(format = "latex", booktabs = TRUE)
# by class.
cm2_byClass <- cm_test$byClass %>%
knitr::kable(format = "latex", booktabs = TRUE)
# Confusion matrix for full ransomware set,
cm3_full_set <- cm_ransomware %>% as.matrix() %>%
knitr::kable(format = "latex", booktabs = TRUE)
# overall results
cm3_overall <- cm_ransomware$overall %>%
knitr::kable(format = "latex", booktabs = TRUE)
# by class.
cm3_byClass <- cm_ransomware$byClass %>%
knitr::kable(format = "latex", booktabs = TRUE)
```
Here are the results by class...
Here are the confusion matrices for the test set and the full set resulting from the Random Forest model, respectively.
```{r random-forest-output_test3, echo=FALSE}
```{r random-forest-comfusion_matrices, echo=FALSE, results='asis'}
# Print all three tables on one line
cat(c("\\begin{table}[!htb]
\\begin{minipage}{.5\\linewidth}
\\caption{confusion matrix for test set}
\\centering",
cm2_test_set,
"\\end{minipage}%
\\begin{minipage}{.5\\linewidth}
\\centering
\\caption{confusion matrix for full set}",
cm3_full_set,
"\\end{minipage}
\\end{table}"
))
# by class.
cm_test$byClass %>% knitr::kable()
```
The confusion matrix for the full ransomware set is very similar to that of the test set.
Here is the confusion matrix for the full ransomware data set.
Overall results for test and full sets show good results.
```{r random-forest-output_big1, echo=FALSE}
```{r random-forest-overall_results, echo=FALSE, results='asis'}
# Confusion matrix for full ransomware set,
cm_ransomware %>% as.matrix() %>% knitr::kable()
# Print both tables on one line
cat(c("\\begin{table}[!htb]
\\begin{minipage}{.5\\linewidth}
\\caption{test set overall results}
\\centering",
cm2_overall,
"\\end{minipage}%
\\begin{minipage}{.5\\linewidth}
\\centering
\\caption{full set overall results}",
cm3_overall,
"\\end{minipage}
\\end{table}"
))
```
Here are the big overall results....
Results by class for the test and full sets. What can you say about these, specifically?
```{r random-forest-output_big2, echo=FALSE}
```{r random-forest-results_by_class, echo=FALSE, results='asis'}
# overall results
cm_ransomware$overall %>% knitr::kable()
# Print both tables on one line
cat(c("\\begin{table}[!htb]
\\begin{minipage}{.5\\linewidth}
\\caption{test set results by class}
\\centering",
cm2_byClass,
"\\end{minipage}%
\\begin{minipage}{.5\\linewidth}
\\centering
\\caption{full set results by class}",
cm3_byClass,
"\\end{minipage}
\\end{table}"
))
```
Here are the big set results by class....
This is a much quicker way of removing most of the *white* addresses, and will be used in the final composite model to save time.
```{r random-forest-output_big3, echo=FALSE}
### Method Part 2: Categorical SOMs
# by class.
cm_ransomware$byClass %>% knitr::kable()
```
### Method Part 2: Categorical SOMs to categorize predicted ransomware addresses.
Now we train a new model after throwing away all "white" addresses. The predictions from the Random Forest model are used to isolate all "black" addresses for further classification into ransomware addresses using SOMs.
Now we train a new model after throwing away all *white* addresses. The predictions from the Random Forest model are used to isolate all *black* addresses for further classification into ransomware addresses using SOMs. The reduced set is then categorized using a supervised SOM method with the 28 ransomware families as the target classification groups.
```{r soms-prep, echo=FALSE, include=FALSE}
@ -887,8 +963,7 @@ som_model2 <- xyf(train_mat, train_label,
grid = train_grid,
rlen = 100,
mode="pbatch",
cores = detectCores(), # Use all cores
# cores = detectCores() - 1, # Leave one core for system
cores = n_cores,
keep.data = TRUE
)
@ -898,7 +973,7 @@ test_list <- list(independent = test_mat, dependent = test_label)
# Generate predictions
ransomware_group.prediction <- predict(som_model2, newdata = test_list)
# Confusion Matrix
# Confusion matrix
cm_labels <- confusionMatrix(ransomware_group.prediction$prediction[[2]],
test_set$label)
@ -909,27 +984,31 @@ When selecting the grid size for a Self Organizing Map, there are at least two d
A summary of the results for the categorization of black addresses into ransomware families follows. For the full table of predictions and statistics, see the Appendix.
Here are the overall results.
Here are the overall results of the final categorization.
```{r cm_overall, echo=FALSE}
# Overall section of the confusion matrix formatted through kable()
cm_labels$overall %>% knitr::kable()
cm_labels$overall %>% knitr::kable(caption="overall categorization results")
```
Here are the results by class.
Here are the final results by class.
```{r soms-output-by-class, echo=FALSE, size="tiny"}
# By Class section of the confusion matrix formatted through kable()
cm_labels$byClass %>% knitr::kable()
cm_labels$byClass %>% knitr::kable(caption="categorization results by class")
```
### Clustering Visualizations: Heatmaps and K-means clustering
\newpage
Here are some graphs, tell a bit more about them.
### Clustering Visualizations
Heatmaps and K-means clustering
Toroidal nerual node maps are used to generate the models, and can be visualized n a number of ways.
```{r binary som graphs, echo=FALSE, fig.show="hold", out.width='35%'}
@ -983,7 +1062,7 @@ plot(som_model2, type = 'property', property = som_model2$codes[[1]][,6],
```
K-means clustering offers a nice way of visualizing the final SOM grid and the categorical boundaries that were formed by the model.
K-means clustering offers a nice way of visualizing the final SOM grid and the categorical boundaries that were formed by the model. Say a bit more about it here....
```{r clustering-setup, echo=FALSE, include=FALSE}
#############################################################################
@ -1038,12 +1117,12 @@ add.cluster.boundaries(som_model2, som.cluster$cluster)
This is a computer known for being slow and clunky. Even on this device, which runs the same operating system and software as the hardware listed previously, the total run time for the script is around 1665 seconds. At nearly 28 minutes, this is not fast enough to analyze the Bitcoin blockchain in real time, but it does show that the script can be run on very modest hardware to completion.
#### Pine64 Quartz64
#### Pine64 Quartz64 Model A
- CPU: Rockchip RK3566 SoC aarch64 (64-bit quad-core ARM)
- RAM: DDR4 xxxxMB (8 GB)
- RAM: DDR4 8080MB (8 GB)
Single board computer / Development board. This was run to benchmark a modern 64-bit ARM processor. The script runs in about xxxx minutes on this platform, just for reference.
Single board computer / Development board. This was run to benchmark a modern 64-bit ARM processor. The script runs in about 860 minutes on this platform, nearly half of that for the Atom processor above.
---
@ -1098,9 +1177,9 @@ Kyle Hogan, Jason Hennessey, Andrew Miller, Arvind Narayanan, and Nicolas Christ
## Appendix:
### Categorical SOM ransowmare family prediction table and confusion matrix
### Categorical SOM prediction table and confusion matrix
Here are the full prediction results for the categorization of black addresses into ransomware families. It is assumed that all white address have already been removed.
Here are the full prediction results for the categorization of *black* addresses into ransomware families. It is assumed that all *white* address have already been removed.
```{r soms-output-table, echo=FALSE}
@ -1109,22 +1188,3 @@ Here are the full prediction results for the categorization of black addresses i
cm_labels
```
```{r toc, echo=FALSE}
#End timer
toc()
```
```{r empty block, echo=FALSE, include=FALSE}
##############################################################################
## Description of block goes here.
## Include notes and resources as necessary.
##############################################################################
# First comment goes here.
```

Binary file not shown.