ransomware/Detecting_Bitcoin_Ransomwar...

---
title: \vspace{1in}Detecting Ransomware Addresses on the Bitcoin Blockchain using Random Forest and Self Organizing Maps
subtitle: \vspace{.5in}HarvardX PH125.9x Final Capstone CYO Project
          \vspace{.5in}
author: "Kaylee Robert Tejeda"
date: "11/14/2021"
abstract: "Ransomware is a persisent and growing threat in the world of cybersecurity. A specific area of focus involves detecting and tracking payments made to ransomware operators. Many attempts towards this goal have not made use of sophisticated machine learning methods. Those that have often result in models with poor precision or other performance issues. A two-step method is developed to address the issue of false positives."
keywords:
- Bitcoin
- blockchain
- ransomware
- machine learning
- Random Forest
- Self Organizing Maps
- SOMs
- cryptocurrency
output: pdf_document
header-includes:
- \usepackage{booktabs}
geometry: margin=2cm
---
\def\bitcoinA{%
  \leavevmode
  \vtop{\offinterlineskip %\bfseries
    \setbox0=\hbox{B}%
    \setbox2=\hbox to\wd0{\hfil\hskip-.03em
    \vrule height .3ex width .15ex\hskip .08em
    \vrule height .3ex width .15ex\hfil}
    \vbox{\copy2\box0}\box2}}

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, out.width="400px", dpi=120)
def.chunk.hook  <- knitr::knit_hooks$get("chunk")
knitr::knit_hooks$set(chunk = function(x, options) {
  x <- def.chunk.hook(x, options)
  ifelse(options$size != "normalsize", paste0("\n \\", options$size,"\n\n", x,
                                              "\n\n \\normalsize"), x)
})


```

\newpage
&nbsp;
\vspace{25pt}
\tableofcontents

\newpage

## Introduction

  Ransomware attacks are of interest to security professionals, law enforcement, and financial regulatory officials.$^{[1]}$  The pseudo-anonymous Bitcoin network provides a convenient method for ransomware attackers to accept payments without revealing their identity or location.  The victims (usually hospitals or other large organizations) come to learn that much if not all of their important organizational data have been encrypted with a secret key by an unknown attacker. They are instructed to make a payment to a specific Bitcoin address by a certain deadline to have the data decrypted or else it will all be deleted automatically.

  The deeper legal and financial implications of ransomware attacks are inconsequential to the work in this report, as we are merely interested in being able to classify Bitcoin addresses by their connection to ransomware transactions. Many researchers are already tracking illicit activity (such as ransomware payments) around the Bitcoin blockchain in order to minimize financial losses.  Daniel Goldsmith explains some of the reasons and methods of blockchain analysis at [Chainalysis.com](https://www.chainalysis.com/).$^{[2]}$  For example, consider a ransomware attack conducted towards an illegal darknet market site. The news of such an attack might not be announced at all to prevent loss of trust among its users. By analyzing the global transaction record with a blockchain explorer such as [BTC.com](https://btc.com/), suspicious activity could be flagged in real time given a sufficiently robust model. It may, in fact, be the first public notice of such an event. Any suspicious addresses could then be blacklisted or banned from using other services, if so desired.

  Lists of known ransomware payment addresses have been compiled and analyzed using various methods. One well known paper entitled "BitcoinHeist: Topological Data Analysis for Ransomware Detection on the Bitcoin Blockchain"$^{[3]}$ will be the source of our data set and the baseline to which we will compare our results.  In that paper, Akcora, et al. use Topological Data Analysis (TDA) to classify addresses on the Bitcoin blockchain into one of 28 known ransomware address groups. Addresses with no known ransomware associations are classified as *white*.  The blockchain is then considered as a heterogeneous Directed Acyclic Graph (DAG) with two types of nodes describing *addresses* and *transactions*.  Edges are formed between the nodes when a transaction can be associated with a particular address.

  Any given address on the Bitcoin network may appear many times, possibly with different inputs and outputs each time.  The Bitcoin network data has been divided into 24-hour time intervals with the UTC-6 timezone as a reference, allowing for variables to be defined in a specific and meaningful way.  For example, *speed* can be defined as the number of blocks the coin appears in during a 24-hour period, and provides information on how quickly a coin moves through the network. *Speed* may be an indicator of money laundering or "coin mixing", as typical payments only involve a limited number of addresses in a given 24 hour period, and thus have lower *speeds* when compared to "mixed" coins.  The temporal data can also help distinguish transactions by geolocation, as criminal transactions tend to cluster in time.

 With the graph specified as such, the following six numerical features$^{[2]}$ are associated with a given address:

   1)  *Income* - the total amount of bitcoins sent to an address

   2)  *Neighbors* - the number of transactions that have this address as one of its output addresses

   3)  *Weight* - the sum of fraction of bitcoins that reach this address from address that do not have any other inputs within the 24-hour window, which are referred to as "starter transactions"

   4)  *Length* - the number of non-starter transactions on its longest chain, where a chain is defined as an
acyclic directed path originating from any starter transaction and ending at the address in question

   5)  *Count* - the number of starter addresses connected to this address through a chain

   6)  *Looped* - the number of starter addresses connected to this address by more than one path

These variables are defined somewhat conceptually, by viewing the blockchain as a topological graph with nodes and edges.  The rationale behind this approach is to facilitate quantification of specific transaction patterns. Akcora, et al.$^{[3]}$ give a thorough explanation of how and why these features were chosen.  We shall treat the features as general numerical variables and will not seek to justify their definitions beyond that.  Machine learning methods will be applied to the original data set from the same paper, and the new results will be compared to the original ones.

### Data

   The data set was found while exploring the [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/index.php)$^{[4]}$  as suggested in the project instructions.  The author of this report, interested in Bitcoin and other cryptocurrencies since (unsuccessfully) mining for them on an ASUS netbook in rural Peru in late 2010, used *cryptocurrency* as a preliminary search term.  This brought up a single data set entitled ["BitcoinHeist: Ransomware Address Data Set"](https://archive.ics.uci.edu/ml/datasets/BitcoinHeistRansomwareAddressDataset#).  The data set was downloaded and the exploration began.

```{r install_load_libraries-and-download_data, echo=FALSE, include=FALSE}

# Set the repository mirror to “0-Cloud” for maximum availability
r = getOption("repos")
r["CRAN"] = "http://cran.rstudio.com"
options(repos = r)
rm(r)

# Install necessary packages if not already present
if(!require(tidyverse)) install.packages("tidyverse")
if(!require(caret)) install.packages("caret")
if(!require(randomForest)) install.packages("randomForest")
if(!require(kohonen)) install.packages("kohonen")
if(!require(parallel)) install.packages("parallel")
if(!require(matrixStats)) install.packages("matrixStats")
if(!require(xtable)) install.packages("xtable")
if(!require(tictoc)) install.packages("tictoc")

# Load Libraries
library(tidyverse)
library(caret)
library(randomForest)
library(kohonen)
library(parallel)
library(matrixStats)
library(xtable)
library(tictoc)

# Set number of cores, use detectCores() - 1 to leave one for the system
n_cores <- detectCores()

# Download data
url <-
  "https://archive.ics.uci.edu/ml/machine-learning-databases/00526/data.zip"
dest_file <- "data/data.zip"
if(!dir.exists("data"))dir.create("data")
if(!file.exists(dest_file))download.file(url, destfile = dest_file)

# Unzip as CSV
if(!file.exists("data/BitcoinHeistData.csv"))unzip(dest_file,
                                                   "BitcoinHeistData.csv",
                                                   exdir="data")

# Import data from CSV
ransomware <- read_csv("data/BitcoinHeistData.csv")

# Define custom tictoc messages
# tic() message
my.msg.tic <- function(tic, msg)
{
   if (is.null(msg) || is.na(msg) || length(msg) == 0)
   {
      outmsg <- paste(round(toc - tic, 3), " seconds elapsed", sep="")
   }
   else
   {
      outmsg <- paste("Starting ", msg, "...", sep="")
   }
}
# toc() message
my.msg.toc <- function(tic, toc, msg, info)
{
   if (is.null(msg) || is.na(msg) || length(msg) == 0)
   {
      outmsg <- paste(round(toc - tic, 3), " seconds elapsed", sep="")
   }
   else
   {
      outmsg <- paste(info, ": ", msg, ": ",
                   round(toc - tic, 3), " seconds elapsed", sep="")
   }
}


```

A summary of the data set shows the range of values and size of the sample.  Some of the features, such as *weight* for example, already appear to be very skewed just from the quartiles.  In the case of *weight*, the third quartile is only `r quantile(ransomware$weight, 0.75)`, meaning that 75% of the data is at or below this value for *weight* (with a minimum of  `r min(ransomware$weight)`).  The maximum *weight* value, however, is `r max(ransomware$weight)`.  This means that nearly the entire range of values occurs in the upper 25%.  In fact, many of the numerical features are similarly skewed, as you can see in the following summary.

```{r data_summary, echo=FALSE, size="tiny"}

# Summary
ransomware %>% select(-address, -label) %>% summary() %>% knitr::kable(caption="Summary of data set")

```

This data set has 2,916,697 observations of ten features associated with a sample of transactions from the Bitcoin blockchain.  The ten features include *address* as a unique identifier, the six numerical features defined previously (*income, neighbors, weight, length, count, looped*), two temporal features in the form of *year* and *day* (day of the year as an integer from 1 to 365), and a categorical feature called *label* that categorizes each address as either *white* (i.e. not connected to any ransomware activity), or one of 28 known ransomware groups as identified by three independent ransomware analysis teams (Montreal, Princeton, and Padua)$^{[3]}$. A listing of the first ten rows provides a sample of the features associated with each observation.

```{r data_head, echo=FALSE, size="tiny"}

# Inspect data
ransomware %>% head() %>%
  knitr::kable(caption="First ten entries of data set")

```

The original research team downloaded and parsed the entire Bitcoin transaction graph from January 2009 to December 2018. Based on a 24 hour time interval, daily transactions on the network were extracted and the Bitcoin graph was formed. Network edges that transferred less than \bitcoinA 0.3 were filtered out since ransom amounts are rarely below this threshold. Ransomware addresses are taken from three widely adopted studies: Montreal, Princeton and Padua. *White* Bitcoin addresses were capped at one thousand per day, whereas the entire network sees up to 800,000 addresses daily.$^{[5]}$

### Goal

  The goal of this project is to apply different machine learning algorithms to the same data set used in the original paper, producing a practical predictive model for categorizing ransomware addresses with an acceptable degree of accuracy.  Increasing the precision, while not strictly necessary for the purposes of the project, would be a notable sign of success.

###  Outline of Steps Taken

1. Analyze data set numerically and visually, look for insights in any patterns.
2. Binary separation using Self Organizing Maps.
3. Faster binary separation using Random Forest.
4. Categorical classification using Self Organizing Maps.
5. Visualize clustering to analyze results further.
6. Generate confusion matrix to quantify results.

## Data Analysis

### Hardware Specification

   All of the analysis in this report was conducted on a single laptop computer, a **Lenovo Yoga S1** from late 2013 with the following specifications.

   - CPU:  Intel i7-4600U @ 3.300GHz (4th Gen quad-core i7 `x86_64`)
   - RAM:  8217MB DDR3L @ 1600 MHz  (8 GB)
   - OS:   Slackware64-current (15.0 RC1) `x86_64-slackware-linux-gnu` (64-bit GNU/Linux)
   - R version 4.0.0 (2020-04-24) -- "Arbor Day" (built from source using scripts from [slackbuilds.org](https://slackbuilds.org/))
   - RStudio Version 1.4.1106 "Tiger Daylily" (2389bc24, 2021-02-11) for CentOS 8 (converted using `rpm2tgz`)

###  Data Preparation

  It is immediately apparent that this is a rather large data set.  The usual practice of partitioning out 80% to 90% of the data for training results in a training set that is too large to process given the hardware limitations.  For reasons that are no longer relevant, the original data set was first split in half with 50% reserved as *validation set* and the other 50% used as the *working set*.  This working set was again split in half, to give a *training set* that was of a reasonable size to deal with.  This produced partitions that were small enough to work with, so the partition size ratio was not further refined.  This is a potential area for later optimization.  A better partitioning scheme can surely be optimized further. Careful sampling was carried out to ensure that the ransomware groups were represented in each sample as much as possible.

```{r data_prep, echo=FALSE, include=FALSE}

# Turn labels into factors, "bw" is binary factor for ransomware/non-ransomware
ransomware <- ransomware %>%
  mutate(label=as.factor(label),
         bw=as.factor(ifelse(label=="white", "white", "black")))

# Validation set made from 50% of BitcoinHeist data, for RAM considerations
test_index <- createDataPartition(y = ransomware$bw,
                                  times = 1, p = .5, list = FALSE)

workset <- ransomware[-test_index,]
validation <- ransomware[test_index,]

# Split the working set into a training set and a test set @ 50%, RAM dictated
test_index <- createDataPartition(y = workset$bw,
                                  times = 1, p = .5, list = FALSE)

train_set <- workset[-test_index,]
test_set <- workset[test_index,]

# Find proportion of full data set that is ransomware
ransomprop <- mean(ransomware$bw=="black")

# Check for NAs
no_nas <- sum(is.na(ransomware))

```

### Exploration and Visualization

```{r cv_calcs, echo=FALSE}

# Keep only numeric columns, ignoring temporal features
ransomware_num <- ransomware %>%
  select(income, neighbors, weight, length, count, looped)

# Check for variation across numerical columns using coefficients of variation
#
# Calculate standard deviations for each column
sds <- ransomware_num %>% as.matrix() %>% colSds()

# Calculate means for each column
means <- ransomware_num %>% as.matrix() %>% colMeans()

# Calculate CVs for each column
coeff_vars <- sds %/% means

#  Select the two features with the highest coefficients of variation
selected_features <- names(sort(coeff_vars, decreasing=TRUE))[1:2]

#Sample every 100th row due to memory constraints
train_samp <- train_set[seq(1, nrow(train_set), 100), ]

# Keep only numeric columns with highest coefficients of variation
train_num <- train_samp %>% select(selected_features[1], selected_features[2])

# Binary labels, black = ransomware, white = non-ransomware, train set
train_bw <- train_samp$bw

# Sample every 100th row due to memory constraints to make test sample same size
test_samp <- test_set[seq(1, nrow(train_set), 100), ]

# Dimension reduction again, selecting features with highest CVs
test_num <- test_samp %>% select(selected_features[1], selected_features[2])

# Binary labels for test set
test_bw <- test_samp$bw


# Summarize ransomware family membership
labels <- ransomware$label  %>% summary()


```


The proportion of ransomware addresses in the original data set is `r ransomprop`. Thus, they make up less than 2% of all observations.  This presents a challenge as the target observations are sparse within the data set, especially when we consider that this small percentage is then further divided into 28 subsets.  In fact, some of the ransomware groups have only a single member, making categorization a dubious task.

The total number of `NA` or missing values in the original data set is `r no_nas`. At least there are no missing values to worry about.  The original data set is clean in that sense.

A listing of all ransomware families in the full original data set, plus a member count for each family is shown in Table 3.  As can be seen, `r length(unname(labels)[unname(labels)<10])` of the 28 families have less than 10 addresses associated with them.  We shall keep this in mind for later.

```{r ransomware_families, echo=FALSE}

# Print ransomware family summary table
knitr::kable(list(labels[1:10], labels[11:20], labels[21:29]),
             caption="Ransomware families and membership counts",
             booktabs = TRUE,
             format = "latex",
             col.names = c("n") )

```

We can take a look at the overall distribution of the different features.  The temporal features have been left out. Those plots are essentially flat due to the capped nature of the address collection, making each day of the year equally represented across the set. The skewed nature of the non-temporal features causes the plots to look better on a log$_2$ scale $x$-axis.

```{r histograms, echo=FALSE, warning=FALSE, fig.align="center"}
########################################################
## Histograms of each of the columns to show skewness
## Plot histograms for each column using facet wrap
########################################################

# Remove non-numerical and temporal columns to look for patterns in
# topologically defined features
train_hist <- train_samp %>% select(-address, -label, -bw, -day, -year)

# Apply pivot_longer function to facilitate facet wrapping
train_long <- train_hist %>%
  pivot_longer(colnames(train_hist)) %>%
  as.data.frame()

# Log scale on value axis,
histograms <- ggplot(train_long, aes(x = value)) +
  geom_histogram(aes(y = ..density..), bins=20) +
  geom_density(col = "green", size = .5) +
  scale_x_continuous(trans='log2') +
  facet_wrap(~ name, scales = "free") +
  ggtitle("Histograms and densitiy plots for non-temporal features")

histograms + theme(axis.text.x = element_text(size = 8, angle=30, hjust=1))


```

We can easily compare the relative spread of each feature by calculating the coefficient of variation for each column.  Larger coefficients of variation indicate larger relative spread compared to other columns.  A listing of the coefficients of variation for the non-temporal features is shown in Table 4.

```{r coefficients_of_variation, echo=FALSE}

# Summarize CV results in a table
knitr::kable(
    list(coeff_vars[1:2], coeff_vars[3:4], coeff_vars[5:6]),
  format = "latex", booktabs = TRUE, caption="Coefficients of Variation",
  col.names = c("CV") )

```

From this, it appears that `r selected_features[1]` has the widest range of variability, followed by `r selected_features[2]`.  These are also the features that are most strongly skewed to the right, meaning that a few addresses have really high values for each of these features while the bulk of the data set has very low values.

Taking the feature with the highest variation,  `r selected_features[1]`, we can take a look at the distribution for individual ransomware families to see if there is a similarity across families. This can be done for all the features, but we will focus on  `r selected_features[1]` in the interest of saving space and to avoid repetition and redundancy.  The distribution plots for  `r selected_features[1]` show the most variation since it is the feature with the highest coefficient of variation, so it is a good one to focus on.


```{r variation_histograms, echo=FALSE, fig.height=2, fig.width=2.5, fig.show="hold", out.width='35%', warning=FALSE}

# Density plots of the feature with highest variation
selected_feature1 <- selected_features[1]

ransomware_big_families <- ransomware %>%
  mutate(selected_feature1 = as.numeric(selected_feature1))

# Note: Putting these graphs into a for loop breaks some of the formatting.
# Low membership makes some of the graphs not very informative
# Relatively meaningless graphs have been left out to save time and space.
# Label numbers correspond to the ransomware families listed previously.

# Label 1
ransomware_big_families %>%
  filter(label==levels(ransomware_big_families$label)[1]) %>%
  select(income) %>%
  ggplot(aes(x=income,  y = ..density..)) +
  geom_density(col = "green")+
  ggtitle(levels(ransomware_big_families$label)[1]) +
  scale_x_continuous(trans='log2')  +
  theme(axis.text.x = element_text(size = 8, angle=30, hjust=1),
        plot.title = element_text(size = 9, face = "bold"),
        axis.title.x=element_blank(),
        axis.title.y=element_blank())
# Label 4
ransomware_big_families %>%
  filter(label==levels(ransomware_big_families$label)[4]) %>%
  select(income) %>%
  ggplot(aes(x=income,  y = ..density..)) +
  geom_density(col = "green")+
  ggtitle(levels(ransomware_big_families$label)[4]) +
  scale_x_continuous(trans='log2')  +
  theme(axis.text.x = element_text(size = 8, angle=30, hjust=1),
        plot.title = element_text(size = 9, face = "bold"),
        axis.title.x=element_blank(),
        axis.title.y=element_blank())
# Label 5
ransomware_big_families %>%
  filter(label==levels(ransomware_big_families$label)[5]) %>%
  select(income) %>%
  ggplot(aes(x=income,  y = ..density..)) +
  geom_density(col = "green")+
  ggtitle(levels(ransomware_big_families$label)[5]) +
  scale_x_continuous(trans='log2')  +
  theme(axis.text.x = element_text(size = 8, angle=30, hjust=1),
        plot.title = element_text(size = 9, face = "bold"),
        axis.title.x=element_blank(),
        axis.title.y=element_blank())
# Label 6
ransomware_big_families %>%
  filter(label==levels(ransomware_big_families$label)[6]) %>%
  select(income) %>%
  ggplot(aes(x=income,  y = ..density..)) +
  geom_density(col = "green")+
  ggtitle(levels(ransomware_big_families$label)[6]) +
  scale_x_continuous(trans='log2')  +
  theme(axis.text.x = element_text(size = 8, angle=30, hjust=1),
        plot.title = element_text(size = 9, face = "bold"),
        axis.title.x=element_blank(),
        axis.title.y=element_blank())
# Label 7
ransomware_big_families %>%
  filter(label==levels(ransomware_big_families$label)[7]) %>%
  select(income) %>%
  ggplot(aes(x=income,  y = ..density..)) +
  geom_density(col = "green")+
  ggtitle(levels(ransomware_big_families$label)[7]) +
  scale_x_continuous(trans='log2')  +
  theme(axis.text.x = element_text(size = 8, angle=30, hjust=1),
        plot.title = element_text(size = 9, face = "bold"),
        axis.title.x=element_blank(),
        axis.title.y=element_blank())
# Label 8
ransomware_big_families %>%
  filter(label==levels(ransomware_big_families$label)[8]) %>%
  select(income) %>%
  ggplot(aes(x=income,  y = ..density..)) +
  geom_density(col = "green")+
  ggtitle(levels(ransomware_big_families$label)[8]) +
  scale_x_continuous(trans='log2')  +
  theme(axis.text.x = element_text(size = 8, angle=30, hjust=1),
        plot.title = element_text(size = 9, face = "bold"),
        axis.title.x=element_blank(),
        axis.title.y=element_blank())
# Label 10
ransomware_big_families %>%
  filter(label==levels(ransomware_big_families$label)[10]) %>%
  select(income) %>%
  ggplot(aes(x=income,  y = ..density..)) +
  geom_density(col = "green")+
  ggtitle(levels(ransomware_big_families$label)[10]) +
  scale_x_continuous(trans='log2')  +
  theme(axis.text.x = element_text(size = 8, angle=30, hjust=1),
        plot.title = element_text(size = 9, face = "bold"),
        axis.title.x=element_blank(),
        axis.title.y=element_blank())
# Label 11
ransomware_big_families %>%
  filter(label==levels(ransomware_big_families$label)[11]) %>%
  select(income) %>%
  ggplot(aes(x=income,  y = ..density..)) +
  geom_density(col = "green")+
  ggtitle(levels(ransomware_big_families$label)[11]) +
  scale_x_continuous(trans='log2')  +
  theme(axis.text.x = element_text(size = 8, angle=30, hjust=1),
        plot.title = element_text(size = 9, face = "bold"),
        axis.title.x=element_blank(),
        axis.title.y=element_blank())
# Label 12
ransomware_big_families %>%
  filter(label==levels(ransomware_big_families$label)[12]) %>%
  select(income) %>%
  ggplot(aes(x=income,  y = ..density..)) +
  geom_density(col = "green")+
  ggtitle(levels(ransomware_big_families$label)[12]) +
  scale_x_continuous(trans='log2')  +
  theme(axis.text.x = element_text(size = 8, angle=30, hjust=1),
        plot.title = element_text(size = 9, face = "bold"),
        axis.title.x=element_blank(),
        axis.title.y=element_blank())
# Label 13
ransomware_big_families %>%
  filter(label==levels(ransomware_big_families$label)[13]) %>%
  select(income) %>%
  ggplot(aes(x=income,  y = ..density..)) +
  geom_density(col = "green")+
  ggtitle(levels(ransomware_big_families$label)[13]) +
  scale_x_continuous(trans='log2')  +
  theme(axis.text.x = element_text(size = 8, angle=30, hjust=1),
        plot.title = element_text(size = 9, face = "bold"),
        axis.title.x=element_blank(),
        axis.title.y=element_blank())
# Label 14
ransomware_big_families %>%
  filter(label==levels(ransomware_big_families$label)[14]) %>%
  select(income) %>%
  ggplot(aes(x=income,  y = ..density..)) +
  geom_density(col = "green")+
  ggtitle(levels(ransomware_big_families$label)[14]) +
  scale_x_continuous(trans='log2')  +
  theme(axis.text.x = element_text(size = 8, angle=30, hjust=1),
        plot.title = element_text(size = 9, face = "bold"),
        axis.title.x=element_blank(),
        axis.title.y=element_blank())
# Label 15
ransomware_big_families %>%
  filter(label==levels(ransomware_big_families$label)[15]) %>%
  select(income) %>%
  ggplot(aes(x=income,  y = ..density..)) +
  geom_density(col = "green")+
  ggtitle(levels(ransomware_big_families$label)[15]) +
  scale_x_continuous(trans='log2')  +
  theme(axis.text.x = element_text(size = 8, angle=30, hjust=1),
        plot.title = element_text(size = 9, face = "bold"),
        axis.title.x=element_blank(),
        axis.title.y=element_blank())
# Label 16
ransomware_big_families %>%
  filter(label==levels(ransomware_big_families$label)[16]) %>%
  select(income) %>%
  ggplot(aes(x=income,  y = ..density..)) +
  geom_density(col = "green")+
  ggtitle(levels(ransomware_big_families$label)[16]) +
  scale_x_continuous(trans='log2')  +
  theme(axis.text.x = element_text(size = 8, angle=30, hjust=1),
        plot.title = element_text(size = 9, face = "bold"),
        axis.title.x=element_blank(),
        axis.title.y=element_blank())
# Label 18
ransomware_big_families %>%
  filter(label==levels(ransomware_big_families$label)[18]) %>%
  select(income) %>%
  ggplot(aes(x=income,  y = ..density..)) +
  geom_density(col = "green")+
  ggtitle(levels(ransomware_big_families$label)[18]) +
  scale_x_continuous(trans='log2')  +
  theme(axis.text.x = element_text(size = 8, angle=30, hjust=1),
        plot.title = element_text(size = 9, face = "bold"),
        axis.title.x=element_blank(),
        axis.title.y=element_blank())
# Label 20
ransomware_big_families %>%
  filter(label==levels(ransomware_big_families$label)[20]) %>%
  select(income) %>%
  ggplot(aes(x=income,  y = ..density..)) +
  geom_density(col = "green")+
  ggtitle(levels(ransomware_big_families$label)[20]) +
  scale_x_continuous(trans='log2')  +
  theme(axis.text.x = element_text(size = 8, angle=30, hjust=1),
        plot.title = element_text(size = 9, face = "bold"),
        axis.title.x=element_blank(),
        axis.title.y=element_blank())
# Label 22
ransomware_big_families %>%
  filter(label==levels(ransomware_big_families$label)[22]) %>%
  select(income) %>%
  ggplot(aes(x=income,  y = ..density..)) +
  geom_density(col = "green")+
  ggtitle(levels(ransomware_big_families$label)[22]) +
  scale_x_continuous(trans='log2')  +
  theme(axis.text.x = element_text(size = 8, angle=30, hjust=1),
        plot.title = element_text(size = 9, face = "bold"),
        axis.title.x=element_blank(),
        axis.title.y=element_blank())
# Label 23
ransomware_big_families %>%
  filter(label==levels(ransomware_big_families$label)[23]) %>%
  select(income) %>%
  ggplot(aes(x=income,  y = ..density..)) +
  geom_density(col = "green")+
  ggtitle(levels(ransomware_big_families$label)[23]) +
  scale_x_continuous(trans='log2')  +
  theme(axis.text.x = element_text(size = 8, angle=30, hjust=1),
        plot.title = element_text(size = 9, face = "bold"),
        axis.title.x=element_blank(),
        axis.title.y=element_blank())
# Label 24
ransomware_big_families %>%
  filter(label==levels(ransomware_big_families$label)[24]) %>%
  select(income) %>%
  ggplot(aes(x=income,  y = ..density..)) +
  geom_density(col = "green")+
  ggtitle(levels(ransomware_big_families$label)[24]) +
  scale_x_continuous(trans='log2')  +
  theme(axis.text.x = element_text(size = 8, angle=30, hjust=1),
        plot.title = element_text(size = 9, face = "bold"),
        axis.title.x=element_blank(),
        axis.title.y=element_blank())
# Label 27
ransomware_big_families %>%
  filter(label==levels(ransomware_big_families$label)[27]) %>%
  select(income) %>%
  ggplot(aes(x=income,  y = ..density..)) +
  geom_density(col = "green")+
  ggtitle(levels(ransomware_big_families$label)[27]) +
  scale_x_continuous(trans='log2')  +
  theme(axis.text.x = element_text(size = 8, angle=30, hjust=1),
        plot.title = element_text(size = 9, face = "bold"),
        axis.title.x=element_blank(),
        axis.title.y=element_blank())
# Label 28
ransomware_big_families %>%
  filter(label==levels(ransomware_big_families$label)[28]) %>%
  select(income) %>%
  ggplot(aes(x=income,  y = ..density..)) +
  geom_density(col = "green")+
  ggtitle(levels(ransomware_big_families$label)[28]) +
  scale_x_continuous(trans='log2')  +
  theme(axis.text.x = element_text(size = 8, angle=30, hjust=1),
        plot.title = element_text(size = 9, face = "bold"),
        axis.title.x=element_blank(),
        axis.title.y=element_blank())
# Label 29
ransomware_big_families %>%
  filter(label==levels(ransomware_big_families$label)[29]) %>%
  select(income) %>%
  ggplot(aes(x=income,  y = ..density..)) +
  geom_density(col = "green")+
  ggtitle(levels(ransomware_big_families$label)[29]) +
  scale_x_continuous(trans='log2')  +
  theme(axis.text.x = element_text(size = 8, angle=30, hjust=1),
        plot.title = element_text(size = 9, face = "bold"),
        axis.title.x=element_blank(),
        axis.title.y=element_blank())

```

It appears that, although the  `r selected_features[1]` distribution for ransomware groups does differ from the distribution pattern for *white* addresses, it also varies from group to group.  For this reason, this makes a good feature to use in the training of the models.

### Insights gained from exploration

  After visually and numerically exploring the data, it becomes clear what the challenge is.  Ransomware-related addresses are very sparse, comprising `r ransomprop*100`% of all addresses.  This small percentage is also further classified into 28 groups.  Perhaps the original paper was a overly ambitious in trying to categorize all the addresses into 29 categories, including the vastly prevalent *white* addresses.  To simplify our approach, we will categorize the addresses in a binary way: as either *white* or *black*, where *black* signifies an association with ransomware transactions.  Asking this as a "ransomware or not-ransomware" question allows for application of methods that have been shown to be impractical otherwise.

## Modeling approach

  Akcora, et al. applied a Random Forest approach to the data; however "Despite improving data scarcity, [...] tree based methods (i.e., Random Forest and XGBoost) fail to predict any ransomware family".$^{[3]}$ Considering all ransomware addresses as belonging to a single group may help to improve the predictive power of such methods, making Random Forest worth another try.

  The topological description of the data set inspired a search for topological machine learning methods, although one does not necessitate the other.  Searching for *topo* in the documentation for the `caret` package$^{[6]}$ resulted in the entry for Self Organizing Maps (SOMs), supplied by the `kohonen` package.$^{[11]}$  The description at CRAN$^{[7]}$ was intriguing enough to merit further investigation.

  Initially, the categorization of ransomware into the 29 different families (including *white*) was attempted using SOMs.  This proved to be very resource intensive, requiring more time and RAM than was available.  Although it did help to illuminate how SOMs are configured, the resource requirements of the algorithm became a deterrent.  It was at this point that the SOMs were applied in a binary way, classifying all ransomware addresses as merely *black*, initially in an attempt to simply get the algorithm to run to completion without error.  This reduced RAM usage to the point of being feasible on the available hardware.

   Self Organizing Maps were not covered in the coursework at any point, therefore a familiar method was sought out to compare the results to.  Random Forest was chosen and applied to the data set in a binary way, classifying every address as either *white* or *black*, ignoring the ransomware families.  Surprisingly, not only did the Random Forest approach result in an acceptable model, it did so much quicker than expected, taking only a few minutes to produce results.

   It was very tempting to leave it there and write up a comparison of the two approaches to the binary problem by classifying all ransomware related addresses as *black*.  However, a nagging feeling that more could be done eventually inspired a second look at the categorical problem of grouping the ransomware addresses into the 28 known families.  Given the high accuracy and precision of the binary Random Forest approach, the sparseness of the ransomware in the larger set has been mostly eliminated, along with many of the chances for false positives. There are a few cases of false negatives, depending on how the randomization is done during the sampling process. However, the Random Forest method does not seem to produce many false positive (if any), meaning it never seems to predict a truly white address as being black.  Hence, by applying the Random Forest method first, we have effectively filtered out any possibility of false positives by correctly identifying a very large set of purely *white* addresses, which are then removed from the set.  The best model used in the original paper by Akcora, et al. resulted in more false positives than true positives.  This low precision rate is what made it impractical for real-world usage.$^{[3]}$

   All of these factors combined to inspire a two-part method: first to separate the addresses into *black* and *white* groups, and then to further classify the *black* addresses into ransomware families.  We shall explore each of these steps separately.

### Method Part 0:  Binary SOMs

The first working model that ran to completion without exhausting computer resources ignored the ransomware family labels and instead used the two categories of *black* and *white*.  The `kohonen` package provides algorithms for both supervised and unsupervised model building, using both Self Organizing Maps and Super Organizing Maps respectively.$^{[11]}$  A supervised approach was used since the data set includes information about the membership of ransomware families that can be used to train the model.

```{r binary_SOMs}
##############################################################################
## This is a first attempt using SOMs to model the data set as "black" and
## "white" addresses only.
##
## NOTE:  This is the most computationally heavy part of the paper and takes
## several hours to run to completion.  It is also completely optional, only
## used to compare with the quicker method. If, for some reason, you want to
## compile the report without this section, you can just comment it all out
## or remove it because nothing is needed from Method Part 0 for any of the
## other methods.  In other words, it can be safely skipped if you are short on
## time or RAM.
##############################################################################

# Start timer
tic("Binary SOMs", quiet = FALSE, func.tic = my.msg.tic)

# Keep only numeric columns, ignoring dates and looped
som1_train_num <- train_set %>% select(length, weight, count, neighbors, income)

# SOM function can only work on matrices
som1_train_mat <- as.matrix(scale(som1_train_num))

# Switching to supervised SOMs
som1_test_num <- test_set %>% select(length, weight, count, neighbors, income)

# Note that when we rescale our testing data we need to scale it
# according to how we scaled our training data
som1_test_mat <-
  as.matrix(scale(som1_test_num, center = attr(som1_train_mat, "scaled:center"),
                  scale = attr(som1_train_mat, "scaled:scale")))

# Binary outputs, black=ransomware, white=non-ransomware, train set
som1_train_bw <- train_set$bw %>% classvec2classmat()

# Same for test set
som1_test_bw <- test_set$bw %>% classvec2classmat()

# Create Data list for supervised SOM
som1_train_list <-
  list(independent = som1_train_mat, dependent = som1_train_bw)

############################################################################
## Calculate idea grid size according to:
## https://www.researchgate.net/post/How-many-nodes-for-self-organizing-maps
############################################################################

# Formulaic method 1
grid_size <- round(sqrt(5*sqrt(nrow(train_set))))
# Based on categorical number, method 2
#grid_size = ceiling(sqrt(length(unique(ransomware$bw))))

# Create SOM grid
som1_train_grid <-
  somgrid(xdim=grid_size, ydim=grid_size, topo="hexagonal", toroidal = TRUE)

## Now build the model
som_model1 <- xyf(som1_train_mat, som1_train_bw,
                 grid = som1_train_grid,
                 rlen = 100,
                 mode="pbatch",
                 cores = n_cores,
                 keep.data = TRUE
)

# Now test predictions
som1_test_list <- list(independent = som1_test_mat, dependent = som1_test_bw)

ransomware.prediction1 <- predict(som_model1, newdata = som1_test_list)

# Confusion matrix
som1_cm_bw <-
  confusionMatrix(ransomware.prediction1$prediction[[2]], test_set$bw)

# Now test predictions of validation set

# Switching to supervised SOMs
valid_num <- validation %>% select(length, weight, count, neighbors, income)

# Note that when we rescale our testing data we need to scale it
# according to how we scaled our training data
valid_mat <-
  as.matrix(scale(valid_num, center = attr(som1_train_mat,  "scaled:center"),
                  scale = attr(som1_train_mat, "scaled:scale")))

valid_bw <- validation$bw

valid_list <- list(independent = valid_mat, dependent = valid_bw)

# Requires up to 16GB of RAM, skip if resources are limited
ransomware.prediction1.validation <- predict(som_model1, newdata = valid_list)

# Confusion matrix
cm_bw.validation <-
  confusionMatrix(ransomware.prediction1.validation$prediction[[2]],
                  validation$bw)

# End timer
toc(quiet = FALSE, func.toc = my.msg.toc, info = "Run Time")

```


After training the model, we obtain the confusion matrices for the test set and the validation set, separately.  As you can see in Tables 5 and 6, the results are very good in both cases.

```{r binary_SOM_results, echo=FALSE, results='asis' }


cm1_test_set <- som1_cm_bw %>% as.matrix() %>%
  knitr::kable(format = "latex", booktabs = TRUE)

cm1_validation_set <- cm_bw.validation %>% as.matrix() %>%
  knitr::kable(format = "latex", booktabs = TRUE)


cat(c("\\begin{table}[!htb]
    \\begin{minipage}{.5\\linewidth}
      \\caption{Test set confusion matrix}
      \\centering",
        cm1_test_set,
    "\\end{minipage}%
    \\begin{minipage}{.5\\linewidth}
      \\centering
        \\caption{Validation set confusion matrix}",
        cm1_validation_set,
    "\\end{minipage}
\\end{table}"
))

```


This is a very intensive method compared to what follows. It was left out of the final version of the script and has been included here only for model comparison and to track developmental evolution.

### Method Part 1:  Binary Random Forest

A Random Forest model is trained using ten-fold cross validation and a tuning grid with the number of variables randomly sampled as candidates at each split (`mtry`) set to the values $={2, 4, 6, 8, 10, 12}$, each one being checked for optimization.

```{r random_forest, warning=FALSE}
##############################################################################
## This is a better attempt using Random Forest to model the data set as
## "black" and "white" addresses only.
##############################################################################

# Start timer
tic("Random Forest", quiet = FALSE, func.tic = my.msg.tic)

# Cross Validation, ten fold
control <- trainControl(method="cv", number = 10)

# Control grid with variation on mtry
grid <- data.frame(mtry = c(2, 4, 6, 8, 10, 12))

# Run Cross Validation using control and grid set above
rf_model <- train(train_num, train_bw, method="rf",
                  trControl = control, tuneGrid=grid)

# Supervised fit of model using cross validated optimization
fit_rf <- randomForest(train_samp, train_bw,
                       minNode = rf_model$bestTune$mtry)

# Measure accuracy of model against test sample
y_hat_rf <- predict(fit_rf, test_samp)
cm_test <- confusionMatrix(y_hat_rf, test_bw)


# Measure accuracy of model against full ransomware set
ransomware_y_hat_rf <- predict(fit_rf, ransomware)
cm_ransomware <- confusionMatrix(ransomware_y_hat_rf, ransomware$bw)

# End timer
toc(quiet = FALSE, func.toc = my.msg.toc, info = "Run Time")

```


The confusion matrix for the test set shows very good results, specifically in the areas of accuracy and precision.  Although not as good as the SOM model used previously, the results are good enough to justify the time saved.

```{r random-forest-output_test, echo=FALSE}

# Confusion matrix for test set
cm2_test_set <- cm_test %>% as.matrix() %>%
  knitr::kable(format = "latex", booktabs = TRUE)

# overall results
cm2_overall <- cm_test$overall %>%
  knitr::kable(format = "latex", booktabs = TRUE,
               col.names=c("score"))

# by class.
cm2_byClass <- cm_test$byClass %>%
  knitr::kable(format = "latex", booktabs = TRUE,
               col.names=c("score"))


# Confusion matrix for full ransomware set,
cm3_full_set <- cm_ransomware %>% as.matrix() %>%
  knitr::kable(format = "latex", booktabs = TRUE)

# overall results
cm3_overall <- cm_ransomware$overall %>%
  knitr::kable(format = "latex", booktabs = TRUE,
               col.names=c("score"))

#  by class.
cm3_byClass <- cm_ransomware$byClass %>%
  knitr::kable(format = "latex", booktabs = TRUE,
               col.names=c("score"))


```

Tables 7 and 8 show the confusion matrices for the test set and the full set resulting from the Random Forest model, respectively.  Note the absence of false negatives (upper right hand corners), meaning that no truly *black* addresses were predicted to be *white*.  The converse is not necessarily true, a few truly *white* addresses get marked as *black* (lower left hand corners).

```{r random-forest-comfusion_matrices, echo=FALSE, results='asis'}

# Print all three tables on one line
cat(c("\\begin{table}[!htb]
    \\begin{minipage}{.5\\linewidth}
      \\caption{Test set confusion matrix}
      \\centering",
        cm2_test_set,
    "\\end{minipage}%
    \\begin{minipage}{.5\\linewidth}
      \\centering
        \\caption{Full set confusion matrix}",
        cm3_full_set,
    "\\end{minipage}
\\end{table}"
))


```

Tables 9 and 10 show the accuracy intervals for the test set and the full set, respectively.

```{r random-forest-overall_results, echo=FALSE, results='asis'}

# Print both tables on one line
cat(c("\\begin{table}[!htb]
    \\begin{minipage}{.5\\linewidth}
      \\caption{Test set accuracy}
      \\centering",
        cm2_overall,
    "\\end{minipage}%
    \\begin{minipage}{.5\\linewidth}
      \\centering
        \\caption{Full set accuracy}",
        cm3_overall,
    "\\end{minipage}
\\end{table}"
))

```

Tables 11 and 12 show the overall results for each set.

```{r random-forest-results_by_class, echo=FALSE, results='asis'}

# Print both tables on one line
cat(c("\\begin{table}[!htb]
    \\begin{minipage}{.5\\linewidth}
      \\caption{Test set results}
      \\centering",
        cm2_byClass,
    "\\end{minipage}%
    \\begin{minipage}{.5\\linewidth}
      \\centering
        \\caption{Full set results}",
        cm3_byClass,
    "\\end{minipage}
\\end{table}"
))

```

As can be seen from these results, Random Forest is a much quicker way of removing most of the *white* addresses, while providing a comparable level of accuracy and precision.  This method will be used in the final composite model to save time.

### Method Part 2:  Categorical SOMs

Now we train a new model after removing all *white* addresses.  The predictions from the Random Forest model are used to isolate all *black* addresses for further classification into ransomware addresses using SOMs.  The reduced set is then categorized using a supervised SOM method with the 28 ransomware families as the target classification groups.

```{r soms-families, warning=FALSE}

# Start timer
tic("Categorical SOMs", quiet = FALSE, func.tic = my.msg.tic)

# Now use this prediction to reduce the original set to only "black" addresses
# First append the full set of predictions to the original set
ransomware$prediction <- ransomware_y_hat_rf

# Filter out all the predicted "white" addresses,
# leaving only predicted "black" addresses
black_addresses <- ransomware %>% filter(prediction=="black")

# Split the reduced black-predictions into a training set and a test set @ 50%
test_index <- createDataPartition(y = black_addresses$prediction,
                                  times = 1, p = .5, list = FALSE)

train_set <- black_addresses[-test_index,]
test_set <- black_addresses[test_index,]

# Keep only numeric columns, ignoring temporal variables
train_num <- train_set %>%
  select(income, neighbors, weight, length, count, looped)

# SOM function can only work on matrices
train_mat <- as.matrix(scale(train_num))

# Select non-temporal numerical features only
test_num <- test_set %>%
  select(income, neighbors, weight, length, count, looped)

# Testing data is scaled according to how we scaled our training data
test_mat <- as.matrix(scale(test_num,
                            center = attr(train_mat, "scaled:center"),
                            scale = attr(train_mat, "scaled:scale")))

# Categorical labels for training set
train_label <- train_set$label %>% classvec2classmat()

# Same for test set
test_label <- test_set$label %>% classvec2classmat()

# Create data list for supervised SOM
train_list <- list(independent = train_mat, dependent = train_label)

############################################################################
## Calculate idea grid size according to:
## https://www.researchgate.net/post/How-many-nodes-for-self-organizing-maps
############################################################################

# Formulaic method 1, makes a larger graph in this case
grid_size <- round(sqrt(5*sqrt(nrow(train_set))))

# Based on categorical number, method 2, smaller graph with less cells
#grid_size = ceiling(sqrt(length(unique(ransomware$label))))

# Create SOM grid
train_grid <- somgrid(xdim=grid_size, ydim=grid_size,
                      topo="hexagonal", toroidal = TRUE)

## Now build the SOM model using the supervised method xyf()
som_model2 <- xyf(train_mat, train_label,
                  grid = train_grid,
                  rlen = 100,
                  mode="pbatch",
                  cores = n_cores,
                  keep.data = TRUE
)

# Now test predictions of test set, create data list for test set
test_list <- list(independent = test_mat, dependent = test_label)

# Generate predictions
ransomware_group.prediction <- predict(som_model2, newdata = test_list)

# Confusion matrix
cm_labels <- confusionMatrix(ransomware_group.prediction$prediction[[2]],
                             test_set$label)

# End timer
toc(quiet = FALSE, func.toc = my.msg.toc, info = "Run Time")

```

When selecting the grid size for a Self Organizing Map, there are at least two different schools of thought. The two that were tried here are explained (with supporting documentation) on a Researchgate$^{[8]}$ forum.  The first method is based on the size of the training set, and in this case results in a larger, more accurate map.  The second method is based on the number of known categories to classify the data into, and in this case results in a smaller, less accurate map.  For this script, a grid size of `r grid_size` has been selected.

A summary of the results for the categorization of black addresses into ransomware families follows.  For the full table of predictions and statistics, see the Appendix.

Table 13 shows the overall results of the final categorization.

```{r cm_overall, echo=FALSE}

# Overall section of the confusion matrix formatted through kable()
cm_labels$overall %>%
  knitr::kable(caption="Overall categorization results",
               col.names = c("score") )

```

Table 14 shows the final results by class.  It appears that many of the families with lower membership were not predicted at all.  In fact, all the addresses classified as *black* by the Random Forest method have been grouped into only 7 families, a quarter of the actual 28.  The relatively high accuracy rate would suggest that the larger families were predicted correctly, and that the smaller families were lumped in with the most similiar of the larger families.  This could be an area for further refinement of the second SOM algorithm.

```{r soms-output-by-class, echo=FALSE, size="tiny"}

# By Class section of the confusion matrix formatted through kable()
cm_labels$byClass %>%
  knitr::kable(caption="Categorization results by class")

```

### Map Visualizations and Clusterings

Toroidal neural node maps are used to generate the models, and can be visualized in a number of ways.  The toroidal nature means that the top and bottom edges can be matched together, and the same with the left and right edges, forming a toroid, or donut shape.

The Training progress plot shows how many iterations the model had to undergo before the distances on the map stabilized. The Mapping plot is a visual representation of the individual observations and where they lie in the two-dimensional grid generated by the model. The Quality plot shows the average distance between addresses in each cell.  The Counts plot gives a measure of the number of observations in each cell of the grid.

```{r categorical som graphs, echo=FALSE, fig.show="hold", out.width='50%'}

# SOM visualization plots

# Visualize training progress
plot(som_model2, type = 'changes', pch = 19, palette.name = topo.colors)

# Visualize neural network mapping
plot(som_model2, type = 'mapping', pch = 19, palette.name = topo.colors)

# Distance map
plot(som_model2, type = 'quality', pch = 19, palette.name = topo.colors)

# Visualize counts
plot(som_model2, type = 'counts', pch = 19, palette.name = topo.colors)


```


We can also look at heatmaps for each of the non-temporal features.  This is where the grouping and the toroidal nature of the maps starts to become apparent.  The color represents the average value for that feature in that cell.

```{r heatmaps, echo=FALSE, fig.show="hold", out.width='50%'}


# Visualize heatmap for variable 1
plot(som_model2, type = 'property', property = som_model2$codes[[1]][,1],
     main=colnames(train_num)[1], pch = 19, palette.name = topo.colors)

# Visualize heatmap for variable 2
plot(som_model2, type = 'property', property = som_model2$codes[[1]][,2],
     main=colnames(train_num)[2], pch = 19, palette.name = topo.colors)

# Visualize heatmap for variable 3
plot(som_model2, type = 'property', property = som_model2$codes[[1]][,3],
     main=colnames(train_num)[3], pch = 19, palette.name = topo.colors)

# Visualize heatmap for variable 4
plot(som_model2, type = 'property', property = som_model2$codes[[1]][,4],
     main=colnames(train_num)[4], pch = 19, palette.name = topo.colors)

# Visualize heatmap for variable 5
plot(som_model2, type = 'property', property = som_model2$codes[[1]][,5],
     main=colnames(train_num)[5], pch = 19, palette.name = topo.colors)

# Visualize heatmap for variable 6
plot(som_model2, type = 'property', property = som_model2$codes[[1]][,6],
     main=colnames(train_num)[6], pch = 19, palette.name = topo.colors)


```

The code plots show how much of each feature is represented by each cell in the map. For large numbers of categories (such as with the ransomware families), the default behavior is to make a line plot instead of a segment plot, which leads to the density-like patterns to the right. In the left plot, the codebook vectors of the features used in the model are shown. These can be directly interpreted as an indication of how likely a given class is at a certain unit. The standard code plot creates these pie representations of the corresponding vectors for the grid cells. The radius of a wedge represents the magnitude in a particular dimension. From these, visual patterns start to emerge, as similar addresses are grouped together according to similarities of pie representations.

```{r fan diagrams graphs, echo=FALSE}

# Visualize fan diagram
plot(som_model2, type = 'codes', pch = 19, palette.name = topo.colors,
     main = c("Codes for training features", "Codes for ransomware families"))

```

Clustering offers a nice way of visualizing the final SOM grid and the categorical boundaries that were formed by the model.  Ideally, it is a visual representation of the final grouping.  There are multiple algorithms for doing this.

K-means clustering is said to be better for smaller maps, while Hierarchical clustering is supposed to be better for larger maps.  In this case, Hierarchical clustering does not converge on the right number of groups, while K-means requires the number of groups be specified ahead of time.  Since we already know how many ransomware families are represented by the data set, K-means clustering is used to visualize the final categorization of the data on the map.


```{r clustering-setup, echo=FALSE, include=FALSE}
#############################################################################
## K-Means Clustering to visualize the categorization of the SOM
## For a good tutorial, see:
## https://www.polarmicrobes.org/microbial-community-segmentation-with-r/
#############################################################################

# Set number of clusters to be equal to number of known ransomware groups
n_groups <- length(unique(ransomware$label)) - 1

# Generate k-means clustering
som.cluster <- kmeans(data.frame(som_model2$codes[[1]]), centers=n_groups)


```

```{r clustering-plots, echo=FALSE,  fig.align="center"}

# Plot K-means clustering results
plot(som_model2,
     main = 'K-Means Clustering',
     type = "property",
     property = som.cluster$cluster,
     palette.name = topo.colors)
add.cluster.boundaries(som_model2, som.cluster$cluster)

```

##  Results & Performance

### Results

   The first attempt to isolate ransomware from *white* addresses using SOMs resulted in a model with an accuracy of `r toString(cm_bw.validation$overall["Accuracy"])` and precision `r toString(cm_bw.validation$byClass[3])`.

   The second attempt to isolate ransomware from *white* addresses using Random Forest resulted in a model with an accuracy of `r toString(cm_ransomware$overall["Accuracy"])` and precision `r toString(cm_ransomware$byClass[3])`.

   Classifying the ransomware predicted by the second attempt into 28 ransomware families using SOMs resulted in a model with an overall accuracy of `r toString(cm_labels$overall["Accuracy"])` and minimum nonzero precision of `r toString(min(cm_labels$byClass[,5][which(cm_labels$byClass[,5] > 0)]))`.


### Performance

  The script runs on the aforementioned hardware in 235 seconds and uses less than 4GB of RAM.  Given that the Bitcoin network produces one new block every ten minutes on average, then real-time analysis could theoretically be conducted on each block as they are announced using even moderate computing resources.  Just for comparison, the final script was also run on lower powered machines with the following specifications:

#### ASUS Eee PC 1025C

   - CPU:  Intel Atom N2600 @ 1.6GHz (64-bit Intel Atom quad-core x86)
   - RAM:  3911MB DDR3 @ 800 MT/s  (4 GB)

   This is a computer known for being slow and clunky.  Even on this device, which runs the same operating system and software as the hardware listed previously, the total run time for the script is around 1665 seconds.  At nearly 28 minutes, this is not fast enough to analyze the Bitcoin blockchain in real time, although it does show that the script can be run on very modest hardware to completion.

#### Pine64 Quartz64 Model A

  - CPU:  Rockchip RK3566 SoC `aarch64` @1.8GHz (64-bit quad-core ARM)
  - RAM:  DDR4 8080MB (8 GB)

  This is a single board computer / development board, which runs the same software as the others (ported to `aarch64`), except for Rstudio.  It is of personal interest to benchmark a modern 64-bit ARM processor in addition to the two Intel CPUs.  The script runs in about 860 seconds on this platform, nearly half of that for the Atom processor above.  Still not fast enough to analyze each block in real time, but a significant improvement given the low power usage of such processors.


## Summary

### Comparison to results from original paper

   In the original paper by Akcora et al., they tested several different sets of parameters on their TDA model.  According to them, "In the best TDA models for each ransomware family, we predict **16.59 false positives for each
true positive.** In turn, this number is 27.44 for the best non-TDA models."$^{[3]}$  In fact, the **highest** precision [a.k.a. Positive Predictive Value, defined as TP/(TP+FP), where TP = the number of true positives, and FP = the number of false positives] they achieved was only 0.1610.  By comparison, although several of our predicted classes had zero or NA precision values due to low family membership in some cases, the **lowest** non-zero precision value is `r toString(min(cm_labels$byClass[,5][which(cm_labels$byClass[,5] > 0)]))`, with many well above that, equaling one in a few cases.

  One might say that we are comparing apples to oranges by benchmarking single method model with a two-method stack.  The two-model approach is justified and seems superior in this case, especially when measured in terms of total run time and having the benefit of avoiding false positives to a great degree.


### Limitations

  SOMs have many different parameters that seem easy to misconfigure, and usually require significantly more computing resources than less sophisticated algorithms.  Perhaps a dual Random Forest approach would be better, if the loss of accuracy or precision was worth the time gain.

### Future Work

  We only scratched the surface of the SOM algorithm, which seems to have many implementations and parameters that could be investigated further and possibly optimized via cross-validation. For example, the grid size used to train the SOM was calculated using an algorithm based on the size of the training set, and while this performed better than a grid size based on the number of categories, it may not be ideal.  Optimization around grid size could still be carried out.  Hexagonal grids with toroidal topology were the only type used.  Other types, such as square grids and non-toroidal topology are also possible, and may also be worth investigating.

 A dual Random Forest approach could be used to first isolate the ransomware addresses as well as classify them might be quick enough to run in under ten minutes on all the hardware listed.  Conversely, a dual SOM method could be created for maximum precision if the necessary computing resources were available.

  The script itself has a few areas that could be further optimized.  The sampling method does what it needs to do, but the ratios taken for each set could possibly be optimized further. The Random Forest algorithm could be trained on more than just two features in an attempt to reduce the number of false positives. The second SOM algorithm could be optimized to correctly predict more of the low-membership families.

  Hierarchical clustering was attempted in addition to K-means clustering. The correct number of families was difficult to achieve, whereas it is a direct input of the K-means method. Another look at the clustering techniques might yield different results.  Other clustering techniques exist, such as "Hierarchical K-Means"$^{[13]}$, which could be explored for even more clustering visualizations.

### Conclusion

   This report presents a reliable method for classifying Bitcoin addresses into known ransomware families, while at the same time avoiding false positives to a high degree by filtering out *white* addresses using a binary method before classifying the remaining addresses further.  It leaves the author wondering how much harder it would be to perform the same task for ransomware that uses privacy-oriented coins.  Certain cryptocurrency networks, such as Monero, utilize privacy features that obfuscate transactions from being analyzed in the same way that the Bitcoin network has been analyzed here.  Some progress has been made towards analyzing these networks$^{[9]}$. At the same time, the developers of such networks continually evolve the code to complicate transaction tracking.  This could be another promising area for future research.

## References

[1] Adam Brian Turner, Stephen McCombie and Allon J. Uhlmann (November 30, 2020) [Analysis Techniques for Illicit Bitcoin Transactions](https://doi.org/10.3389/fcomp.2020.600596)

[2] Daniel Goldsmith, Kim Grauer and Yonah Shmalo (April 16, 2020) [Analyzing hack subnetworks in the
bitcoin transaction graph](https://doi.org/10.1007/s41109-020-00261-7)

[3] Cuneyt Gurcan Akcora, Yitao Li, Yulia R. Gel, Murat Kantarcioglu (June 19, 2019) [BitcoinHeist: Topological Data Analysis for Ransomware Detection on the Bitcoin Blockchain](https://arxiv.org/abs/1906.07852)

[4] UCI Machine Learning Repository https://archive.ics.uci.edu/ml/index.php

[5] BitcoinHeist Ransomware Address Dataset
https://archive.ics.uci.edu/ml/datasets/BitcoinHeistRansomwareAddressDataset

[6] Available Models - The `caret` package http://topepo.github.io/caret/available-models.html

[7] Ron Wehrens and Johannes Kruisselbrink, Package ‘`kohonen`’ @ CRAN (2019) https://cran.r-project.org/web/packages/kohonen/kohonen.pdf

[8] How many nodes for self-organizing maps? (Oct 22, 2021) https://www.researchgate.net/post/How-many-nodes-for-self-organizing-maps

[9] Malte Möser, Kyle Soska, Ethan Heilman, Kevin Lee, Henry Heffan, Shashvat Srivastava,
Kyle Hogan, Jason Hennessey, Andrew Miller, Arvind Narayanan, and Nicolas Christin (April 23, 2018) [An Empirical Analysis of Traceability in the Monero Blockchain](https://arxiv.org/pdf/1704.04299/)

[10] KR Tejeda, Detecting Bitcoin Ransomware, https://git.disroot.org/shelldweller/ransomware

[11] Wehrens R, Buydens LMC (2007). “Self- and Super-Organizing Maps in R: The kohonen Package.” _Journal of
Statistical Software_, *21*(5), 1-19. doi: 10.18637/jss.v021.i05 (URL:
https://doi.org/10.18637/jss.v021.i05).

[and] Wehrens R, Kruisselbrink J (2018). “Flexible Self-Organizing Maps in kohonen 3.0.” _Journal of Statistical
Software_, *87*(7), 1-18. doi: 10.18637/jss.v087.i07 (URL: https://doi.org/10.18637/jss.v087.i07).

[12] Difference between K means and Hierarchical Clustering (Jul 07, 2021) https://www.geeksforgeeks.org/difference-between-k-means-and-hierarchical-clustering/

[13] Hierarchical K-Means Clustering: Optimize Clusters (Oct 15 2021) https://www.datanovia.com/en/lessons/hierarchical-k-means-clustering-optimize-clusters/

\newpage

## Appendix:

### Categorical SOM prediction table and confusion matrix

Here are the full prediction results for the categorization of *black* addresses into ransomware families.  It is assumed that all *white* address have already been removed.

```{r soms-output-table, echo=FALSE}

# Final results: categorization of "black" addresses into ransomware families.
cm_labels
```