ransomware/Detecting_Bitcoin_Ransomwar...

616 lines
30 KiB
Plaintext
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

---
title: \vspace{1in}Detecting Ransomware Addresses on the Bitcoin Blockchain using Random Forests and Self Organizing Maps
subtitle: \vspace{.5in}HarvardX PH125.9x Final Capstone CYO Project
\vspace{.5in}
author: "Kaylee Robert Tejeda"
date: "10/31/2021"
abstract: "Ransomware is a persisent and growing threat in the world of cybersecurity. A specific area of focus involves detecting and tracking payments made to ransomware operators. While many attempts towards this goal have not made use of sophisticated machine learning methods, even those that have often result in models with poor specificity or other performance issues. A two-step method is developed to address the issue of false positives and improve on previous results."
keywords:
- Bitcoin
- blockchain
- ransomware
- machine learning
- Random Forests
- Self Organizing Maps
- SOMs
- cryptocurrency
output: pdf_document
geometry: margin=2cm
---
\def\bitcoinA{%
\leavevmode
\vtop{\offinterlineskip %\bfseries
\setbox0=\hbox{B}%
\setbox2=\hbox to\wd0{\hfil\hskip-.03em
\vrule height .3ex width .15ex\hskip .08em
\vrule height .3ex width .15ex\hfil}
\vbox{\copy2\box0}\box2}}
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, out.width="400px", dpi=120)
def.chunk.hook <- knitr::knit_hooks$get("chunk")
knitr::knit_hooks$set(chunk = function(x, options) {
x <- def.chunk.hook(x, options)
ifelse(options$size != "normalsize", paste0("\n \\", options$size,"\n\n", x, "\n\n \\normalsize"), x)
})
```
\newpage
&nbsp;
\vspace{25pt}
\tableofcontents
\newpage
## Introduction
Ransomware attacks have gained the attention of security professionals, law enforcement, and financial regulatory officials.$^{[1]}$ The pseudo-anonymous Bitcoin network provides a convenient method for ransomware attackers to accept payments without revealing their identity or location. The victims (usually hospitals or other large organizations) come to find that much if not all of their important organizational data have been encrypted with a secret key by an unknown attacker. They are instructed to make a payment to a specific Bitcoin address before a certain deadline to have their data decrypted, otherwise the data will be deleted.
The legal and financial implications of ransomware attacks are not of concern for the purpose of this analysis. Many parties are interested in tracking illicit activity (such as ransomware payments) around the Bitcoin blockchain as soon as possible to minimize financial losses. Daniel Goldsmith explains some of the reasons and methods of blockchain analysis at [Chainalysis.com](https://www.chainalysis.com/).$^{[2]}$ A ransomware attack could be perpetrated on an illegal darknet market site, for example. The news of such an attack might not be published at all, let alone in popular media. By analyzing the transaction record with a blockchain explorer such as [BTC.com](https://btc.com/), suspicious activity could be flagged in real time given a sufficiently robust model. It may, in fact, be the first public notice of such an event. Any suspicious addresses could then be blacklisted or banned from using other services.
Lists of known ransomware payment addresses have been compiled and analyzed using various methods. One well known paper entitled "BitcoinHeist: Topological Data Analysis for Ransomware Detection on the Bitcoin Blockchain"$^{[3]}$ will be the source of our data set and the baseline to which we will compare our results. In that paper, Akcora, et al. use Topological Data Analysis (TDA) to classify addresses on the Bitcoin blockchain into one of 29 known ransomware address groups. Addresses with no known ransomware associations are classified as "white". The blockchain is then considered as a heterogeneous Directed Acyclic Graph (DAG) with two types of nodes describing *addresses* and *transactions*. Edges are formed between the nodes when a transaction can be associated with a particular address.
Addresses on the Bitcoin network may appear many times, with different inputs and outputs each time. The Bitcoin network data has been divided into 24-hour time intervals with the UTC-6 timezone as a reference. Speed is defined as the number of blocks the coin appears in during a 24-hour period and provides information on how quickly a coin moves through the network. Speed can be an indicator of money laundering or coin mixing, as normal payments only involve a limited number of addresses in a 24 hour period, and thus have lower speeds when compared to "mixed" coins. The temporal data can also help distinguish transactions by geolocation, as criminal transactions tend to cluster in time.
With the graph defined as such, the following six numerical features$^{[2]}$ are associated with a given address:
1) *Income* - the total amount of coins sent to an address (decimal value with 8 decimal places)
2) *Neighbors* - the number of transactions that have this address as one of its output addresses (integer)
3) *Weight* - the sum of fraction of coins that reach this address from address that do not have any other inputs within the 24-hour window, which are referred to as "starter transactions" (decimal value)
4) *Length* - the number of non-starter transactions on its longest chain, where a chain is defined as an
acyclic directed path originating from any starter transaction and ending at the address in question (integer)
5) *Count* - The number of starter addresses connected to this address through a chain (integer)
6) *Loop* - The number of starter addresses connected to this address by more than one path (integer)
These variables are defined rather abstractly, viewing the blockchain as a topological graph with nodes and edges. The rationale behind this approach is to quantify specific transaction patterns. Akcora$^{[3]}$ gives a thorough explanation in the original paper of how and why these features were chosen. We shall treat the features as general numerical variables and will not seek to justify their definitions. Several machine learning methods will be applied to the original data set from the paper by Akcora$^{[3]}$, and the results will be compared.
### Data
This data set was discovered while exploring the [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/index.php)$^{[4]}$ as suggested in the project instructions. The author of this report, interested in Bitcoin and other cryptocurrencies since (unsuccessfully) mining them on an ASUS netbook in rural Peru in late 2010, used *cryptocurrency* as a preliminary search term. This brought up a single data set entitled ["BitcoinHeist: Ransomware Address Data Set"](https://archive.ics.uci.edu/ml/datasets/BitcoinHeistRansomwareAddressDataset#). The data set was downloaded and the exploration began.
```{r install-load-libraries&download-data, echo=FALSE, include=FALSE}
# Set the repository
r = getOption("repos")
r["CRAN"] = "http://cran.us.r-project.org"
options(repos = r)
rm(r)
# Install necessary packages if not already present
if(!require(tidyverse)) install.packages("tidyverse")
if(!require(caret)) install.packages("caret")
if(!require(randomForest)) install.packages("randomForest")
if(!require(kohonen)) install.packages("kohonen")
if(!require(parallel)) install.packages("parallel")
if(!require(matrixStats)) install.packages("matrixStats")
# Load Libraries
library(tidyverse)
library(caret)
library(randomForest)
library(kohonen)
library(parallel)
library(matrixStats)
# Download data
url <-
"https://archive.ics.uci.edu/ml/machine-learning-databases/00526/data.zip"
dest_file <- "data/data.zip"
if(!dir.exists("data"))dir.create("data")
if(!file.exists(dest_file))download.file(url, destfile = dest_file)
# Unzip as CSV
if(!file.exists("data/BitcoinHeistData.csv"))unzip(dest_file,
"BitcoinHeistData.csv",
exdir="data")
# Import data from CSV
ransomware <- read_csv("data/BitcoinHeistData.csv")
```
A summary of the data set tells the range of values and size of the sample.
```{r data-summary, echo=FALSE, size="tiny"}
# Summary
ransomware %>% summary() %>% knitr::kable()
```
A listing of the first ten rows provides a sample of the features associated with each observation.
```{r data-head, echo=FALSE, size="tiny"}
# Inspect data
ransomware %>% head() %>% knitr::kable()
```
This data set has 2,916,697 observations of ten features associated with a sample of transactions from the Bitcoin blockchain. The ten features include *address* as a unique identifier, the six features defined previously (*income, neighbors, weight, length, count, loop*), two temporal features in the form of *year* and *day* (of the year as 1-365), and a categorical feature called *label* that categorizes each address as either "white" (meaning not connected to any ransomware activity), or one of 29 known ransomware groups as identified by three independent ransomware analysis teams (Montreal, Princeton, and Padua)$^{[3]}$ .
The original research team downloaded and parsed the entire Bitcoin transaction graph from 2009 January to 2018 December. Based on a 24 hour time interval, daily transactions on the network were extracted and the Bitcoin graph was formed. Network edges that transfered less than \bitcoinA 0.3 were filtered out since ransom amounts are rarely below this threshold. Ransomware addresses are taken from three widely adopted studies: Montreal, Princeton and Padua. "White" Bitcoin addresses were capped at one thousand per day while the entire network has up to 800,000 addresses daily.$^{[5]}$
### Goal
The goal of this project is to apply different machine learning algorithms to the same data set used in the original paper to produce an acceptable predictive model for categorizing ransomware addresses correctly. Improving on the results of the original paper in some way, while not strictly necessary for the purposes of the project, would be a notable sign of success.
### Outline of Steps Taken (refine this as steps are written up...)
1) Analyze data set numerically and visually. Notice any pattern, look for insights.
2) Binary classification using Random Forests.
3) Binary classification using Self Organizing Maps.
4) Categorical classification using Self Organizing Maps.
5) Two step method using Random Forests and Self Organizing Maps.
6) Visualize clustering to analyze results further.
7) Generate Confusion Matrix to quantify results.
---
## Data Analysis
### Hardware Specification
All of the analysis in this report was conducted on a single laptop computer, a Lenovo Yoga S1 from late 2013 with the following specs:
- CPU: Intel i7-4600U @ 3.300GHz (4th Gen quad-core i7)
- RAM: 8217MB DDR3L @ 1600 MHz (8 GB)
- OS: Slackware64-current (15.0 RC1) `x86_64-slackware-linux-gnu` (64-bit GNU/Linux)
- R version 4.0.0 (2020-04-24) -- "Arbor Day" (built from source using scripts from [slackbuilds.org](https://slackbuilds.org/))
- RStudio Version 1.4.1106 "Tiger Daylily" (2389bc24, 2021-02-11) for CentOS 8 (converted using `rpm2tgz`)
### Data Preparation
It is immediately apparent that this is a rather large data set. The usual practice of partitioning out eighty to ninety percent of the data for a training set results in a data set that is too large to process given the hardware available. For reasons that no longer apply, the original data set was first split in half with 50% reserved as "validation set" and the other 50% used as the "working set". This working set was again split in half, to give a "training set" that was of a reasonable size to deal with. At this point the partitions were small enough to work with, so the sample partitions were not further refined. This is a potential area for later optimization. Careful sampling was carried out to ensure that the ransomware groups were represented in each sample.
```{r data-prep, echo=FALSE, include=FALSE}
# Turn labels into factors, "bw" is binary factor for ransomware/non-ransomware
ransomware <- ransomware %>%
mutate(label=as.factor(label),
bw=as.factor(ifelse(label=="white", "white", "black")))
# Validation set made from 50% of BitcoinHeist data, for RAM considerations
test_index <- createDataPartition(y = ransomware$bw,
times = 1, p = .5, list = FALSE)
workset <- ransomware[-test_index,]
validation <- ransomware[test_index,]
# Split the working set into a training set and a test set @ 50%, RAM dictated
test_index <- createDataPartition(y = workset$bw,
times = 1, p = .5, list = FALSE)
train_set <- workset[-test_index,]
test_set <- workset[test_index,]
# Find proportion of full data set that is ransomware
ransomprop <- mean(ransomware$bw=="black")
# Check for NAs
no_nas <- sum(is.na(ransomware))
```
---
###############################################################################
---
### Exploration and Visualization ( Chunk #2, do this part last....)
The ransomware addresses make up less than 2% of the overall data set. This presents a challenge as the target observations are sparse within the data set, especially when we consider that this is then divided into 29 subsets. In fact, some of the ransomware groups have only a single member, making categorization a dubious task. At least there are no missing values to worry about.
```{r cv-calcs, echo=FALSE}
# Keep only numeric columns, ignoring temporal features
ransomware_num <- ransomware %>%
select(length, weight, count, looped, neighbors, income)
# Check for variation across numerical columns using coefficients of variation
#
# Calculate standard deviations for each column
sds <- ransomware_num %>% as.matrix() %>% colSds()
# Calculate means for each column
means <- ransomware_num %>% as.matrix() %>% colMeans()
# Calculate CVs for each column
coeff_vars <- sds %/% means
# Select the two features with the highest coefficients of variation
selected_features <- names(sort(coeff_vars, decreasing=TRUE))[1:2]
#Sample every 100th row due to memory constraints
train_samp <- train_set[seq(1, nrow(train_set), 100), ]
# Keep only numeric columns with highest coefficients of variation
train_num <- train_samp %>% select(selected_features[1], selected_features[2])
# Binary labels, black = ransomware, white = non-ransomware, train set
train_bw <- train_samp$bw
#Sample every 100th row due to memory constraints to make test sample same size.
test_samp <- test_set[seq(1, nrow(train_set), 100), ]
# Dimension reduction again, selecting features with highest CVs
test_num <- test_samp %>% select(selected_features[1], selected_features[2])
# Binary labels for test set
test_bw <- test_samp$bw
```
```{r data-sparsness, echo=FALSE}
message("The proportion of ransomware addresses in the original data set is ", ransomprop, ".")
message("The total number of NA or missing values in the original data set is ", no_nas, ".")
labels <- ransomware$label %>% summary()
knitr::kable(
list(labels[1:15], labels[16:29]),
caption = 'Ransomware group labels and frequency counts for full data set',
booktabs = TRUE
)
```
Let's take a look at the distribution of the different features. Note how skewed the non-temporal features are, some of them being bimodal:
```{r histograms, echo=FALSE}
# Histograms of each of the columns to show skewness
# Plot histograms for each column using facet wrap
train_long <- train_num %>% # Apply pivot_longer function
pivot_longer(colnames(train_num)) %>%
as.data.frame()
# Histograms per column
ggp1 <- ggplot(train_long, aes(x = value)) + # Draw each column as histogram
geom_histogram(aes(y = ..density..), bins=20) +
geom_density(col = "green", size = .5) +
facet_wrap(~ name, scales = "free")
ggp1
# Log scale on value axis, does not make sense for temporal variables
ggp2 <- ggplot(train_long, aes(x = value)) + # Draw each column as histogram
geom_histogram(aes(y = ..density..), bins=20) +
geom_density(col = "green", size = .5) +
scale_x_continuous(trans='log2') +
facet_wrap(~ name, scales = "free")
ggp2
```
Now we can compare the relative spread of each feature by calculating the coefficient of variation for each column. Larger coefficients of variation indicate larger relative spread compared to other columns.
```{r cv-results, echo=FALSE}
message("The features with the highest coefficients of variation are ",
selected_features[1], selected_features[2],
", which will be used to train the binary model.")
# Summarize results in a table and a plot
knitr::kable(coeff_vars)
plot(coeff_vars)
```
From this, it appears that *income* has the widest range of variability, followed by *neighbors*. These are also the features that are most strongly skewed to the right, meaning that a few addresses have really high values for each of these features while the bulk of the data set has very low values for these numbers.
Now do the following (after filling in methods, results, and conclusions, since those are done already):
6) Break into groups somehow. Graph variables per group? Show how the variables are distributed for each ransomware group? Percent ransomware per each day of the week, for example. Is ransomware more prevalent on a particular day of the week? Break other numerical values into bins, and graph percentage per bin. Look for trends and correlations between groups/variables, and display them here. MORE OF THIS....
```{r shrimp-percentage, echo=FALSE, include=FALSE}
# Count how many wallets have less than one full bitcoin
shrimp <- train_samp %>% filter(income < 10^8 )
```
```{r shrimp-output, echo=FALSE}
# Print the percentage of wallets with less than one full bitcoin
mean(shrimp$bw == "black")
```
---
###############################################################################
## Clean up and add text from here to end.....
###############################################################################
---
### Insights Gained from Exploration
From the previous visual and statistical exploration of the data, it becomes clear what the challenge is. Ransomware addresses are very sparse in the data set, making up less than 2% of the addresses. That small percentage is also further classified into 28 groups. Perhaps the original paper was a bit too ambitious in trying to categorize all the addresses into 29 categories, including the "white" addresses. To simplify our approach, we will categorize the addresses in a binary way, either "white" or "black", where "black" signifies an association with ransomware transactions. Asking this as a "ransomware or not-ransomware" question allows for application of methods that are impractical otherwise.
---
## Modeling approach (chunk #3, needs rewriting of text parts only)
Akcora et al. mention that they tried to model the data using a Random Forests method, but that the complexity of the data set lead to problems with that approach.[3] Switching to a binary perspective on the problem might alleviate some of that complexity, and is worth another look. The topological nature of the way the data set has been described numerically lead me to search for topological machine learning methods. Searching for *topo* in the documentation for the `caret` package [6] resulted in the entry for Self Organizing Maps, supplied by the `kohonen` package. The description at CRAN [7] was intriguing enough for me to investigate further.
Describe how you started with categorical SOMs, switched to binary SOMs, then applied randomForest to the binary problem, and was surprised with the results. Decided to re-apply categorical SOMS to black-only addresses, as predicted by the binary Random forest approach. The result is the following two-step approach, with the optional clustering visualizations at the end
### Method Part 1: Binary Random Forests to isolate ransomware addresses first.
```{r random-forest-prep, echo=FALSE, inculde=FALSE, warning=FALSE}
# Cross Validation, ten fold
control <- trainControl(method="cv", number = 10)
# Control grid with variation on mtry
grid <- data.frame(mtry = c(2, 4, 6, 8, 10, 12))
# Run Cross Validation using control and grid set above
rf_model <- train(train_num, train_bw, method="rf",
trControl = control, tuneGrid=grid)
# Supervised fit of model using cross validated optimization
fit_rf <- randomForest(train_samp, train_bw,
minNode = rf_model$bestTune$mtry)
# Measure accuracy of model against test sample
y_hat_rf <- predict(fit_rf, test_samp)
cm_test <- confusionMatrix(y_hat_rf, test_bw)
# Measure accuracy of model against full ransomware set
ransomware_y_hat_rf <- predict(fit_rf, ransomware)
cm_ransomware <- confusionMatrix(ransomware_y_hat_rf, ransomware$bw)
```
Here are the results for the test set.
```{r random-forest-output_test, echo=FALSE}
cm_test %>% as.matrix() %>% knitr::kable()
cm_test$overall %>% knitr::kable()
cm_test$byClass %>% knitr::kable()
```
Here are the results for the full original set.
```{r random-forest-output_big, echo=FALSE}
cm_ransomware %>% as.matrix() %>% knitr::kable()
cm_ransomware$overall %>% knitr::kable()
cm_ransomware$byClass %>% knitr::kable()
```
### Method Part 2: Categorical SOMs to categorize predicted ransomware addresses.
Now we train a new model after throwing away all "white" addresses.
```{r soms-prep, echo=FALSE, include=FALSE}
##############################################################################
## Now we use the Random Forest model to exclude the "white" addresses from
## the full ransomware set, to categorize the "black" addresses into families.
##############################################################################
# Now use this prediction to reduce the original set to only "black" addresses
# First append the full set of predictions to the original set.
ransomware$prediction <- ransomware_y_hat_rf
# Filter out all the predicted "white" addresses,
# leaving only predicted "black" addresses
black_addresses <- ransomware %>% filter(prediction=="black")
# Split the reduced black-predictions into a training set and a test set @ 50%
test_index <- createDataPartition(y = black_addresses$prediction,
times = 1, p = .5, list = FALSE)
train_set <- black_addresses[-test_index,]
test_set <- black_addresses[test_index,]
# Keep only numeric columns, ignoring temporal variables.
train_num <- train_set %>%
select(length, weight, count, looped, neighbors, income)
# SOM function can only work on matrices.
train_mat <- as.matrix(scale(train_num))
# Select non-temporal numerical features only
test_num <- test_set %>%
select(length, weight, count, looped, neighbors, income)
# Testing data is scaled according to how we scaled our training data.
test_mat <- as.matrix(scale(test_num,
center = attr(train_mat, "scaled:center"),
scale = attr(train_mat, "scaled:scale")))
# Categorical labels for training set
train_label <- train_set$label %>% classvec2classmat()
# Same for test set
test_label <- test_set$label %>% classvec2classmat()
# Create data list for supervised SOM
train_list <- list(independent = train_mat, dependent = train_label)
# Calculate idea grid size according to:
# https://www.researchgate.net/post/How-many-nodes-for-self-organizing-maps
# Formulaic method 1, makes a larger graph in this case
grid_size <- round(sqrt(5*sqrt(nrow(train_set))))
# Based on categorical number, method 2, smaller graph with less cells
#grid_size = ceiling(sqrt(length(unique(ransomware$label))))
# Create SOM grid
train_grid <- somgrid(xdim=grid_size, ydim=grid_size,
topo="hexagonal", toroidal = TRUE)
## Now build the SOM model using the supervised method xyf()
som_model2 <- xyf(train_mat, train_label,
grid = train_grid,
rlen = 100,
mode="pbatch",
cores = detectCores(), # Use all cores
# cores = detectCores() - 1, # Leave one core for system
keep.data = TRUE
)
# Now test predictions of test set, create data list for test set
test_list <- list(independent = test_mat, dependent = test_label)
# Generate predictions
ransomware_group.prediction <- predict(som_model2, newdata = test_list)
# Confusion Matrix
cm_labels <- confusionMatrix(ransomware_group.prediction$prediction[[2]],
test_set$label)
```
A specific method was used to select the optimal grid size. Cite source here and explain options, and why you chose the one you did.
```{r grid-size, echo=FALSE}
message("A grid size of ", grid_size, " has been chosen.")
```
Here is a summary of the results for the categorization of black addresses into ransomware families. For the full table of predictions and statistics, see the Appendix.
```{r soms-output, echo=FALSE, size="tiny"}
cm_labels$overall %>% knitr::kable()
cm_labels$byClass %>% knitr::kable()
```
### Clustering Visualizations: K-means clustering
K-means clustering offers a nice way of visualizing the final SOM grid and the categorical boundaries that were formed by the model.
```{r clustering-setup, echo=FALSE, include=FALSE}
#############################################################################
## K-Means Clustering to visualize the categorization of the SOM
## For a good tutorial, visit:
## https://www.polarmicrobes.org/microbial-community-segmentation-with-r/
#############################################################################
# Set number of clusters to be equal to number of known ransomware groups
n_groups <- length(unique(ransomware$label)) - 1
# Generate k-means clustering
som.cluster <- kmeans(data.frame(som_model2$codes[[1]]), centers=n_groups)
```
Here is a nice graph. How do I center it?
```{r clustering-plot, echo=FALSE}
# Plot clustering results
plot(som_model2,
main = 'K-Means Clustering',
type = "property",
property = som.cluster$cluster,
palette.name = topo.colors)
add.cluster.boundaries(som_model2, som.cluster$cluster)
```
---
## Results & Performance (chunk #4, write up after chunk #3 is done)
### Results
In the original paper by Akcora et al, they tested several different sets of parameters on their TDA model. According to them, "In the best TDA models for each ransomware family, we predict **16.59 false positives for each
true positive**. In turn, this number is 27.44 for the best non-TDA models."[3] In fact, the highest Precision (a.k.a. Positive Predictive Value, defined as TP/(TP+FP)) they achieved was only 0.1610. Compare this to our final Precision value of 1.000? It is almost embarrassing... did I do something wrong here?
### Performance
The script runs on the aforementioned hardware in less than five minutes and uses less than 4GB of RAM. Given that the Bitcoin network produces one new block every ten minutes on average, then real-time analysis could theoretically be conducted on each block as it is announced using even moderate compuitng resources.
## Summary
### Comparison to original paper and impact of findings
They suck, I rule, 'nuff said.
### Limitations
SOMs seem like they are easy to misconfigure. Perhaps a dual Random Forest approach would be better?
### Future Work
I only scratched he surface of the SOM algorithm which seems to have many implementations and parameters that could be investigated further and possibly optimized via cross-validation, somehow.
### Conclusions
#### Get Monero!
This paper/report presents a reliable method for classifying bitcoin addresses into know ransomware families, while at the same time avoiding false positives by filtering them out using a binary method before classifying them further. It leaves the author of the paper wondering how long before we see ransomware using privacy coins such as Monero. Find and cite a recent paper on the untracability of the Monero blockchain.
## References
[1] Adam Brian Turner, Stephen McCombie and Allon J. Uhlmann (November 30, 2020) [Analysis Techniques for Illicit Bitcoin Transactions](https://doi.org/10.3389/fcomp.2020.600596)
[2] Daniel Goldsmith, Kim Grauer and Yonah Shmalo (April 16, 2020) [Analyzing hack subnetworks in the
bitcoin transaction graph](https://doi.org/10.1007/s41109-020-00261-7)
[3] Cuneyt Gurcan Akcora, Yitao Li, Yulia R. Gel, Murat Kantarcioglu (June 19, 2019) [BitcoinHeist: Topological Data Analysis for Ransomware Detection on the Bitcoin Blockchain](https://arxiv.org/abs/1906.07852)
[4] UCI Machine Learning Repository [https://archive.ics.uci.edu/ml/index.php](https://archive.ics.uci.edu/ml/index.php)
[5] BitcoinHeist Ransomware Address Dataset /n [https://archive.ics.uci.edu/ml/datasets/BitcoinHeistRansomwareAddressDataset](https://archive.ics.uci.edu/ml/datasets/BitcoinHeistRansomwareAddressDataset)
[6] Available Models - The `caret` package [http://topepo.github.io/caret/available-models.html](http://topepo.github.io/caret/available-models.html)
[7] Ron Wehrens and Johannes Kruisselbrink, Package `kohonen` @ CRAN (2019) [https://cran.r-project.org/web/packages/kohonen/kohonen.pdf](https://cran.r-project.org/web/packages/kohonen/kohonen.pdf)
[XMR] Malte Möser*, Kyle Soska, Ethan Heilman, Kevin Lee, Henry Heffan, Shashvat Srivastava,
Kyle Hogan, Jason Hennessey, Andrew Miller, Arvind Narayanan, and Nicolas Christin (April 23, 2018) [An Empirical Analysis of Traceability in the Monero Blockchain](https://arxiv.org/pdf/1704.04299/)
\newpage
## Appendix:
### Categorical SOM ransowmare family prediction table and confusion matrix - detailed
```{r soms-output-table, echo=FALSE}
cm_labels
```
```{r empty block, echo=FALSE, include=FALSE}
# Comment goes here....
# Use this for other blocks, etc.
```