ransomware/Detecting_Bitcoin_Ransomware.Rmd

---
title: \vspace{1in}Detecting Ransomware Addresses on the Bitcoin Blockchain using Random Forest and Self Organizing Maps
subtitle: \vspace{.5in}HarvardX PH125.9x Final Capstone CYO Project
          \vspace{.5in}
author: "Kaylee Robert Tejeda"
date: "11/11/2021"
abstract: "Ransomware is a persisent and growing threat in the world of cybersecurity. A specific area of focus involves detecting and tracking payments made to ransomware operators. While many attempts towards this goal have not made use of sophisticated machine learning methods, even those that have often result in models with poor precision or other performance issues. A two-step method is developed to address the issue of false positives." 
keywords: 
- Bitcoin
- blockchain
- ransomware
- machine learning
- Random Forest
- Self Organizing Maps
- SOMs
- cryptocurrency
output: pdf_document
header-includes:
- \usepackage{booktabs}
geometry: margin=2cm
---
\def\bitcoinA{%
  \leavevmode
  \vtop{\offinterlineskip %\bfseries
    \setbox0=\hbox{B}%
    \setbox2=\hbox to\wd0{\hfil\hskip-.03em
    \vrule height .3ex width .15ex\hskip .08em
    \vrule height .3ex width .15ex\hfil}
    \vbox{\copy2\box0}\box2}}
    
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, out.width="400px", dpi=120)
def.chunk.hook  <- knitr::knit_hooks$get("chunk")
knitr::knit_hooks$set(chunk = function(x, options) {
  x <- def.chunk.hook(x, options)
  ifelse(options$size != "normalsize", paste0("\n \\", options$size,"\n\n", x,
                                              "\n\n \\normalsize"), x)
})


```

\newpage
&nbsp;
\vspace{25pt}
\tableofcontents

\newpage

## Introduction

  Ransomware attacks are of interest to security professionals, law enforcement, and financial regulatory officials.$^{[1]}$  The pseudo-anonymous Bitcoin network provides a convenient method for ransomware attackers to accept payments without revealing their identity or location.  The victims (usually hospitals or other large organizations) come to learn that much if not all of their important organizational data have been encrypted with a secret key by an unknown attacker. They are instructed to make a payment to a specific Bitcoin address by a certain deadline to have the data decrypted or else it will be deleted automatically.  
  
  The deeper legal and financial implications of ransomware attacks are inconsequential to the work in this report, as we are merely interested in being able to classify bitcoin addresses by their connection to ransomware transactions. Many researchers are already tracking illicit activity (such as ransomware payments) around the Bitcoin blockchain as soon as possible to minimize financial losses.  Daniel Goldsmith explains some of the reasons and methods of blockchain analysis at [Chainalysis.com](https://www.chainalysis.com/).$^{[2]}$  For example, consider a ransomware attack conducted towards an illegal darknet market site. The news of such an attack might not be published at all, let alone in popular media. By analyzing the transaction record with a blockchain explorer such as [BTC.com](https://btc.com/), suspicious activity could be flagged in real time given a sufficiently robust model. It may, in fact, be the first public notice of such an event. Any suspicious addresses could then be blacklisted or banned from using other services, if that is so desired.
  
  Lists of known ransomware payment addresses have been compiled and analyzed using various methods. One well known paper entitled "BitcoinHeist: Topological Data Analysis for Ransomware Detection on the Bitcoin Blockchain"$^{[3]}$ will be the source of our data set and the baseline to which we will compare our results.  In that paper, Akcora, et al. use Topological Data Analysis (TDA) to classify addresses on the Bitcoin blockchain into one of 28 known ransomware address groups. Addresses with no known ransomware associations are classified as *white*.  The blockchain is then considered as a heterogeneous Directed Acyclic Graph (DAG) with two types of nodes describing *addresses* and *transactions*.  Edges are formed between the nodes when a transaction can be associated with a particular address.  
  
  Any given address on the Bitcoin network may appear many times, with different inputs and outputs each time.  The Bitcoin network data has been divided into 24-hour time intervals with the UTC-6 timezone as a reference.  This way, variables can be defined in a specific and meaningful way.  For example, *speed* can be defined as the number of blocks the coin appears in during a 24-hour period, and provides information on how quickly a coin moves through the network. *Speed* may be an indicator of money laundering or coin mixing, as normal payments only involve a limited number of addresses in a given 24 hour period, and thus have lower speeds when compared to "mixed" coins.  The temporal data can also help distinguish transactions by geolocation, as criminal transactions tend to cluster in time.

 With the graph specified as such, the following six numerical features$^{[2]}$ are associated with a given address:
   
   1)  *Income* - the total amount of coins sent to an address
   
   2)  *Neighbors* - the number of transactions that have this address as one of its output addresses
   
   3)  *Weight* - the sum of fraction of coins that reach this address from address that do not have any other inputs within the 24-hour window, which are referred to as "starter transactions"
   
   4)  *Length* - the number of non-starter transactions on its longest chain, where a chain is defined as an
acyclic directed path originating from any starter transaction and ending at the address in question
   
   5)  *Count* - The number of starter addresses connected to this address through a chain
   
   6)  *Looped* - The number of starter addresses connected to this address by more than one path
   
These variables are defined rather conceptually, viewing the blockchain as a topological graph with nodes and edges.  The rationale behind this approach is to quantify specific transaction patterns. Akcora$^{[3]}$ gives a thorough explanation in the original paper of how and why these features were chosen.  We shall treat the features as general numerical variables and will not seek to justify their definitions beyond that.  Machine learning methods will be applied to the original data set from the paper by Akcora$^{[3]}$, and the new results will be compared to the original ones.

### Data

   This data set was found while exploring the [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/index.php)$^{[4]}$  as suggested in the project instructions.  The author of this report, interested in Bitcoin and other cryptocurrencies since (unsuccessfully) mining them on an ASUS netbook in rural Peru in late 2010, used *cryptocurrency* as a preliminary search term.  This brought up a single data set entitled ["BitcoinHeist: Ransomware Address Data Set"](https://archive.ics.uci.edu/ml/datasets/BitcoinHeistRansomwareAddressDataset#).  The data set was downloaded and the exploration began.
  
```{r install_load_libraries-and-download_data, echo=FALSE, include=FALSE}

# Set the repository mirror to “0-Cloud” for maximum availability
r = getOption("repos") 
r["CRAN"] = "http://cran.rstudio.com"
options(repos = r)
rm(r)

# Install necessary packages if not already present
if(!require(tidyverse)) install.packages("tidyverse")
if(!require(caret)) install.packages("caret")
if(!require(randomForest)) install.packages("randomForest")
if(!require(kohonen)) install.packages("kohonen")
if(!require(parallel)) install.packages("parallel")
if(!require(matrixStats)) install.packages("matrixStats")
if(!require(xtable)) install.packages("xtable")

# Load Libraries
library(tidyverse)
library(caret)
library(randomForest)
library(kohonen)
library(parallel)
library(matrixStats)
library(xtable)

# Set # of cores, use detectCores() - 1 to leave one for the system
n_cores <- detectCores()

# Download data
url <- 
  "https://archive.ics.uci.edu/ml/machine-learning-databases/00526/data.zip"
dest_file <- "data/data.zip"
if(!dir.exists("data"))dir.create("data")
if(!file.exists(dest_file))download.file(url, destfile = dest_file)

# Unzip as CSV
if(!file.exists("data/BitcoinHeistData.csv"))unzip(dest_file, 
                                                   "BitcoinHeistData.csv", 
                                                   exdir="data")

# Import data from CSV
ransomware <- read_csv("data/BitcoinHeistData.csv")

```

A summary of the data set shows the range of values and size of the sample.

```{r data_summary, echo=FALSE, size="tiny"}

# Summary
ransomware %>% summary() %>% knitr::kable(caption="Summary of data set")

```

A listing of the first ten rows provides a sample of the features associated with each observation.

```{r data_head, echo=FALSE, size="tiny"}

# Inspect data
ransomware %>% head() %>% knitr::kable(caption="First ten entries of data set")

```

This data set has 2,916,697 observations of ten features associated with a sample of transactions from the Bitcoin blockchain.  The ten features include *address* as a unique identifier, the six features defined previously (*income, neighbors, weight, length, count, loop*), two temporal features in the form of *year* and *day* (of the year as 1 to 365), and a categorical feature called *label* that categorizes each address as either *white* (meaning not connected to any ransomware activity), or one of 28 known ransomware groups as identified by three independent ransomware analysis teams (Montreal, Princeton, and Padua)$^{[3]}$. 

The original research team downloaded and parsed the entire Bitcoin transaction graph from January 2009 to December 2018. Based on a 24 hour time interval, daily transactions on the network were extracted and the Bitcoin graph was formed. Network edges that transferred less than \bitcoinA 0.3 were filtered out since ransom amounts are rarely below this threshold. Ransomware addresses are taken from three widely adopted studies: Montreal, Princeton and Padua. *White* Bitcoin addresses were capped at one thousand per day, whereas the entire network sees up to 800,000 addresses daily.$^{[5]}$

### Goal

  The goal of this project is to apply different machine learning algorithms to the same data set used in the original paper, producing an acceptable predictive model for categorizing ransomware addresses correctly.  Improving on the results of the original paper in some way, while not strictly necessary for the purposes of the project, would be a notable sign of success.  
  
###  Outline of Steps Taken 

1. Analyze data set numerically and visually, look for insights in any patterns.
2. Binary separation using Self Organizing Maps.
3. Fast binary separation using Random Forest.
4. Categorical classification using Self Organizing Maps.
5. Visualize clustering to analyze results further.
6. Generate confusion matrix to quantify results.

---

## Data Analysis

### Hardware Specification

   All of the analysis in this report was conducted on a single laptop computer, a Lenovo Yoga S1 from late 2013 with the following specifications.
   
   - CPU:  Intel i7-4600U @ 3.300GHz (4th Gen quad-core i7 x86_64)
   - RAM:  8217MB DDR3L @ 1600 MHz  (8 GB)
   - OS:   Slackware64-current (15.0 RC1) `x86_64-slackware-linux-gnu` (64-bit GNU/Linux)
   - R version 4.0.0 (2020-04-24) -- "Arbor Day" (built from source using scripts from [slackbuilds.org](https://slackbuilds.org/))
   - RStudio Version 1.4.1106 "Tiger Daylily" (2389bc24, 2021-02-11) for CentOS 8 (converted using `rpm2tgz`)

###  Data Preparation

  It is immediately apparent that this is a rather large data set.  The usual practice of partitioning out 80% to 90% of the data for training results in a training set that is too large to process given the hardware limitations.  For reasons that no longer apply, the original data set was first split in half with 50% reserved as *validation set* and the other 50% used as the *working set*.  This working set was again split in half, to give a *training set* that was of a reasonable size to deal with.  This produced partitions that were small enough to work with, so the partition size ratio was not further refined.  This is a potential area for later optimization. Careful sampling was carried out to ensure that the ransomware groups were represented in each sample.
  
```{r data_prep, echo=FALSE, include=FALSE}

# Turn labels into factors, "bw" is binary factor for ransomware/non-ransomware
ransomware <- ransomware %>%
  mutate(label=as.factor(label), 
         bw=as.factor(ifelse(label=="white", "white", "black")))

# Validation set made from 50% of BitcoinHeist data, for RAM considerations
test_index <- createDataPartition(y = ransomware$bw, 
                                  times = 1, p = .5, list = FALSE)

workset <- ransomware[-test_index,]
validation <- ransomware[test_index,]

# Split the working set into a training set and a test set @ 50%, RAM dictated
test_index <- createDataPartition(y = workset$bw,
                                  times = 1, p = .5, list = FALSE)

train_set <- workset[-test_index,]
test_set <- workset[test_index,]

# Find proportion of full data set that is ransomware
ransomprop <- mean(ransomware$bw=="black")

# Check for NAs
no_nas <- sum(is.na(ransomware))

```

### Exploration and Visualization 

By graphing a values, we can get an idea of how the data is distributed across the various features.

```{r cv_calcs, echo=FALSE}

# Keep only numeric columns, ignoring temporal features
ransomware_num <- ransomware %>% 
  select(income, neighbors, weight, length, count, looped)

# Check for variation across numerical columns using coefficients of variation
#
# Calculate standard deviations for each column
sds <- ransomware_num %>% as.matrix() %>% colSds()

# Calculate means for each column
means <- ransomware_num %>% as.matrix() %>% colMeans()

# Calculate CVs for each column
coeff_vars <- sds %/% means

#  Select the two features with the highest coefficients of variation
selected_features <- names(sort(coeff_vars, decreasing=TRUE))[1:2]

#Sample every 100th row due to memory constraints
train_samp <- train_set[seq(1, nrow(train_set), 100), ]

# Keep only numeric columns with highest coefficients of variation
train_num <- train_samp %>% select(selected_features[1], selected_features[2])

# Binary labels, black = ransomware, white = non-ransomware, train set
train_bw <- train_samp$bw 

#Sample every 100th row due to memory constraints to make test sample same size.
test_samp <- test_set[seq(1, nrow(train_set), 100), ]

# Dimension reduction again, selecting features with highest CVs
test_num <- test_samp %>% select(selected_features[1], selected_features[2])

# Binary labels for test set 
test_bw <- test_samp$bw 

```  

The proportion of ransomware addresses in the original data set is `r ransomprop`.  The total number of NA or missing values in the original data set is `r no_nas`.

The ransomware addresses make up less than 2% of the overall data set.  This presents a challenge as the target observations are sparse within the data set, especially when we consider that this is then divided into 28 subsets.  In fact, some of the ransomware groups have only a single member, making categorization a dubious task.  At least there are no missing values to worry about.

```{r data_sparsness, echo=FALSE}

labels <- ransomware$label  %>% summary() 

knitr::kable(
    list(labels[1:10], labels[11:20], labels[21:29]),
  caption = 'Ransomware group labels and frequency counts for full data set',
  booktabs = TRUE)

```

Let's take a look at the distribution of the different features.  Note how skewed the non-temporal features are, some of them being bimodal.  Looks better on a log scale x-axis.

```{r histograms, echo=FALSE, warning=FALSE, fig.align="center"}
########################################################
## Histograms of each of the columns to show skewness
## Plot histograms for each column using facet wrap
########################################################

# Remove non-numerical and temporal columns to look for patterns in 
# topologically defined features
train_hist <- train_samp %>% select(-address, -label, -bw, -day, -year)

# Apply pivot_longer function to facilitate facet wrapping
train_long <- train_hist %>%         
  pivot_longer(colnames(train_hist)) %>% 
  as.data.frame()

# Log scale on value axis, 
histograms <- ggplot(train_long, aes(x = value)) +  
  geom_histogram(aes(y = ..density..), bins=20) + 
  geom_density(col = "green", size = .5) +
  scale_x_continuous(trans='log2') +
  facet_wrap(~ name, scales = "free")

histograms + theme(axis.text.x = element_text(size = 8, angle=30, hjust=1))


```


Now let us compare the relative spread of each feature by calculating the coefficient of variation for each column.  Larger coefficients of variation indicate larger relative spread compared to other columns.

```{r cv_results, echo=FALSE, fig.align="center"}

# Summarize results in a table
knitr::kable(
    list(coeff_vars[1:2], coeff_vars[3:4], coeff_vars[5:6]),
  caption = 'Coefficients of Variation for each feature',
  booktabs = TRUE)

```

From this, it appears that `r selected_features[1]` has the widest range of variability, followed by `r selected_features[2]`.  These are also the features that are most strongly skewed to the right, meaning that a few addresses have really high values for each of these features while the bulk of the data set has very low values for these numbers.

Taking the feature with the highest variation  `r selected_features[1]`, let us take a look at the distribution for individual ransomware families.  Perhaps there is a similarity across families.


```{r variation_histograms, echo=FALSE, fig.height=2, fig.width=2.5, fig.show="hold", out.width='35%', warning=FALSE}

# Density plots of the feature with highest variation
selected_feature1 <- selected_features[1]

ransomware_big_families <- ransomware %>% 
  mutate(selected_feature1 = as.numeric(selected_feature1))

# Note: Putting these graphs into a for loop breaks some of the formatting.
# Low membership makes some of the graphs not very informative
# Relatively meaningless graphs have been left out to save time and space.

# Label 1
ransomware_big_families %>% 
  filter(label==levels(ransomware_big_families$label)[1]) %>% 
  select(income) %>% 
  ggplot(aes(x=income,  y = ..density..)) +
  geom_density(col = "green")+
  ggtitle(levels(ransomware_big_families$label)[1]) +
  scale_x_continuous(trans='log2')  + 
  theme(axis.text.x = element_text(size = 8, angle=30, hjust=1),
        plot.title = element_text(size = 9, face = "bold"),
        axis.title.x=element_blank(),
        axis.title.y=element_blank())

# Label 4
ransomware_big_families %>% 
  filter(label==levels(ransomware_big_families$label)[4]) %>% 
  select(income) %>% 
  ggplot(aes(x=income,  y = ..density..)) +
  geom_density(col = "green")+
  ggtitle(levels(ransomware_big_families$label)[4]) +
  scale_x_continuous(trans='log2')  + 
  theme(axis.text.x = element_text(size = 8, angle=30, hjust=1),
        plot.title = element_text(size = 9, face = "bold"),
        axis.title.x=element_blank(),
        axis.title.y=element_blank())
# Label 5
ransomware_big_families %>% 
  filter(label==levels(ransomware_big_families$label)[5]) %>% 
  select(income) %>% 
  ggplot(aes(x=income,  y = ..density..)) +
  geom_density(col = "green")+
  ggtitle(levels(ransomware_big_families$label)[5]) +
  scale_x_continuous(trans='log2')  + 
  theme(axis.text.x = element_text(size = 8, angle=30, hjust=1),
        plot.title = element_text(size = 9, face = "bold"),
        axis.title.x=element_blank(),
        axis.title.y=element_blank())

# Label 6
ransomware_big_families %>% 
  filter(label==levels(ransomware_big_families$label)[6]) %>% 
  select(income) %>% 
  ggplot(aes(x=income,  y = ..density..)) +
  geom_density(col = "green")+
  ggtitle(levels(ransomware_big_families$label)[6]) +
  scale_x_continuous(trans='log2')  + 
  theme(axis.text.x = element_text(size = 8, angle=30, hjust=1),
        plot.title = element_text(size = 9, face = "bold"),
        axis.title.x=element_blank(),
        axis.title.y=element_blank())
# Label 7
ransomware_big_families %>% 
  filter(label==levels(ransomware_big_families$label)[7]) %>% 
  select(income) %>% 
  ggplot(aes(x=income,  y = ..density..)) +
  geom_density(col = "green")+
  ggtitle(levels(ransomware_big_families$label)[7]) +
  scale_x_continuous(trans='log2')  + 
  theme(axis.text.x = element_text(size = 8, angle=30, hjust=1),
        plot.title = element_text(size = 9, face = "bold"),
        axis.title.x=element_blank(),
        axis.title.y=element_blank())
# Label 8
ransomware_big_families %>% 
  filter(label==levels(ransomware_big_families$label)[8]) %>% 
  select(income) %>% 
  ggplot(aes(x=income,  y = ..density..)) +
  geom_density(col = "green")+
  ggtitle(levels(ransomware_big_families$label)[8]) +
  scale_x_continuous(trans='log2')  + 
  theme(axis.text.x = element_text(size = 8, angle=30, hjust=1),
        plot.title = element_text(size = 9, face = "bold"),
        axis.title.x=element_blank(),
        axis.title.y=element_blank())


# Label 10
ransomware_big_families %>% 
  filter(label==levels(ransomware_big_families$label)[10]) %>% 
  select(income) %>% 
  ggplot(aes(x=income,  y = ..density..)) +
  geom_density(col = "green")+
  ggtitle(levels(ransomware_big_families$label)[10]) +
  scale_x_continuous(trans='log2')  + 
  theme(axis.text.x = element_text(size = 8, angle=30, hjust=1),
        plot.title = element_text(size = 9, face = "bold"),
        axis.title.x=element_blank(),
        axis.title.y=element_blank())
# Label 11
ransomware_big_families %>% 
  filter(label==levels(ransomware_big_families$label)[11]) %>% 
  select(income) %>% 
  ggplot(aes(x=income,  y = ..density..)) +
  geom_density(col = "green")+
  ggtitle(levels(ransomware_big_families$label)[11]) +
  scale_x_continuous(trans='log2')  +
  theme(axis.text.x = element_text(size = 8, angle=30, hjust=1),
        plot.title = element_text(size = 9, face = "bold"),
        axis.title.x=element_blank(),
        axis.title.y=element_blank())
# Label 12
ransomware_big_families %>% 
  filter(label==levels(ransomware_big_families$label)[12]) %>% 
  select(income) %>% 
  ggplot(aes(x=income,  y = ..density..)) +
  geom_density(col = "green")+
  ggtitle(levels(ransomware_big_families$label)[12]) +
  scale_x_continuous(trans='log2')  + 
  theme(axis.text.x = element_text(size = 8, angle=30, hjust=1),
        plot.title = element_text(size = 9, face = "bold"),
        axis.title.x=element_blank(),
        axis.title.y=element_blank())

# Label 13
ransomware_big_families %>% 
  filter(label==levels(ransomware_big_families$label)[13]) %>% 
  select(income) %>% 
  ggplot(aes(x=income,  y = ..density..)) +
  geom_density(col = "green")+
  ggtitle(levels(ransomware_big_families$label)[13]) +
  scale_x_continuous(trans='log2')  +
  theme(axis.text.x = element_text(size = 8, angle=30, hjust=1),
        plot.title = element_text(size = 9, face = "bold"),
        axis.title.x=element_blank(),
        axis.title.y=element_blank())
# Label 14
ransomware_big_families %>% 
  filter(label==levels(ransomware_big_families$label)[14]) %>% 
  select(income) %>% 
  ggplot(aes(x=income,  y = ..density..)) +
  geom_density(col = "green")+
  ggtitle(levels(ransomware_big_families$label)[14]) +
  scale_x_continuous(trans='log2')  + 
  theme(axis.text.x = element_text(size = 8, angle=30, hjust=1),
        plot.title = element_text(size = 9, face = "bold"),
        axis.title.x=element_blank(),
        axis.title.y=element_blank())
# Label 15
ransomware_big_families %>% 
  filter(label==levels(ransomware_big_families$label)[15]) %>% 
  select(income) %>% 
  ggplot(aes(x=income,  y = ..density..)) +
  geom_density(col = "green")+
  ggtitle(levels(ransomware_big_families$label)[15]) +
  scale_x_continuous(trans='log2')  + 
  theme(axis.text.x = element_text(size = 8, angle=30, hjust=1),
        plot.title = element_text(size = 9, face = "bold"),
        axis.title.x=element_blank(),
        axis.title.y=element_blank())

# Label 16
ransomware_big_families %>% 
  filter(label==levels(ransomware_big_families$label)[16]) %>% 
  select(income) %>% 
  ggplot(aes(x=income,  y = ..density..)) +
  geom_density(col = "green")+
  ggtitle(levels(ransomware_big_families$label)[16]) +
  scale_x_continuous(trans='log2')  + 
  theme(axis.text.x = element_text(size = 8, angle=30, hjust=1),
        plot.title = element_text(size = 9, face = "bold"),
        axis.title.x=element_blank(),
        axis.title.y=element_blank())

# Label 18
ransomware_big_families %>% 
  filter(label==levels(ransomware_big_families$label)[18]) %>% 
  select(income) %>% 
  ggplot(aes(x=income,  y = ..density..)) +
  geom_density(col = "green")+
  ggtitle(levels(ransomware_big_families$label)[18]) +
  scale_x_continuous(trans='log2')  + 
  theme(axis.text.x = element_text(size = 8, angle=30, hjust=1),
        plot.title = element_text(size = 9, face = "bold"),
        axis.title.x=element_blank(),
        axis.title.y=element_blank())

# Label 20
ransomware_big_families %>% 
  filter(label==levels(ransomware_big_families$label)[20]) %>% 
  select(income) %>% 
  ggplot(aes(x=income,  y = ..density..)) +
  geom_density(col = "green")+
  ggtitle(levels(ransomware_big_families$label)[20]) +
  scale_x_continuous(trans='log2')  + 
  theme(axis.text.x = element_text(size = 8, angle=30, hjust=1),
        plot.title = element_text(size = 9, face = "bold"),
        axis.title.x=element_blank(),
        axis.title.y=element_blank())

# Label 22
ransomware_big_families %>% 
  filter(label==levels(ransomware_big_families$label)[22]) %>% 
  select(income) %>% 
  ggplot(aes(x=income,  y = ..density..)) +
  geom_density(col = "green")+
  ggtitle(levels(ransomware_big_families$label)[22]) +
  scale_x_continuous(trans='log2')  + 
  theme(axis.text.x = element_text(size = 8, angle=30, hjust=1),
        plot.title = element_text(size = 9, face = "bold"),
        axis.title.x=element_blank(),
        axis.title.y=element_blank())
# Label 23
ransomware_big_families %>% 
  filter(label==levels(ransomware_big_families$label)[23]) %>% 
  select(income) %>% 
  ggplot(aes(x=income,  y = ..density..)) +
  geom_density(col = "green")+
  ggtitle(levels(ransomware_big_families$label)[23]) +
  scale_x_continuous(trans='log2')  + 
  theme(axis.text.x = element_text(size = 8, angle=30, hjust=1),
        plot.title = element_text(size = 9, face = "bold"),
        axis.title.x=element_blank(),
        axis.title.y=element_blank())
# Label 24
ransomware_big_families %>% 
  filter(label==levels(ransomware_big_families$label)[24]) %>% 
  select(income) %>% 
  ggplot(aes(x=income,  y = ..density..)) +
  geom_density(col = "green")+
  ggtitle(levels(ransomware_big_families$label)[24]) +
  scale_x_continuous(trans='log2')  + 
  theme(axis.text.x = element_text(size = 8, angle=30, hjust=1),
        plot.title = element_text(size = 9, face = "bold"),
        axis.title.x=element_blank(),
        axis.title.y=element_blank())

# Label 27
ransomware_big_families %>% 
  filter(label==levels(ransomware_big_families$label)[27]) %>% 
  select(income) %>% 
  ggplot(aes(x=income,  y = ..density..)) +
  geom_density(col = "green")+
  ggtitle(levels(ransomware_big_families$label)[27]) +
  scale_x_continuous(trans='log2')  + 
  theme(axis.text.x = element_text(size = 8, angle=30, hjust=1),
        plot.title = element_text(size = 9, face = "bold"),
        axis.title.x=element_blank(),
        axis.title.y=element_blank())
# Label 28
ransomware_big_families %>% 
  filter(label==levels(ransomware_big_families$label)[28]) %>% 
  select(income) %>% 
  ggplot(aes(x=income,  y = ..density..)) +
  geom_density(col = "green")+
  ggtitle(levels(ransomware_big_families$label)[28]) +
  scale_x_continuous(trans='log2')  + 
  theme(axis.text.x = element_text(size = 8, angle=30, hjust=1),
        plot.title = element_text(size = 9, face = "bold"),
        axis.title.x=element_blank(),
        axis.title.y=element_blank())
# Label 29
ransomware_big_families %>% 
  filter(label==levels(ransomware_big_families$label)[29]) %>% 
  select(income) %>% 
  ggplot(aes(x=income,  y = ..density..)) +
  geom_density(col = "green")+
  ggtitle(levels(ransomware_big_families$label)[29]) +
  scale_x_continuous(trans='log2')  + 
  theme(axis.text.x = element_text(size = 8, angle=30, hjust=1),
        plot.title = element_text(size = 9, face = "bold"),
        axis.title.x=element_blank(),
        axis.title.y=element_blank())


``` 

It appears that, although the income distribution (as an example feature to consider) for ransomware groups does differ from the distribution pattern for *white* addresses, it also varies from group to group.  For this reason, this makes a good feature to use in the training of the models.

```{r shrimp-percentage, echo=FALSE, include=FALSE}

# Count how many wallets have less than one hundred bitcoins
shrimp <- ransomware %>% filter(income < 10^10 )

```  
  
The percentage of wallets with less than one hundred bitcoins as their balance is `r mean(shrimp$bw == "black")`.  I have no idea why this is meaningful, but I can calculate it at least.

### Insights gained from exploration

  After visually and statistically exploring of the data, it becomes clear what the challenge is.  Ransomware related addresses are very sparse in the data set, making up less than 2% of all addresses.  This small percentage is also further classified into 28 groups.  Perhaps the original paper was a overly ambitious in trying to categorize all the addresses into 29 categories, including the vastly prevalent *white* addresses.  To simplify our approach, we will categorize the addresses in a binary way as either *white* or *black*, where *black* signifies an association with ransomware transactions.  Asking this as a "ransomware or not-ransomware" question allows for application of methods that have been shown to be impractical otherwise.

---

## Modelling approach

  Akcora, et al. applied a Random Forest approach to the data, however "Despite improving data scarcity, [...] tree based methods (i.e., Random Forest and XGBoost) fail to predict any ransomware family".[3, 11]  Considering all ransomware addresses as belonging to a single group may improve the predictive power of such methods, making Random Forest worth another try.  
  
  The topological description of the data set inspired a search for topological machine learning methods, although one does not necessitate the other.  Searching for *topo* in the documentation for the `caret` package [6] resulted in the entry for Self Organizing Maps (SOMs), supplied by the `kohonen` package.  The description at CRAN [7] was intriguing enough to merit further investigation. 
  
  Initially, the categorization of ransomware into the 28 different families was attempted using SOMs.  This proved to be very resource intensive, requiring more time and RAM than was available.  Although it did help to illuminate how SOMs are configured, the resource requirements of the algorithm became a deterrent.  It was at this point that the SOMs were applied in a binary way, classifying all ransomware addresses as merely *black*, initially in an attempt to simply get the algorithm to run to completion without error.  This seemed to reduce RAM usage to the point of being feasible on the available hardware.  
  
   Self Organizing Maps were not covered in the coursework at any point, therefore a familiar method was sought out to compare the results to.  Random Forest was chosen and applied to the data set in a binary way, classifying every address as either *white* or *black*, ignoring the ransomware families.  Surprisingly, not only did the Random Forest approach result in an acceptable model, it did so much quicker than expected, taking only a few minutes to produce results.
   
   At this point, it was very tempting to leave it there and write up a comparison of the two approaches to the binary problem, by classifying all ransomware related addresses as *black*.  However, a nagging feeling that more could be done eventually inspired a second look at the categorical problem of grouping the ransomware addresses into the 28 known families.  Given the high accuracy and precision of the binary Random Forest approach, the sparseness of the ransomware in the larger set has been eliminated completely, along with any chances of false positives. There are a few cases of false negatives, depending on how the randomization is done during the sampling process. However, the Random Forest method does not seem to produce many false positive (if any), meaning it never seems to predict a truly white address as being black.  Hence, by applying the Random Forest method first, we have effectively filtered out any possibility of false positives by correctly identifying a very large set of purely *white* addresses, which are then removed from the set.  The best model used in the original paper by Akcora, et al. resulted in more false positives than true positives.  This low precision rate is what made it impractical for real-world usage.[3] 
   
   This all inspired a two-part method to first separate the addresses into *black* and *white* groups, and then further classify the *black* addresses into ransomware families.  We shall explore each of these steps separately.
  
### Method Part 0:  Binary SOMs

The first working model that ran to completion without exhausting computer resources did not make use of the ransomware family labels and instead the two categories of *black* and *white*.  The `kohonen` package provides algorithms for both supervised and unsupervised model building.  A supervised approach was used since the data set includes information about the membership of ransomware families that can be used to train the model.

```{r binary_SOMs, echo=FALSE, include=FALSE}
##############################################################################
## This is a first attempt using SOMs to model the data set as "black" and
## "white" addresses only.
##
## NOTE:  This is the most computationally heavy part of the paper and takes
## several hours to run to completion.  It is also completely optional, only
## used to compare with the better method. If, for some reason, you want to 
## compile the report without this section, you can just comment it all out
## or remove it because nothing is needed from Method Part 0 for any of the
## other methods.  In other words, it can be safely skipped if you are short on 
## tine or RAM.
##############################################################################

# Keep only numeric columns, ignoring dates and looped.
som1_train_num <- train_set %>% select(length, weight, count, neighbors, income)

# SOM function can only work on matrices
som1_train_mat <- as.matrix(scale(som1_train_num))

# Switching to supervised SOMs
som1_test_num <- test_set %>% select(length, weight, count, neighbors, income)

# Note that when we rescale our testing data we need to scale it
# according to how we scaled our training data.
som1_test_mat <- 
  as.matrix(scale(som1_test_num, center = attr(som1_train_mat, "scaled:center"),
                  scale = attr(som1_train_mat, "scaled:scale")))

# Binary outputs, black=ransomware, white=non-ransomware, train set
som1_train_bw <- train_set$bw %>% classvec2classmat()

# Same for test set
som1_test_bw <- test_set$bw %>% classvec2classmat()

# Create Data list for supervised SOM
som1_train_list <-
  list(independent = som1_train_mat, dependent = som1_train_bw)

############################################################################
## Calculate idea grid size according to:
## https://www.researchgate.net/post/How-many-nodes-for-self-organizing-maps
############################################################################

# Formulaic method 1
grid_size <- round(sqrt(5*sqrt(nrow(train_set))))
# Based on categorical number, method 2
#grid_size = ceiling(sqrt(length(unique(ransomware$bw))))
grid_size

# Create SOM grid
som1_train_grid <- 
  somgrid(xdim=grid_size, ydim=grid_size, topo="hexagonal", toroidal = TRUE)

## Now build the model.
som_model1 <- xyf(som1_train_mat, som1_train_bw,
                 grid = som1_train_grid, 
                 rlen = 100,
                 mode="pbatch", 
                 cores = n_cores, 
                 keep.data = TRUE
)


# Now test predictions

som1_test_list <- list(independent = som1_test_mat, dependent = som1_test_bw)

ransomware.prediction1 <- predict(som_model1, newdata = som1_test_list)


# Confusion matrix
som1_cm_bw <- 
  confusionMatrix(ransomware.prediction1$prediction[[2]], test_set$bw)

# Now test predictions of validation set

# Switching to supervised SOMs
valid_num <- validation %>% select(length, weight, count, neighbors, income)

# Note that when we rescale our testing data we need to scale it 
# according to how we scaled our training data.
valid_mat <- 
  as.matrix(scale(valid_num, center = attr(som1_train_mat,  "scaled:center"),
                  scale = attr(som1_train_mat, "scaled:scale")))

valid_bw <- validation$bw

valid_list <- list(independent = valid_mat, dependent = valid_bw)

# Requires up to 16GB of RAM, skip if resources are limited
ransomware.prediction1.validation <- predict(som_model1, newdata = valid_list)

# Confusion matrix
cm_bw.validation <- 
  confusionMatrix(ransomware.prediction1.validation$prediction[[2]],
                  validation$bw)

```  

After training the model, weobtain the confusion matricies for the test set and the validation set, separately.

```{r binary_SOM_results, echo=FALSE, results='asis' }

cm1_test_set <- som1_cm_bw %>% as.matrix() %>% 
  knitr::kable(format = "latex", booktabs = TRUE)

cm1_validation_set <- cm_bw.validation %>% as.matrix() %>% 
  knitr::kable(format = "latex", booktabs = TRUE)


cat(c("\\begin{table}[!htb]
    \\begin{minipage}{.5\\linewidth}
      \\caption{test set}
      \\centering",
        cm1_test_set,
    "\\end{minipage}%
    \\begin{minipage}{.5\\linewidth}
      \\centering
        \\caption{validation set}",
        cm1_validation_set,
    "\\end{minipage} 
\\end{table}"
))  

```  


This is a very intensive and somewhat inaccurate method compared to what follows.  It was left out of the final version of the script and has been included here only for model comparison and to track developmental evolution.

### Method Part 1:  Binary Random Forest

A Random Forest model is trained using ten-fold cross validation and a tuning grid with the number of variables randomly sampled as candidates at each split (`mtry`) set to the values $={2, 4, 6, 8, 10, 12}$, each one being checked for optimization.

```{r random_forest_prep, echo=FALSE, inculde=FALSE, warning=FALSE}
##############################################################################
## This is a better attempt using Random Forest to model the data set as
## "black" and "white" addresses only.
##############################################################################

# Cross Validation, ten fold
control <- trainControl(method="cv", number = 10)

# Control grid with variation on mtry
grid <- data.frame(mtry = c(2, 4, 6, 8, 10, 12))

# Run Cross Validation using control and grid set above
rf_model <- train(train_num, train_bw, method="rf", 
                  trControl = control, tuneGrid=grid)

# Supervised fit of model using cross validated optimization
fit_rf <- randomForest(train_samp, train_bw,
                       minNode = rf_model$bestTune$mtry)

# Measure accuracy of model against test sample
y_hat_rf <- predict(fit_rf, test_samp)
cm_test <- confusionMatrix(y_hat_rf, test_bw)


# Measure accuracy of model against full ransomware set
ransomware_y_hat_rf <- predict(fit_rf, ransomware)
cm_ransomware <- confusionMatrix(ransomware_y_hat_rf, ransomware$bw)


```  

The confusion matrix for the test set shows excellent results, specifically in the areas of accuracy and precision.


```{r random-forest-output_test, echo=FALSE}

# Confusion matrix for test set
cm2_test_set <- cm_test %>% as.matrix() %>%
  knitr::kable(format = "latex", booktabs = TRUE)

# overall results
cm2_overall <- cm_test$overall %>%
  knitr::kable(format = "latex", booktabs = TRUE)

# by class.
cm2_byClass <- cm_test$byClass %>%
  knitr::kable(format = "latex", booktabs = TRUE)


# Confusion matrix for full ransomware set,
cm3_full_set <- cm_ransomware %>% as.matrix() %>%
  knitr::kable(format = "latex", booktabs = TRUE)

# overall results 
cm3_overall <- cm_ransomware$overall %>%
  knitr::kable(format = "latex", booktabs = TRUE)

#  by class.
cm3_byClass <- cm_ransomware$byClass %>%
  knitr::kable(format = "latex", booktabs = TRUE)


```  

Here are the confusion matrices for the test set and the full set resulting from the Random Forest model, respectively.

```{r random-forest-comfusion_matrices, echo=FALSE, results='asis'}

# Print all three tables on one line
cat(c("\\begin{table}[!htb]
    \\begin{minipage}{.5\\linewidth}
      \\caption{confusion matrix for test set}
      \\centering",
        cm2_test_set,
    "\\end{minipage}%
    \\begin{minipage}{.5\\linewidth}
      \\centering
        \\caption{confusion matrix for full set}",
        cm3_full_set,
    "\\end{minipage} 
\\end{table}"
))  


```  

The confusion matrix for the full ransomware set is very similar to that of the test set.

Overall results for test and full sets show good results.  

```{r random-forest-overall_results, echo=FALSE, results='asis'}

# Print both tables on one line
cat(c("\\begin{table}[!htb]
    \\begin{minipage}{.5\\linewidth}
      \\caption{test set overall results}
      \\centering",
        cm2_overall,
    "\\end{minipage}%
    \\begin{minipage}{.5\\linewidth}
      \\centering
        \\caption{full set overall results}",
        cm3_overall,
    "\\end{minipage} 
\\end{table}"
))  

```  

Results by class for the test and full sets.  What can you say about these, specifically?

```{r random-forest-results_by_class, echo=FALSE, results='asis'}

# Print both tables on one line
cat(c("\\begin{table}[!htb]
    \\begin{minipage}{.5\\linewidth}
      \\caption{test set results by class}
      \\centering",
        cm2_byClass,
    "\\end{minipage}%
    \\begin{minipage}{.5\\linewidth}
      \\centering
        \\caption{full set results by class}",
        cm3_byClass,
    "\\end{minipage} 
\\end{table}"
))  

```  

This is a much quicker way of removing most of the *white* addresses, and will be used in the final composite model to save time.

### Method Part 2:  Categorical SOMs

Now we train a new model after throwing away all *white* addresses.  The predictions from the Random Forest model are used to isolate all *black* addresses for further classification into ransomware addresses using SOMs.  The reduced set is then categorized using a supervised SOM method with the 28 ransomware families as the target classification groups.

```{r soms-prep, echo=FALSE, include=FALSE}

##############################################################################
## Now we use the Random Forest model to classify the data set into "black" 
## and "white" categories with better precision.
##############################################################################

# Now use this prediction to reduce the original set to only "black" addresses
# First append the full set of predictions to the original set.
ransomware$prediction <- ransomware_y_hat_rf

# Filter out all the predicted "white" addresses, 
# leaving only predicted "black" addresses
black_addresses <- ransomware %>% filter(prediction=="black")

# Split the reduced black-predictions into a training set and a test set @ 50%
test_index <- createDataPartition(y = black_addresses$prediction,
                                  times = 1, p = .5, list = FALSE)

train_set <- black_addresses[-test_index,]
test_set <- black_addresses[test_index,]

# Keep only numeric columns, ignoring temporal variables.
train_num <- train_set %>% 
  select(income, neighbors, weight, length, count, looped)

# SOM function can only work on matrices. 
train_mat <- as.matrix(scale(train_num))

# Select non-temporal numerical features only
test_num <- test_set %>% 
  select(income, neighbors, weight, length, count, looped)

# Testing data is scaled according to how we scaled our training data.
test_mat <- as.matrix(scale(test_num, 
                            center = attr(train_mat, "scaled:center"),
                            scale = attr(train_mat, "scaled:scale")))

# Categorical labels for training set
train_label <- train_set$label %>% classvec2classmat()

# Same for test set
test_label <- test_set$label %>% classvec2classmat()

# Create data list for supervised SOM
train_list <- list(independent = train_mat, dependent = train_label)

############################################################################
## Calculate idea grid size according to:
## https://www.researchgate.net/post/How-many-nodes-for-self-organizing-maps
############################################################################

# Formulaic method 1, makes a larger graph in this case
grid_size <- round(sqrt(5*sqrt(nrow(train_set))))

# Based on categorical number, method 2, smaller graph with less cells
#grid_size = ceiling(sqrt(length(unique(ransomware$label))))

# Create SOM grid
train_grid <- somgrid(xdim=grid_size, ydim=grid_size, 
                      topo="hexagonal", toroidal = TRUE)

## Now build the SOM model using the supervised method xyf()
som_model2 <- xyf(train_mat, train_label,
                  grid = train_grid, 
                  rlen = 100,
                  mode="pbatch", 
                  cores = n_cores,
                  keep.data = TRUE
)

# Now test predictions of test set, create data list for test set
test_list <- list(independent = test_mat, dependent = test_label)

# Generate predictions
ransomware_group.prediction <- predict(som_model2, newdata = test_list)

# Confusion matrix
cm_labels <- confusionMatrix(ransomware_group.prediction$prediction[[2]],
                             test_set$label)


```  

When selecting the grid size for a Self Organizing Map, there are at least two different schools of thought. The two that were tried here are explained (with supporting documentation) on a Researchgate forum.[8]  The first method is based on the size of the training set, and in this case results in a larger, more accurate map.  The second method is based on the number of known categories to classify the data into, and in this case results in a smaller, less accurate map.  For this script, a grid size of `r grid_size` has been selected.  

A summary of the results for the categorization of black addresses into ransomware families follows.  For the full table of predictions and statistics, see the Appendix.

Here are the overall results of the final categorization.

```{r cm_overall, echo=FALSE}

# Overall section of the confusion matrix formatted through kable()
cm_labels$overall %>% knitr::kable(caption="overall categorization results")

```  

Here are the final results by class.

```{r soms-output-by-class, echo=FALSE, size="tiny"}

# By Class section of the confusion matrix formatted through kable()
cm_labels$byClass %>% knitr::kable(caption="categorization results by class")

```  

\newpage

### Clustering Visualizations

 Heatmaps and K-means clustering 

Toroidal nerual node maps are used to generate the models, and can be visualized n a number of ways.

```{r binary som graphs, echo=FALSE, fig.show="hold", out.width='35%'}

# Be careful with these, some are really large and take a long time to produce.

# Visualize neural network mapping
plot(som_model2, type = 'mapping', pch = 19, palette.name = topo.colors)
#cat(" \n")

# Distance map
plot(som_model2, type = 'quality', pch = 19, palette.name = topo.colors)
#cat(" \n")

# Visualize counts
plot(som_model2, type = 'counts', pch = 19, palette.name = topo.colors)
#cat(" \n")

# Visualize fan diagram
plot(som_model2, type = 'codes', pch = 19, palette.name = topo.colors)
#cat(" \n")

# Visualize heatmap for variable 1
plot(som_model2, type = 'property', property = som_model2$codes[[1]][,1],
     main=colnames(train_num)[1], pch = 19, palette.name = topo.colors)
#cat(" \n")

# Visualize heatmap for variable 2
plot(som_model2, type = 'property', property = som_model2$codes[[1]][,2],
     main=colnames(train_num)[2], pch = 19, palette.name = topo.colors)
#cat(" \n")

# Visualize heatmap for variable 3
plot(som_model2, type = 'property', property = som_model2$codes[[1]][,3],
     main=colnames(train_num)[3], pch = 19, palette.name = topo.colors)
#cat(" \n")

# Visualize heatmap for variable 4
plot(som_model2, type = 'property', property = som_model2$codes[[1]][,4],
     main=colnames(train_num)[4], pch = 19, palette.name = topo.colors)
#cat(" \n")

# Visualize heatmap for variable 5
plot(som_model2, type = 'property', property = som_model2$codes[[1]][,5],
     main=colnames(train_num)[5], pch = 19, palette.name = topo.colors)
#cat(" \n")

# Visualize heatmap for variable 6
plot(som_model2, type = 'property', property = som_model2$codes[[1]][,6],
     main=colnames(train_num)[6], pch = 19, palette.name = topo.colors)
#cat(" \n")

```  

K-means clustering offers a nice way of visualizing the final SOM grid and the categorical boundaries that were formed by the model.  Say a bit more about it here....

```{r clustering-setup, echo=FALSE, include=FALSE}
#############################################################################
## K-Means Clustering to visualize the categorization of the SOM
## For a good tutorial, see:
## https://www.polarmicrobes.org/microbial-community-segmentation-with-r/
#############################################################################

# Set number of clusters to be equal to number of known ransomware groups
n_groups <- length(unique(ransomware$label)) - 1

# Generate k-means clustering
som.cluster <- kmeans(data.frame(som_model2$codes[[1]]), centers=n_groups)

```  

K-means clustering categorizes the SOM grid by adding boundaries to the classification groups.  This is the author's favorite graph in the entire report.

```{r clustering-plot, echo=FALSE, fig.align="center"}

# Plot K-means clustering results
plot(som_model2,
     main = 'K-Means Clustering',
     type = "property",
     property = som.cluster$cluster,
     palette.name = topo.colors)
add.cluster.boundaries(som_model2, som.cluster$cluster)

```  

---

##  Results & Performance

### Results

   The first attempt to isolate ransomware using SOMs resulted in a model with an accuracy of `r toString(cm_bw.validation$overall["Accuracy"])` and precision `r toString(cm_bw.validation$byClass[3])`.
   
   The the second attempt to isolate ransomware using Random forest resulted in a model with an accuracy of `r toString(cm_ransomware$overall["Accuracy"])` and precision `r toString(cm_ransomware$byClass[3])`.
   
   Classifying the ransomware predicted by the second attempt into 28 ransomware families resulted in a model with an overall accuracy of `r toString(cm_labels$overall["Accuracy"])` and minimum nonzero precision of `r toString(min(cm_labels$byClass[,5][which(cm_labels$byClass[,5] > 0)]))`.


### Performance

  The script runs on the aforementioned hardware in less than five minutes and uses less than 4GB of RAM.  Given that the Bitcoin network produces one new block every ten minutes on average, then real-time analysis could theoretically be conducted on each block as it is announced using even moderate computing resources.  Just for kicks, the final script was also run on a more humble computer with the following specifications:

#### ASUS Eee PC 1025C  
     
   - CPU:  Intel Atom N2600 @ 1.600GHz (64-bit Intel Atom quad-core x86)
   - RAM:  3911MB DDR3 @ 800 MT/s  (4 GB)
   
   This is a computer known for being slow and clunky.  Even on this device, which runs the same operating system and software as the hardware listed previously, the total run time for the script is around 1665 seconds.  At nearly 28 minutes, this is not fast enough to analyze the Bitcoin blockchain in real time, but it does show that the script can be run on very modest hardware to completion.
   
#### Pine64 Quartz64 Model A

  - CPU:  Rockchip RK3566 SoC aarch64 (64-bit quad-core ARM)
  - RAM:  DDR4 8080MB (8 GB)
  
  Single board computer / Development board.  This was run to benchmark a modern 64-bit ARM processor.  The script runs in about 860 minutes on this platform, nearly half of that for the Atom processor above.

---

## Summary

### Comparison to results from original paper

   In the original paper by Akcora et al., they tested several different sets of parameters on their TDA model.  According to them, "In the best TDA models for each ransomware family, we predict **16.59 false positives for each
true positive**. In turn, this number is 27.44 for the best non-TDA models."[3]  In fact, the highest Precision (a.k.a. Positive Predictive Value, defined as TP/(TP+FP)) they achieved was only 0.1610.  By comparison, although several of our predicted classes had zero or NA precision values, the lowest non-zero precision value is `r toString(min(cm_labels$byClass[,5][which(cm_labels$byClass[,5] > 0)]))`, with many well above that, approaching one in a few cases.

  One might say that we are comparing apples to oranges in a sense, because their method was one single model, while these results are from a two-method stack.  Still, given the run time of the final script, I think the two-model approach is superior in this case, especially when measured in terms of precision and avoiding false positives.


### Limitations

  SOMs seem like they are easy to misconfigure.  Perhaps a dual Random Forest approach would be better.  this has not been attempted yet, as the two method approach presented here was satisfactory enough to present in a report.

### Future Work

  I only scratched he surface of the SOM algorithm which seems to have many implementations and parameters that could be investigated further and possibly optimized via cross-validation.  Also, a dual Random Forest approach to first isolate the ransomware addresses and also 
  
  The script itself has a few areas that could be further optimization.  The sampling method does what it needs to do, but the ratios taken for each set could possibly be optimized. 

### Conclusion

   This paper/report presents a reliable method for classifying Bitcoin addresses into known ransomware families, while at the same time avoiding false positives by filtering them out using a binary method before classifying them further.  It leaves the author of the paper wondering how much harder it would be to perform the same task for ransomware that uses privacy coins.  Certain cryptocurrency networks utilize privacy coins, such as Monero, that obfuscate transactions from being analyzed in the same way that the Bitcoin network has been analyzed here.  Some progress has been made towards analyzing such networks[9], but the developers of such networks continually evolve the code to complicate transaction tracking.  This could be another good area for future research.

## References  

[1] Adam Brian Turner, Stephen McCombie and Allon J. Uhlmann (November 30, 2020) [Analysis Techniques for Illicit Bitcoin Transactions](https://doi.org/10.3389/fcomp.2020.600596)

[2] Daniel Goldsmith, Kim Grauer and Yonah Shmalo (April 16, 2020) [Analyzing hack subnetworks in the
bitcoin transaction graph](https://doi.org/10.1007/s41109-020-00261-7) 

[3] Cuneyt Gurcan Akcora, Yitao Li, Yulia R. Gel, Murat Kantarcioglu (June 19, 2019) [BitcoinHeist: Topological Data Analysis for Ransomware Detection on the Bitcoin Blockchain](https://arxiv.org/abs/1906.07852)

[4] UCI Machine Learning Repository https://archive.ics.uci.edu/ml/index.php

[5]  BitcoinHeist Ransomware Address Dataset  
https://archive.ics.uci.edu/ml/datasets/BitcoinHeistRansomwareAddressDataset

[6]  Available Models - The `caret` package http://topepo.github.io/caret/available-models.html

[7] Ron Wehrens and Johannes Kruisselbrink, Package ‘`kohonen`’ @ CRAN (2019) https://cran.r-project.org/web/packages/kohonen/kohonen.pdf

[8] How many nodes for self-organizing maps? (Oct 22, 2021) https://www.researchgate.net/post/How-many-nodes-for-self-organizing-maps

[9] Malte Möser, Kyle Soska, Ethan Heilman, Kevin Lee, Henry Heffan, Shashvat Srivastava,
Kyle Hogan, Jason Hennessey, Andrew Miller, Arvind Narayanan, and Nicolas Christin (April 23, 2018) [An Empirical Analysis of Traceability in the Monero Blockchain](https://arxiv.org/pdf/1704.04299/)

\newpage

## Appendix:  

### Categorical SOM prediction table and confusion matrix

Here are the full prediction results for the categorization of *black* addresses into ransomware families.  It is assumed that all *white* address have already been removed.

```{r soms-output-table, echo=FALSE}

# Final results of categorization of "black" addresses
# into ransomware families.
cm_labels

```
-												Started new Rmd document, got outline done at least.

											
										
										
											2021-10-04 06:55:21 +02:00
+								---
-												graphs are starting to look good.  Now pick the best ones and leave the rest.....

											
										
										
											2021-10-28 15:04:34 +02:00
+								title: \vspace{1in}Detecting Ransomware Addresses on the Bitcoin Blockchain using Random Forest and Self Organizing Maps
-												The script is complete.  The report has all the code from the script, and it compiles.  I still need to add the textual parts for chunks 3 and 4.  Final step is still the visuals, although I am thinking less is more on that one.

											
										
										
											2021-10-20 08:33:26 +02:00
+								subtitle: \vspace{.5in}HarvardX PH125.9x Final Capstone CYO Project
-												Started new Rmd document, got outline done at least.

											
										
										
											2021-10-04 06:55:21 +02:00
+								          \vspace{.5in}
 								author: "Kaylee Robert Tejeda"
-												quick reboot, things are acting weird, but the graphs should have labels that do not overlap now, at least....

											
										
										
											2021-11-01 16:45:40 +01:00
+								date: "11/11/2021"
-												Graphs finally all look good.  Now focus on the words!  Words words words, write up the words and then put out a "first reading draft".

											
										
										
											2021-11-04 12:12:12 +01:00
+								abstract: "Ransomware is a persisent and growing threat in the world of cybersecurity. A specific area of focus involves detecting and tracking payments made to ransomware operators. While many attempts towards this goal have not made use of sophisticated machine learning methods, even those that have often result in models with poor precision or other performance issues. A two-step method is developed to address the issue of false positives."
-												Finished first draft of Introduction section, up to Visual Analysis and Exploration.  I am satisfied with it up to that point.  Chunk #2 begins tomorrow, produce the basic analysis graphs I have already outlined....

											
										
										
											2021-10-13 07:20:59 +02:00
+								keywords:
 								- Bitcoin
 								- blockchain
 								- ransomware
 								- machine learning
-												graphs are starting to look good.  Now pick the best ones and leave the rest.....

											
										
										
											2021-10-28 15:04:34 +02:00
+								- Random Forest
-												Finished first draft of Introduction section, up to Visual Analysis and Exploration.  I am satisfied with it up to that point.  Chunk #2 begins tomorrow, produce the basic analysis graphs I have already outlined....

											
										
										
											2021-10-13 07:20:59 +02:00
+								- Self Organizing Maps
 								- SOMs
 								- cryptocurrency
-												Started new Rmd document, got outline done at least.

											
										
										
											2021-10-04 06:55:21 +02:00
+								output: pdf_document
-												Graphs finally all look good.  Now focus on the words!  Words words words, write up the words and then put out a "first reading draft".

											
										
										
											2021-11-04 12:12:12 +01:00
+								header-includes:
 								- \usepackage{booktabs}
-												Finished first draft of Introduction section, up to Visual Analysis and Exploration.  I am satisfied with it up to that point.  Chunk #2 begins tomorrow, produce the basic analysis graphs I have already outlined....

											
										
										
											2021-10-13 07:20:59 +02:00
+								geometry: margin=2cm
-												Started new Rmd document, got outline done at least.

											
										
										
											2021-10-04 06:55:21 +02:00
+								---
-												Added introductory section (chunk #1).

											
										
										
											2021-10-08 08:07:38 +02:00
+								\def\bitcoinA{%
 								  \leavevmode
 								  \vtop{\offinterlineskip %\bfseries
 								    \setbox0=\hbox{B}%
 								    \setbox2=\hbox to\wd0{\hfil\hskip-.03em
 								    \vrule height .3ex width .15ex\hskip .08em
 								    \vrule height .3ex width .15ex\hfil}
 								    \vbox{\copy2\box0}\box2}}
-												Started new Rmd document, got outline done at least.

											
										
										
											2021-10-04 06:55:21 +02:00
+								```{r setup, include=FALSE}
 								knitr::opts_chunk$set(echo = TRUE, out.width="400px", dpi=120)
-												Added introductory section (chunk #1).

											
										
										
											2021-10-08 08:07:38 +02:00
+								def.chunk.hook  <- knitr::knit_hooks$get("chunk")
 								knitr::knit_hooks$set(chunk = function(x, options) {
 								  x <- def.chunk.hook(x, options)
-												graphs are starting to look good.  Now pick the best ones and leave the rest.....

											
										
										
											2021-10-28 15:04:34 +02:00
+								  ifelse(options$size != "normalsize", paste0("\n \\", options$size,"\n\n", x,
 								                                              "\n\n \\normalsize"), x)
-												Added introductory section (chunk #1).

											
										
										
											2021-10-08 08:07:38 +02:00
+								})
-												Graphs finally all look good.  Now focus on the words!  Words words words, write up the words and then put out a "first reading draft".

											
										
										
											2021-11-04 12:12:12 +01:00
-												Started new Rmd document, got outline done at least.

											
										
										
											2021-10-04 06:55:21 +02:00
+								```
 								\newpage
 								&nbsp;
 								\vspace{25pt}
 								\tableofcontents
 								\newpage
 								## Introduction
-												Graphs finally all look good.  Now focus on the words!  Words words words, write up the words and then put out a "first reading draft".

											
										
										
											2021-11-04 12:12:12 +01:00
+								  Ransomware attacks are of interest to security professionals, law enforcement, and financial regulatory officials.$^{[1]}$  The pseudo-anonymous Bitcoin network provides a convenient method for ransomware attackers to accept payments without revealing their identity or location.  The victims (usually hospitals or other large organizations) come to learn that much if not all of their important organizational data have been encrypted with a secret key by an unknown attacker. They are instructed to make a payment to a specific Bitcoin address by a certain deadline to have the data decrypted or else it will be deleted automatically.
-												Added introductory section (chunk #1).

											
										
										
											2021-10-08 08:07:38 +02:00
-												Graphs finally all look good.  Now focus on the words!  Words words words, write up the words and then put out a "first reading draft".

											
										
										
											2021-11-04 12:12:12 +01:00
+								  The deeper legal and financial implications of ransomware attacks are inconsequential to the work in this report, as we are merely interested in being able to classify bitcoin addresses by their connection to ransomware transactions. Many researchers are already tracking illicit activity (such as ransomware payments) around the Bitcoin blockchain as soon as possible to minimize financial losses.  Daniel Goldsmith explains some of the reasons and methods of blockchain analysis at [Chainalysis.com](https://www.chainalysis.com/).$^{[2]}$  For example, consider a ransomware attack conducted towards an illegal darknet market site. The news of such an attack might not be published at all, let alone in popular media. By analyzing the transaction record with a blockchain explorer such as [BTC.com](https://btc.com/), suspicious activity could be flagged in real time given a sufficiently robust model. It may, in fact, be the first public notice of such an event. Any suspicious addresses could then be blacklisted or banned from using other services, if that is so desired.
-												Added introductory section (chunk #1).

											
										
										
											2021-10-08 08:07:38 +02:00
-												Graphs finally all look good.  Now focus on the words!  Words words words, write up the words and then put out a "first reading draft".

											
										
										
											2021-11-04 12:12:12 +01:00
+								  Lists of known ransomware payment addresses have been compiled and analyzed using various methods. One well known paper entitled "BitcoinHeist: Topological Data Analysis for Ransomware Detection on the Bitcoin Blockchain"$^{[3]}$ will be the source of our data set and the baseline to which we will compare our results.  In that paper, Akcora, et al. use Topological Data Analysis (TDA) to classify addresses on the Bitcoin blockchain into one of 28 known ransomware address groups. Addresses with no known ransomware associations are classified as *white*.  The blockchain is then considered as a heterogeneous Directed Acyclic Graph (DAG) with two types of nodes describing *addresses* and *transactions*.  Edges are formed between the nodes when a transaction can be associated with a particular address.
-												Added introductory section (chunk #1).

											
										
										
											2021-10-08 08:07:38 +02:00
-												Graphs finally all look good.  Now focus on the words!  Words words words, write up the words and then put out a "first reading draft".

											
										
										
											2021-11-04 12:12:12 +01:00
+								  Any given address on the Bitcoin network may appear many times, with different inputs and outputs each time.  The Bitcoin network data has been divided into 24-hour time intervals with the UTC-6 timezone as a reference.  This way, variables can be defined in a specific and meaningful way.  For example, *speed* can be defined as the number of blocks the coin appears in during a 24-hour period, and provides information on how quickly a coin moves through the network. *Speed* may be an indicator of money laundering or coin mixing, as normal payments only involve a limited number of addresses in a given 24 hour period, and thus have lower speeds when compared to "mixed" coins.  The temporal data can also help distinguish transactions by geolocation, as criminal transactions tend to cluster in time.
-												Added introductory section (chunk #1).

											
										
										
											2021-10-08 08:07:38 +02:00
-												Graphs finally all look good.  Now focus on the words!  Words words words, write up the words and then put out a "first reading draft".

											
										
										
											2021-11-04 12:12:12 +01:00
+								 With the graph specified as such, the following six numerical features$^{[2]}$ are associated with a given address:
-												Added introductory section (chunk #1).

											
										
										
											2021-10-08 08:07:38 +02:00
-												Things are finally starting to look balanced, visually speaking. Next is to create a few group_by graphs and put them in the right place.  Get scripts working before putting into full document.

											
										
										
											2021-10-29 05:02:34 +02:00
+)  *Income* - the total amount of coins sent to an address
-												Added introductory section (chunk #1).

											
										
										
											2021-10-08 08:07:38 +02:00
-												Things are finally starting to look balanced, visually speaking. Next is to create a few group_by graphs and put them in the right place.  Get scripts working before putting into full document.

											
										
										
											2021-10-29 05:02:34 +02:00
+)  *Neighbors* - the number of transactions that have this address as one of its output addresses
-												Added introductory section (chunk #1).

											
										
										
											2021-10-08 08:07:38 +02:00
-												Things are finally starting to look balanced, visually speaking. Next is to create a few group_by graphs and put them in the right place.  Get scripts working before putting into full document.

											
										
										
											2021-10-29 05:02:34 +02:00
+)  *Weight* - the sum of fraction of coins that reach this address from address that do not have any other inputs within the 24-hour window, which are referred to as "starter transactions"
-												Added introductory section (chunk #1).

											
										
										
											2021-10-08 08:07:38 +02:00
-												changed instances of "grey" to "bw". Started visualization section of report.  Mars needs better graphs!

											
										
										
											2021-10-15 07:55:01 +02:00
+)  *Length* - the number of non-starter transactions on its longest chain, where a chain is defined as an
-												Things are finally starting to look balanced, visually speaking. Next is to create a few group_by graphs and put them in the right place.  Get scripts working before putting into full document.

											
										
										
											2021-10-29 05:02:34 +02:00
+								acyclic directed path originating from any starter transaction and ending at the address in question
-												Added introductory section (chunk #1).

											
										
										
											2021-10-08 08:07:38 +02:00
-												Things are finally starting to look balanced, visually speaking. Next is to create a few group_by graphs and put them in the right place.  Get scripts working before putting into full document.

											
										
										
											2021-10-29 05:02:34 +02:00
+)  *Count* - The number of starter addresses connected to this address through a chain
-												Added introductory section (chunk #1).

											
										
										
											2021-10-08 08:07:38 +02:00
-												quick reboot, things are acting weird, but the graphs should have labels that do not overlap now, at least....

											
										
										
											2021-11-01 16:45:40 +01:00
+)  *Looped* - The number of starter addresses connected to this address by more than one path
-												Added introductory section (chunk #1).

											
										
										
											2021-10-08 08:07:38 +02:00
-												Graphs finally all look good.  Now focus on the words!  Words words words, write up the words and then put out a "first reading draft".

											
										
										
											2021-11-04 12:12:12 +01:00
+								These variables are defined rather conceptually, viewing the blockchain as a topological graph with nodes and edges.  The rationale behind this approach is to quantify specific transaction patterns. Akcora$^{[3]}$ gives a thorough explanation in the original paper of how and why these features were chosen.  We shall treat the features as general numerical variables and will not seek to justify their definitions beyond that.  Machine learning methods will be applied to the original data set from the paper by Akcora$^{[3]}$, and the new results will be compared to the original ones.
-												Started new Rmd document, got outline done at least.

											
										
										
											2021-10-04 06:55:21 +02:00
 								### Data
-												Added introductory section (chunk #1).

											
										
										
											2021-10-08 08:07:38 +02:00
-												Graphs finally all look good.  Now focus on the words!  Words words words, write up the words and then put out a "first reading draft".

											
										
										
											2021-11-04 12:12:12 +01:00
+								   This data set was found while exploring the [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/index.php)$^{[4]}$  as suggested in the project instructions.  The author of this report, interested in Bitcoin and other cryptocurrencies since (unsuccessfully) mining them on an ASUS netbook in rural Peru in late 2010, used *cryptocurrency* as a preliminary search term.  This brought up a single data set entitled ["BitcoinHeist: Ransomware Address Data Set"](https://archive.ics.uci.edu/ml/datasets/BitcoinHeistRansomwareAddressDataset#).  The data set was downloaded and the exploration began.
-												Started new Rmd document, got outline done at least.

											
										
										
											2021-10-04 06:55:21 +02:00
-												Graphs finally all look good.  Now focus on the words!  Words words words, write up the words and then put out a "first reading draft".

											
										
										
											2021-11-04 12:12:12 +01:00
+								```{r install_load_libraries-and-download_data, echo=FALSE, include=FALSE}
-												Started new Rmd document, got outline done at least.

											
										
										
											2021-10-04 06:55:21 +02:00
-												Things are finally starting to look balanced, visually speaking. Next is to create a few group_by graphs and put them in the right place.  Get scripts working before putting into full document.

											
										
										
											2021-10-29 05:02:34 +02:00
+								# Set the repository mirror to “0-Cloud” for maximum availability
-												cut out binary soms from paper (saves hours of compile time, might stick with it.).  Also made up first draft of Final Method script.  Need to use first half to inform the set for the second half.  Might work on that tonight....

											
										
										
											2021-10-19 04:39:23 +02:00
+								r = getOption("repos")
-												Things are finally starting to look balanced, visually speaking. Next is to create a few group_by graphs and put them in the right place.  Get scripts working before putting into full document.

											
										
										
											2021-10-29 05:02:34 +02:00
+								r["CRAN"] = "http://cran.rstudio.com"
-												cut out binary soms from paper (saves hours of compile time, might stick with it.).  Also made up first draft of Final Method script.  Need to use first half to inform the set for the second half.  Might work on that tonight....

											
										
										
											2021-10-19 04:39:23 +02:00
+								options(repos = r)
 								rm(r)
-												The script is complete.  The report has all the code from the script, and it compiles.  I still need to add the textual parts for chunks 3 and 4.  Final step is still the visuals, although I am thinking less is more on that one.

											
										
										
											2021-10-20 08:33:26 +02:00
+								# Install necessary packages if not already present
-												Started new Rmd document, got outline done at least.

											
										
										
											2021-10-04 06:55:21 +02:00
+								if(!require(tidyverse)) install.packages("tidyverse")
 								if(!require(caret)) install.packages("caret")
-												The script is complete.  The report has all the code from the script, and it compiles.  I still need to add the textual parts for chunks 3 and 4.  Final step is still the visuals, although I am thinking less is more on that one.

											
										
										
											2021-10-20 08:33:26 +02:00
+								if(!require(randomForest)) install.packages("randomForest")
 								if(!require(kohonen)) install.packages("kohonen")
 								if(!require(parallel)) install.packages("parallel")
 								if(!require(matrixStats)) install.packages("matrixStats")
-												Graphs finally all look good.  Now focus on the words!  Words words words, write up the words and then put out a "first reading draft".

											
										
										
											2021-11-04 12:12:12 +01:00
+								if(!require(xtable)) install.packages("xtable")
-												Started new Rmd document, got outline done at least.

											
										
										
											2021-10-04 06:55:21 +02:00
 								# Load Libraries
 								library(tidyverse)
 								library(caret)
-												The script is complete.  The report has all the code from the script, and it compiles.  I still need to add the textual parts for chunks 3 and 4.  Final step is still the visuals, although I am thinking less is more on that one.

											
										
										
											2021-10-20 08:33:26 +02:00
+								library(randomForest)
 								library(kohonen)
 								library(parallel)
 								library(matrixStats)
-												Graphs finally all look good.  Now focus on the words!  Words words words, write up the words and then put out a "first reading draft".

											
										
										
											2021-11-04 12:12:12 +01:00
+								library(xtable)
 								# Set # of cores, use detectCores() - 1 to leave one for the system
 								n_cores <- detectCores()
-												Started new Rmd document, got outline done at least.

											
										
										
											2021-10-04 06:55:21 +02:00
 								# Download data
-												The script is complete.  The report has all the code from the script, and it compiles.  I still need to add the textual parts for chunks 3 and 4.  Final step is still the visuals, although I am thinking less is more on that one.

											
										
										
											2021-10-20 08:33:26 +02:00
+								url <-
 								  "https://archive.ics.uci.edu/ml/machine-learning-databases/00526/data.zip"
-												Started new Rmd document, got outline done at least.

											
										
										
											2021-10-04 06:55:21 +02:00
+								dest_file <- "data/data.zip"
 								if(!dir.exists("data"))dir.create("data")
 								if(!file.exists(dest_file))download.file(url, destfile = dest_file)
-												The script is complete.  The report has all the code from the script, and it compiles.  I still need to add the textual parts for chunks 3 and 4.  Final step is still the visuals, although I am thinking less is more on that one.

											
										
										
											2021-10-20 08:33:26 +02:00
+								# Unzip as CSV
 								if(!file.exists("data/BitcoinHeistData.csv"))unzip(dest_file,
 								                                                   "BitcoinHeistData.csv",
 								                                                   exdir="data")
-												Started new Rmd document, got outline done at least.

											
										
										
											2021-10-04 06:55:21 +02:00
 								# Import data from CSV
 								ransomware <- read_csv("data/BitcoinHeistData.csv")
-												Finished first draft of Introduction section, up to Visual Analysis and Exploration.  I am satisfied with it up to that point.  Chunk #2 begins tomorrow, produce the basic analysis graphs I have already outlined....

											
										
										
											2021-10-13 07:20:59 +02:00
+								```
-												Graphs finally all look good.  Now focus on the words!  Words words words, write up the words and then put out a "first reading draft".

											
										
										
											2021-11-04 12:12:12 +01:00
+								A summary of the data set shows the range of values and size of the sample.
-												Finished first draft of Introduction section, up to Visual Analysis and Exploration.  I am satisfied with it up to that point.  Chunk #2 begins tomorrow, produce the basic analysis graphs I have already outlined....

											
										
										
											2021-10-13 07:20:59 +02:00
-												Graphs finally all look good.  Now focus on the words!  Words words words, write up the words and then put out a "first reading draft".

											
										
										
											2021-11-04 12:12:12 +01:00
+								```{r data_summary, echo=FALSE, size="tiny"}
-												Finished first draft of Introduction section, up to Visual Analysis and Exploration.  I am satisfied with it up to that point.  Chunk #2 begins tomorrow, produce the basic analysis graphs I have already outlined....

											
										
										
											2021-10-13 07:20:59 +02:00
 								# Summary
-												Graphs finally all look good.  Now focus on the words!  Words words words, write up the words and then put out a "first reading draft".

											
										
										
											2021-11-04 12:12:12 +01:00
+								ransomware %>% summary() %>% knitr::kable(caption="Summary of data set")
-												Started new Rmd document, got outline done at least.

											
										
										
											2021-10-04 06:55:21 +02:00
-												Added introductory section (chunk #1).

											
										
										
											2021-10-08 08:07:38 +02:00
+								```
-												Started new Rmd document, got outline done at least.

											
										
										
											2021-10-04 06:55:21 +02:00
-												Finished first draft of Introduction section, up to Visual Analysis and Exploration.  I am satisfied with it up to that point.  Chunk #2 begins tomorrow, produce the basic analysis graphs I have already outlined....

											
										
										
											2021-10-13 07:20:59 +02:00
+								A listing of the first ten rows provides a sample of the features associated with each observation.
-												Started new Rmd document, got outline done at least.

											
										
										
											2021-10-04 06:55:21 +02:00
-												Graphs finally all look good.  Now focus on the words!  Words words words, write up the words and then put out a "first reading draft".

											
										
										
											2021-11-04 12:12:12 +01:00
+								```{r data_head, echo=FALSE, size="tiny"}
-												Started new Rmd document, got outline done at least.

											
										
										
											2021-10-04 06:55:21 +02:00
-												Added introductory section (chunk #1).

											
										
										
											2021-10-08 08:07:38 +02:00
+								# Inspect data
-												Graphs finally all look good.  Now focus on the words!  Words words words, write up the words and then put out a "first reading draft".

											
										
										
											2021-11-04 12:12:12 +01:00
+								ransomware %>% head() %>% knitr::kable(caption="First ten entries of data set")
-												Started new Rmd document, got outline done at least.

											
										
										
											2021-10-04 06:55:21 +02:00
 								```
-												Added introductory section (chunk #1).

											
										
										
											2021-10-08 08:07:38 +02:00
-												Graphs finally all look good.  Now focus on the words!  Words words words, write up the words and then put out a "first reading draft".

											
										
										
											2021-11-04 12:12:12 +01:00
+								This data set has 2,916,697 observations of ten features associated with a sample of transactions from the Bitcoin blockchain.  The ten features include *address* as a unique identifier, the six features defined previously (*income, neighbors, weight, length, count, loop*), two temporal features in the form of *year* and *day* (of the year as 1 to 365), and a categorical feature called *label* that categorizes each address as either *white* (meaning not connected to any ransomware activity), or one of 28 known ransomware groups as identified by three independent ransomware analysis teams (Montreal, Princeton, and Padua)$^{[3]}$.
-												Added introductory section (chunk #1).

											
										
										
											2021-10-08 08:07:38 +02:00
-												Graphs finally all look good.  Now focus on the words!  Words words words, write up the words and then put out a "first reading draft".

											
										
										
											2021-11-04 12:12:12 +01:00
+								The original research team downloaded and parsed the entire Bitcoin transaction graph from January 2009 to December 2018. Based on a 24 hour time interval, daily transactions on the network were extracted and the Bitcoin graph was formed. Network edges that transferred less than \bitcoinA 0.3 were filtered out since ransom amounts are rarely below this threshold. Ransomware addresses are taken from three widely adopted studies: Montreal, Princeton and Padua. *White* Bitcoin addresses were capped at one thousand per day, whereas the entire network sees up to 800,000 addresses daily.$^{[5]}$
-												Added introductory section (chunk #1).

											
										
										
											2021-10-08 08:07:38 +02:00
-												Started new Rmd document, got outline done at least.

											
										
										
											2021-10-04 06:55:21 +02:00
+								### Goal
-												Graphs finally all look good.  Now focus on the words!  Words words words, write up the words and then put out a "first reading draft".

											
										
										
											2021-11-04 12:12:12 +01:00
+								  The goal of this project is to apply different machine learning algorithms to the same data set used in the original paper, producing an acceptable predictive model for categorizing ransomware addresses correctly.  Improving on the results of the original paper in some way, while not strictly necessary for the purposes of the project, would be a notable sign of success.
-												Added introductory section (chunk #1).

											
										
										
											2021-10-08 08:07:38 +02:00
-												my wife has a limit to how much she can care, but she wants me to care in an unlimited fashion.  she needs a shoulder to cry on, but cannot provide a shoulder for others to cry on.  That is some sort of mental disorder, for real.

											
										
										
											2021-10-22 06:58:54 +02:00
+								###  Outline of Steps Taken
-												Started new Rmd document, got outline done at least.

											
										
										
											2021-10-04 06:55:21 +02:00
-												Graphs finally all look good.  Now focus on the words!  Words words words, write up the words and then put out a "first reading draft".

											
										
										
											2021-11-04 12:12:12 +01:00
+. Analyze data set numerically and visually, look for insights in any patterns.
-												graphs are starting to look good.  Now pick the best ones and leave the rest.....

											
										
										
											2021-10-28 15:04:34 +02:00
+. Binary separation using Self Organizing Maps.
-												Graphs finally all look good.  Now focus on the words!  Words words words, write up the words and then put out a "first reading draft".

											
										
										
											2021-11-04 12:12:12 +01:00
+. Fast binary separation using Random Forest.
-												graphs are starting to look good.  Now pick the best ones and leave the rest.....

											
										
										
											2021-10-28 15:04:34 +02:00
+. Categorical classification using Self Organizing Maps.
 . Visualize clustering to analyze results further.
-												Graphs finally all look good.  Now focus on the words!  Words words words, write up the words and then put out a "first reading draft".

											
										
										
											2021-11-04 12:12:12 +01:00
+. Generate confusion matrix to quantify results.
-												Started new Rmd document, got outline done at least.

											
										
										
											2021-10-04 06:55:21 +02:00
-												Added introductory section (chunk #1).

											
										
										
											2021-10-08 08:07:38 +02:00
+								---
-												Started new Rmd document, got outline done at least.

											
										
										
											2021-10-04 06:55:21 +02:00
-												changed instances of "grey" to "bw". Started visualization section of report.  Mars needs better graphs!

											
										
										
											2021-10-15 07:55:01 +02:00
+								## Data Analysis
-												Finished first draft of Introduction section, up to Visual Analysis and Exploration.  I am satisfied with it up to that point.  Chunk #2 begins tomorrow, produce the basic analysis graphs I have already outlined....

											
										
										
											2021-10-13 07:20:59 +02:00
-												changed instances of "grey" to "bw". Started visualization section of report.  Mars needs better graphs!

											
										
										
											2021-10-15 07:55:01 +02:00
+								### Hardware Specification
-												Started new Rmd document, got outline done at least.

											
										
										
											2021-10-04 06:55:21 +02:00
-												Graphs finally all look good.  Now focus on the words!  Words words words, write up the words and then put out a "first reading draft".

											
										
										
											2021-11-04 12:12:12 +01:00
+								   All of the analysis in this report was conducted on a single laptop computer, a Lenovo Yoga S1 from late 2013 with the following specifications.
-												changed instances of "grey" to "bw". Started visualization section of report.  Mars needs better graphs!

											
										
										
											2021-10-15 07:55:01 +02:00
-												quick reboot, things are acting weird, but the graphs should have labels that do not overlap now, at least....

											
										
										
											2021-11-01 16:45:40 +01:00
+								   - CPU:  Intel i7-4600U @ 3.300GHz (4th Gen quad-core i7 x86_64)
-												changed instances of "grey" to "bw". Started visualization section of report.  Mars needs better graphs!

											
										
										
											2021-10-15 07:55:01 +02:00
+								   - RAM:  8217MB DDR3L @ 1600 MHz  (8 GB)
 								   - OS:   Slackware64-current (15.0 RC1) `x86_64-slackware-linux-gnu` (64-bit GNU/Linux)
 								   - R version 4.0.0 (2020-04-24) -- "Arbor Day" (built from source using scripts from [slackbuilds.org](https://slackbuilds.org/))
 								   - RStudio Version 1.4.1106 "Tiger Daylily" (2389bc24, 2021-02-11) for CentOS 8 (converted using `rpm2tgz`)
-												Finished first draft of Introduction section, up to Visual Analysis and Exploration.  I am satisfied with it up to that point.  Chunk #2 begins tomorrow, produce the basic analysis graphs I have already outlined....

											
										
										
											2021-10-13 07:20:59 +02:00
 								###  Data Preparation
-												Graphs finally all look good.  Now focus on the words!  Words words words, write up the words and then put out a "first reading draft".

											
										
										
											2021-11-04 12:12:12 +01:00
+								  It is immediately apparent that this is a rather large data set.  The usual practice of partitioning out 80% to 90% of the data for training results in a training set that is too large to process given the hardware limitations.  For reasons that no longer apply, the original data set was first split in half with 50% reserved as *validation set* and the other 50% used as the *working set*.  This working set was again split in half, to give a *training set* that was of a reasonable size to deal with.  This produced partitions that were small enough to work with, so the partition size ratio was not further refined.  This is a potential area for later optimization. Careful sampling was carried out to ensure that the ransomware groups were represented in each sample.
-												Finished first draft of Introduction section, up to Visual Analysis and Exploration.  I am satisfied with it up to that point.  Chunk #2 begins tomorrow, produce the basic analysis graphs I have already outlined....

											
										
										
											2021-10-13 07:20:59 +02:00
-												Graphs finally all look good.  Now focus on the words!  Words words words, write up the words and then put out a "first reading draft".

											
										
										
											2021-11-04 12:12:12 +01:00
+								```{r data_prep, echo=FALSE, include=FALSE}
-												Added introductory section (chunk #1).

											
										
										
											2021-10-08 08:07:38 +02:00
-												The script is complete.  The report has all the code from the script, and it compiles.  I still need to add the textual parts for chunks 3 and 4.  Final step is still the visuals, although I am thinking less is more on that one.

											
										
										
											2021-10-20 08:33:26 +02:00
+								# Turn labels into factors, "bw" is binary factor for ransomware/non-ransomware
 								ransomware <- ransomware %>%
 								  mutate(label=as.factor(label),
 								         bw=as.factor(ifelse(label=="white", "white", "black")))
-												Added introductory section (chunk #1).

											
										
										
											2021-10-08 08:07:38 +02:00
-												The script is complete.  The report has all the code from the script, and it compiles.  I still need to add the textual parts for chunks 3 and 4.  Final step is still the visuals, although I am thinking less is more on that one.

											
										
										
											2021-10-20 08:33:26 +02:00
+								# Validation set made from 50% of BitcoinHeist data, for RAM considerations
 								test_index <- createDataPartition(y = ransomware$bw,
 								                                  times = 1, p = .5, list = FALSE)
-												Added introductory section (chunk #1).

											
										
										
											2021-10-08 08:07:38 +02:00
 								workset <- ransomware[-test_index,]
 								validation <- ransomware[test_index,]
-												The script is complete.  The report has all the code from the script, and it compiles.  I still need to add the textual parts for chunks 3 and 4.  Final step is still the visuals, although I am thinking less is more on that one.

											
										
										
											2021-10-20 08:33:26 +02:00
+								# Split the working set into a training set and a test set @ 50%, RAM dictated
 								test_index <- createDataPartition(y = workset$bw,
 								                                  times = 1, p = .5, list = FALSE)
-												Added introductory section (chunk #1).

											
										
										
											2021-10-08 08:07:38 +02:00
 								train_set <- workset[-test_index,]
 								test_set <- workset[test_index,]
-												changed instances of "grey" to "bw". Started visualization section of report.  Mars needs better graphs!

											
										
										
											2021-10-15 07:55:01 +02:00
+								# Find proportion of full data set that is ransomware
 								ransomprop <- mean(ransomware$bw=="black")
 								# Check for NAs
 								no_nas <- sum(is.na(ransomware))
-												Started new Rmd document, got outline done at least.

											
										
										
											2021-10-04 06:55:21 +02:00
-												changed instances of "grey" to "bw". Started visualization section of report.  Mars needs better graphs!

											
										
										
											2021-10-15 07:55:01 +02:00
+								```
-												quick reboot, things are acting weird, but the graphs should have labels that do not overlap now, at least....

											
										
										
											2021-11-01 16:45:40 +01:00
+								### Exploration and Visualization
-												Started new Rmd document, got outline done at least.

											
										
										
											2021-10-04 06:55:21 +02:00
-												Graphs finally all look good.  Now focus on the words!  Words words words, write up the words and then put out a "first reading draft".

											
										
										
											2021-11-04 12:12:12 +01:00
+								By graphing a values, we can get an idea of how the data is distributed across the various features.
 								```{r cv_calcs, echo=FALSE}
-												The script is complete.  The report has all the code from the script, and it compiles.  I still need to add the textual parts for chunks 3 and 4.  Final step is still the visuals, although I am thinking less is more on that one.

											
										
										
											2021-10-20 08:33:26 +02:00
 								# Keep only numeric columns, ignoring temporal features
 								ransomware_num <- ransomware %>%
-												quick reboot, things are acting weird, but the graphs should have labels that do not overlap now, at least....

											
										
										
											2021-11-01 16:45:40 +01:00
+								  select(income, neighbors, weight, length, count, looped)
-												The script is complete.  The report has all the code from the script, and it compiles.  I still need to add the textual parts for chunks 3 and 4.  Final step is still the visuals, although I am thinking less is more on that one.

											
										
										
											2021-10-20 08:33:26 +02:00
 								# Check for variation across numerical columns using coefficients of variation
 								#
 								# Calculate standard deviations for each column
 								sds <- ransomware_num %>% as.matrix() %>% colSds()
 								# Calculate means for each column
 								means <- ransomware_num %>% as.matrix() %>% colMeans()
 								# Calculate CVs for each column
 								coeff_vars <- sds %/% means
 								#  Select the two features with the highest coefficients of variation
 								selected_features <- names(sort(coeff_vars, decreasing=TRUE))[1:2]
 								#Sample every 100th row due to memory constraints
 								train_samp <- train_set[seq(1, nrow(train_set), 100), ]
 								# Keep only numeric columns with highest coefficients of variation
 								train_num <- train_samp %>% select(selected_features[1], selected_features[2])
 								# Binary labels, black = ransomware, white = non-ransomware, train set
 								train_bw <- train_samp$bw
 								#Sample every 100th row due to memory constraints to make test sample same size.
 								test_samp <- test_set[seq(1, nrow(train_set), 100), ]
 								# Dimension reduction again, selecting features with highest CVs
 								test_num <- test_samp %>% select(selected_features[1], selected_features[2])
 								# Binary labels for test set
 								test_bw <- test_samp$bw
 								```
-												Started new Rmd document, got outline done at least.

											
										
										
											2021-10-04 06:55:21 +02:00
-												graphs are starting to look good.  Now pick the best ones and leave the rest.....

											
										
										
											2021-10-28 15:04:34 +02:00
+								The proportion of ransomware addresses in the original data set is `r ransomprop`.  The total number of NA or missing values in the original data set is `r no_nas`.
-												Started new Rmd document, got outline done at least.

											
										
										
											2021-10-04 06:55:21 +02:00
-												Graphs finally all look good.  Now focus on the words!  Words words words, write up the words and then put out a "first reading draft".

											
										
										
											2021-11-04 12:12:12 +01:00
+								The ransomware addresses make up less than 2% of the overall data set.  This presents a challenge as the target observations are sparse within the data set, especially when we consider that this is then divided into 28 subsets.  In fact, some of the ransomware groups have only a single member, making categorization a dubious task.  At least there are no missing values to worry about.
 								```{r data_sparsness, echo=FALSE}
-												changed instances of "grey" to "bw". Started visualization section of report.  Mars needs better graphs!

											
										
										
											2021-10-15 07:55:01 +02:00
-												It is finally starting to look good.  Keep going.  Chunk #2 needs you.  You are its only hope.

											
										
										
											2021-10-23 16:36:04 +02:00
+								labels <- ransomware$label  %>% summary()
-												changed instances of "grey" to "bw". Started visualization section of report.  Mars needs better graphs!

											
										
										
											2021-10-15 07:55:01 +02:00
 								knitr::kable(
-												Graphs finally all look good.  Now focus on the words!  Words words words, write up the words and then put out a "first reading draft".

											
										
										
											2021-11-04 12:12:12 +01:00
+								    list(labels[1:10], labels[11:20], labels[21:29]),
-												changed instances of "grey" to "bw". Started visualization section of report.  Mars needs better graphs!

											
										
										
											2021-10-15 07:55:01 +02:00
+								  caption = 'Ransomware group labels and frequency counts for full data set',
-												Things are finally starting to look balanced, visually speaking. Next is to create a few group_by graphs and put them in the right place.  Get scripts working before putting into full document.

											
										
										
											2021-10-29 05:02:34 +02:00
+								  booktabs = TRUE)
-												changed instances of "grey" to "bw". Started visualization section of report.  Mars needs better graphs!

											
										
										
											2021-10-15 07:55:01 +02:00
 								```
-												Started new Rmd document, got outline done at least.

											
										
										
											2021-10-04 06:55:21 +02:00
-												Things are finally starting to look balanced, visually speaking. Next is to create a few group_by graphs and put them in the right place.  Get scripts working before putting into full document.

											
										
										
											2021-10-29 05:02:34 +02:00
+								Let's take a look at the distribution of the different features.  Note how skewed the non-temporal features are, some of them being bimodal.  Looks better on a log scale x-axis.
-												Started new Rmd document, got outline done at least.

											
										
										
											2021-10-04 06:55:21 +02:00
-												Things are finally starting to look balanced, visually speaking. Next is to create a few group_by graphs and put them in the right place.  Get scripts working before putting into full document.

											
										
										
											2021-10-29 05:02:34 +02:00
+								```{r histograms, echo=FALSE, warning=FALSE, fig.align="center"}
-												It is finally starting to look good.  Keep going.  Chunk #2 needs you.  You are its only hope.

											
										
										
											2021-10-23 16:36:04 +02:00
+								########################################################
 								## Histograms of each of the columns to show skewness
 								## Plot histograms for each column using facet wrap
 								########################################################
-												changed instances of "grey" to "bw". Started visualization section of report.  Mars needs better graphs!

											
										
										
											2021-10-15 07:55:01 +02:00
-												Things are finally starting to look balanced, visually speaking. Next is to create a few group_by graphs and put them in the right place.  Get scripts working before putting into full document.

											
										
										
											2021-10-29 05:02:34 +02:00
+								# Remove non-numerical and temporal columns to look for patterns in
 								# topologically defined features
 								train_hist <- train_samp %>% select(-address, -label, -bw, -day, -year)
-												graphs are starting to look good.  Now pick the best ones and leave the rest.....

											
										
										
											2021-10-28 15:04:34 +02:00
-												Things are finally starting to look balanced, visually speaking. Next is to create a few group_by graphs and put them in the right place.  Get scripts working before putting into full document.

											
										
										
											2021-10-29 05:02:34 +02:00
+								# Apply pivot_longer function to facilitate facet wrapping
 								train_long <- train_hist %>%
-												graphs are starting to look good.  Now pick the best ones and leave the rest.....

											
										
										
											2021-10-28 15:04:34 +02:00
+								  pivot_longer(colnames(train_hist)) %>%
-												changed instances of "grey" to "bw". Started visualization section of report.  Mars needs better graphs!

											
										
										
											2021-10-15 07:55:01 +02:00
+								  as.data.frame()
-												Things are finally starting to look balanced, visually speaking. Next is to create a few group_by graphs and put them in the right place.  Get scripts working before putting into full document.

											
										
										
											2021-10-29 05:02:34 +02:00
+								# Log scale on value axis,
 								histograms <- ggplot(train_long, aes(x = value)) +
-												changed instances of "grey" to "bw". Started visualization section of report.  Mars needs better graphs!

											
										
										
											2021-10-15 07:55:01 +02:00
+								  geom_histogram(aes(y = ..density..), bins=20) +
 								  geom_density(col = "green", size = .5) +
 								  scale_x_continuous(trans='log2') +
 								  facet_wrap(~ name, scales = "free")
-												quick reboot, things are acting weird, but the graphs should have labels that do not overlap now, at least....

											
										
										
											2021-11-01 16:45:40 +01:00
 								histograms + theme(axis.text.x = element_text(size = 8, angle=30, hjust=1))
-												Things are finally starting to look balanced, visually speaking. Next is to create a few group_by graphs and put them in the right place.  Get scripts working before putting into full document.

											
										
										
											2021-10-29 05:02:34 +02:00
-												changed instances of "grey" to "bw". Started visualization section of report.  Mars needs better graphs!

											
										
										
											2021-10-15 07:55:01 +02:00
 								```
-												Things are finally starting to look balanced, visually speaking. Next is to create a few group_by graphs and put them in the right place.  Get scripts working before putting into full document.

											
										
										
											2021-10-29 05:02:34 +02:00
-												Graphs finally all look good.  Now focus on the words!  Words words words, write up the words and then put out a "first reading draft".

											
										
										
											2021-11-04 12:12:12 +01:00
+								Now let us compare the relative spread of each feature by calculating the coefficient of variation for each column.  Larger coefficients of variation indicate larger relative spread compared to other columns.
-												changed instances of "grey" to "bw". Started visualization section of report.  Mars needs better graphs!

											
										
										
											2021-10-15 07:55:01 +02:00
-												Graphs finally all look good.  Now focus on the words!  Words words words, write up the words and then put out a "first reading draft".

											
										
										
											2021-11-04 12:12:12 +01:00
+								```{r cv_results, echo=FALSE, fig.align="center"}
-												changed instances of "grey" to "bw". Started visualization section of report.  Mars needs better graphs!

											
										
										
											2021-10-15 07:55:01 +02:00
-												Visualizations are finally in place and looking good enough to show others.  Now all that is needed is to fill in the text and clean up the flow.  One or two more sessions, and then I can sent out reading drafts.

											
										
										
											2021-11-02 18:34:50 +01:00
+								# Summarize results in a table
-												Graphs finally all look good.  Now focus on the words!  Words words words, write up the words and then put out a "first reading draft".

											
										
										
											2021-11-04 12:12:12 +01:00
+								knitr::kable(
 								    list(coeff_vars[1:2], coeff_vars[3:4], coeff_vars[5:6]),
 								  caption = 'Coefficients of Variation for each feature',
 								  booktabs = TRUE)
-												changed instances of "grey" to "bw". Started visualization section of report.  Mars needs better graphs!

											
										
										
											2021-10-15 07:55:01 +02:00
 								```
-												graphs are starting to look good.  Now pick the best ones and leave the rest.....

											
										
										
											2021-10-28 15:04:34 +02:00
+								From this, it appears that `r selected_features[1]` has the widest range of variability, followed by `r selected_features[2]`.  These are also the features that are most strongly skewed to the right, meaning that a few addresses have really high values for each of these features while the bulk of the data set has very low values for these numbers.
-												changed instances of "grey" to "bw". Started visualization section of report.  Mars needs better graphs!

											
										
										
											2021-10-15 07:55:01 +02:00
-												Visualizations are finally in place and looking good enough to show others.  Now all that is needed is to fill in the text and clean up the flow.  One or two more sessions, and then I can sent out reading drafts.

											
										
										
											2021-11-02 18:34:50 +01:00
+								Taking the feature with the highest variation  `r selected_features[1]`, let us take a look at the distribution for individual ransomware families.  Perhaps there is a similarity across families.
-												Graphs finally all look good.  Now focus on the words!  Words words words, write up the words and then put out a "first reading draft".

											
										
										
											2021-11-04 12:12:12 +01:00
+								```{r variation_histograms, echo=FALSE, fig.height=2, fig.width=2.5, fig.show="hold", out.width='35%', warning=FALSE}
-												Visualizations are finally in place and looking good enough to show others.  Now all that is needed is to fill in the text and clean up the flow.  One or two more sessions, and then I can sent out reading drafts.

											
										
										
											2021-11-02 18:34:50 +01:00
 								# Density plots of the feature with highest variation
 								selected_feature1 <- selected_features[1]
 								ransomware_big_families <- ransomware %>%
 								  mutate(selected_feature1 = as.numeric(selected_feature1))
 								# Note: Putting these graphs into a for loop breaks some of the formatting.
 								# Low membership makes some of the graphs not very informative
-												Graphs finally all look good.  Now focus on the words!  Words words words, write up the words and then put out a "first reading draft".

											
										
										
											2021-11-04 12:12:12 +01:00
+								# Relatively meaningless graphs have been left out to save time and space.
-												Visualizations are finally in place and looking good enough to show others.  Now all that is needed is to fill in the text and clean up the flow.  One or two more sessions, and then I can sent out reading drafts.

											
										
										
											2021-11-02 18:34:50 +01:00
+								# Label 1
 								ransomware_big_families %>%
 								  filter(label==levels(ransomware_big_families$label)[1]) %>%
 								  select(income) %>%
 								  ggplot(aes(x=income,  y = ..density..)) +
 								  geom_density(col = "green")+
 								  ggtitle(levels(ransomware_big_families$label)[1]) +
-												Graphs finally all look good.  Now focus on the words!  Words words words, write up the words and then put out a "first reading draft".

											
										
										
											2021-11-04 12:12:12 +01:00
+								  scale_x_continuous(trans='log2')  +
 								  theme(axis.text.x = element_text(size = 8, angle=30, hjust=1),
 								        plot.title = element_text(size = 9, face = "bold"),
 								        axis.title.x=element_blank(),
 								        axis.title.y=element_blank())
-												Visualizations are finally in place and looking good enough to show others.  Now all that is needed is to fill in the text and clean up the flow.  One or two more sessions, and then I can sent out reading drafts.

											
										
										
											2021-11-02 18:34:50 +01:00
+								# Label 4
 								ransomware_big_families %>%
 								  filter(label==levels(ransomware_big_families$label)[4]) %>%
 								  select(income) %>%
 								  ggplot(aes(x=income,  y = ..density..)) +
 								  geom_density(col = "green")+
 								  ggtitle(levels(ransomware_big_families$label)[4]) +
-												Graphs finally all look good.  Now focus on the words!  Words words words, write up the words and then put out a "first reading draft".

											
										
										
											2021-11-04 12:12:12 +01:00
+								  scale_x_continuous(trans='log2')  +
 								  theme(axis.text.x = element_text(size = 8, angle=30, hjust=1),
 								        plot.title = element_text(size = 9, face = "bold"),
 								        axis.title.x=element_blank(),
 								        axis.title.y=element_blank())
-												Visualizations are finally in place and looking good enough to show others.  Now all that is needed is to fill in the text and clean up the flow.  One or two more sessions, and then I can sent out reading drafts.

											
										
										
											2021-11-02 18:34:50 +01:00
+								# Label 5
 								ransomware_big_families %>%
 								  filter(label==levels(ransomware_big_families$label)[5]) %>%
 								  select(income) %>%
 								  ggplot(aes(x=income,  y = ..density..)) +
 								  geom_density(col = "green")+
 								  ggtitle(levels(ransomware_big_families$label)[5]) +
-												Graphs finally all look good.  Now focus on the words!  Words words words, write up the words and then put out a "first reading draft".

											
										
										
											2021-11-04 12:12:12 +01:00
+								  scale_x_continuous(trans='log2')  +
 								  theme(axis.text.x = element_text(size = 8, angle=30, hjust=1),
 								        plot.title = element_text(size = 9, face = "bold"),
 								        axis.title.x=element_blank(),
 								        axis.title.y=element_blank())
-												Visualizations are finally in place and looking good enough to show others.  Now all that is needed is to fill in the text and clean up the flow.  One or two more sessions, and then I can sent out reading drafts.

											
										
										
											2021-11-02 18:34:50 +01:00
+								# Label 6
 								ransomware_big_families %>%
 								  filter(label==levels(ransomware_big_families$label)[6]) %>%
 								  select(income) %>%
 								  ggplot(aes(x=income,  y = ..density..)) +
 								  geom_density(col = "green")+
 								  ggtitle(levels(ransomware_big_families$label)[6]) +
-												Graphs finally all look good.  Now focus on the words!  Words words words, write up the words and then put out a "first reading draft".

											
										
										
											2021-11-04 12:12:12 +01:00
+								  scale_x_continuous(trans='log2')  +
 								  theme(axis.text.x = element_text(size = 8, angle=30, hjust=1),
 								        plot.title = element_text(size = 9, face = "bold"),
 								        axis.title.x=element_blank(),
 								        axis.title.y=element_blank())
-												Visualizations are finally in place and looking good enough to show others.  Now all that is needed is to fill in the text and clean up the flow.  One or two more sessions, and then I can sent out reading drafts.

											
										
										
											2021-11-02 18:34:50 +01:00
+								# Label 7
 								ransomware_big_families %>%
 								  filter(label==levels(ransomware_big_families$label)[7]) %>%
 								  select(income) %>%
 								  ggplot(aes(x=income,  y = ..density..)) +
 								  geom_density(col = "green")+
 								  ggtitle(levels(ransomware_big_families$label)[7]) +
-												Graphs finally all look good.  Now focus on the words!  Words words words, write up the words and then put out a "first reading draft".

											
										
										
											2021-11-04 12:12:12 +01:00
+								  scale_x_continuous(trans='log2')  +
 								  theme(axis.text.x = element_text(size = 8, angle=30, hjust=1),
 								        plot.title = element_text(size = 9, face = "bold"),
 								        axis.title.x=element_blank(),
 								        axis.title.y=element_blank())
-												Visualizations are finally in place and looking good enough to show others.  Now all that is needed is to fill in the text and clean up the flow.  One or two more sessions, and then I can sent out reading drafts.

											
										
										
											2021-11-02 18:34:50 +01:00
+								# Label 8
 								ransomware_big_families %>%
 								  filter(label==levels(ransomware_big_families$label)[8]) %>%
 								  select(income) %>%
 								  ggplot(aes(x=income,  y = ..density..)) +
 								  geom_density(col = "green")+
 								  ggtitle(levels(ransomware_big_families$label)[8]) +
-												Graphs finally all look good.  Now focus on the words!  Words words words, write up the words and then put out a "first reading draft".

											
										
										
											2021-11-04 12:12:12 +01:00
+								  scale_x_continuous(trans='log2')  +
 								  theme(axis.text.x = element_text(size = 8, angle=30, hjust=1),
 								        plot.title = element_text(size = 9, face = "bold"),
 								        axis.title.x=element_blank(),
 								        axis.title.y=element_blank())
-												Visualizations are finally in place and looking good enough to show others.  Now all that is needed is to fill in the text and clean up the flow.  One or two more sessions, and then I can sent out reading drafts.

											
										
										
											2021-11-02 18:34:50 +01:00
+								# Label 10
 								ransomware_big_families %>%
 								  filter(label==levels(ransomware_big_families$label)[10]) %>%
 								  select(income) %>%
 								  ggplot(aes(x=income,  y = ..density..)) +
 								  geom_density(col = "green")+
 								  ggtitle(levels(ransomware_big_families$label)[10]) +
-												Graphs finally all look good.  Now focus on the words!  Words words words, write up the words and then put out a "first reading draft".

											
										
										
											2021-11-04 12:12:12 +01:00
+								  scale_x_continuous(trans='log2')  +
 								  theme(axis.text.x = element_text(size = 8, angle=30, hjust=1),
 								        plot.title = element_text(size = 9, face = "bold"),
 								        axis.title.x=element_blank(),
 								        axis.title.y=element_blank())
-												Visualizations are finally in place and looking good enough to show others.  Now all that is needed is to fill in the text and clean up the flow.  One or two more sessions, and then I can sent out reading drafts.

											
										
										
											2021-11-02 18:34:50 +01:00
+								# Label 11
 								ransomware_big_families %>%
 								  filter(label==levels(ransomware_big_families$label)[11]) %>%
 								  select(income) %>%
 								  ggplot(aes(x=income,  y = ..density..)) +
 								  geom_density(col = "green")+
 								  ggtitle(levels(ransomware_big_families$label)[11]) +
-												Graphs finally all look good.  Now focus on the words!  Words words words, write up the words and then put out a "first reading draft".

											
										
										
											2021-11-04 12:12:12 +01:00
+								  scale_x_continuous(trans='log2')  +
 								  theme(axis.text.x = element_text(size = 8, angle=30, hjust=1),
 								        plot.title = element_text(size = 9, face = "bold"),
 								        axis.title.x=element_blank(),
 								        axis.title.y=element_blank())
-												Visualizations are finally in place and looking good enough to show others.  Now all that is needed is to fill in the text and clean up the flow.  One or two more sessions, and then I can sent out reading drafts.

											
										
										
											2021-11-02 18:34:50 +01:00
+								# Label 12
 								ransomware_big_families %>%
 								  filter(label==levels(ransomware_big_families$label)[12]) %>%
 								  select(income) %>%
 								  ggplot(aes(x=income,  y = ..density..)) +
 								  geom_density(col = "green")+
 								  ggtitle(levels(ransomware_big_families$label)[12]) +
-												Graphs finally all look good.  Now focus on the words!  Words words words, write up the words and then put out a "first reading draft".

											
										
										
											2021-11-04 12:12:12 +01:00
+								  scale_x_continuous(trans='log2')  +
 								  theme(axis.text.x = element_text(size = 8, angle=30, hjust=1),
 								        plot.title = element_text(size = 9, face = "bold"),
 								        axis.title.x=element_blank(),
 								        axis.title.y=element_blank())
-												Visualizations are finally in place and looking good enough to show others.  Now all that is needed is to fill in the text and clean up the flow.  One or two more sessions, and then I can sent out reading drafts.

											
										
										
											2021-11-02 18:34:50 +01:00
+								# Label 13
 								ransomware_big_families %>%
 								  filter(label==levels(ransomware_big_families$label)[13]) %>%
 								  select(income) %>%
 								  ggplot(aes(x=income,  y = ..density..)) +
 								  geom_density(col = "green")+
 								  ggtitle(levels(ransomware_big_families$label)[13]) +
-												Graphs finally all look good.  Now focus on the words!  Words words words, write up the words and then put out a "first reading draft".

											
										
										
											2021-11-04 12:12:12 +01:00
+								  scale_x_continuous(trans='log2')  +
 								  theme(axis.text.x = element_text(size = 8, angle=30, hjust=1),
 								        plot.title = element_text(size = 9, face = "bold"),
 								        axis.title.x=element_blank(),
 								        axis.title.y=element_blank())
-												Visualizations are finally in place and looking good enough to show others.  Now all that is needed is to fill in the text and clean up the flow.  One or two more sessions, and then I can sent out reading drafts.

											
										
										
											2021-11-02 18:34:50 +01:00
+								# Label 14
 								ransomware_big_families %>%
 								  filter(label==levels(ransomware_big_families$label)[14]) %>%
 								  select(income) %>%
 								  ggplot(aes(x=income,  y = ..density..)) +
 								  geom_density(col = "green")+
 								  ggtitle(levels(ransomware_big_families$label)[14]) +
-												Graphs finally all look good.  Now focus on the words!  Words words words, write up the words and then put out a "first reading draft".

											
										
										
											2021-11-04 12:12:12 +01:00
+								  scale_x_continuous(trans='log2')  +
 								  theme(axis.text.x = element_text(size = 8, angle=30, hjust=1),
 								        plot.title = element_text(size = 9, face = "bold"),
 								        axis.title.x=element_blank(),
 								        axis.title.y=element_blank())
-												Visualizations are finally in place and looking good enough to show others.  Now all that is needed is to fill in the text and clean up the flow.  One or two more sessions, and then I can sent out reading drafts.

											
										
										
											2021-11-02 18:34:50 +01:00
+								# Label 15
 								ransomware_big_families %>%
 								  filter(label==levels(ransomware_big_families$label)[15]) %>%
 								  select(income) %>%
 								  ggplot(aes(x=income,  y = ..density..)) +
 								  geom_density(col = "green")+
 								  ggtitle(levels(ransomware_big_families$label)[15]) +
-												Graphs finally all look good.  Now focus on the words!  Words words words, write up the words and then put out a "first reading draft".

											
										
										
											2021-11-04 12:12:12 +01:00
+								  scale_x_continuous(trans='log2')  +
 								  theme(axis.text.x = element_text(size = 8, angle=30, hjust=1),
 								        plot.title = element_text(size = 9, face = "bold"),
 								        axis.title.x=element_blank(),
 								        axis.title.y=element_blank())
-												Visualizations are finally in place and looking good enough to show others.  Now all that is needed is to fill in the text and clean up the flow.  One or two more sessions, and then I can sent out reading drafts.

											
										
										
											2021-11-02 18:34:50 +01:00
+								# Label 16
 								ransomware_big_families %>%
 								  filter(label==levels(ransomware_big_families$label)[16]) %>%
 								  select(income) %>%
 								  ggplot(aes(x=income,  y = ..density..)) +
 								  geom_density(col = "green")+
 								  ggtitle(levels(ransomware_big_families$label)[16]) +
-												Graphs finally all look good.  Now focus on the words!  Words words words, write up the words and then put out a "first reading draft".

											
										
										
											2021-11-04 12:12:12 +01:00
+								  scale_x_continuous(trans='log2')  +
 								  theme(axis.text.x = element_text(size = 8, angle=30, hjust=1),
 								        plot.title = element_text(size = 9, face = "bold"),
 								        axis.title.x=element_blank(),
 								        axis.title.y=element_blank())
-												Visualizations are finally in place and looking good enough to show others.  Now all that is needed is to fill in the text and clean up the flow.  One or two more sessions, and then I can sent out reading drafts.

											
										
										
											2021-11-02 18:34:50 +01:00
+								# Label 18
 								ransomware_big_families %>%
 								  filter(label==levels(ransomware_big_families$label)[18]) %>%
 								  select(income) %>%
 								  ggplot(aes(x=income,  y = ..density..)) +
 								  geom_density(col = "green")+
 								  ggtitle(levels(ransomware_big_families$label)[18]) +
-												Graphs finally all look good.  Now focus on the words!  Words words words, write up the words and then put out a "first reading draft".

											
										
										
											2021-11-04 12:12:12 +01:00
+								  scale_x_continuous(trans='log2')  +
 								  theme(axis.text.x = element_text(size = 8, angle=30, hjust=1),
 								        plot.title = element_text(size = 9, face = "bold"),
 								        axis.title.x=element_blank(),
 								        axis.title.y=element_blank())
-												Visualizations are finally in place and looking good enough to show others.  Now all that is needed is to fill in the text and clean up the flow.  One or two more sessions, and then I can sent out reading drafts.

											
										
										
											2021-11-02 18:34:50 +01:00
+								# Label 20
 								ransomware_big_families %>%
 								  filter(label==levels(ransomware_big_families$label)[20]) %>%
 								  select(income) %>%
 								  ggplot(aes(x=income,  y = ..density..)) +
 								  geom_density(col = "green")+
 								  ggtitle(levels(ransomware_big_families$label)[20]) +
-												Graphs finally all look good.  Now focus on the words!  Words words words, write up the words and then put out a "first reading draft".

											
										
										
											2021-11-04 12:12:12 +01:00
+								  scale_x_continuous(trans='log2')  +
 								  theme(axis.text.x = element_text(size = 8, angle=30, hjust=1),
 								        plot.title = element_text(size = 9, face = "bold"),
 								        axis.title.x=element_blank(),
 								        axis.title.y=element_blank())
-												Visualizations are finally in place and looking good enough to show others.  Now all that is needed is to fill in the text and clean up the flow.  One or two more sessions, and then I can sent out reading drafts.

											
										
										
											2021-11-02 18:34:50 +01:00
+								# Label 22
 								ransomware_big_families %>%
 								  filter(label==levels(ransomware_big_families$label)[22]) %>%
 								  select(income) %>%
 								  ggplot(aes(x=income,  y = ..density..)) +
 								  geom_density(col = "green")+
 								  ggtitle(levels(ransomware_big_families$label)[22]) +
-												Graphs finally all look good.  Now focus on the words!  Words words words, write up the words and then put out a "first reading draft".

											
										
										
											2021-11-04 12:12:12 +01:00
+								  scale_x_continuous(trans='log2')  +
 								  theme(axis.text.x = element_text(size = 8, angle=30, hjust=1),
 								        plot.title = element_text(size = 9, face = "bold"),
 								        axis.title.x=element_blank(),
 								        axis.title.y=element_blank())
-												Visualizations are finally in place and looking good enough to show others.  Now all that is needed is to fill in the text and clean up the flow.  One or two more sessions, and then I can sent out reading drafts.

											
										
										
											2021-11-02 18:34:50 +01:00
+								# Label 23
 								ransomware_big_families %>%
 								  filter(label==levels(ransomware_big_families$label)[23]) %>%
 								  select(income) %>%
 								  ggplot(aes(x=income,  y = ..density..)) +
 								  geom_density(col = "green")+
 								  ggtitle(levels(ransomware_big_families$label)[23]) +
-												Graphs finally all look good.  Now focus on the words!  Words words words, write up the words and then put out a "first reading draft".

											
										
										
											2021-11-04 12:12:12 +01:00
+								  scale_x_continuous(trans='log2')  +
 								  theme(axis.text.x = element_text(size = 8, angle=30, hjust=1),
 								        plot.title = element_text(size = 9, face = "bold"),
 								        axis.title.x=element_blank(),
 								        axis.title.y=element_blank())
-												Visualizations are finally in place and looking good enough to show others.  Now all that is needed is to fill in the text and clean up the flow.  One or two more sessions, and then I can sent out reading drafts.

											
										
										
											2021-11-02 18:34:50 +01:00
+								# Label 24
 								ransomware_big_families %>%
 								  filter(label==levels(ransomware_big_families$label)[24]) %>%
 								  select(income) %>%
 								  ggplot(aes(x=income,  y = ..density..)) +
 								  geom_density(col = "green")+
 								  ggtitle(levels(ransomware_big_families$label)[24]) +
-												Graphs finally all look good.  Now focus on the words!  Words words words, write up the words and then put out a "first reading draft".

											
										
										
											2021-11-04 12:12:12 +01:00
+								  scale_x_continuous(trans='log2')  +
 								  theme(axis.text.x = element_text(size = 8, angle=30, hjust=1),
 								        plot.title = element_text(size = 9, face = "bold"),
 								        axis.title.x=element_blank(),
 								        axis.title.y=element_blank())
-												Visualizations are finally in place and looking good enough to show others.  Now all that is needed is to fill in the text and clean up the flow.  One or two more sessions, and then I can sent out reading drafts.

											
										
										
											2021-11-02 18:34:50 +01:00
+								# Label 27
 								ransomware_big_families %>%
 								  filter(label==levels(ransomware_big_families$label)[27]) %>%
 								  select(income) %>%
 								  ggplot(aes(x=income,  y = ..density..)) +
 								  geom_density(col = "green")+
 								  ggtitle(levels(ransomware_big_families$label)[27]) +
-												Graphs finally all look good.  Now focus on the words!  Words words words, write up the words and then put out a "first reading draft".

											
										
										
											2021-11-04 12:12:12 +01:00
+								  scale_x_continuous(trans='log2')  +
 								  theme(axis.text.x = element_text(size = 8, angle=30, hjust=1),
 								        plot.title = element_text(size = 9, face = "bold"),
 								        axis.title.x=element_blank(),
 								        axis.title.y=element_blank())
-												Visualizations are finally in place and looking good enough to show others.  Now all that is needed is to fill in the text and clean up the flow.  One or two more sessions, and then I can sent out reading drafts.

											
										
										
											2021-11-02 18:34:50 +01:00
+								# Label 28
 								ransomware_big_families %>%
 								  filter(label==levels(ransomware_big_families$label)[28]) %>%
 								  select(income) %>%
 								  ggplot(aes(x=income,  y = ..density..)) +
 								  geom_density(col = "green")+
 								  ggtitle(levels(ransomware_big_families$label)[28]) +
-												Graphs finally all look good.  Now focus on the words!  Words words words, write up the words and then put out a "first reading draft".

											
										
										
											2021-11-04 12:12:12 +01:00
+								  scale_x_continuous(trans='log2')  +
 								  theme(axis.text.x = element_text(size = 8, angle=30, hjust=1),
 								        plot.title = element_text(size = 9, face = "bold"),
 								        axis.title.x=element_blank(),
 								        axis.title.y=element_blank())
-												Visualizations are finally in place and looking good enough to show others.  Now all that is needed is to fill in the text and clean up the flow.  One or two more sessions, and then I can sent out reading drafts.

											
										
										
											2021-11-02 18:34:50 +01:00
+								# Label 29
 								ransomware_big_families %>%
 								  filter(label==levels(ransomware_big_families$label)[29]) %>%
 								  select(income) %>%
 								  ggplot(aes(x=income,  y = ..density..)) +
 								  geom_density(col = "green")+
 								  ggtitle(levels(ransomware_big_families$label)[29]) +
-												Graphs finally all look good.  Now focus on the words!  Words words words, write up the words and then put out a "first reading draft".

											
										
										
											2021-11-04 12:12:12 +01:00
+								  scale_x_continuous(trans='log2')  +
 								  theme(axis.text.x = element_text(size = 8, angle=30, hjust=1),
 								        plot.title = element_text(size = 9, face = "bold"),
 								        axis.title.x=element_blank(),
 								        axis.title.y=element_blank())
-												Visualizations are finally in place and looking good enough to show others.  Now all that is needed is to fill in the text and clean up the flow.  One or two more sessions, and then I can sent out reading drafts.

											
										
										
											2021-11-02 18:34:50 +01:00
 								```
-												Graphs finally all look good.  Now focus on the words!  Words words words, write up the words and then put out a "first reading draft".

											
										
										
											2021-11-04 12:12:12 +01:00
+								It appears that, although the income distribution (as an example feature to consider) for ransomware groups does differ from the distribution pattern for *white* addresses, it also varies from group to group.  For this reason, this makes a good feature to use in the training of the models.
-												changed instances of "grey" to "bw". Started visualization section of report.  Mars needs better graphs!

											
										
										
											2021-10-15 07:55:01 +02:00
-												The script is complete.  The report has all the code from the script, and it compiles.  I still need to add the textual parts for chunks 3 and 4.  Final step is still the visuals, although I am thinking less is more on that one.

											
										
										
											2021-10-20 08:33:26 +02:00
+								```{r shrimp-percentage, echo=FALSE, include=FALSE}
-												changed instances of "grey" to "bw". Started visualization section of report.  Mars needs better graphs!

											
										
										
											2021-10-15 07:55:01 +02:00
-												Graphs finally all look good.  Now focus on the words!  Words words words, write up the words and then put out a "first reading draft".

											
										
										
											2021-11-04 12:12:12 +01:00
+								# Count how many wallets have less than one hundred bitcoins
 								shrimp <- ransomware %>% filter(income < 10^10 )
-												changed instances of "grey" to "bw". Started visualization section of report.  Mars needs better graphs!

											
										
										
											2021-10-15 07:55:01 +02:00
 								```
-												Graphs finally all look good.  Now focus on the words!  Words words words, write up the words and then put out a "first reading draft".

											
										
										
											2021-11-04 12:12:12 +01:00
+								The percentage of wallets with less than one hundred bitcoins as their balance is `r mean(shrimp$bw == "black")`.  I have no idea why this is meaningful, but I can calculate it at least.
-												Started new Rmd document, got outline done at least.

											
										
										
											2021-10-04 06:55:21 +02:00
-												Graphs finally all look good.  Now focus on the words!  Words words words, write up the words and then put out a "first reading draft".

											
										
										
											2021-11-04 12:12:12 +01:00
+								### Insights gained from exploration
-												Started new Rmd document, got outline done at least.

											
										
										
											2021-10-04 06:55:21 +02:00
-												Graphs finally all look good.  Now focus on the words!  Words words words, write up the words and then put out a "first reading draft".

											
										
										
											2021-11-04 12:12:12 +01:00
+								  After visually and statistically exploring of the data, it becomes clear what the challenge is.  Ransomware related addresses are very sparse in the data set, making up less than 2% of all addresses.  This small percentage is also further classified into 28 groups.  Perhaps the original paper was a overly ambitious in trying to categorize all the addresses into 29 categories, including the vastly prevalent *white* addresses.  To simplify our approach, we will categorize the addresses in a binary way as either *white* or *black*, where *black* signifies an association with ransomware transactions.  Asking this as a "ransomware or not-ransomware" question allows for application of methods that have been shown to be impractical otherwise.
-												Added introductory section (chunk #1).

											
										
										
											2021-10-08 08:07:38 +02:00
-												nap time.  re-read from beginning and start into chunk3/4 text cleanup AFTER nap.

											
										
										
											2021-10-20 19:59:26 +02:00
+								---
-												Started new Rmd document, got outline done at least.

											
										
										
											2021-10-04 06:55:21 +02:00
-												Graphs finally all look good.  Now focus on the words!  Words words words, write up the words and then put out a "first reading draft".

											
										
										
											2021-11-04 12:12:12 +01:00
+								## Modelling approach
-												cut out binary soms from paper (saves hours of compile time, might stick with it.).  Also made up first draft of Final Method script.  Need to use first half to inform the set for the second half.  Might work on that tonight....

											
										
										
											2021-10-19 04:39:23 +02:00
-												Graphs finally all look good.  Now focus on the words!  Words words words, write up the words and then put out a "first reading draft".

											
										
										
											2021-11-04 12:12:12 +01:00
+								  Akcora, et al. applied a Random Forest approach to the data, however "Despite improving data scarcity, [...] tree based methods (i.e., Random Forest and XGBoost) fail to predict any ransomware family".[3, 11]  Considering all ransomware addresses as belonging to a single group may improve the predictive power of such methods, making Random Forest worth another try.
-												my wife has a limit to how much she can care, but she wants me to care in an unlimited fashion.  she needs a shoulder to cry on, but cannot provide a shoulder for others to cry on.  That is some sort of mental disorder, for real.

											
										
										
											2021-10-22 06:58:54 +02:00
-												Graphs finally all look good.  Now focus on the words!  Words words words, write up the words and then put out a "first reading draft".

											
										
										
											2021-11-04 12:12:12 +01:00
+								  The topological description of the data set inspired a search for topological machine learning methods, although one does not necessitate the other.  Searching for *topo* in the documentation for the `caret` package [6] resulted in the entry for Self Organizing Maps (SOMs), supplied by the `kohonen` package.  The description at CRAN [7] was intriguing enough to merit further investigation.
-												my wife has a limit to how much she can care, but she wants me to care in an unlimited fashion.  she needs a shoulder to cry on, but cannot provide a shoulder for others to cry on.  That is some sort of mental disorder, for real.

											
										
										
											2021-10-22 06:58:54 +02:00
-												Graphs finally all look good.  Now focus on the words!  Words words words, write up the words and then put out a "first reading draft".

											
										
										
											2021-11-04 12:12:12 +01:00
+								  Initially, the categorization of ransomware into the 28 different families was attempted using SOMs.  This proved to be very resource intensive, requiring more time and RAM than was available.  Although it did help to illuminate how SOMs are configured, the resource requirements of the algorithm became a deterrent.  It was at this point that the SOMs were applied in a binary way, classifying all ransomware addresses as merely *black*, initially in an attempt to simply get the algorithm to run to completion without error.  This seemed to reduce RAM usage to the point of being feasible on the available hardware.
-												The script is complete.  The report has all the code from the script, and it compiles.  I still need to add the textual parts for chunks 3 and 4.  Final step is still the visuals, although I am thinking less is more on that one.

											
										
										
											2021-10-20 08:33:26 +02:00
-												Graphs finally all look good.  Now focus on the words!  Words words words, write up the words and then put out a "first reading draft".

											
										
										
											2021-11-04 12:12:12 +01:00
+								   Self Organizing Maps were not covered in the coursework at any point, therefore a familiar method was sought out to compare the results to.  Random Forest was chosen and applied to the data set in a binary way, classifying every address as either *white* or *black*, ignoring the ransomware families.  Surprisingly, not only did the Random Forest approach result in an acceptable model, it did so much quicker than expected, taking only a few minutes to produce results.
-												It is finally starting to look good.  Keep going.  Chunk #2 needs you.  You are its only hope.

											
										
										
											2021-10-23 16:36:04 +02:00
-												Graphs finally all look good.  Now focus on the words!  Words words words, write up the words and then put out a "first reading draft".

											
										
										
											2021-11-04 12:12:12 +01:00
+								   At this point, it was very tempting to leave it there and write up a comparison of the two approaches to the binary problem, by classifying all ransomware related addresses as *black*.  However, a nagging feeling that more could be done eventually inspired a second look at the categorical problem of grouping the ransomware addresses into the 28 known families.  Given the high accuracy and precision of the binary Random Forest approach, the sparseness of the ransomware in the larger set has been eliminated completely, along with any chances of false positives. There are a few cases of false negatives, depending on how the randomization is done during the sampling process. However, the Random Forest method does not seem to produce many false positive (if any), meaning it never seems to predict a truly white address as being black.  Hence, by applying the Random Forest method first, we have effectively filtered out any possibility of false positives by correctly identifying a very large set of purely *white* addresses, which are then removed from the set.  The best model used in the original paper by Akcora, et al. resulted in more false positives than true positives.  This low precision rate is what made it impractical for real-world usage.[3]
-												It is finally starting to look good.  Keep going.  Chunk #2 needs you.  You are its only hope.

											
										
										
											2021-10-23 16:36:04 +02:00
-												Graphs finally all look good.  Now focus on the words!  Words words words, write up the words and then put out a "first reading draft".

											
										
										
											2021-11-04 12:12:12 +01:00
+								   This all inspired a two-part method to first separate the addresses into *black* and *white* groups, and then further classify the *black* addresses into ransomware families.  We shall explore each of these steps separately.
-												The script is complete.  The report has all the code from the script, and it compiles.  I still need to add the textual parts for chunks 3 and 4.  Final step is still the visuals, although I am thinking less is more on that one.

											
										
										
											2021-10-20 08:33:26 +02:00
-												Graphs finally all look good.  Now focus on the words!  Words words words, write up the words and then put out a "first reading draft".

											
										
										
											2021-11-04 12:12:12 +01:00
+								### Method Part 0:  Binary SOMs
-												It is finally starting to look good.  Keep going.  Chunk #2 needs you.  You are its only hope.

											
										
										
											2021-10-23 16:36:04 +02:00
-												Graphs finally all look good.  Now focus on the words!  Words words words, write up the words and then put out a "first reading draft".

											
										
										
											2021-11-04 12:12:12 +01:00
+								The first working model that ran to completion without exhausting computer resources did not make use of the ransomware family labels and instead the two categories of *black* and *white*.  The `kohonen` package provides algorithms for both supervised and unsupervised model building.  A supervised approach was used since the data set includes information about the membership of ransomware families that can be used to train the model.
-												It is finally starting to look good.  Keep going.  Chunk #2 needs you.  You are its only hope.

											
										
										
											2021-10-23 16:36:04 +02:00
-												Graphs finally all look good.  Now focus on the words!  Words words words, write up the words and then put out a "first reading draft".

											
										
										
											2021-11-04 12:12:12 +01:00
+								```{r binary_SOMs, echo=FALSE, include=FALSE}
-												It is finally starting to look good.  Keep going.  Chunk #2 needs you.  You are its only hope.

											
										
										
											2021-10-23 16:36:04 +02:00
+								##############################################################################
 								## This is a first attempt using SOMs to model the data set as "black" and
 								## "white" addresses only.
 								##
 								## NOTE:  This is the most computationally heavy part of the paper and takes
 								## several hours to run to completion.  It is also completely optional, only
 								## used to compare with the better method. If, for some reason, you want to
 								## compile the report without this section, you can just comment it all out
 								## or remove it because nothing is needed from Method Part 0 for any of the
-												graphs are starting to look good.  Now pick the best ones and leave the rest.....

											
										
										
											2021-10-28 15:04:34 +02:00
+								## other methods.  In other words, it can be safely skipped if you are short on
-												It is finally starting to look good.  Keep going.  Chunk #2 needs you.  You are its only hope.

											
										
										
											2021-10-23 16:36:04 +02:00
+								## tine or RAM.
 								##############################################################################
 								# Keep only numeric columns, ignoring dates and looped.
 								som1_train_num <- train_set %>% select(length, weight, count, neighbors, income)
 								# SOM function can only work on matrices
 								som1_train_mat <- as.matrix(scale(som1_train_num))
 								# Switching to supervised SOMs
 								som1_test_num <- test_set %>% select(length, weight, count, neighbors, income)
-												graphs are starting to look good.  Now pick the best ones and leave the rest.....

											
										
										
											2021-10-28 15:04:34 +02:00
+								# Note that when we rescale our testing data we need to scale it
 								# according to how we scaled our training data.
 								som1_test_mat <-
 								  as.matrix(scale(som1_test_num, center = attr(som1_train_mat, "scaled:center"),
 								                  scale = attr(som1_train_mat, "scaled:scale")))
-												It is finally starting to look good.  Keep going.  Chunk #2 needs you.  You are its only hope.

											
										
										
											2021-10-23 16:36:04 +02:00
 								# Binary outputs, black=ransomware, white=non-ransomware, train set
 								som1_train_bw <- train_set$bw %>% classvec2classmat()
 								# Same for test set
 								som1_test_bw <- test_set$bw %>% classvec2classmat()
 								# Create Data list for supervised SOM
-												graphs are starting to look good.  Now pick the best ones and leave the rest.....

											
										
										
											2021-10-28 15:04:34 +02:00
+								som1_train_list <-
 								  list(independent = som1_train_mat, dependent = som1_train_bw)
-												It is finally starting to look good.  Keep going.  Chunk #2 needs you.  You are its only hope.

											
										
										
											2021-10-23 16:36:04 +02:00
-												Things are finally starting to look balanced, visually speaking. Next is to create a few group_by graphs and put them in the right place.  Get scripts working before putting into full document.

											
										
										
											2021-10-29 05:02:34 +02:00
+								############################################################################
 								## Calculate idea grid size according to:
 								## https://www.researchgate.net/post/How-many-nodes-for-self-organizing-maps
 								############################################################################
-												It is finally starting to look good.  Keep going.  Chunk #2 needs you.  You are its only hope.

											
										
										
											2021-10-23 16:36:04 +02:00
 								# Formulaic method 1
 								grid_size <- round(sqrt(5*sqrt(nrow(train_set))))
 								# Based on categorical number, method 2
 								#grid_size = ceiling(sqrt(length(unique(ransomware$bw))))
 								grid_size
 								# Create SOM grid
-												graphs are starting to look good.  Now pick the best ones and leave the rest.....

											
										
										
											2021-10-28 15:04:34 +02:00
+								som1_train_grid <-
 								  somgrid(xdim=grid_size, ydim=grid_size, topo="hexagonal", toroidal = TRUE)
-												It is finally starting to look good.  Keep going.  Chunk #2 needs you.  You are its only hope.

											
										
										
											2021-10-23 16:36:04 +02:00
 								## Now build the model.
 								som_model1 <- xyf(som1_train_mat, som1_train_bw,
 								                 grid = som1_train_grid,
 								                 rlen = 100,
 								                 mode="pbatch",
-												Graphs finally all look good.  Now focus on the words!  Words words words, write up the words and then put out a "first reading draft".

											
										
										
											2021-11-04 12:12:12 +01:00
+								                 cores = n_cores,
-												It is finally starting to look good.  Keep going.  Chunk #2 needs you.  You are its only hope.

											
										
										
											2021-10-23 16:36:04 +02:00
+								                 keep.data = TRUE
 								)
 								# Now test predictions
 								som1_test_list <- list(independent = som1_test_mat, dependent = som1_test_bw)
 								ransomware.prediction1 <- predict(som_model1, newdata = som1_test_list)
-												Graphs finally all look good.  Now focus on the words!  Words words words, write up the words and then put out a "first reading draft".

											
										
										
											2021-11-04 12:12:12 +01:00
+								# Confusion matrix
-												graphs are starting to look good.  Now pick the best ones and leave the rest.....

											
										
										
											2021-10-28 15:04:34 +02:00
+								som1_cm_bw <-
 								  confusionMatrix(ransomware.prediction1$prediction[[2]], test_set$bw)
-												It is finally starting to look good.  Keep going.  Chunk #2 needs you.  You are its only hope.

											
										
										
											2021-10-23 16:36:04 +02:00
 								# Now test predictions of validation set
 								# Switching to supervised SOMs
 								valid_num <- validation %>% select(length, weight, count, neighbors, income)
-												graphs are starting to look good.  Now pick the best ones and leave the rest.....

											
										
										
											2021-10-28 15:04:34 +02:00
+								# Note that when we rescale our testing data we need to scale it
 								# according to how we scaled our training data.
 								valid_mat <-
 								  as.matrix(scale(valid_num, center = attr(som1_train_mat,  "scaled:center"),
 								                  scale = attr(som1_train_mat, "scaled:scale")))
-												It is finally starting to look good.  Keep going.  Chunk #2 needs you.  You are its only hope.

											
										
										
											2021-10-23 16:36:04 +02:00
 								valid_bw <- validation$bw
 								valid_list <- list(independent = valid_mat, dependent = valid_bw)
 								# Requires up to 16GB of RAM, skip if resources are limited
 								ransomware.prediction1.validation <- predict(som_model1, newdata = valid_list)
-												Graphs finally all look good.  Now focus on the words!  Words words words, write up the words and then put out a "first reading draft".

											
										
										
											2021-11-04 12:12:12 +01:00
+								# Confusion matrix
-												graphs are starting to look good.  Now pick the best ones and leave the rest.....

											
										
										
											2021-10-28 15:04:34 +02:00
+								cm_bw.validation <-
 								  confusionMatrix(ransomware.prediction1.validation$prediction[[2]],
 								                  validation$bw)
-												It is finally starting to look good.  Keep going.  Chunk #2 needs you.  You are its only hope.

											
										
										
											2021-10-23 16:36:04 +02:00
 								```
-												Graphs finally all look good.  Now focus on the words!  Words words words, write up the words and then put out a "first reading draft".

											
										
										
											2021-11-04 12:12:12 +01:00
+								After training the model, weobtain the confusion matricies for the test set and the validation set, separately.
-												It is finally starting to look good.  Keep going.  Chunk #2 needs you.  You are its only hope.

											
										
										
											2021-10-23 16:36:04 +02:00
-												Graphs finally all look good.  Now focus on the words!  Words words words, write up the words and then put out a "first reading draft".

											
										
										
											2021-11-04 12:12:12 +01:00
+								```{r binary_SOM_results, echo=FALSE, results='asis' }
-												It is finally starting to look good.  Keep going.  Chunk #2 needs you.  You are its only hope.

											
										
										
											2021-10-23 16:36:04 +02:00
-												Graphs finally all look good.  Now focus on the words!  Words words words, write up the words and then put out a "first reading draft".

											
										
										
											2021-11-04 12:12:12 +01:00
+								cm1_test_set <- som1_cm_bw %>% as.matrix() %>%
 								  knitr::kable(format = "latex", booktabs = TRUE)
-												It is finally starting to look good.  Keep going.  Chunk #2 needs you.  You are its only hope.

											
										
										
											2021-10-23 16:36:04 +02:00
-												Graphs finally all look good.  Now focus on the words!  Words words words, write up the words and then put out a "first reading draft".

											
										
										
											2021-11-04 12:12:12 +01:00
+								cm1_validation_set <- cm_bw.validation %>% as.matrix() %>%
 								  knitr::kable(format = "latex", booktabs = TRUE)
-												It is finally starting to look good.  Keep going.  Chunk #2 needs you.  You are its only hope.

											
										
										
											2021-10-23 16:36:04 +02:00
-												Graphs finally all look good.  Now focus on the words!  Words words words, write up the words and then put out a "first reading draft".

											
										
										
											2021-11-04 12:12:12 +01:00
+								cat(c("\\begin{table}[!htb]
 								    \\begin{minipage}{.5\\linewidth}
 								      \\caption{test set}
 								      \\centering",
 								        cm1_test_set,
 								    "\\end{minipage}%
 								    \\begin{minipage}{.5\\linewidth}
 								      \\centering
 								        \\caption{validation set}",
 								        cm1_validation_set,
 								    "\\end{minipage}
 								\\end{table}"
 								))
-												It is finally starting to look good.  Keep going.  Chunk #2 needs you.  You are its only hope.

											
										
										
											2021-10-23 16:36:04 +02:00
 								```
-												Graphs finally all look good.  Now focus on the words!  Words words words, write up the words and then put out a "first reading draft".

											
										
										
											2021-11-04 12:12:12 +01:00
+								This is a very intensive and somewhat inaccurate method compared to what follows.  It was left out of the final version of the script and has been included here only for model comparison and to track developmental evolution.
-												It is finally starting to look good.  Keep going.  Chunk #2 needs you.  You are its only hope.

											
										
										
											2021-10-23 16:36:04 +02:00
-												Graphs finally all look good.  Now focus on the words!  Words words words, write up the words and then put out a "first reading draft".

											
										
										
											2021-11-04 12:12:12 +01:00
+								### Method Part 1:  Binary Random Forest
-												cut out binary soms from paper (saves hours of compile time, might stick with it.).  Also made up first draft of Final Method script.  Need to use first half to inform the set for the second half.  Might work on that tonight....

											
										
										
											2021-10-19 04:39:23 +02:00
-												Graphs finally all look good.  Now focus on the words!  Words words words, write up the words and then put out a "first reading draft".

											
										
										
											2021-11-04 12:12:12 +01:00
+								A Random Forest model is trained using ten-fold cross validation and a tuning grid with the number of variables randomly sampled as candidates at each split (`mtry`) set to the values $={2, 4, 6, 8, 10, 12}$, each one being checked for optimization.
-												nap time.  re-read from beginning and start into chunk3/4 text cleanup AFTER nap.

											
										
										
											2021-10-20 19:59:26 +02:00
-												Graphs finally all look good.  Now focus on the words!  Words words words, write up the words and then put out a "first reading draft".

											
										
										
											2021-11-04 12:12:12 +01:00
+								```{r random_forest_prep, echo=FALSE, inculde=FALSE, warning=FALSE}
-												It is finally starting to look good.  Keep going.  Chunk #2 needs you.  You are its only hope.

											
										
										
											2021-10-23 16:36:04 +02:00
+								##############################################################################
-												Things are finally starting to look balanced, visually speaking. Next is to create a few group_by graphs and put them in the right place.  Get scripts working before putting into full document.

											
										
										
											2021-10-29 05:02:34 +02:00
+								## This is a better attempt using Random Forest to model the data set as
 								## "black" and "white" addresses only.
-												It is finally starting to look good.  Keep going.  Chunk #2 needs you.  You are its only hope.

											
										
										
											2021-10-23 16:36:04 +02:00
+								##############################################################################
-												cut out binary soms from paper (saves hours of compile time, might stick with it.).  Also made up first draft of Final Method script.  Need to use first half to inform the set for the second half.  Might work on that tonight....

											
										
										
											2021-10-19 04:39:23 +02:00
-												The script is complete.  The report has all the code from the script, and it compiles.  I still need to add the textual parts for chunks 3 and 4.  Final step is still the visuals, although I am thinking less is more on that one.

											
										
										
											2021-10-20 08:33:26 +02:00
+								# Cross Validation, ten fold
-												cut out binary soms from paper (saves hours of compile time, might stick with it.).  Also made up first draft of Final Method script.  Need to use first half to inform the set for the second half.  Might work on that tonight....

											
										
										
											2021-10-19 04:39:23 +02:00
+								control <- trainControl(method="cv", number = 10)
-												The script is complete.  The report has all the code from the script, and it compiles.  I still need to add the textual parts for chunks 3 and 4.  Final step is still the visuals, although I am thinking less is more on that one.

											
										
										
											2021-10-20 08:33:26 +02:00
+								# Control grid with variation on mtry
 								grid <- data.frame(mtry = c(2, 4, 6, 8, 10, 12))
-												cut out binary soms from paper (saves hours of compile time, might stick with it.).  Also made up first draft of Final Method script.  Need to use first half to inform the set for the second half.  Might work on that tonight....

											
										
										
											2021-10-19 04:39:23 +02:00
-												The script is complete.  The report has all the code from the script, and it compiles.  I still need to add the textual parts for chunks 3 and 4.  Final step is still the visuals, although I am thinking less is more on that one.

											
										
										
											2021-10-20 08:33:26 +02:00
+								# Run Cross Validation using control and grid set above
 								rf_model <- train(train_num, train_bw, method="rf",
 								                  trControl = control, tuneGrid=grid)
-												cut out binary soms from paper (saves hours of compile time, might stick with it.).  Also made up first draft of Final Method script.  Need to use first half to inform the set for the second half.  Might work on that tonight....

											
										
										
											2021-10-19 04:39:23 +02:00
-												The script is complete.  The report has all the code from the script, and it compiles.  I still need to add the textual parts for chunks 3 and 4.  Final step is still the visuals, although I am thinking less is more on that one.

											
										
										
											2021-10-20 08:33:26 +02:00
+								# Supervised fit of model using cross validated optimization
-												cut out binary soms from paper (saves hours of compile time, might stick with it.).  Also made up first draft of Final Method script.  Need to use first half to inform the set for the second half.  Might work on that tonight....

											
										
										
											2021-10-19 04:39:23 +02:00
+								fit_rf <- randomForest(train_samp, train_bw,
 								                       minNode = rf_model$bestTune$mtry)
 								# Measure accuracy of model against test sample
 								y_hat_rf <- predict(fit_rf, test_samp)
 								cm_test <- confusionMatrix(y_hat_rf, test_bw)
-												The script is complete.  The report has all the code from the script, and it compiles.  I still need to add the textual parts for chunks 3 and 4.  Final step is still the visuals, although I am thinking less is more on that one.

											
										
										
											2021-10-20 08:33:26 +02:00
+								# Measure accuracy of model against full ransomware set
 								ransomware_y_hat_rf <- predict(fit_rf, ransomware)
 								cm_ransomware <- confusionMatrix(ransomware_y_hat_rf, ransomware$bw)
-												cut out binary soms from paper (saves hours of compile time, might stick with it.).  Also made up first draft of Final Method script.  Need to use first half to inform the set for the second half.  Might work on that tonight....

											
										
										
											2021-10-19 04:39:23 +02:00
-												nap time.  re-read from beginning and start into chunk3/4 text cleanup AFTER nap.

											
										
										
											2021-10-20 19:59:26 +02:00
-												cut out binary soms from paper (saves hours of compile time, might stick with it.).  Also made up first draft of Final Method script.  Need to use first half to inform the set for the second half.  Might work on that tonight....

											
										
										
											2021-10-19 04:39:23 +02:00
+								```
-												It is finally starting to look good.  Keep going.  Chunk #2 needs you.  You are its only hope.

											
										
										
											2021-10-23 16:36:04 +02:00
+								The confusion matrix for the test set shows excellent results, specifically in the areas of accuracy and precision.
-												nap time.  re-read from beginning and start into chunk3/4 text cleanup AFTER nap.

											
										
										
											2021-10-20 19:59:26 +02:00
-												Graphs finally all look good.  Now focus on the words!  Words words words, write up the words and then put out a "first reading draft".

											
										
										
											2021-11-04 12:12:12 +01:00
 								```{r random-forest-output_test, echo=FALSE}
-												nap time.  re-read from beginning and start into chunk3/4 text cleanup AFTER nap.

											
										
										
											2021-10-20 19:59:26 +02:00
-												Things are finally starting to look balanced, visually speaking. Next is to create a few group_by graphs and put them in the right place.  Get scripts working before putting into full document.

											
										
										
											2021-10-29 05:02:34 +02:00
+								# Confusion matrix for test set
-												Graphs finally all look good.  Now focus on the words!  Words words words, write up the words and then put out a "first reading draft".

											
										
										
											2021-11-04 12:12:12 +01:00
+								cm2_test_set <- cm_test %>% as.matrix() %>%
 								  knitr::kable(format = "latex", booktabs = TRUE)
-												Things are finally starting to look balanced, visually speaking. Next is to create a few group_by graphs and put them in the right place.  Get scripts working before putting into full document.

											
										
										
											2021-10-29 05:02:34 +02:00
-												Graphs finally all look good.  Now focus on the words!  Words words words, write up the words and then put out a "first reading draft".

											
										
										
											2021-11-04 12:12:12 +01:00
+								# overall results
 								cm2_overall <- cm_test$overall %>%
 								  knitr::kable(format = "latex", booktabs = TRUE)
-												Things are finally starting to look balanced, visually speaking. Next is to create a few group_by graphs and put them in the right place.  Get scripts working before putting into full document.

											
										
										
											2021-10-29 05:02:34 +02:00
-												Graphs finally all look good.  Now focus on the words!  Words words words, write up the words and then put out a "first reading draft".

											
										
										
											2021-11-04 12:12:12 +01:00
+								# by class.
 								cm2_byClass <- cm_test$byClass %>%
 								  knitr::kable(format = "latex", booktabs = TRUE)
-												Things are finally starting to look balanced, visually speaking. Next is to create a few group_by graphs and put them in the right place.  Get scripts working before putting into full document.

											
										
										
											2021-10-29 05:02:34 +02:00
-												Graphs finally all look good.  Now focus on the words!  Words words words, write up the words and then put out a "first reading draft".

											
										
										
											2021-11-04 12:12:12 +01:00
+								# Confusion matrix for full ransomware set,
 								cm3_full_set <- cm_ransomware %>% as.matrix() %>%
 								  knitr::kable(format = "latex", booktabs = TRUE)
-												Things are finally starting to look balanced, visually speaking. Next is to create a few group_by graphs and put them in the right place.  Get scripts working before putting into full document.

											
										
										
											2021-10-29 05:02:34 +02:00
-												Graphs finally all look good.  Now focus on the words!  Words words words, write up the words and then put out a "first reading draft".

											
										
										
											2021-11-04 12:12:12 +01:00
+								# overall results
 								cm3_overall <- cm_ransomware$overall %>%
 								  knitr::kable(format = "latex", booktabs = TRUE)
-												Things are finally starting to look balanced, visually speaking. Next is to create a few group_by graphs and put them in the right place.  Get scripts working before putting into full document.

											
										
										
											2021-10-29 05:02:34 +02:00
-												Graphs finally all look good.  Now focus on the words!  Words words words, write up the words and then put out a "first reading draft".

											
										
										
											2021-11-04 12:12:12 +01:00
+								#  by class.
 								cm3_byClass <- cm_ransomware$byClass %>%
 								  knitr::kable(format = "latex", booktabs = TRUE)
-												Things are finally starting to look balanced, visually speaking. Next is to create a few group_by graphs and put them in the right place.  Get scripts working before putting into full document.

											
										
										
											2021-10-29 05:02:34 +02:00
-												nap time.  re-read from beginning and start into chunk3/4 text cleanup AFTER nap.

											
										
										
											2021-10-20 19:59:26 +02:00
 								```
-												Started new Rmd document, got outline done at least.

											
										
										
											2021-10-04 06:55:21 +02:00
-												Graphs finally all look good.  Now focus on the words!  Words words words, write up the words and then put out a "first reading draft".

											
										
										
											2021-11-04 12:12:12 +01:00
+								Here are the confusion matrices for the test set and the full set resulting from the Random Forest model, respectively.
-												cut out binary soms from paper (saves hours of compile time, might stick with it.).  Also made up first draft of Final Method script.  Need to use first half to inform the set for the second half.  Might work on that tonight....

											
										
										
											2021-10-19 04:39:23 +02:00
-												Graphs finally all look good.  Now focus on the words!  Words words words, write up the words and then put out a "first reading draft".

											
										
										
											2021-11-04 12:12:12 +01:00
+								```{r random-forest-comfusion_matrices, echo=FALSE, results='asis'}
-												cut out binary soms from paper (saves hours of compile time, might stick with it.).  Also made up first draft of Final Method script.  Need to use first half to inform the set for the second half.  Might work on that tonight....

											
										
										
											2021-10-19 04:39:23 +02:00
-												Graphs finally all look good.  Now focus on the words!  Words words words, write up the words and then put out a "first reading draft".

											
										
										
											2021-11-04 12:12:12 +01:00
+								# Print all three tables on one line
 								cat(c("\\begin{table}[!htb]
 								    \\begin{minipage}{.5\\linewidth}
 								      \\caption{confusion matrix for test set}
 								      \\centering",
 								        cm2_test_set,
 								    "\\end{minipage}%
 								    \\begin{minipage}{.5\\linewidth}
 								      \\centering
 								        \\caption{confusion matrix for full set}",
 								        cm3_full_set,
 								    "\\end{minipage}
 								\\end{table}"
 								))
-												Things are finally starting to look balanced, visually speaking. Next is to create a few group_by graphs and put them in the right place.  Get scripts working before putting into full document.

											
										
										
											2021-10-29 05:02:34 +02:00
 								```
-												Graphs finally all look good.  Now focus on the words!  Words words words, write up the words and then put out a "first reading draft".

											
										
										
											2021-11-04 12:12:12 +01:00
+								The confusion matrix for the full ransomware set is very similar to that of the test set.
-												Things are finally starting to look balanced, visually speaking. Next is to create a few group_by graphs and put them in the right place.  Get scripts working before putting into full document.

											
										
										
											2021-10-29 05:02:34 +02:00
-												Graphs finally all look good.  Now focus on the words!  Words words words, write up the words and then put out a "first reading draft".

											
										
										
											2021-11-04 12:12:12 +01:00
+								Overall results for test and full sets show good results.
 								```{r random-forest-overall_results, echo=FALSE, results='asis'}
 								# Print both tables on one line
 								cat(c("\\begin{table}[!htb]
 								    \\begin{minipage}{.5\\linewidth}
 								      \\caption{test set overall results}
 								      \\centering",
 								        cm2_overall,
 								    "\\end{minipage}%
 								    \\begin{minipage}{.5\\linewidth}
 								      \\centering
 								        \\caption{full set overall results}",
 								        cm3_overall,
 								    "\\end{minipage}
 								\\end{table}"
 								))
-												cut out binary soms from paper (saves hours of compile time, might stick with it.).  Also made up first draft of Final Method script.  Need to use first half to inform the set for the second half.  Might work on that tonight....

											
										
										
											2021-10-19 04:39:23 +02:00
-												Things are finally starting to look balanced, visually speaking. Next is to create a few group_by graphs and put them in the right place.  Get scripts working before putting into full document.

											
										
										
											2021-10-29 05:02:34 +02:00
+								```
-												Graphs finally all look good.  Now focus on the words!  Words words words, write up the words and then put out a "first reading draft".

											
										
										
											2021-11-04 12:12:12 +01:00
+								Results by class for the test and full sets.  What can you say about these, specifically?
 								```{r random-forest-results_by_class, echo=FALSE, results='asis'}
 								# Print both tables on one line
 								cat(c("\\begin{table}[!htb]
 								    \\begin{minipage}{.5\\linewidth}
 								      \\caption{test set results by class}
 								      \\centering",
 								        cm2_byClass,
 								    "\\end{minipage}%
 								    \\begin{minipage}{.5\\linewidth}
 								      \\centering
 								        \\caption{full set results by class}",
 								        cm3_byClass,
 								    "\\end{minipage}
 								\\end{table}"
 								))
-												cut out binary soms from paper (saves hours of compile time, might stick with it.).  Also made up first draft of Final Method script.  Need to use first half to inform the set for the second half.  Might work on that tonight....

											
										
										
											2021-10-19 04:39:23 +02:00
 								```
-												Graphs finally all look good.  Now focus on the words!  Words words words, write up the words and then put out a "first reading draft".

											
										
										
											2021-11-04 12:12:12 +01:00
+								This is a much quicker way of removing most of the *white* addresses, and will be used in the final composite model to save time.
-												Things are finally starting to look balanced, visually speaking. Next is to create a few group_by graphs and put them in the right place.  Get scripts working before putting into full document.

											
										
										
											2021-10-29 05:02:34 +02:00
-												Graphs finally all look good.  Now focus on the words!  Words words words, write up the words and then put out a "first reading draft".

											
										
										
											2021-11-04 12:12:12 +01:00
+								### Method Part 2:  Categorical SOMs
-												Added introductory section (chunk #1).

											
										
										
											2021-10-08 08:07:38 +02:00
-												Graphs finally all look good.  Now focus on the words!  Words words words, write up the words and then put out a "first reading draft".

											
										
										
											2021-11-04 12:12:12 +01:00
+								Now we train a new model after throwing away all *white* addresses.  The predictions from the Random Forest model are used to isolate all *black* addresses for further classification into ransomware addresses using SOMs.  The reduced set is then categorized using a supervised SOM method with the 28 ransomware families as the target classification groups.
-												nap time.  re-read from beginning and start into chunk3/4 text cleanup AFTER nap.

											
										
										
											2021-10-20 19:59:26 +02:00
-												The script is complete.  The report has all the code from the script, and it compiles.  I still need to add the textual parts for chunks 3 and 4.  Final step is still the visuals, although I am thinking less is more on that one.

											
										
										
											2021-10-20 08:33:26 +02:00
+								```{r soms-prep, echo=FALSE, include=FALSE}
-												cut out binary soms from paper (saves hours of compile time, might stick with it.).  Also made up first draft of Final Method script.  Need to use first half to inform the set for the second half.  Might work on that tonight....

											
										
										
											2021-10-19 04:39:23 +02:00
-												The script is complete.  The report has all the code from the script, and it compiles.  I still need to add the textual parts for chunks 3 and 4.  Final step is still the visuals, although I am thinking less is more on that one.

											
										
										
											2021-10-20 08:33:26 +02:00
+								##############################################################################
-												It is finally starting to look good.  Keep going.  Chunk #2 needs you.  You are its only hope.

											
										
										
											2021-10-23 16:36:04 +02:00
+								## Now we use the Random Forest model to classify the data set into "black"
 								## and "white" categories with better precision.
-												The script is complete.  The report has all the code from the script, and it compiles.  I still need to add the textual parts for chunks 3 and 4.  Final step is still the visuals, although I am thinking less is more on that one.

											
										
										
											2021-10-20 08:33:26 +02:00
+								##############################################################################
-												cut out binary soms from paper (saves hours of compile time, might stick with it.).  Also made up first draft of Final Method script.  Need to use first half to inform the set for the second half.  Might work on that tonight....

											
										
										
											2021-10-19 04:39:23 +02:00
-												The script is complete.  The report has all the code from the script, and it compiles.  I still need to add the textual parts for chunks 3 and 4.  Final step is still the visuals, although I am thinking less is more on that one.

											
										
										
											2021-10-20 08:33:26 +02:00
+								# Now use this prediction to reduce the original set to only "black" addresses
 								# First append the full set of predictions to the original set.
 								ransomware$prediction <- ransomware_y_hat_rf
-												cut out binary soms from paper (saves hours of compile time, might stick with it.).  Also made up first draft of Final Method script.  Need to use first half to inform the set for the second half.  Might work on that tonight....

											
										
										
											2021-10-19 04:39:23 +02:00
-												The script is complete.  The report has all the code from the script, and it compiles.  I still need to add the textual parts for chunks 3 and 4.  Final step is still the visuals, although I am thinking less is more on that one.

											
										
										
											2021-10-20 08:33:26 +02:00
+								# Filter out all the predicted "white" addresses,
 								# leaving only predicted "black" addresses
 								black_addresses <- ransomware %>% filter(prediction=="black")
-												cut out binary soms from paper (saves hours of compile time, might stick with it.).  Also made up first draft of Final Method script.  Need to use first half to inform the set for the second half.  Might work on that tonight....

											
										
										
											2021-10-19 04:39:23 +02:00
-												The script is complete.  The report has all the code from the script, and it compiles.  I still need to add the textual parts for chunks 3 and 4.  Final step is still the visuals, although I am thinking less is more on that one.

											
										
										
											2021-10-20 08:33:26 +02:00
+								# Split the reduced black-predictions into a training set and a test set @ 50%
 								test_index <- createDataPartition(y = black_addresses$prediction,
 								                                  times = 1, p = .5, list = FALSE)
-												cut out binary soms from paper (saves hours of compile time, might stick with it.).  Also made up first draft of Final Method script.  Need to use first half to inform the set for the second half.  Might work on that tonight....

											
										
										
											2021-10-19 04:39:23 +02:00
-												The script is complete.  The report has all the code from the script, and it compiles.  I still need to add the textual parts for chunks 3 and 4.  Final step is still the visuals, although I am thinking less is more on that one.

											
										
										
											2021-10-20 08:33:26 +02:00
+								train_set <- black_addresses[-test_index,]
 								test_set <- black_addresses[test_index,]
-												cut out binary soms from paper (saves hours of compile time, might stick with it.).  Also made up first draft of Final Method script.  Need to use first half to inform the set for the second half.  Might work on that tonight....

											
										
										
											2021-10-19 04:39:23 +02:00
-												The script is complete.  The report has all the code from the script, and it compiles.  I still need to add the textual parts for chunks 3 and 4.  Final step is still the visuals, although I am thinking less is more on that one.

											
										
										
											2021-10-20 08:33:26 +02:00
+								# Keep only numeric columns, ignoring temporal variables.
 								train_num <- train_set %>%
-												quick reboot, things are acting weird, but the graphs should have labels that do not overlap now, at least....

											
										
										
											2021-11-01 16:45:40 +01:00
+								  select(income, neighbors, weight, length, count, looped)
-												cut out binary soms from paper (saves hours of compile time, might stick with it.).  Also made up first draft of Final Method script.  Need to use first half to inform the set for the second half.  Might work on that tonight....

											
										
										
											2021-10-19 04:39:23 +02:00
-												The script is complete.  The report has all the code from the script, and it compiles.  I still need to add the textual parts for chunks 3 and 4.  Final step is still the visuals, although I am thinking less is more on that one.

											
										
										
											2021-10-20 08:33:26 +02:00
+								# SOM function can only work on matrices.
-												cut out binary soms from paper (saves hours of compile time, might stick with it.).  Also made up first draft of Final Method script.  Need to use first half to inform the set for the second half.  Might work on that tonight....

											
										
										
											2021-10-19 04:39:23 +02:00
+								train_mat <- as.matrix(scale(train_num))
-												The script is complete.  The report has all the code from the script, and it compiles.  I still need to add the textual parts for chunks 3 and 4.  Final step is still the visuals, although I am thinking less is more on that one.

											
										
										
											2021-10-20 08:33:26 +02:00
+								# Select non-temporal numerical features only
 								test_num <- test_set %>%
-												quick reboot, things are acting weird, but the graphs should have labels that do not overlap now, at least....

											
										
										
											2021-11-01 16:45:40 +01:00
+								  select(income, neighbors, weight, length, count, looped)
-												cut out binary soms from paper (saves hours of compile time, might stick with it.).  Also made up first draft of Final Method script.  Need to use first half to inform the set for the second half.  Might work on that tonight....

											
										
										
											2021-10-19 04:39:23 +02:00
-												The script is complete.  The report has all the code from the script, and it compiles.  I still need to add the textual parts for chunks 3 and 4.  Final step is still the visuals, although I am thinking less is more on that one.

											
										
										
											2021-10-20 08:33:26 +02:00
+								# Testing data is scaled according to how we scaled our training data.
 								test_mat <- as.matrix(scale(test_num,
 								                            center = attr(train_mat, "scaled:center"),
 								                            scale = attr(train_mat, "scaled:scale")))
-												cut out binary soms from paper (saves hours of compile time, might stick with it.).  Also made up first draft of Final Method script.  Need to use first half to inform the set for the second half.  Might work on that tonight....

											
										
										
											2021-10-19 04:39:23 +02:00
-												The script is complete.  The report has all the code from the script, and it compiles.  I still need to add the textual parts for chunks 3 and 4.  Final step is still the visuals, although I am thinking less is more on that one.

											
										
										
											2021-10-20 08:33:26 +02:00
+								# Categorical labels for training set
-												cut out binary soms from paper (saves hours of compile time, might stick with it.).  Also made up first draft of Final Method script.  Need to use first half to inform the set for the second half.  Might work on that tonight....

											
										
										
											2021-10-19 04:39:23 +02:00
+								train_label <- train_set$label %>% classvec2classmat()
 								# Same for test set
 								test_label <- test_set$label %>% classvec2classmat()
-												The script is complete.  The report has all the code from the script, and it compiles.  I still need to add the textual parts for chunks 3 and 4.  Final step is still the visuals, although I am thinking less is more on that one.

											
										
										
											2021-10-20 08:33:26 +02:00
+								# Create data list for supervised SOM
-												cut out binary soms from paper (saves hours of compile time, might stick with it.).  Also made up first draft of Final Method script.  Need to use first half to inform the set for the second half.  Might work on that tonight....

											
										
										
											2021-10-19 04:39:23 +02:00
+								train_list <- list(independent = train_mat, dependent = train_label)
-												Things are finally starting to look balanced, visually speaking. Next is to create a few group_by graphs and put them in the right place.  Get scripts working before putting into full document.

											
										
										
											2021-10-29 05:02:34 +02:00
+								############################################################################
 								## Calculate idea grid size according to:
 								## https://www.researchgate.net/post/How-many-nodes-for-self-organizing-maps
 								############################################################################
-												cut out binary soms from paper (saves hours of compile time, might stick with it.).  Also made up first draft of Final Method script.  Need to use first half to inform the set for the second half.  Might work on that tonight....

											
										
										
											2021-10-19 04:39:23 +02:00
-												The script is complete.  The report has all the code from the script, and it compiles.  I still need to add the textual parts for chunks 3 and 4.  Final step is still the visuals, although I am thinking less is more on that one.

											
										
										
											2021-10-20 08:33:26 +02:00
+								# Formulaic method 1, makes a larger graph in this case
-												cut out binary soms from paper (saves hours of compile time, might stick with it.).  Also made up first draft of Final Method script.  Need to use first half to inform the set for the second half.  Might work on that tonight....

											
										
										
											2021-10-19 04:39:23 +02:00
+								grid_size <- round(sqrt(5*sqrt(nrow(train_set))))
-												The script is complete.  The report has all the code from the script, and it compiles.  I still need to add the textual parts for chunks 3 and 4.  Final step is still the visuals, although I am thinking less is more on that one.

											
										
										
											2021-10-20 08:33:26 +02:00
 								# Based on categorical number, method 2, smaller graph with less cells
-												cut out binary soms from paper (saves hours of compile time, might stick with it.).  Also made up first draft of Final Method script.  Need to use first half to inform the set for the second half.  Might work on that tonight....

											
										
										
											2021-10-19 04:39:23 +02:00
+								#grid_size = ceiling(sqrt(length(unique(ransomware$label))))
 								# Create SOM grid
-												The script is complete.  The report has all the code from the script, and it compiles.  I still need to add the textual parts for chunks 3 and 4.  Final step is still the visuals, although I am thinking less is more on that one.

											
										
										
											2021-10-20 08:33:26 +02:00
+								train_grid <- somgrid(xdim=grid_size, ydim=grid_size,
 								                      topo="hexagonal", toroidal = TRUE)
-												cut out binary soms from paper (saves hours of compile time, might stick with it.).  Also made up first draft of Final Method script.  Need to use first half to inform the set for the second half.  Might work on that tonight....

											
										
										
											2021-10-19 04:39:23 +02:00
-												The script is complete.  The report has all the code from the script, and it compiles.  I still need to add the textual parts for chunks 3 and 4.  Final step is still the visuals, although I am thinking less is more on that one.

											
										
										
											2021-10-20 08:33:26 +02:00
+								## Now build the SOM model using the supervised method xyf()
-												cut out binary soms from paper (saves hours of compile time, might stick with it.).  Also made up first draft of Final Method script.  Need to use first half to inform the set for the second half.  Might work on that tonight....

											
										
										
											2021-10-19 04:39:23 +02:00
+								som_model2 <- xyf(train_mat, train_label,
-												The script is complete.  The report has all the code from the script, and it compiles.  I still need to add the textual parts for chunks 3 and 4.  Final step is still the visuals, although I am thinking less is more on that one.

											
										
										
											2021-10-20 08:33:26 +02:00
+								                  grid = train_grid,
 								                  rlen = 100,
 								                  mode="pbatch",
-												Graphs finally all look good.  Now focus on the words!  Words words words, write up the words and then put out a "first reading draft".

											
										
										
											2021-11-04 12:12:12 +01:00
+								                  cores = n_cores,
-												The script is complete.  The report has all the code from the script, and it compiles.  I still need to add the textual parts for chunks 3 and 4.  Final step is still the visuals, although I am thinking less is more on that one.

											
										
										
											2021-10-20 08:33:26 +02:00
+								                  keep.data = TRUE
-												cut out binary soms from paper (saves hours of compile time, might stick with it.).  Also made up first draft of Final Method script.  Need to use first half to inform the set for the second half.  Might work on that tonight....

											
										
										
											2021-10-19 04:39:23 +02:00
+								)
-												The script is complete.  The report has all the code from the script, and it compiles.  I still need to add the textual parts for chunks 3 and 4.  Final step is still the visuals, although I am thinking less is more on that one.

											
										
										
											2021-10-20 08:33:26 +02:00
+								# Now test predictions of test set, create data list for test set
-												cut out binary soms from paper (saves hours of compile time, might stick with it.).  Also made up first draft of Final Method script.  Need to use first half to inform the set for the second half.  Might work on that tonight....

											
										
										
											2021-10-19 04:39:23 +02:00
+								test_list <- list(independent = test_mat, dependent = test_label)
-												The script is complete.  The report has all the code from the script, and it compiles.  I still need to add the textual parts for chunks 3 and 4.  Final step is still the visuals, although I am thinking less is more on that one.

											
										
										
											2021-10-20 08:33:26 +02:00
+								# Generate predictions
-												cut out binary soms from paper (saves hours of compile time, might stick with it.).  Also made up first draft of Final Method script.  Need to use first half to inform the set for the second half.  Might work on that tonight....

											
										
										
											2021-10-19 04:39:23 +02:00
+								ransomware_group.prediction <- predict(som_model2, newdata = test_list)
-												Graphs finally all look good.  Now focus on the words!  Words words words, write up the words and then put out a "first reading draft".

											
										
										
											2021-11-04 12:12:12 +01:00
+								# Confusion matrix
-												The script is complete.  The report has all the code from the script, and it compiles.  I still need to add the textual parts for chunks 3 and 4.  Final step is still the visuals, although I am thinking less is more on that one.

											
										
										
											2021-10-20 08:33:26 +02:00
+								cm_labels <- confusionMatrix(ransomware_group.prediction$prediction[[2]],
 								                             test_set$label)
-												cut out binary soms from paper (saves hours of compile time, might stick with it.).  Also made up first draft of Final Method script.  Need to use first half to inform the set for the second half.  Might work on that tonight....

											
										
										
											2021-10-19 04:39:23 +02:00
 								```
-												It is finally starting to look good.  Keep going.  Chunk #2 needs you.  You are its only hope.

											
										
										
											2021-10-23 16:36:04 +02:00
+								When selecting the grid size for a Self Organizing Map, there are at least two different schools of thought. The two that were tried here are explained (with supporting documentation) on a Researchgate forum.[8]  The first method is based on the size of the training set, and in this case results in a larger, more accurate map.  The second method is based on the number of known categories to classify the data into, and in this case results in a smaller, less accurate map.  For this script, a grid size of `r grid_size` has been selected.
-												cut out binary soms from paper (saves hours of compile time, might stick with it.).  Also made up first draft of Final Method script.  Need to use first half to inform the set for the second half.  Might work on that tonight....

											
										
										
											2021-10-19 04:39:23 +02:00
-												It is finally starting to look good.  Keep going.  Chunk #2 needs you.  You are its only hope.

											
										
										
											2021-10-23 16:36:04 +02:00
+								A summary of the results for the categorization of black addresses into ransomware families follows.  For the full table of predictions and statistics, see the Appendix.
-												cut out binary soms from paper (saves hours of compile time, might stick with it.).  Also made up first draft of Final Method script.  Need to use first half to inform the set for the second half.  Might work on that tonight....

											
										
										
											2021-10-19 04:39:23 +02:00
-												Graphs finally all look good.  Now focus on the words!  Words words words, write up the words and then put out a "first reading draft".

											
										
										
											2021-11-04 12:12:12 +01:00
+								Here are the overall results of the final categorization.
-												Things are finally starting to look balanced, visually speaking. Next is to create a few group_by graphs and put them in the right place.  Get scripts working before putting into full document.

											
										
										
											2021-10-29 05:02:34 +02:00
-												my wife has a limit to how much she can care, but she wants me to care in an unlimited fashion.  she needs a shoulder to cry on, but cannot provide a shoulder for others to cry on.  That is some sort of mental disorder, for real.

											
										
										
											2021-10-22 06:58:54 +02:00
+								```{r cm_overall, echo=FALSE}
-												It is finally starting to look good.  Keep going.  Chunk #2 needs you.  You are its only hope.

											
										
										
											2021-10-23 16:36:04 +02:00
+								# Overall section of the confusion matrix formatted through kable()
-												Graphs finally all look good.  Now focus on the words!  Words words words, write up the words and then put out a "first reading draft".

											
										
										
											2021-11-04 12:12:12 +01:00
+								cm_labels$overall %>% knitr::kable(caption="overall categorization results")
-												my wife has a limit to how much she can care, but she wants me to care in an unlimited fashion.  she needs a shoulder to cry on, but cannot provide a shoulder for others to cry on.  That is some sort of mental disorder, for real.

											
										
										
											2021-10-22 06:58:54 +02:00
 								```
-												Graphs finally all look good.  Now focus on the words!  Words words words, write up the words and then put out a "first reading draft".

											
										
										
											2021-11-04 12:12:12 +01:00
+								Here are the final results by class.
-												Things are finally starting to look balanced, visually speaking. Next is to create a few group_by graphs and put them in the right place.  Get scripts working before putting into full document.

											
										
										
											2021-10-29 05:02:34 +02:00
-												my wife has a limit to how much she can care, but she wants me to care in an unlimited fashion.  she needs a shoulder to cry on, but cannot provide a shoulder for others to cry on.  That is some sort of mental disorder, for real.

											
										
										
											2021-10-22 06:58:54 +02:00
+								```{r soms-output-by-class, echo=FALSE, size="tiny"}
-												It is finally starting to look good.  Keep going.  Chunk #2 needs you.  You are its only hope.

											
										
										
											2021-10-23 16:36:04 +02:00
+								# By Class section of the confusion matrix formatted through kable()
-												Graphs finally all look good.  Now focus on the words!  Words words words, write up the words and then put out a "first reading draft".

											
										
										
											2021-11-04 12:12:12 +01:00
+								cm_labels$byClass %>% knitr::kable(caption="categorization results by class")
-												cut out binary soms from paper (saves hours of compile time, might stick with it.).  Also made up first draft of Final Method script.  Need to use first half to inform the set for the second half.  Might work on that tonight....

											
										
										
											2021-10-19 04:39:23 +02:00
-												The script is complete.  The report has all the code from the script, and it compiles.  I still need to add the textual parts for chunks 3 and 4.  Final step is still the visuals, although I am thinking less is more on that one.

											
										
										
											2021-10-20 08:33:26 +02:00
+								```
-												nap time.  re-read from beginning and start into chunk3/4 text cleanup AFTER nap.

											
										
										
											2021-10-20 19:59:26 +02:00
-												Graphs finally all look good.  Now focus on the words!  Words words words, write up the words and then put out a "first reading draft".

											
										
										
											2021-11-04 12:12:12 +01:00
+								\newpage
 								### Clustering Visualizations
-												Things are finally starting to look balanced, visually speaking. Next is to create a few group_by graphs and put them in the right place.  Get scripts working before putting into full document.

											
										
										
											2021-10-29 05:02:34 +02:00
-												Graphs finally all look good.  Now focus on the words!  Words words words, write up the words and then put out a "first reading draft".

											
										
										
											2021-11-04 12:12:12 +01:00
+								 Heatmaps and K-means clustering
 								Toroidal nerual node maps are used to generate the models, and can be visualized n a number of ways.
-												Things are finally starting to look balanced, visually speaking. Next is to create a few group_by graphs and put them in the right place.  Get scripts working before putting into full document.

											
										
										
											2021-10-29 05:02:34 +02:00
-												Visualizations are finally in place and looking good enough to show others.  Now all that is needed is to fill in the text and clean up the flow.  One or two more sessions, and then I can sent out reading drafts.

											
										
										
											2021-11-02 18:34:50 +01:00
+								```{r binary som graphs, echo=FALSE, fig.show="hold", out.width='35%'}
-												Things are finally starting to look balanced, visually speaking. Next is to create a few group_by graphs and put them in the right place.  Get scripts working before putting into full document.

											
										
										
											2021-10-29 05:02:34 +02:00
 								# Be careful with these, some are really large and take a long time to produce.
 								# Visualize neural network mapping
 								plot(som_model2, type = 'mapping', pch = 19, palette.name = topo.colors)
-												Visualizations are finally in place and looking good enough to show others.  Now all that is needed is to fill in the text and clean up the flow.  One or two more sessions, and then I can sent out reading drafts.

											
										
										
											2021-11-02 18:34:50 +01:00
+								#cat(" \n")
-												Things are finally starting to look balanced, visually speaking. Next is to create a few group_by graphs and put them in the right place.  Get scripts working before putting into full document.

											
										
										
											2021-10-29 05:02:34 +02:00
 								# Distance map
 								plot(som_model2, type = 'quality', pch = 19, palette.name = topo.colors)
-												Visualizations are finally in place and looking good enough to show others.  Now all that is needed is to fill in the text and clean up the flow.  One or two more sessions, and then I can sent out reading drafts.

											
										
										
											2021-11-02 18:34:50 +01:00
+								#cat(" \n")
-												Things are finally starting to look balanced, visually speaking. Next is to create a few group_by graphs and put them in the right place.  Get scripts working before putting into full document.

											
										
										
											2021-10-29 05:02:34 +02:00
 								# Visualize counts
 								plot(som_model2, type = 'counts', pch = 19, palette.name = topo.colors)
-												Visualizations are finally in place and looking good enough to show others.  Now all that is needed is to fill in the text and clean up the flow.  One or two more sessions, and then I can sent out reading drafts.

											
										
										
											2021-11-02 18:34:50 +01:00
+								#cat(" \n")
-												Things are finally starting to look balanced, visually speaking. Next is to create a few group_by graphs and put them in the right place.  Get scripts working before putting into full document.

											
										
										
											2021-10-29 05:02:34 +02:00
 								# Visualize fan diagram
 								plot(som_model2, type = 'codes', pch = 19, palette.name = topo.colors)
-												Visualizations are finally in place and looking good enough to show others.  Now all that is needed is to fill in the text and clean up the flow.  One or two more sessions, and then I can sent out reading drafts.

											
										
										
											2021-11-02 18:34:50 +01:00
+								#cat(" \n")
-												Things are finally starting to look balanced, visually speaking. Next is to create a few group_by graphs and put them in the right place.  Get scripts working before putting into full document.

											
										
										
											2021-10-29 05:02:34 +02:00
 								# Visualize heatmap for variable 1
 								plot(som_model2, type = 'property', property = som_model2$codes[[1]][,1],
 								     main=colnames(train_num)[1], pch = 19, palette.name = topo.colors)
-												Visualizations are finally in place and looking good enough to show others.  Now all that is needed is to fill in the text and clean up the flow.  One or two more sessions, and then I can sent out reading drafts.

											
										
										
											2021-11-02 18:34:50 +01:00
+								#cat(" \n")
-												Things are finally starting to look balanced, visually speaking. Next is to create a few group_by graphs and put them in the right place.  Get scripts working before putting into full document.

											
										
										
											2021-10-29 05:02:34 +02:00
 								# Visualize heatmap for variable 2
 								plot(som_model2, type = 'property', property = som_model2$codes[[1]][,2],
 								     main=colnames(train_num)[2], pch = 19, palette.name = topo.colors)
-												Visualizations are finally in place and looking good enough to show others.  Now all that is needed is to fill in the text and clean up the flow.  One or two more sessions, and then I can sent out reading drafts.

											
										
										
											2021-11-02 18:34:50 +01:00
+								#cat(" \n")
-												Things are finally starting to look balanced, visually speaking. Next is to create a few group_by graphs and put them in the right place.  Get scripts working before putting into full document.

											
										
										
											2021-10-29 05:02:34 +02:00
 								# Visualize heatmap for variable 3
 								plot(som_model2, type = 'property', property = som_model2$codes[[1]][,3],
 								     main=colnames(train_num)[3], pch = 19, palette.name = topo.colors)
-												Visualizations are finally in place and looking good enough to show others.  Now all that is needed is to fill in the text and clean up the flow.  One or two more sessions, and then I can sent out reading drafts.

											
										
										
											2021-11-02 18:34:50 +01:00
+								#cat(" \n")
-												Things are finally starting to look balanced, visually speaking. Next is to create a few group_by graphs and put them in the right place.  Get scripts working before putting into full document.

											
										
										
											2021-10-29 05:02:34 +02:00
 								# Visualize heatmap for variable 4
 								plot(som_model2, type = 'property', property = som_model2$codes[[1]][,4],
 								     main=colnames(train_num)[4], pch = 19, palette.name = topo.colors)
-												Visualizations are finally in place and looking good enough to show others.  Now all that is needed is to fill in the text and clean up the flow.  One or two more sessions, and then I can sent out reading drafts.

											
										
										
											2021-11-02 18:34:50 +01:00
+								#cat(" \n")
-												Things are finally starting to look balanced, visually speaking. Next is to create a few group_by graphs and put them in the right place.  Get scripts working before putting into full document.

											
										
										
											2021-10-29 05:02:34 +02:00
 								# Visualize heatmap for variable 5
 								plot(som_model2, type = 'property', property = som_model2$codes[[1]][,5],
 								     main=colnames(train_num)[5], pch = 19, palette.name = topo.colors)
-												Visualizations are finally in place and looking good enough to show others.  Now all that is needed is to fill in the text and clean up the flow.  One or two more sessions, and then I can sent out reading drafts.

											
										
										
											2021-11-02 18:34:50 +01:00
+								#cat(" \n")
-												Things are finally starting to look balanced, visually speaking. Next is to create a few group_by graphs and put them in the right place.  Get scripts working before putting into full document.

											
										
										
											2021-10-29 05:02:34 +02:00
 								# Visualize heatmap for variable 6
 								plot(som_model2, type = 'property', property = som_model2$codes[[1]][,6],
 								     main=colnames(train_num)[6], pch = 19, palette.name = topo.colors)
-												Visualizations are finally in place and looking good enough to show others.  Now all that is needed is to fill in the text and clean up the flow.  One or two more sessions, and then I can sent out reading drafts.

											
										
										
											2021-11-02 18:34:50 +01:00
+								#cat(" \n")
-												Things are finally starting to look balanced, visually speaking. Next is to create a few group_by graphs and put them in the right place.  Get scripts working before putting into full document.

											
										
										
											2021-10-29 05:02:34 +02:00
 								```
-												cut out binary soms from paper (saves hours of compile time, might stick with it.).  Also made up first draft of Final Method script.  Need to use first half to inform the set for the second half.  Might work on that tonight....

											
										
										
											2021-10-19 04:39:23 +02:00
-												Graphs finally all look good.  Now focus on the words!  Words words words, write up the words and then put out a "first reading draft".

											
										
										
											2021-11-04 12:12:12 +01:00
+								K-means clustering offers a nice way of visualizing the final SOM grid and the categorical boundaries that were formed by the model.  Say a bit more about it here....
-												nap time.  re-read from beginning and start into chunk3/4 text cleanup AFTER nap.

											
										
										
											2021-10-20 19:59:26 +02:00
-												The script is complete.  The report has all the code from the script, and it compiles.  I still need to add the textual parts for chunks 3 and 4.  Final step is still the visuals, although I am thinking less is more on that one.

											
										
										
											2021-10-20 08:33:26 +02:00
+								```{r clustering-setup, echo=FALSE, include=FALSE}
 								#############################################################################
 								## K-Means Clustering to visualize the categorization of the SOM
-												Things are finally starting to look balanced, visually speaking. Next is to create a few group_by graphs and put them in the right place.  Get scripts working before putting into full document.

											
										
										
											2021-10-29 05:02:34 +02:00
+								## For a good tutorial, see:
-												The script is complete.  The report has all the code from the script, and it compiles.  I still need to add the textual parts for chunks 3 and 4.  Final step is still the visuals, although I am thinking less is more on that one.

											
										
										
											2021-10-20 08:33:26 +02:00
+								## https://www.polarmicrobes.org/microbial-community-segmentation-with-r/
 								#############################################################################
-												cut out binary soms from paper (saves hours of compile time, might stick with it.).  Also made up first draft of Final Method script.  Need to use first half to inform the set for the second half.  Might work on that tonight....

											
										
										
											2021-10-19 04:39:23 +02:00
-												The script is complete.  The report has all the code from the script, and it compiles.  I still need to add the textual parts for chunks 3 and 4.  Final step is still the visuals, although I am thinking less is more on that one.

											
										
										
											2021-10-20 08:33:26 +02:00
+								# Set number of clusters to be equal to number of known ransomware groups
-												cut out binary soms from paper (saves hours of compile time, might stick with it.).  Also made up first draft of Final Method script.  Need to use first half to inform the set for the second half.  Might work on that tonight....

											
										
										
											2021-10-19 04:39:23 +02:00
+								n_groups <- length(unique(ransomware$label)) - 1
-												The script is complete.  The report has all the code from the script, and it compiles.  I still need to add the textual parts for chunks 3 and 4.  Final step is still the visuals, although I am thinking less is more on that one.

											
										
										
											2021-10-20 08:33:26 +02:00
+								# Generate k-means clustering
-												cut out binary soms from paper (saves hours of compile time, might stick with it.).  Also made up first draft of Final Method script.  Need to use first half to inform the set for the second half.  Might work on that tonight....

											
										
										
											2021-10-19 04:39:23 +02:00
+								som.cluster <- kmeans(data.frame(som_model2$codes[[1]]), centers=n_groups)
-												The script is complete.  The report has all the code from the script, and it compiles.  I still need to add the textual parts for chunks 3 and 4.  Final step is still the visuals, although I am thinking less is more on that one.

											
										
										
											2021-10-20 08:33:26 +02:00
+								```
-												It is finally starting to look good.  Keep going.  Chunk #2 needs you.  You are its only hope.

											
										
										
											2021-10-23 16:36:04 +02:00
+								K-means clustering categorizes the SOM grid by adding boundaries to the classification groups.  This is the author's favorite graph in the entire report.
-												nap time.  re-read from beginning and start into chunk3/4 text cleanup AFTER nap.

											
										
										
											2021-10-20 19:59:26 +02:00
-												my wife has a limit to how much she can care, but she wants me to care in an unlimited fashion.  she needs a shoulder to cry on, but cannot provide a shoulder for others to cry on.  That is some sort of mental disorder, for real.

											
										
										
											2021-10-22 06:58:54 +02:00
+								```{r clustering-plot, echo=FALSE, fig.align="center"}
-												Things are finally starting to look balanced, visually speaking. Next is to create a few group_by graphs and put them in the right place.  Get scripts working before putting into full document.

											
										
										
											2021-10-29 05:02:34 +02:00
 								# Plot K-means clustering results
-												cut out binary soms from paper (saves hours of compile time, might stick with it.).  Also made up first draft of Final Method script.  Need to use first half to inform the set for the second half.  Might work on that tonight....

											
										
										
											2021-10-19 04:39:23 +02:00
+								plot(som_model2,
 								     main = 'K-Means Clustering',
 								     type = "property",
 								     property = som.cluster$cluster,
 								     palette.name = topo.colors)
 								add.cluster.boundaries(som_model2, som.cluster$cluster)
 								```
-												Started new Rmd document, got outline done at least.

											
										
										
											2021-10-04 06:55:21 +02:00
-												nap time.  re-read from beginning and start into chunk3/4 text cleanup AFTER nap.

											
										
										
											2021-10-20 19:59:26 +02:00
+								---
-												It is finally starting to look good.  Keep going.  Chunk #2 needs you.  You are its only hope.

											
										
										
											2021-10-23 16:36:04 +02:00
+								##  Results & Performance
-												Started new Rmd document, got outline done at least.

											
										
										
											2021-10-04 06:55:21 +02:00
 								### Results
-												It is finally starting to look good.  Keep going.  Chunk #2 needs you.  You are its only hope.

											
										
										
											2021-10-23 16:36:04 +02:00
+								   The first attempt to isolate ransomware using SOMs resulted in a model with an accuracy of `r toString(cm_bw.validation$overall["Accuracy"])` and precision `r toString(cm_bw.validation$byClass[3])`.
 								   The the second attempt to isolate ransomware using Random forest resulted in a model with an accuracy of `r toString(cm_ransomware$overall["Accuracy"])` and precision `r toString(cm_ransomware$byClass[3])`.
 								   Classifying the ransomware predicted by the second attempt into 28 ransomware families resulted in a model with an overall accuracy of `r toString(cm_labels$overall["Accuracy"])` and minimum nonzero precision of `r toString(min(cm_labels$byClass[,5][which(cm_labels$byClass[,5] > 0)]))`.
-												cut out binary soms from paper (saves hours of compile time, might stick with it.).  Also made up first draft of Final Method script.  Need to use first half to inform the set for the second half.  Might work on that tonight....

											
										
										
											2021-10-19 04:39:23 +02:00
-												Started new Rmd document, got outline done at least.

											
										
										
											2021-10-04 06:55:21 +02:00
+								### Performance
-												It is finally starting to look good.  Keep going.  Chunk #2 needs you.  You are its only hope.

											
										
										
											2021-10-23 16:36:04 +02:00
+								  The script runs on the aforementioned hardware in less than five minutes and uses less than 4GB of RAM.  Given that the Bitcoin network produces one new block every ten minutes on average, then real-time analysis could theoretically be conducted on each block as it is announced using even moderate computing resources.  Just for kicks, the final script was also run on a more humble computer with the following specifications:
-												quick reboot, things are acting weird, but the graphs should have labels that do not overlap now, at least....

											
										
										
											2021-11-01 16:45:40 +01:00
 								#### ASUS Eee PC 1025C
-												It is finally starting to look good.  Keep going.  Chunk #2 needs you.  You are its only hope.

											
										
										
											2021-10-23 16:36:04 +02:00
-												quick reboot, things are acting weird, but the graphs should have labels that do not overlap now, at least....

											
										
										
											2021-11-01 16:45:40 +01:00
+								   - CPU:  Intel Atom N2600 @ 1.600GHz (64-bit Intel Atom quad-core x86)
-												It is finally starting to look good.  Keep going.  Chunk #2 needs you.  You are its only hope.

											
										
										
											2021-10-23 16:36:04 +02:00
+								   - RAM:  3911MB DDR3 @ 800 MT/s  (4 GB)
 								   This is a computer known for being slow and clunky.  Even on this device, which runs the same operating system and software as the hardware listed previously, the total run time for the script is around 1665 seconds.  At nearly 28 minutes, this is not fast enough to analyze the Bitcoin blockchain in real time, but it does show that the script can be run on very modest hardware to completion.
-												quick reboot, things are acting weird, but the graphs should have labels that do not overlap now, at least....

											
										
										
											2021-11-01 16:45:40 +01:00
-												Graphs finally all look good.  Now focus on the words!  Words words words, write up the words and then put out a "first reading draft".

											
										
										
											2021-11-04 12:12:12 +01:00
+								#### Pine64 Quartz64 Model A
-												quick reboot, things are acting weird, but the graphs should have labels that do not overlap now, at least....

											
										
										
											2021-11-01 16:45:40 +01:00
 								  - CPU:  Rockchip RK3566 SoC aarch64 (64-bit quad-core ARM)
-												Graphs finally all look good.  Now focus on the words!  Words words words, write up the words and then put out a "first reading draft".

											
										
										
											2021-11-04 12:12:12 +01:00
+								  - RAM:  DDR4 8080MB (8 GB)
-												quick reboot, things are acting weird, but the graphs should have labels that do not overlap now, at least....

											
										
										
											2021-11-01 16:45:40 +01:00
-												Graphs finally all look good.  Now focus on the words!  Words words words, write up the words and then put out a "first reading draft".

											
										
										
											2021-11-04 12:12:12 +01:00
+								  Single board computer / Development board.  This was run to benchmark a modern 64-bit ARM processor.  The script runs in about 860 minutes on this platform, nearly half of that for the Atom processor above.
-												It is finally starting to look good.  Keep going.  Chunk #2 needs you.  You are its only hope.

											
										
										
											2021-10-23 16:36:04 +02:00
 								---
-												Started new Rmd document, got outline done at least.

											
										
										
											2021-10-04 06:55:21 +02:00
 								## Summary
-												It is finally starting to look good.  Keep going.  Chunk #2 needs you.  You are its only hope.

											
										
										
											2021-10-23 16:36:04 +02:00
+								### Comparison to results from original paper
 								   In the original paper by Akcora et al., they tested several different sets of parameters on their TDA model.  According to them, "In the best TDA models for each ransomware family, we predict **16.59 false positives for each
 								true positive**. In turn, this number is 27.44 for the best non-TDA models."[3]  In fact, the highest Precision (a.k.a. Positive Predictive Value, defined as TP/(TP+FP)) they achieved was only 0.1610.  By comparison, although several of our predicted classes had zero or NA precision values, the lowest non-zero precision value is `r toString(min(cm_labels$byClass[,5][which(cm_labels$byClass[,5] > 0)]))`, with many well above that, approaching one in a few cases.
 								  One might say that we are comparing apples to oranges in a sense, because their method was one single model, while these results are from a two-method stack.  Still, given the run time of the final script, I think the two-model approach is superior in this case, especially when measured in terms of precision and avoiding false positives.
-												Started new Rmd document, got outline done at least.

											
										
										
											2021-10-04 06:55:21 +02:00
-												cut out binary soms from paper (saves hours of compile time, might stick with it.).  Also made up first draft of Final Method script.  Need to use first half to inform the set for the second half.  Might work on that tonight....

											
										
										
											2021-10-19 04:39:23 +02:00
-												Started new Rmd document, got outline done at least.

											
										
										
											2021-10-04 06:55:21 +02:00
+								### Limitations
-												It is finally starting to look good.  Keep going.  Chunk #2 needs you.  You are its only hope.

											
										
										
											2021-10-23 16:36:04 +02:00
+								  SOMs seem like they are easy to misconfigure.  Perhaps a dual Random Forest approach would be better.  this has not been attempted yet, as the two method approach presented here was satisfactory enough to present in a report.
-												cut out binary soms from paper (saves hours of compile time, might stick with it.).  Also made up first draft of Final Method script.  Need to use first half to inform the set for the second half.  Might work on that tonight....

											
										
										
											2021-10-19 04:39:23 +02:00
-												Started new Rmd document, got outline done at least.

											
										
										
											2021-10-04 06:55:21 +02:00
+								### Future Work
-												It is finally starting to look good.  Keep going.  Chunk #2 needs you.  You are its only hope.

											
										
										
											2021-10-23 16:36:04 +02:00
+								  I only scratched he surface of the SOM algorithm which seems to have many implementations and parameters that could be investigated further and possibly optimized via cross-validation.  Also, a dual Random Forest approach to first isolate the ransomware addresses and also
 								  The script itself has a few areas that could be further optimization.  The sampling method does what it needs to do, but the ratios taken for each set could possibly be optimized.
-												Added introductory section (chunk #1).

											
										
										
											2021-10-08 08:07:38 +02:00
-												It is finally starting to look good.  Keep going.  Chunk #2 needs you.  You are its only hope.

											
										
										
											2021-10-23 16:36:04 +02:00
+								### Conclusion
-												Started new Rmd document, got outline done at least.

											
										
										
											2021-10-04 06:55:21 +02:00
-												It is finally starting to look good.  Keep going.  Chunk #2 needs you.  You are its only hope.

											
										
										
											2021-10-23 16:36:04 +02:00
+								   This paper/report presents a reliable method for classifying Bitcoin addresses into known ransomware families, while at the same time avoiding false positives by filtering them out using a binary method before classifying them further.  It leaves the author of the paper wondering how much harder it would be to perform the same task for ransomware that uses privacy coins.  Certain cryptocurrency networks utilize privacy coins, such as Monero, that obfuscate transactions from being analyzed in the same way that the Bitcoin network has been analyzed here.  Some progress has been made towards analyzing such networks[9], but the developers of such networks continually evolve the code to complicate transaction tracking.  This could be another good area for future research.
-												Added introductory section (chunk #1).

											
										
										
											2021-10-08 08:07:38 +02:00
 								## References
 								[1] Adam Brian Turner, Stephen McCombie and Allon J. Uhlmann (November 30, 2020) [Analysis Techniques for Illicit Bitcoin Transactions](https://doi.org/10.3389/fcomp.2020.600596)
 								[2] Daniel Goldsmith, Kim Grauer and Yonah Shmalo (April 16, 2020) [Analyzing hack subnetworks in the
 								bitcoin transaction graph](https://doi.org/10.1007/s41109-020-00261-7)
 								[3] Cuneyt Gurcan Akcora, Yitao Li, Yulia R. Gel, Murat Kantarcioglu (June 19, 2019) [BitcoinHeist: Topological Data Analysis for Ransomware Detection on the Bitcoin Blockchain](https://arxiv.org/abs/1906.07852)
-												graphs are starting to look good.  Now pick the best ones and leave the rest.....

											
										
										
											2021-10-28 15:04:34 +02:00
+								[4] UCI Machine Learning Repository https://archive.ics.uci.edu/ml/index.php
-												Added introductory section (chunk #1).

											
										
										
											2021-10-08 08:07:38 +02:00
-												my wife has a limit to how much she can care, but she wants me to care in an unlimited fashion.  she needs a shoulder to cry on, but cannot provide a shoulder for others to cry on.  That is some sort of mental disorder, for real.

											
										
										
											2021-10-22 06:58:54 +02:00
+								[5]  BitcoinHeist Ransomware Address Dataset
-												graphs are starting to look good.  Now pick the best ones and leave the rest.....

											
										
										
											2021-10-28 15:04:34 +02:00
+								https://archive.ics.uci.edu/ml/datasets/BitcoinHeistRansomwareAddressDataset
-												Added introductory section (chunk #1).

											
										
										
											2021-10-08 08:07:38 +02:00
-												graphs are starting to look good.  Now pick the best ones and leave the rest.....

											
										
										
											2021-10-28 15:04:34 +02:00
+								[6]  Available Models - The `caret` package http://topepo.github.io/caret/available-models.html
-												cut out binary soms from paper (saves hours of compile time, might stick with it.).  Also made up first draft of Final Method script.  Need to use first half to inform the set for the second half.  Might work on that tonight....

											
										
										
											2021-10-19 04:39:23 +02:00
-												graphs are starting to look good.  Now pick the best ones and leave the rest.....

											
										
										
											2021-10-28 15:04:34 +02:00
+								[7] Ron Wehrens and Johannes Kruisselbrink, Package ‘`kohonen`’ @ CRAN (2019) https://cran.r-project.org/web/packages/kohonen/kohonen.pdf
-												cut out binary soms from paper (saves hours of compile time, might stick with it.).  Also made up first draft of Final Method script.  Need to use first half to inform the set for the second half.  Might work on that tonight....

											
										
										
											2021-10-19 04:39:23 +02:00
-												graphs are starting to look good.  Now pick the best ones and leave the rest.....

											
										
										
											2021-10-28 15:04:34 +02:00
+								[8] How many nodes for self-organizing maps? (Oct 22, 2021) https://www.researchgate.net/post/How-many-nodes-for-self-organizing-maps
-												It is finally starting to look good.  Keep going.  Chunk #2 needs you.  You are its only hope.

											
										
										
											2021-10-23 16:36:04 +02:00
 								[9] Malte Möser, Kyle Soska, Ethan Heilman, Kevin Lee, Henry Heffan, Shashvat Srivastava,
-												cut out binary soms from paper (saves hours of compile time, might stick with it.).  Also made up first draft of Final Method script.  Need to use first half to inform the set for the second half.  Might work on that tonight....

											
										
										
											2021-10-19 04:39:23 +02:00
+								Kyle Hogan, Jason Hennessey, Andrew Miller, Arvind Narayanan, and Nicolas Christin (April 23, 2018) [An Empirical Analysis of Traceability in the Monero Blockchain](https://arxiv.org/pdf/1704.04299/)
-												nap time.  re-read from beginning and start into chunk3/4 text cleanup AFTER nap.

											
										
										
											2021-10-20 19:59:26 +02:00
 								\newpage
 								## Appendix:
-												It is finally starting to look good.  Keep going.  Chunk #2 needs you.  You are its only hope.

											
										
										
											2021-10-23 16:36:04 +02:00
-												Graphs finally all look good.  Now focus on the words!  Words words words, write up the words and then put out a "first reading draft".

											
										
										
											2021-11-04 12:12:12 +01:00
+								### Categorical SOM prediction table and confusion matrix
-												Things are finally starting to look balanced, visually speaking. Next is to create a few group_by graphs and put them in the right place.  Get scripts working before putting into full document.

											
										
										
											2021-10-29 05:02:34 +02:00
-												Graphs finally all look good.  Now focus on the words!  Words words words, write up the words and then put out a "first reading draft".

											
										
										
											2021-11-04 12:12:12 +01:00
+								Here are the full prediction results for the categorization of *black* addresses into ransomware families.  It is assumed that all *white* address have already been removed.
-												It is finally starting to look good.  Keep going.  Chunk #2 needs you.  You are its only hope.

											
										
										
											2021-10-23 16:36:04 +02:00
-												nap time.  re-read from beginning and start into chunk3/4 text cleanup AFTER nap.

											
										
										
											2021-10-20 19:59:26 +02:00
+								```{r soms-output-table, echo=FALSE}
-												Things are finally starting to look balanced, visually speaking. Next is to create a few group_by graphs and put them in the right place.  Get scripts working before putting into full document.

											
										
										
											2021-10-29 05:02:34 +02:00
+								# Final results of categorization of "black" addresses
 								# into ransomware families.
-												nap time.  re-read from beginning and start into chunk3/4 text cleanup AFTER nap.

											
										
										
											2021-10-20 19:59:26 +02:00
+								cm_labels
 								```