Finished first draft of Introduction section, up to Visual Analysis and Exploration. I am satisfied with it up to that point. Chunk #2 begins tomorrow, produce the basic analysis graphs I have already outlined....

This commit is contained in:
shelldweller 2021-10-12 23:20:59 -06:00
parent 8d453a78f2
commit d89266b7a4
2 changed files with 54 additions and 38 deletions

View File

@ -1,11 +1,21 @@
---
title: \vspace{1in}Detecting Ransomware Addresses on the Bitcoin Blockchain
title: \vspace{1in}Detecting Ransomware Addresses on the Bitcoin Blockchain using Random Forests and Self Organizing Maps
subtitle: \vspace{.5in}HarvardX Final Capstone CYO Project
\vspace{.5in}
author: "Kaylee Robert Tejeda"
date: "10/31/2021"
abstract: "Ransomware is a persisent and growing threat in the world of cybersecurity. There is much interest in detecting and tracking transactions made to ransomware operators. While many of the attempts towards achieving this have not relied on sophisticated machine learning methods, even those that do have resulted in models with poor specificity. A two-step method is developed to address the issue of false positives and improve on previous results."
abstract: "Ransomware is a persisent and growing threat in the world of cybersecurity. A specific area of focus involves detecting and tracking payments made to ransomware operators. While many attempts towards this goal have not made use of sophisticated machine learning methods, even those that have often result in models with poor specificity or other performance issues. A two-step method is developed to address the issue of false positives and improve on previous results."
keywords:
- Bitcoin
- blockchain
- ransomware
- machine learning
- Random Forests
- Self Organizing Maps
- SOMs
- cryptocurrency
output: pdf_document
geometry: margin=2cm
---
\def\bitcoinA{%
\leavevmode
@ -35,35 +45,34 @@ knitr::knit_hooks$set(chunk = function(x, options) {
## Introduction
Ransomware attacks have gained the attention of security professionals and are of specific interest to international law enforcement and financial regulatory officials.$^{[1]}$ The pseudo-anonymous nature of the Bitcoin blockchain makes it a convenient payment method for attackers who deploy ransomware to accept payments without revealing their identity or location. The victims (usually hospitals or other large organizations) first come to find that all of their important organizational data has been encrypted with a secret key, and are then instructed to make a payment to a specific Bitcoin address in order to have their data decrypted by a certain deadline, otherwise the data will be deleted forever.
Ransomware attacks have gained the attention of security professionals, law enforcement, and financial regulatory officials.$^{[1]}$ The pseudo-anonymous Bitcoin network provides a convenient method for ransomware attackers to accept payments without revealing their identity or location. The victims (usually hospitals or other large organizations) come to find that much if not all of their important organizational data have been encrypted with a secret key by an unknown attacker. They are instructed to make a payment to a specific Bitcoin address before a certain deadline to have their data decrypted, otherwise the data will be deleted.
For the purposes of this paper, we will ignore the legal and financial implications of ransomware attacks. It will suffice to say that certain parties are interested in tracking and tracing illicit activity on and around the Bitcoin blockchain, and that ransomware transactions are a good example of such activity. For a more detailed overview of how and why such analysis is carried out, the reader is referred to Daniel Goldsmith's work at [Chainalysis.com](https://www.chainalysis.com/).$^{[2]}$ It can be said that there is significant interest in detecting illicit activity on the Bitcoin blockchain as soon as possible to minimize financial losses. For example, it could be the case that a ransomware attack is being perpetrated on an illegal darknet market site. The news of such an attack might not be published at all, let alone in popular media. However, by analyzing the transaction record with a blockchain explorer, such as [BTC.com](https://btc.com/), we might be able to flag suspicious activity in real time if we have a model that is sufficiently robust. It may, in fact, be the first public notice of such an event. At that point, the suspicious address can be blacklisted or banned from using other services.
The legal and financial implications of ransomware attacks are not of concern for the purpose of this analysis. Many parties are interested in tracking illicit activity (such as ransomware payments) around the Bitcoin blockchain as soon as possible to minimize financial losses. Daniel Goldsmith explains some of the reasons and methods of blockchain analysis at [Chainalysis.com](https://www.chainalysis.com/).$^{[2]}$ A ransomware attack could be perpetrated on an illegal darknet market site, for example. The news of such an attack might not be published at all, let alone in popular media. By analyzing the transaction record with a blockchain explorer such as [BTC.com](https://btc.com/), suspicious activity could be flagged in real time given a sufficiently robust model. It may, in fact, be the first public notice of such an event. Any suspicious addresses could then be blacklisted or banned from using other services.
Ransomware attackers provide their victims with a payment address, allowing for a list of known ransomware payment addresses to be compiled and analyzed using various methods. One well known paper entitled "BitcoinHeist: Topological Data Analysis for Ransomware Detection on the Bitcoin Blockchain"$^{[3]}$ will be the source of the data set and the baseline by which we will compare our results. In that paper, Akcora, et al. use Topological Data Analysis (TDA) to classify addresses on the Bitcoin blockchain into one of 29 known ransomware address groups. Otherwise they are classified as "white", meaning there is no ransomware activity associated with that address. They then consider the blockchain as a heterogeneous Directed Acyclic Graph (DAG) with two types of nodes describing *addresses* and *transactions*. Edges are formed between the nodes when a transaction can be associated with a particular address.
Lists of known ransomware payment addresses have been compiled and analyzed using various methods. One well known paper entitled "BitcoinHeist: Topological Data Analysis for Ransomware Detection on the Bitcoin Blockchain"$^{[3]}$ will be the source of our data set and the baseline to which we will compare our results. In that paper, Akcora, et al. use Topological Data Analysis (TDA) to classify addresses on the Bitcoin blockchain into one of 29 known ransomware address groups. Addresses with no known ransomware associations are classified as "white". The blockchain is then considered as a heterogeneous Directed Acyclic Graph (DAG) with two types of nodes describing *addresses* and *transactions*. Edges are formed between the nodes when a transaction can be associated with a particular address.
Addresses on the Bitcoin network may appear many times, with different inputs and outputs each time. The Bitcoin network data is divided into 24-hour long windows with the UTC-6 timezone as a reference. Doing so provides information on how quickly a coin moves through the network, with speed measured as the number of blocks the coin appears in during a 24-hour period, with the maximum being 144 blocks per 24 hours (at an average rate of one block every ten minutes). This speed can be an indicator of money laundering. The temporal data can also help distinguish transactions by geolocation, as criminal transactions tend to cluster in time.
Addresses on the Bitcoin network may appear many times, with different inputs and outputs each time. The Bitcoin network data has been divided into 24-hour time intervals with the UTC-6 timezone as a reference. Speed is defined as the number of blocks the coin appears in during a 24-hour period and provides information on how quickly a coin moves through the network. Speed can be an indicator of money laundering or coin mixing, as normal payments only involve a limited number of addresses in a 24 hour period, and thus have lower speeds when compared to "mixed" coins. The temporal data can also help distinguish transactions by geolocation, as criminal transactions tend to cluster in time.
With the graph formed this way, the following six numerical features$^{[2]}$ can be associated with a given address:
With the graph defined as such, the following six numerical features$^{[2]}$ are associated with a given address:
1) Income - the total amount of coins sent to an address
1) Income - the total amount of coins sent to an address (decimal value with 8 decimal places)
2) Neighbors - the number of transactions that have this address as one of its output addresses
2) Neighbors - the number of transactions that have this address as one of its output addresses (integer)
3) Weight - the sum of fraction of coins that reach this address from address that do not have any other inputs within the 24-hour window, which are referred to as "starter transactions"
3) Weight - the sum of fraction of coins that reach this address from address that do not have any other inputs within the 24-hour window, which are referred to as "starter transactions" (decimal value)
4) Length - the number of non-starter transactions on its longest chain, where a chain is defined as an
acyclic directed path originating from any starter transaction and ending at the address in question
acyclic directed path originating from any starter transaction and ending at the address in question (integer)
5) Count - The number of starter addresses connected to this address through a chain
5) Count - The number of starter addresses connected to this address through a chain (integer)
6) Loop - The number of starter addresses connected to this address by more than one path
6) Loop - The number of starter addresses connected to this address by more than one path (integer)
These variables are defined in a somewhat abstract way, viewing the blockchain as a topological graph with nodes and edges. The rationale is to be able to quantify specific transaction patterns. For a deeper discussion of how and why these variables were chosen, Akcora$^{[3]}$ gives a thorough explanation in the original paper. For the purposes of this report, we will just be treating the variables as abstract numerical features rather than trying to justify their definitions. Instead, we will run the same data set as used in the original paper by Akcora$^{[3]}$ through a few different machine learning methods to see how closely we can come to their results.
These variables are defined rather abstractly, viewing the blockchain as a topological graph with nodes and edges. The rationale behind this approach is to quantify specific transaction patterns. Akcora$^{[3]}$ gives a thorough explanation in the original paper of how and why these features were chosen. We shall treat the features as general numerical variables and will not seek to justify their definitions. Several machine learning methods will be applied to the original data set from the paper by Akcora$^{[3]}$, and the results will be compared.
### Data
This data set was discovered while exploring the [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/index.php)$^{[4]}$ as suggested in the instructions for this project. The author of this report, having been interested in Bitcoin and other cryptocurrencies since (unsuccessfully) mining for them on a ASUS netbook in rural Peru in late 2010, found "cryptocurrencies" to be a natural search term. This brings up a single data set entitled [BitcoinHeist: Ransomeware Address Data Set](https://archive.ics.uci.edu/ml/datasets/BitcoinHeistRansomwareAddressDataset#). The data set was downloaded and the exploration began.
This data set was discovered while exploring the [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/index.php)$^{[4]}$ as suggested in the project instructions. The author of this report, interested in Bitcoin and other cryptocurrencies since (unsuccessfully) mining them on an ASUS netbook in rural Peru in late 2010, used *cryptocurrency* as a preliminary search term. This brought up a single data set entitled ["BitcoinHeist: Ransomware Address Data Set"](https://archive.ics.uci.edu/ml/datasets/BitcoinHeistRansomwareAddressDataset#). The data set was downloaded and the exploration began.
```{r data-prep, echo=FALSE, include=FALSE}
@ -88,13 +97,19 @@ if(!file.exists("data/BitcoinHeistData.csv"))unzip(dest_file, "BitcoinHeistData.
# Import data from CSV
ransomware <- read_csv("data/BitcoinHeistData.csv")
# Inspect data
# str(ransomware)
```
A summary of the data set tells the range of values and size of the sample.
```{r data-summary, echo=FALSE, size="tiny"}
# Summary
ransomware %>% summary() %>% knitr::kable()
```
We can inspect the first ten observations to get an idea of what features are present.
A listing of the first ten rows provides a sample of the features associated with each observation.
```{r data-head, echo=FALSE, size="tiny"}
@ -103,15 +118,15 @@ ransomware %>% head() %>% knitr::kable()
```
This data set has 2,916,697 observations of ten features associated with a sample of transactions from the Bitcoin blockchain. The ten features include *address* as a unique identifier, the six features defined above (*income, neighbors, weight, length, count, loop*), two temporal features in the form of *year* and *day* (of year as 1-365), and a categorical factor called *label* that categorizes each address as either "white" (meaning not connected to any ransomware activity), or else one of the 29 known ransomware groups, as identified by three independent ransomware analysis teams (Montreal, Princeton, and Padua)$^{[3]}$ .
This data set has 2,916,697 observations of ten features associated with a sample of transactions from the Bitcoin blockchain. The ten features include *address* as a unique identifier, the six features defined previously (*income, neighbors, weight, length, count, loop*), two temporal features in the form of *year* and *day* (of the year as 1-365), and a categorical feature called *label* that categorizes each address as either "white" (meaning not connected to any ransomware activity), or one of 29 known ransomware groups as identified by three independent ransomware analysis teams (Montreal, Princeton, and Padua)$^{[3]}$ .
The original research team downloaded and parsed the entire Bitcoin transaction graph from 2009 January to 2018 December. Based on a 24 hour time interval, daily transactions on the network were extracted and the Bitcoin graph was formed. Network edges that transfer less than \bitcoinA 0.3 were filtered out since ransom amounts are rarely below this threshold. Ransomware addresses are taken from three widely adopted studies: Montreal, Princeton and Padua. "White" Bitcoin addresses were capped at one thousand per day while the entire network has up to 800,000 addresses daily.$^{[5]}$
The original research team downloaded and parsed the entire Bitcoin transaction graph from 2009 January to 2018 December. Based on a 24 hour time interval, daily transactions on the network were extracted and the Bitcoin graph was formed. Network edges that transfered less than \bitcoinA 0.3 were filtered out since ransom amounts are rarely below this threshold. Ransomware addresses are taken from three widely adopted studies: Montreal, Princeton and Padua. "White" Bitcoin addresses were capped at one thousand per day while the entire network has up to 800,000 addresses daily.$^{[5]}$
### Goal
The goal of this paper is to apply different machine learning algorithms to the same data set as in the original paper to produce a predictive model without some of the drawbacks that were present there.
The goal of this project is to apply different machine learning algorithms to the same data set used in the original paper to produce an acceptable predictive model for categorizing ransomware addresses correctly. Improving on the results of the original paper in some way, while not strictly necessary for the purposes of the project, would be a notable sign of success.
### Outline of Steps Taken
### Outline of Steps Taken (refine this as steps are written up...)
1) Analyze data set numerically and visually. Notice any pattern, look for insights.
@ -123,15 +138,23 @@ The original research team downloaded and parsed the entire Bitcoin transaction
5) Two step method using Random Forests and Self Organizing Maps.
6) Visualize clustering to analyze results.
6) Visualize clustering to analyze results further.
7) Generate Confusion Matrix to quantify results.
---
## Exploration & Visualization (chunk #2)
## Data Analysis (chunk #2)
### Notes on Graphs (remove later)
### Hardware
List computer specs here. Laptop, OS, and R versions.
### Data Preparation
What did I do to prepare the data? Factoring the labels. Adding the b/w label. Splitting into partitions (twice) to reduce set size. Etc..... (see code).
### Exploration and Visualization
I need better graphs. I have plenty, but I need them to look better and/or have more labels, etc.
@ -145,20 +168,13 @@ The original research team downloaded and parsed the entire Bitcoin transaction
4) List group counts in a table
Other fancy graph ideas? Look through sample work for possibilities
5) Check for missing values / NAs.
### Notes End Here
6) Break into groups somehow. Graph variables per group? Show how the variables are distributed for each ransomware group? Percent ransomware per each day of the week, for example. Is ransomware more prevalent on a particular day of the week? Break other numerical values into bins, and graph percentage per bin. Look for trends and correlations between groups/variables, and display them here.
## Data Analysis (chunk #2.5)
List computer specs here. Laptop, OS, and R versions.
### Preparation
What did I do to prepare the data?
### Exploration and Visualization
7) Principle Component Analysis can go here. See "Interlinkages of Malaysian Banking Systems" for an example of detailed PCA. Is it exploratory analysis, or is it a predictive method? I was under the assumption that it is a form of analysis, but the paper mentioned extends it to a form of predictive modeling. How to do this *right* (?!?!)
```{r visuals, echo=FALSE, include=FALSE}
# Do some graphical exploration before applying any models.
@ -326,7 +342,7 @@ bitcoin transaction graph](https://doi.org/10.1007/s41109-020-00261-7)
[4] UCI Machine Learning Repository [https://archive.ics.uci.edu/ml/index.php](https://archive.ics.uci.edu/ml/index.php)
[5] BitcoinHeistRansomwareAddressDataset Data Set [https://archive.ics.uci.edu/ml/datasets/BitcoinHeistRansomwareAddressDataset](https://archive.ics.uci.edu/ml/datasets/BitcoinHeistRansomwareAddressDataset)
[5] BitcoinHeist Ransomware Address Dataset [https://archive.ics.uci.edu/ml/datasets/BitcoinHeistRansomwareAddressDataset](https://archive.ics.uci.edu/ml/datasets/BitcoinHeistRansomwareAddressDataset)

Binary file not shown.