my wife has a limit to how much she can care, but she wants me to care in an unlimited fashion. she needs a shoulder to cry on, but cannot provide a shoulder for others to cry on. That is some sort of mental disorder, for real.

This commit is contained in:
shelldweller 2021-10-21 22:58:54 -06:00
parent 6876625dfb
commit 571bf48790
2 changed files with 31 additions and 22 deletions

View File

@ -4,7 +4,7 @@ subtitle: \vspace{.5in}HarvardX PH125.9x Final Capstone CYO Project
\vspace{.5in}
author: "Kaylee Robert Tejeda"
date: "10/31/2021"
abstract: "Ransomware is a persisent and growing threat in the world of cybersecurity. A specific area of focus involves detecting and tracking payments made to ransomware operators. While many attempts towards this goal have not made use of sophisticated machine learning methods, even those that have often result in models with poor specificity or other performance issues. A two-step method is developed to address the issue of false positives and improve on previous results."
abstract: "Ransomware is a persisent and growing threat in the world of cybersecurity. A specific area of focus involves detecting and tracking payments made to ransomware operators. While many attempts towards this goal have not made use of sophisticated machine learning methods, even those that have often result in models with poor precision or other performance issues. A two-step method is developed to address the issue of false positives and improve on previous results."
keywords:
- Bitcoin
- blockchain
@ -45,7 +45,7 @@ knitr::knit_hooks$set(chunk = function(x, options) {
## Introduction
Ransomware attacks have gained the attention of security professionals, law enforcement, and financial regulatory officials.$^{[1]}$ The pseudo-anonymous Bitcoin network provides a convenient method for ransomware attackers to accept payments without revealing their identity or location. The victims (usually hospitals or other large organizations) come to find that much if not all of their important organizational data have been encrypted with a secret key by an unknown attacker. They are instructed to make a payment to a specific Bitcoin address before a certain deadline to have their data decrypted, otherwise the data will be deleted.
Ransomware attacks have gained the attention of security professionals, law enforcement, and financial regulatory officials.$^{[1]}$ The pseudo-anonymous Bitcoin network provides a convenient method for ransomware attackers to accept payments without revealing their identity or location. The victims (usually hospitals or other large organizations) come to find that much if not all of their important organizational data have been encrypted with a secret key by an unknown attacker. They are instructed to make a payment to a specific Bitcoin address before a certain deadline to have their data decrypted before being deleted permanently.
The legal and financial implications of ransomware attacks are not of concern for the purpose of this analysis. Many parties are interested in tracking illicit activity (such as ransomware payments) around the Bitcoin blockchain as soon as possible to minimize financial losses. Daniel Goldsmith explains some of the reasons and methods of blockchain analysis at [Chainalysis.com](https://www.chainalysis.com/).$^{[2]}$ A ransomware attack could be perpetrated on an illegal darknet market site, for example. The news of such an attack might not be published at all, let alone in popular media. By analyzing the transaction record with a blockchain explorer such as [BTC.com](https://btc.com/), suspicious activity could be flagged in real time given a sufficiently robust model. It may, in fact, be the first public notice of such an event. Any suspicious addresses could then be blacklisted or banned from using other services.
@ -134,29 +134,25 @@ ransomware %>% head() %>% knitr::kable()
```
This data set has 2,916,697 observations of ten features associated with a sample of transactions from the Bitcoin blockchain. The ten features include *address* as a unique identifier, the six features defined previously (*income, neighbors, weight, length, count, loop*), two temporal features in the form of *year* and *day* (of the year as 1-365), and a categorical feature called *label* that categorizes each address as either "white" (meaning not connected to any ransomware activity), or one of 29 known ransomware groups as identified by three independent ransomware analysis teams (Montreal, Princeton, and Padua)$^{[3]}$ .
This data set has 2,916,697 observations of ten features associated with a sample of transactions from the Bitcoin blockchain. The ten features include *address* as a unique identifier, the six features defined previously (*income, neighbors, weight, length, count, loop*), two temporal features in the form of *year* and *day* (of the year as 1 to 365), and a categorical feature called *label* that categorizes each address as either "white" (meaning not connected to any ransomware activity), or one of 29 known ransomware groups as identified by three independent ransomware analysis teams (Montreal, Princeton, and Padua)$^{[3]}$ .
The original research team downloaded and parsed the entire Bitcoin transaction graph from 2009 January to 2018 December. Based on a 24 hour time interval, daily transactions on the network were extracted and the Bitcoin graph was formed. Network edges that transfered less than \bitcoinA 0.3 were filtered out since ransom amounts are rarely below this threshold. Ransomware addresses are taken from three widely adopted studies: Montreal, Princeton and Padua. "White" Bitcoin addresses were capped at one thousand per day while the entire network has up to 800,000 addresses daily.$^{[5]}$
The original research team downloaded and parsed the entire Bitcoin transaction graph from January 2009 to December 2018. Based on a 24 hour time interval, daily transactions on the network were extracted and the Bitcoin graph was formed. Network edges that transferred less than \bitcoinA 0.3 were filtered out since ransom amounts are rarely below this threshold. Ransomware addresses are taken from three widely adopted studies: Montreal, Princeton and Padua. "White" Bitcoin addresses were capped at one thousand per day while the entire network has up to 800,000 addresses daily.$^{[5]}$
### Goal
The goal of this project is to apply different machine learning algorithms to the same data set used in the original paper to produce an acceptable predictive model for categorizing ransomware addresses correctly. Improving on the results of the original paper in some way, while not strictly necessary for the purposes of the project, would be a notable sign of success.
### Outline of Steps Taken (refine this as steps are written up...)
### Outline of Steps Taken
1) Analyze data set numerically and visually. Notice any pattern, look for insights.
2) Binary classification using Random Forests.
3) Binary classification using Self Organizing Maps.
3) Categorical classification using Self Organizing Maps.
4) Categorical classification using Self Organizing Maps.
4) Visualize clustering to analyze results further.
5) Two step method using Random Forests and Self Organizing Maps.
6) Visualize clustering to analyze results further.
7) Generate Confusion Matrix to quantify results.
5) Generate Confusion Matrix to quantify results.
---
@ -208,6 +204,8 @@ no_nas <- sum(is.na(ransomware))
---
###############################################################################
## looks good up to here, continue down to Chunk #3
###############################################################################
---
### Exploration and Visualization ( Chunk #2, do this part last....)
@ -346,16 +344,21 @@ mean(shrimp$bw == "black")
### Insights Gained from Exploration
From the previous visual and statistical exploration of the data, it becomes clear what the challenge is. Ransomware addresses are very sparse in the data set, making up less than 2% of the addresses. That small percentage is also further classified into 28 groups. Perhaps the original paper was a bit too ambitious in trying to categorize all the addresses into 29 categories, including the "white" addresses. To simplify our approach, we will categorize the addresses in a binary way, either "white" or "black", where "black" signifies an association with ransomware transactions. Asking this as a "ransomware or not-ransomware" question allows for application of methods that are impractical otherwise.
From the previous visual and statistical exploration of the data, it becomes clear what the challenge is. Ransomware addresses are very sparse in the data set, making up less than 2% of the total addresses. That small percentage is also further classified into 29 groups. Perhaps the original paper was a bit too ambitious in trying to categorize all the addresses into 30 categories, including the "white" addresses. To simplify our approach, we will categorize the addresses in a binary way, either "white" or "black", where "black" signifies an association with ransomware transactions. Asking this as a "ransomware or not-ransomware" question allows for application of methods that are impractical otherwise.
---
## Modeling approach (chunk #3, needs rewriting of text parts only)
## Modeling approach
Akcora et al. mention that they tried to model the data using a Random Forests method, but that the complexity of the data set lead to problems with that approach.[3] Switching to a binary perspective on the problem might alleviate some of that complexity, and is worth another look. The topological nature of the way the data set has been described numerically lead me to search for topological machine learning methods. Searching for *topo* in the documentation for the `caret` package [6] resulted in the entry for Self Organizing Maps, supplied by the `kohonen` package. The description at CRAN [7] was intriguing enough for me to investigate further.
Akcora et al. mention applied a Random Forest approach to the data, however "Despite improving data scarcity, [...] tree based methods (i.e., Random Forest and XGBoost) fail to predict any ransomware family".[3, 11] Considering all ransomware addresses as belonging to a single group might improve the predictive power of such methods, making Random Forest worth another try. The topological description of the data set inspired a search for topological machine learning methods, although one does not necessitate the other. Searching for *topo* in the documentation for the `caret` package [6] resulted in the entry for Self Organizing Maps, supplied by the `kohonen` package. The description at CRAN [7] was intriguing enough to merit further investigation.
[[Describe how you started with categorical SOMs, switched to binary SOMs, then applied randomForest to the binary problem, and was surprised with the results. Decided to re-apply categorical SOMS to black-only addresses, as predicted by the binary Random forest approach. The result is the following two-step approach, with the optional clustering visualizations at the end]]
Initially, the categorization of ransomware into the 29 different families was attempted using SOMs. This proved to be very resource intensive, requiring more time and RAM than was available at the time. Although it did help to illuminate how SOMs are configured, the runtime of the script became a deterrent.
Describe how you started with categorical SOMs, switched to binary SOMs, then applied randomForest to the binary problem, and was surprised with the results. Decided to re-apply categorical SOMS to black-only addresses, as predicted by the binary Random forest approach. The result is the following two-step approach, with the optional clustering visualizations at the end
### Method Part 1: Binary Random Forests to isolate ransomware addresses first.
@ -506,9 +509,14 @@ message("A grid size of ", grid_size, " has been chosen.")
Here is a summary of the results for the categorization of black addresses into ransomware families. For the full table of predictions and statistics, see the Appendix.
```{r soms-output, echo=FALSE, size="tiny"}
```{r cm_overall, echo=FALSE}
cm_labels$overall %>% knitr::kable()
```
```{r soms-output-by-class, echo=FALSE, size="tiny"}
cm_labels$byClass %>% knitr::kable()
@ -533,9 +541,9 @@ som.cluster <- kmeans(data.frame(som_model2$codes[[1]]), centers=n_groups)
```
Here is a nice graph. How do I center it?
Here is a nice graph.
```{r clustering-plot, echo=FALSE}
```{r clustering-plot, echo=FALSE, fig.align="center"}
# Plot clustering results
plot(som_model2,
main = 'K-Means Clustering',
@ -589,13 +597,14 @@ bitcoin transaction graph](https://doi.org/10.1007/s41109-020-00261-7)
[4] UCI Machine Learning Repository [https://archive.ics.uci.edu/ml/index.php](https://archive.ics.uci.edu/ml/index.php)
[5] BitcoinHeist Ransomware Address Dataset /n [https://archive.ics.uci.edu/ml/datasets/BitcoinHeistRansomwareAddressDataset](https://archive.ics.uci.edu/ml/datasets/BitcoinHeistRansomwareAddressDataset)
[5] BitcoinHeist Ransomware Address Dataset
[https://archive.ics.uci.edu/ml/datasets/BitcoinHeistRansomwareAddressDataset](https://archive.ics.uci.edu/ml/datasets/BitcoinHeistRansomwareAddressDataset)
[6] Available Models - The `caret` package [http://topepo.github.io/caret/available-models.html](http://topepo.github.io/caret/available-models.html)
[7] Ron Wehrens and Johannes Kruisselbrink, Package `kohonen` @ CRAN (2019) [https://cran.r-project.org/web/packages/kohonen/kohonen.pdf](https://cran.r-project.org/web/packages/kohonen/kohonen.pdf)
[XMR] Malte Möser*, Kyle Soska, Ethan Heilman, Kevin Lee, Henry Heffan, Shashvat Srivastava,
[XMR] Malte Möser, Kyle Soska, Ethan Heilman, Kevin Lee, Henry Heffan, Shashvat Srivastava,
Kyle Hogan, Jason Hennessey, Andrew Miller, Arvind Narayanan, and Nicolas Christin (April 23, 2018) [An Empirical Analysis of Traceability in the Monero Blockchain](https://arxiv.org/pdf/1704.04299/)
\newpage

Binary file not shown.