finalized table font sizes.

This commit is contained in:
shelldweller 2021-11-04 23:11:03 -06:00
parent 16a9f84614
commit f6d8ab1cb4
2 changed files with 58 additions and 44 deletions

View File

@ -51,32 +51,32 @@ knitr::knit_hooks$set(chunk = function(x, options) {
Ransomware attacks are of interest to security professionals, law enforcement, and financial regulatory officials.$^{[1]}$ The pseudo-anonymous Bitcoin network provides a convenient method for ransomware attackers to accept payments without revealing their identity or location. The victims (usually hospitals or other large organizations) come to learn that much if not all of their important organizational data have been encrypted with a secret key by an unknown attacker. They are instructed to make a payment to a specific Bitcoin address by a certain deadline to have the data decrypted or else it will be deleted automatically.
The deeper legal and financial implications of ransomware attacks are inconsequential to the work in this report, as we are merely interested in being able to classify bitcoin addresses by their connection to ransomware transactions. Many researchers are already tracking illicit activity (such as ransomware payments) around the Bitcoin blockchain as soon as possible to minimize financial losses. Daniel Goldsmith explains some of the reasons and methods of blockchain analysis at [Chainalysis.com](https://www.chainalysis.com/).$^{[2]}$ For example, consider a ransomware attack conducted towards an illegal darknet market site. The news of such an attack might not be published at all, let alone in popular media. By analyzing the transaction record with a blockchain explorer such as [BTC.com](https://btc.com/), suspicious activity could be flagged in real time given a sufficiently robust model. It may, in fact, be the first public notice of such an event. Any suspicious addresses could then be blacklisted or banned from using other services, if that is so desired.
The deeper legal and financial implications of ransomware attacks are inconsequential to the work in this report, as we are merely interested in being able to classify Bitcoin addresses by their connection to ransomware transactions. Many researchers are already tracking illicit activity (such as ransomware payments) around the Bitcoin blockchain as soon as possible to minimize financial losses. Daniel Goldsmith explains some of the reasons and methods of blockchain analysis at [Chainalysis.com](https://www.chainalysis.com/).$^{[2]}$ For example, consider a ransomware attack conducted towards an illegal darknet market site. The news of such an attack might not be announced at all to prevent loss of trust among its users. By analyzing the transaction record with a blockchain explorer such as [BTC.com](https://btc.com/), suspicious activity could be flagged in real time given a sufficiently robust model. It may, in fact, be the first public notice of such an event. Any suspicious addresses could then be blacklisted or banned from using other services, if so desired.
Lists of known ransomware payment addresses have been compiled and analyzed using various methods. One well known paper entitled "BitcoinHeist: Topological Data Analysis for Ransomware Detection on the Bitcoin Blockchain"$^{[3]}$ will be the source of our data set and the baseline to which we will compare our results. In that paper, Akcora, et al. use Topological Data Analysis (TDA) to classify addresses on the Bitcoin blockchain into one of 28 known ransomware address groups. Addresses with no known ransomware associations are classified as *white*. The blockchain is then considered as a heterogeneous Directed Acyclic Graph (DAG) with two types of nodes describing *addresses* and *transactions*. Edges are formed between the nodes when a transaction can be associated with a particular address.
Any given address on the Bitcoin network may appear many times, with different inputs and outputs each time. The Bitcoin network data has been divided into 24-hour time intervals with the UTC-6 timezone as a reference. This way, variables can be defined in a specific and meaningful way. For example, *speed* can be defined as the number of blocks the coin appears in during a 24-hour period, and provides information on how quickly a coin moves through the network. *Speed* may be an indicator of money laundering or coin mixing, as normal payments only involve a limited number of addresses in a given 24 hour period, and thus have lower speeds when compared to "mixed" coins. The temporal data can also help distinguish transactions by geolocation, as criminal transactions tend to cluster in time.
Any given address on the Bitcoin network may appear many times, with different inputs and outputs each time. The Bitcoin network data has been divided into 24-hour time intervals with the UTC-6 timezone as a reference, allowing for variables to be defined in a specific and meaningful way. For example, *speed* can be defined as the number of blocks the coin appears in during a 24-hour period, and provides information on how quickly a coin moves through the network. *Speed* may be an indicator of money laundering or coin mixing, as normal payments only involve a limited number of addresses in a given 24 hour period, and thus have lower *speeds* when compared to "mixed" coins. The temporal data can also help distinguish transactions by geolocation, as criminal transactions tend to cluster in time.
With the graph specified as such, the following six numerical features$^{[2]}$ are associated with a given address:
1) *Income* - the total amount of coins sent to an address
1) *Income* - the total amount of bitcoins sent to an address
2) *Neighbors* - the number of transactions that have this address as one of its output addresses
3) *Weight* - the sum of fraction of coins that reach this address from address that do not have any other inputs within the 24-hour window, which are referred to as "starter transactions"
3) *Weight* - the sum of fraction of bitcoins that reach this address from address that do not have any other inputs within the 24-hour window, which are referred to as "starter transactions"
4) *Length* - the number of non-starter transactions on its longest chain, where a chain is defined as an
acyclic directed path originating from any starter transaction and ending at the address in question
5) *Count* - The number of starter addresses connected to this address through a chain
5) *Count* - the number of starter addresses connected to this address through a chain
6) *Looped* - The number of starter addresses connected to this address by more than one path
6) *Looped* - the number of starter addresses connected to this address by more than one path
These variables are defined rather conceptually, viewing the blockchain as a topological graph with nodes and edges. The rationale behind this approach is to quantify specific transaction patterns. Akcora$^{[3]}$ gives a thorough explanation in the original paper of how and why these features were chosen. We shall treat the features as general numerical variables and will not seek to justify their definitions beyond that. Machine learning methods will be applied to the original data set from the paper by Akcora$^{[3]}$, and the new results will be compared to the original ones.
These variables are defined somewhat conceptually, viewing the blockchain as a topological graph with nodes and edges. The rationale behind this approach is to facilitate quantification of specific transaction patterns. Akcora, et al.$^{[3]}$ give a thorough explanation in the original paper of how and why these features were chosen. We shall treat the features as general numerical variables and will not seek to justify their definitions beyond that. Machine learning methods will be applied to the original data set from the same paper$^{[3]}$, and the new results will be compared to the original ones.
### Data
This data set was found while exploring the [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/index.php)$^{[4]}$ as suggested in the project instructions. The author of this report, interested in Bitcoin and other cryptocurrencies since (unsuccessfully) mining them on an ASUS netbook in rural Peru in late 2010, used *cryptocurrency* as a preliminary search term. This brought up a single data set entitled ["BitcoinHeist: Ransomware Address Data Set"](https://archive.ics.uci.edu/ml/datasets/BitcoinHeistRansomwareAddressDataset#). The data set was downloaded and the exploration began.
This data set was found while exploring the [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/index.php)$^{[4]}$ as suggested in the project instructions. The author of this report, interested in Bitcoin and other cryptocurrencies since (unsuccessfully) mining for them on an ASUS netbook in rural Peru in late 2010, used *cryptocurrency* as a preliminary search term. This brought up a single data set entitled ["BitcoinHeist: Ransomware Address Data Set"](https://archive.ics.uci.edu/ml/datasets/BitcoinHeistRansomwareAddressDataset#). The data set was downloaded and the exploration began.
```{r install_load_libraries-and-download_data, echo=FALSE, include=FALSE}
@ -142,13 +142,13 @@ ransomware %>% head() %>% knitr::kable(caption="First ten entries of data set")
```
This data set has 2,916,697 observations of ten features associated with a sample of transactions from the Bitcoin blockchain. The ten features include *address* as a unique identifier, the six features defined previously (*income, neighbors, weight, length, count, loop*), two temporal features in the form of *year* and *day* (of the year as 1 to 365), and a categorical feature called *label* that categorizes each address as either *white* (meaning not connected to any ransomware activity), or one of 28 known ransomware groups as identified by three independent ransomware analysis teams (Montreal, Princeton, and Padua)$^{[3]}$.
This data set has 2,916,697 observations of ten features associated with a sample of transactions from the Bitcoin blockchain. The ten features include *address* as a unique identifier, the six features defined previously (*income, neighbors, weight, length, count, loop*), two temporal features in the form of *year* and *day* (day of the year as an integer from 1 to 365), and a categorical feature called *label* that categorizes each address as either *white* (i.e. not connected to any ransomware activity), or one of 28 known ransomware groups as identified by three independent ransomware analysis teams (Montreal, Princeton, and Padua)$^{[3]}$.
The original research team downloaded and parsed the entire Bitcoin transaction graph from January 2009 to December 2018. Based on a 24 hour time interval, daily transactions on the network were extracted and the Bitcoin graph was formed. Network edges that transferred less than \bitcoinA 0.3 were filtered out since ransom amounts are rarely below this threshold. Ransomware addresses are taken from three widely adopted studies: Montreal, Princeton and Padua. *White* Bitcoin addresses were capped at one thousand per day, whereas the entire network sees up to 800,000 addresses daily.$^{[5]}$
### Goal
The goal of this project is to apply different machine learning algorithms to the same data set used in the original paper, producing an acceptable predictive model for categorizing ransomware addresses correctly. Improving on the results of the original paper in some way, while not strictly necessary for the purposes of the project, would be a notable sign of success.
The goal of this project is to apply different machine learning algorithms to the same data set used in the original paper, producing an acceptable predictive model for categorizing ransomware addresses with an acceptable degree of accuracy. Improving on the results of the original paper in some way, while not strictly necessary for the purposes of the project, would be a notable sign of success.
### Outline of Steps Taken
@ -165,7 +165,7 @@ The original research team downloaded and parsed the entire Bitcoin transaction
### Hardware Specification
All of the analysis in this report was conducted on a single laptop computer, a Lenovo Yoga S1 from late 2013 with the following specifications.
All of the analysis in this report was conducted on a single laptop computer, a **Lenovo Yoga S1** from late 2013 with the following specifications.
- CPU: Intel i7-4600U @ 3.300GHz (4th Gen quad-core i7 x86_64)
- RAM: 8217MB DDR3L @ 1600 MHz (8 GB)
@ -603,27 +603,27 @@ The percentage of wallets with less than one hundred bitcoins as their balance i
### Insights gained from exploration
After visually and statistically exploring of the data, it becomes clear what the challenge is. Ransomware related addresses are very sparse in the data set, making up less than 2% of all addresses. This small percentage is also further classified into 28 groups. Perhaps the original paper was a overly ambitious in trying to categorize all the addresses into 29 categories, including the vastly prevalent *white* addresses. To simplify our approach, we will categorize the addresses in a binary way as either *white* or *black*, where *black* signifies an association with ransomware transactions. Asking this as a "ransomware or not-ransomware" question allows for application of methods that have been shown to be impractical otherwise.
After visually and statistically exploring the data, it becomes clear what the challenge is. Ransomware-related addresses are very sparse in the data set, making up less than 2% of all addresses. This small percentage is also further classified into 28 groups. Perhaps the original paper was a overly ambitious in trying to categorize all the addresses into 29 categories, including the vastly prevalent *white* addresses. To simplify our approach, we will categorize the addresses in a binary way: as either *white* or *black*, where *black* signifies an association with ransomware transactions. Asking this as a "ransomware or not-ransomware" question allows for application of methods that have been shown to be impractical otherwise.
---
## Modelling approach
Akcora, et al. applied a Random Forest approach to the data, however "Despite improving data scarcity, [...] tree based methods (i.e., Random Forest and XGBoost) fail to predict any ransomware family".[3, 11] Considering all ransomware addresses as belonging to a single group may improve the predictive power of such methods, making Random Forest worth another try.
Akcora, et al. applied a Random Forest approach to the data, however "Despite improving data scarcity, [...] tree based methods (i.e., Random Forest and XGBoost) fail to predict any ransomware family".[3] Considering all ransomware addresses as belonging to a single group may improve the predictive power of such methods, making Random Forest worth another try.
The topological description of the data set inspired a search for topological machine learning methods, although one does not necessitate the other. Searching for *topo* in the documentation for the `caret` package [6] resulted in the entry for Self Organizing Maps (SOMs), supplied by the `kohonen` package. The description at CRAN [7] was intriguing enough to merit further investigation.
The topological description of the data set inspired a search for topological machine learning methods, although one does not necessitate the other. Searching for *topo* in the documentation for the `caret` package [6] resulted in the entry for Self Organizing Maps (SOMs), supplied by the `kohonen` package.[11] The description at CRAN [7] was intriguing enough to merit further investigation.
Initially, the categorization of ransomware into the 28 different families was attempted using SOMs. This proved to be very resource intensive, requiring more time and RAM than was available. Although it did help to illuminate how SOMs are configured, the resource requirements of the algorithm became a deterrent. It was at this point that the SOMs were applied in a binary way, classifying all ransomware addresses as merely *black*, initially in an attempt to simply get the algorithm to run to completion without error. This seemed to reduce RAM usage to the point of being feasible on the available hardware.
Initially, the categorization of ransomware into the 29 different families (including *white*) was attempted using SOMs. This proved to be very resource intensive, requiring more time and RAM than was available. Although it did help to illuminate how SOMs are configured, the resource requirements of the algorithm became a deterrent. It was at this point that the SOMs were applied in a binary way, classifying all ransomware addresses as merely *black*, initially in an attempt to simply get the algorithm to run to completion without error. This reduced RAM usage to the point of being feasible on the available hardware.
Self Organizing Maps were not covered in the coursework at any point, therefore a familiar method was sought out to compare the results to. Random Forest was chosen and applied to the data set in a binary way, classifying every address as either *white* or *black*, ignoring the ransomware families. Surprisingly, not only did the Random Forest approach result in an acceptable model, it did so much quicker than expected, taking only a few minutes to produce results.
At this point, it was very tempting to leave it there and write up a comparison of the two approaches to the binary problem, by classifying all ransomware related addresses as *black*. However, a nagging feeling that more could be done eventually inspired a second look at the categorical problem of grouping the ransomware addresses into the 28 known families. Given the high accuracy and precision of the binary Random Forest approach, the sparseness of the ransomware in the larger set has been eliminated completely, along with any chances of false positives. There are a few cases of false negatives, depending on how the randomization is done during the sampling process. However, the Random Forest method does not seem to produce many false positive (if any), meaning it never seems to predict a truly white address as being black. Hence, by applying the Random Forest method first, we have effectively filtered out any possibility of false positives by correctly identifying a very large set of purely *white* addresses, which are then removed from the set. The best model used in the original paper by Akcora, et al. resulted in more false positives than true positives. This low precision rate is what made it impractical for real-world usage.[3]
It was very tempting to leave it there and write up a comparison of the two approaches to the binary problem by classifying all ransomware related addresses as *black*. However, a nagging feeling that more could be done eventually inspired a second look at the categorical problem of grouping the ransomware addresses into the 28 known families. Given the high accuracy and precision of the binary Random Forest approach, the sparseness of the ransomware in the larger set has been eliminated completely, along with any chances of false positives. There are a few cases of false negatives, depending on how the randomization is done during the sampling process. However, the Random Forest method does not seem to produce many false positive (if any), meaning it never seems to predict a truly white address as being black. Hence, by applying the Random Forest method first, we have effectively filtered out any possibility of false positives by correctly identifying a very large set of purely *white* addresses, which are then removed from the set. The best model used in the original paper by Akcora, et al. resulted in more false positives than true positives. This low precision rate is what made it impractical for real-world usage.[3]
This all inspired a two-part method to first separate the addresses into *black* and *white* groups, and then further classify the *black* addresses into ransomware families. We shall explore each of these steps separately.
All of these factors combined to inspire a two-part method: first to separate the addresses into *black* and *white* groups, and then to further classify the *black* addresses into ransomware families. We shall explore each of these steps separately.
### Method Part 0: Binary SOMs
The first working model that ran to completion without exhausting computer resources did not make use of the ransomware family labels and instead the two categories of *black* and *white*. The `kohonen` package provides algorithms for both supervised and unsupervised model building. A supervised approach was used since the data set includes information about the membership of ransomware families that can be used to train the model.
The first working model that ran to completion without exhausting computer resources ignored the ransomware family labels and instead used the two categories of *black* and *white*. The `kohonen` package provides algorithms for both supervised and unsupervised model building, using both Self Organizing Maps and Super Organizing Maps respectively.[11] A supervised approach was used since the data set includes information about the membership of ransomware families that can be used to train the model.
```{r binary_SOMs, echo=FALSE, include=FALSE}
##############################################################################
@ -725,7 +725,7 @@ cm_bw.validation <-
```
After training the model, weobtain the confusion matricies for the test set and the validation set, separately.
After training the model, we obtain the confusion matrices for the test set and the validation set, separately. As you can see, the results are very good in both cases.
```{r binary_SOM_results, echo=FALSE, results='asis' }
@ -738,13 +738,13 @@ cm1_validation_set <- cm_bw.validation %>% as.matrix() %>%
cat(c("\\begin{table}[!htb]
\\begin{minipage}{.5\\linewidth}
\\caption{test set}
\\caption{test set confusion matrix}
\\centering",
cm1_test_set,
"\\end{minipage}%
\\begin{minipage}{.5\\linewidth}
\\centering
\\caption{validation set}",
\\caption{validation set confusion matrix}",
cm1_validation_set,
"\\end{minipage}
\\end{table}"
@ -753,7 +753,7 @@ cat(c("\\begin{table}[!htb]
```
This is a very intensive and somewhat inaccurate method compared to what follows. It was left out of the final version of the script and has been included here only for model comparison and to track developmental evolution.
This is a very intensive method compared to what follows. It was left out of the final version of the script and has been included here only for model comparison and to track developmental evolution.
### Method Part 1: Binary Random Forest
@ -791,8 +791,7 @@ cm_ransomware <- confusionMatrix(ransomware_y_hat_rf, ransomware$bw)
```
The confusion matrix for the test set shows excellent results, specifically in the areas of accuracy and precision.
The confusion matrix for the test set shows very good results, specifically in the areas of accuracy and precision. Although not as good as the SOM model used previously, the results are good enough to justify the time saved.
```{r random-forest-output_test, echo=FALSE}
@ -824,20 +823,20 @@ cm3_byClass <- cm_ransomware$byClass %>%
```
Here are the confusion matrices for the test set and the full set resulting from the Random Forest model, respectively.
Here are the confusion matrices for the test set and the full set resulting from the Random Forest model, respectively. Note the high values of accuracy and precision.
```{r random-forest-comfusion_matrices, echo=FALSE, results='asis'}
# Print all three tables on one line
cat(c("\\begin{table}[!htb]
\\begin{minipage}{.5\\linewidth}
\\caption{confusion matrix for test set}
\\caption{test set confusion matrix}
\\centering",
cm2_test_set,
"\\end{minipage}%
\\begin{minipage}{.5\\linewidth}
\\centering
\\caption{confusion matrix for full set}",
\\caption{full set confusion matrix}",
cm3_full_set,
"\\end{minipage}
\\end{table}"
@ -848,7 +847,7 @@ cat(c("\\begin{table}[!htb]
The confusion matrix for the full ransomware set is very similar to that of the test set.
Overall results for test and full sets show good results.
Overall results for test and full sets show very good results.
```{r random-forest-overall_results, echo=FALSE, results='asis'}
@ -869,7 +868,7 @@ cat(c("\\begin{table}[!htb]
```
Results by class for the test and full sets. What can you say about these, specifically?
Results by class for the test and full sets. **What can you say about these, specifically?**
```{r random-forest-results_by_class, echo=FALSE, results='asis'}
@ -1006,9 +1005,11 @@ cm_labels$byClass %>% knitr::kable(caption="categorization results by class")
### Clustering Visualizations
Heatmaps and K-means clustering
Heatmaps and K-means clustering
Toroidal nerual node maps are used to generate the models, and can be visualized n a number of ways.
Toroidal neural node maps are used to generate the models, and can be visualized n a number of ways.
Describe these separately?
```{r binary som graphs, echo=FALSE, fig.show="hold", out.width='35%'}
@ -1062,7 +1063,9 @@ plot(som_model2, type = 'property', property = som_model2$codes[[1]][,6],
```
K-means clustering offers a nice way of visualizing the final SOM grid and the categorical boundaries that were formed by the model. Say a bit more about it here....
K-means clustering offers a nice way of visualizing the final SOM grid and the categorical boundaries that were formed by the model.
Say a bit more about it here....
```{r clustering-setup, echo=FALSE, include=FALSE}
#############################################################################
@ -1108,21 +1111,21 @@ add.cluster.boundaries(som_model2, som.cluster$cluster)
### Performance
The script runs on the aforementioned hardware in less than five minutes and uses less than 4GB of RAM. Given that the Bitcoin network produces one new block every ten minutes on average, then real-time analysis could theoretically be conducted on each block as it is announced using even moderate computing resources. Just for kicks, the final script was also run on a more humble computer with the following specifications:
The script runs on the aforementioned hardware in less than five minutes and uses less than 4GB of RAM. Given that the Bitcoin network produces one new block every ten minutes on average, then real-time analysis could theoretically be conducted on each block as they are announced using even moderate computing resources. Just for comparison, the final script was also run on lower powered machines with the following specifications:
#### ASUS Eee PC 1025C
- CPU: Intel Atom N2600 @ 1.600GHz (64-bit Intel Atom quad-core x86)
- CPU: Intel Atom N2600 @ 1.6GHz (64-bit Intel Atom quad-core x86)
- RAM: 3911MB DDR3 @ 800 MT/s (4 GB)
This is a computer known for being slow and clunky. Even on this device, which runs the same operating system and software as the hardware listed previously, the total run time for the script is around 1665 seconds. At nearly 28 minutes, this is not fast enough to analyze the Bitcoin blockchain in real time, but it does show that the script can be run on very modest hardware to completion.
#### Pine64 Quartz64 Model A
- CPU: Rockchip RK3566 SoC aarch64 (64-bit quad-core ARM)
- CPU: Rockchip RK3566 SoC `aarch64` @1.8GHz (64-bit quad-core ARM)
- RAM: DDR4 8080MB (8 GB)
Single board computer / Development board. This was run to benchmark a modern 64-bit ARM processor. The script runs in about 860 minutes on this platform, nearly half of that for the Atom processor above.
This is a single board computer / development board, which runs the same software as the others (ported to `aarch64`), except for Rstudio. It is nice to be able to benchmark a modern 64-bit ARM processor. The script runs in about 860 minutes on this platform, nearly half of that for the Atom processor above. Still not fast enough to analyze each block in real time, but a significant improvement given the low power usage of such processors.
---
@ -1133,22 +1136,24 @@ add.cluster.boundaries(som_model2, som.cluster$cluster)
In the original paper by Akcora et al., they tested several different sets of parameters on their TDA model. According to them, "In the best TDA models for each ransomware family, we predict **16.59 false positives for each
true positive**. In turn, this number is 27.44 for the best non-TDA models."[3] In fact, the highest Precision (a.k.a. Positive Predictive Value, defined as TP/(TP+FP)) they achieved was only 0.1610. By comparison, although several of our predicted classes had zero or NA precision values, the lowest non-zero precision value is `r toString(min(cm_labels$byClass[,5][which(cm_labels$byClass[,5] > 0)]))`, with many well above that, approaching one in a few cases.
One might say that we are comparing apples to oranges in a sense, because their method was one single model, while these results are from a two-method stack. Still, given the run time of the final script, I think the two-model approach is superior in this case, especially when measured in terms of precision and avoiding false positives.
One might say that we are comparing apples to oranges in a sense, because their method was one single method model, while these results are from a two-method stack. Still, given the run time of the final script, the two-model approach is superior in this case, especially when measured in terms of precision and avoiding false positives.
### Limitations
SOMs seem like they are easy to misconfigure. Perhaps a dual Random Forest approach would be better. this has not been attempted yet, as the two method approach presented here was satisfactory enough to present in a report.
SOMs seem like they are easy to misconfigure, and require significantly more computing resources than less sophisticated algorithms. Perhaps a dual Random Forest approach would be better. This has not been attempted yet, as the two method approach presented here was satisfactory enough to present in a report.
### Future Work
I only scratched he surface of the SOM algorithm which seems to have many implementations and parameters that could be investigated further and possibly optimized via cross-validation. Also, a dual Random Forest approach to first isolate the ransomware addresses and also
I only scratched he surface of the SOM algorithm which seems to have many implementations and parameters that could be investigated further and possibly optimized via cross-validation. For example, the grid size used to train the SOM was calculated using an algorithm based on the size of the training set, and while this performed better than a grid size based on the number of categories, this may not be ideal. Optimization around grid size could still be carried out.
A dual Random Forest approach could be used to first isolate the ransomware addresses as well as classify them might be quick enough to run in under ten minutes on all the hardware listed. Conversely, a dual-SOM method could be created for maximum precision if the necessary computing resources were available.
The script itself has a few areas that could be further optimization. The sampling method does what it needs to do, but the ratios taken for each set could possibly be optimized.
The script itself has a few areas that could be further optimization. The sampling method does what it needs to do, but the ratios taken for each set could possibly be optimized.
### Conclusion
This paper/report presents a reliable method for classifying Bitcoin addresses into known ransomware families, while at the same time avoiding false positives by filtering them out using a binary method before classifying them further. It leaves the author of the paper wondering how much harder it would be to perform the same task for ransomware that uses privacy coins. Certain cryptocurrency networks utilize privacy coins, such as Monero, that obfuscate transactions from being analyzed in the same way that the Bitcoin network has been analyzed here. Some progress has been made towards analyzing such networks[9], but the developers of such networks continually evolve the code to complicate transaction tracking. This could be another good area for future research.
This report presents a reliable method for classifying Bitcoin addresses into known ransomware families, while at the same time avoiding false positives by filtering them out using a binary method before classifying them further. It leaves the author of the paper wondering how much harder it would be to perform the same task for ransomware that uses privacy-centric coins. Certain cryptocurrency networks utilize privacy coins, such as Monero, that obfuscate transactions from being analyzed in the same way that the Bitcoin network has been analyzed here. Some progress has been made towards analyzing such networks[9], but the developers of such networks continually evolve the code to complicate transaction tracking. This could be another good area for future research.
## References
@ -1161,10 +1166,10 @@ bitcoin transaction graph](https://doi.org/10.1007/s41109-020-00261-7)
[4] UCI Machine Learning Repository https://archive.ics.uci.edu/ml/index.php
[5] BitcoinHeist Ransomware Address Dataset
[5] BitcoinHeist Ransomware Address Dataset
https://archive.ics.uci.edu/ml/datasets/BitcoinHeistRansomwareAddressDataset
[6] Available Models - The `caret` package http://topepo.github.io/caret/available-models.html
[6] Available Models - The `caret` package http://topepo.github.io/caret/available-models.html
[7] Ron Wehrens and Johannes Kruisselbrink, Package `kohonen` @ CRAN (2019) https://cran.r-project.org/web/packages/kohonen/kohonen.pdf
@ -1173,6 +1178,15 @@ https://archive.ics.uci.edu/ml/datasets/BitcoinHeistRansomwareAddressDataset
[9] Malte Möser, Kyle Soska, Ethan Heilman, Kevin Lee, Henry Heffan, Shashvat Srivastava,
Kyle Hogan, Jason Hennessey, Andrew Miller, Arvind Narayanan, and Nicolas Christin (April 23, 2018) [An Empirical Analysis of Traceability in the Monero Blockchain](https://arxiv.org/pdf/1704.04299/)
[10] KR Tejeda, Detecting Bitcoin Ransomware, https://git.disroot.org/shelldweller/ransomware
[11b] Wehrens R, Kruisselbrink J (2018). “Flexible Self-Organizing Maps in kohonen 3.0.” _Journal of Statistical
Software_, *87*(7), 1-18. doi: 10.18637/jss.v087.i07 (URL: https://doi.org/10.18637/jss.v087.i07).
[11a] Wehrens R, Buydens LMC (2007). “Self- and Super-Organizing Maps in R: The kohonen Package.” _Journal of
Statistical Software_, *21*(5), 1-19. doi: 10.18637/jss.v021.i05 (URL:
https://doi.org/10.18637/jss.v021.i05).
\newpage
## Appendix:

Binary file not shown.