It is finally starting to look good. Keep going. Chunk #2 needs you. You are its only hope.

This commit is contained in:
shelldweller 2021-10-23 08:36:04 -06:00
parent d811046c5e
commit 50f05fcd5c
3 changed files with 231 additions and 54 deletions

View File

@ -94,7 +94,7 @@ coeff_vars <- sds %/% means
selected_features <- names(sort(coeff_vars, decreasing=TRUE))[1:2]
message("The features with the highest coefficients of variation are ",
selected_features[1], selected_features[2],
selected_features[1], " and ", selected_features[2],
", which will be used to train the binary model.")
#Sample every 100th row due to memory constraints

View File

@ -49,7 +49,7 @@ knitr::knit_hooks$set(chunk = function(x, options) {
The legal and financial implications of ransomware attacks are not of concern for the purpose of this analysis. Many parties are interested in tracking illicit activity (such as ransomware payments) around the Bitcoin blockchain as soon as possible to minimize financial losses. Daniel Goldsmith explains some of the reasons and methods of blockchain analysis at [Chainalysis.com](https://www.chainalysis.com/).$^{[2]}$ A ransomware attack could be perpetrated on an illegal darknet market site, for example. The news of such an attack might not be published at all, let alone in popular media. By analyzing the transaction record with a blockchain explorer such as [BTC.com](https://btc.com/), suspicious activity could be flagged in real time given a sufficiently robust model. It may, in fact, be the first public notice of such an event. Any suspicious addresses could then be blacklisted or banned from using other services.
Lists of known ransomware payment addresses have been compiled and analyzed using various methods. One well known paper entitled "BitcoinHeist: Topological Data Analysis for Ransomware Detection on the Bitcoin Blockchain"$^{[3]}$ will be the source of our data set and the baseline to which we will compare our results. In that paper, Akcora, et al. use Topological Data Analysis (TDA) to classify addresses on the Bitcoin blockchain into one of 29 known ransomware address groups. Addresses with no known ransomware associations are classified as "white". The blockchain is then considered as a heterogeneous Directed Acyclic Graph (DAG) with two types of nodes describing *addresses* and *transactions*. Edges are formed between the nodes when a transaction can be associated with a particular address.
Lists of known ransomware payment addresses have been compiled and analyzed using various methods. One well known paper entitled "BitcoinHeist: Topological Data Analysis for Ransomware Detection on the Bitcoin Blockchain"$^{[3]}$ will be the source of our data set and the baseline to which we will compare our results. In that paper, Akcora, et al. use Topological Data Analysis (TDA) to classify addresses on the Bitcoin blockchain into one of 28 known ransomware address groups. Addresses with no known ransomware associations are classified as "white". The blockchain is then considered as a heterogeneous Directed Acyclic Graph (DAG) with two types of nodes describing *addresses* and *transactions*. Edges are formed between the nodes when a transaction can be associated with a particular address.
Addresses on the Bitcoin network may appear many times, with different inputs and outputs each time. The Bitcoin network data has been divided into 24-hour time intervals with the UTC-6 timezone as a reference. Speed is defined as the number of blocks the coin appears in during a 24-hour period and provides information on how quickly a coin moves through the network. Speed can be an indicator of money laundering or coin mixing, as normal payments only involve a limited number of addresses in a 24 hour period, and thus have lower speeds when compared to "mixed" coins. The temporal data can also help distinguish transactions by geolocation, as criminal transactions tend to cluster in time.
@ -134,7 +134,7 @@ ransomware %>% head() %>% knitr::kable()
```
This data set has 2,916,697 observations of ten features associated with a sample of transactions from the Bitcoin blockchain. The ten features include *address* as a unique identifier, the six features defined previously (*income, neighbors, weight, length, count, loop*), two temporal features in the form of *year* and *day* (of the year as 1 to 365), and a categorical feature called *label* that categorizes each address as either "white" (meaning not connected to any ransomware activity), or one of 29 known ransomware groups as identified by three independent ransomware analysis teams (Montreal, Princeton, and Padua)$^{[3]}$ .
This data set has 2,916,697 observations of ten features associated with a sample of transactions from the Bitcoin blockchain. The ten features include *address* as a unique identifier, the six features defined previously (*income, neighbors, weight, length, count, loop*), two temporal features in the form of *year* and *day* (of the year as 1 to 365), and a categorical feature called *label* that categorizes each address as either "white" (meaning not connected to any ransomware activity), or one of 28 known ransomware groups as identified by three independent ransomware analysis teams (Montreal, Princeton, and Padua)$^{[3]}$ .
The original research team downloaded and parsed the entire Bitcoin transaction graph from January 2009 to December 2018. Based on a 24 hour time interval, daily transactions on the network were extracted and the Bitcoin graph was formed. Network edges that transferred less than \bitcoinA 0.3 were filtered out since ransom amounts are rarely below this threshold. Ransomware addresses are taken from three widely adopted studies: Montreal, Princeton and Padua. "White" Bitcoin addresses were capped at one thousand per day while the entire network has up to 800,000 addresses daily.$^{[5]}$
@ -146,13 +146,15 @@ The original research team downloaded and parsed the entire Bitcoin transaction
1) Analyze data set numerically and visually. Notice any pattern, look for insights.
2) Binary classification using Random Forests.
2) Binary separation using Self Organizing Maps.
3) Categorical classification using Self Organizing Maps.
3) Improved Binary separation using Random Forests.
4) Visualize clustering to analyze results further.
4) Categorical classification using Self Organizing Maps.
5) Generate Confusion Matrix to quantify results.
5) Visualize clustering to analyze results further.
6) Generate Confusion Matrix to quantify results.
---
@ -199,7 +201,6 @@ ransomprop <- mean(ransomware$bw=="black")
# Check for NAs
no_nas <- sum(is.na(ransomware))
```
---
@ -208,9 +209,9 @@ no_nas <- sum(is.na(ransomware))
###############################################################################
---
### Exploration and Visualization ( Chunk #2, do this part last....)
### Exploration and Visualization
The ransomware addresses make up less than 2% of the overall data set. This presents a challenge as the target observations are sparse within the data set, especially when we consider that this is then divided into 29 subsets. In fact, some of the ransomware groups have only a single member, making categorization a dubious task. At least there are no missing values to worry about.
The ransomware addresses make up less than 2% of the overall data set. This presents a challenge as the target observations are sparse within the data set, especially when we consider that this is then divided into 28 subsets. In fact, some of the ransomware groups have only a single member, making categorization a dubious task. At least there are no missing values to worry about.
```{r cv-calcs, echo=FALSE}
@ -250,7 +251,6 @@ test_num <- test_samp %>% select(selected_features[1], selected_features[2])
# Binary labels for test set
test_bw <- test_samp$bw
```
```{r data-sparsness, echo=FALSE}
@ -259,7 +259,7 @@ message("The proportion of ransomware addresses in the original data set is ", r
message("The total number of NA or missing values in the original data set is ", no_nas, ".")
labels <- ransomware$label %>% summary()
labels <- ransomware$label %>% summary()
knitr::kable(
list(labels[1:15], labels[16:29]),
@ -273,9 +273,10 @@ knitr::kable(
Let's take a look at the distribution of the different features. Note how skewed the non-temporal features are, some of them being bimodal:
```{r histograms, echo=FALSE}
# Histograms of each of the columns to show skewness
# Plot histograms for each column using facet wrap
########################################################
## Histograms of each of the columns to show skewness
## Plot histograms for each column using facet wrap
########################################################
train_long <- train_num %>% # Apply pivot_longer function
pivot_longer(colnames(train_num)) %>%
@ -325,7 +326,6 @@ Now do the following (after filling in methods, results, and conclusions, since
# Count how many wallets have less than one full bitcoin
shrimp <- train_samp %>% filter(income < 10^8 )
```
```{r shrimp-output, echo=FALSE}
@ -333,36 +333,196 @@ shrimp <- train_samp %>% filter(income < 10^8 )
# Print the percentage of wallets with less than one full bitcoin
mean(shrimp$bw == "black")
```
---
###############################################################################
## Clean up and add text from here to end.....
## End graphic cleanup here.
###############################################################################
---
### Insights Gained from Exploration
From the previous visual and statistical exploration of the data, it becomes clear what the challenge is. Ransomware addresses are very sparse in the data set, making up less than 2% of the total addresses. That small percentage is also further classified into 29 groups. Perhaps the original paper was a bit too ambitious in trying to categorize all the addresses into 30 categories, including the "white" addresses. To simplify our approach, we will categorize the addresses in a binary way, either "white" or "black", where "black" signifies an association with ransomware transactions. Asking this as a "ransomware or not-ransomware" question allows for application of methods that are impractical otherwise.
From the previous visual and statistical exploration of the data, it becomes clear what the challenge is. Ransomware related addresses are very sparse in the data set, making up less than 2% of all addresses. That small percentage is also further classified into 28 groups. Perhaps the original paper was a bit too ambitious in trying to categorize all the addresses into 29 categories, including the "white" addresses. To simplify our approach, we will categorize the addresses in a binary way, either "white" or "black", where "black" signifies an association with ransomware transactions. Asking this as a "ransomware or not-ransomware" question allows for application of methods that are known to be impractical otherwise.
---
## Modeling approach
Akcora et al. mention applied a Random Forest approach to the data, however "Despite improving data scarcity, [...] tree based methods (i.e., Random Forest and XGBoost) fail to predict any ransomware family".[3, 11] Considering all ransomware addresses as belonging to a single group might improve the predictive power of such methods, making Random Forest worth another try. The topological description of the data set inspired a search for topological machine learning methods, although one does not necessitate the other. Searching for *topo* in the documentation for the `caret` package [6] resulted in the entry for Self Organizing Maps, supplied by the `kohonen` package. The description at CRAN [7] was intriguing enough to merit further investigation.
Akcora et al. applied a Random Forest approach to the data, however "Despite improving data scarcity, [...] tree based methods (i.e., Random Forest and XGBoost) fail to predict any ransomware family".[3, 11] Considering all ransomware addresses as belonging to a single group might improve the predictive power of such methods, making Random Forest worth another try.
The topological description of the data set inspired a search for topological machine learning methods, although one does not necessitate the other. Searching for *topo* in the documentation for the `caret` package [6] resulted in the entry for Self Organizing Maps, supplied by the `kohonen` package. The description at CRAN [7] was intriguing enough to merit further investigation.
[[Describe how you started with categorical SOMs, switched to binary SOMs, then applied randomForest to the binary problem, and was surprised with the results. Decided to re-apply categorical SOMS to black-only addresses, as predicted by the binary Random forest approach. The result is the following two-step approach, with the optional clustering visualizations at the end]]
Initially, the categorization of ransomware into the 29 different families was attempted using SOMs. This proved to be very resource intensive, requiring more time and RAM than was available at the time. Although it did help to illuminate how SOMs are configured, the runtime of the script became a deterrent.
Initially, the categorization of ransomware into the 28 different families was attempted using SOMs. This proved to be very resource intensive, requiring more time and RAM than was available at the time. Although it did help to illuminate how SOMs are configured, the resource requirements of the script became a deterrent. It was at this point that the SOMS were applied in a binary way, classifying all ransomware addresses as merely "black". This seemed to reduce RAM usage to the point of being feasible on the available hardware.
Being unsure of the SOM method, since it was not covered in the coursework at any point, a familiar method was sought out to compare the results to. This is when Random Forest was applied to the data set in a binary way. Much to the surprise of the author, not only did the Random Forest approach result in an acceptable model, it surpassed the model produced by the SOM approach.
The author was tempted to leave it there and write up a comparison of the two approaches to the binary problem. However, a nagging feeling that more could be done eventually inspired a second look at the categorical problem of categorizing the ransomware addresses into the 28 families. Given the high accuracy and precision of the Random Forest approach to the binary problem, it became apparent that the sparseness of the ransomware in the larger set had been eliminated completely, as had any chances of false positives. There are a few cases of false negatives, depending on how the randomization is done during the sampling process. However, the Random Forest method has never produced a false positive yet, meaning it never seems to predict a truly white address as being black. Hence, by applying the Random Forest method first, we have essentially filtered out any possibility of false positives, which is exactly what plagued the original paper by Akcora et al.[3]
Finally, a two-part method was devised to first separate the addresses into black and white groups, and then further classify the black addresses into ransomware families. Let's explore each of these separately.
### Method Part 0: Binary SOMS as a first attempt to isolate ransomware addresses.
Lets see how well the SOM approach can model the data in a black/white fashion.
```{r binary SOMs, echo=FALSE, include=FALSE}
##############################################################################
## This is a first attempt using SOMs to model the data set as "black" and
## "white" addresses only.
##
## NOTE: This is the most computationally heavy part of the paper and takes
## several hours to run to completion. It is also completely optional, only
## used to compare with the better method. If, for some reason, you want to
## compile the report without this section, you can just comment it all out
## or remove it because nothing is needed from Method Part 0 for any of the
## other methods. In otherwords, it can be safely skipped if you are short on
## tine or RAM.
##############################################################################
# Keep only numeric columns, ignoring dates and looped.
som1_train_num <- train_set %>% select(length, weight, count, neighbors, income)
# SOM function can only work on matrices
som1_train_mat <- as.matrix(scale(som1_train_num))
# Switching to supervised SOMs
som1_test_num <- test_set %>% select(length, weight, count, neighbors, income)
# Note that when we rescale our testing data we need to scale it according to how we scaled our training data.
som1_test_mat <- as.matrix(scale(som1_test_num, center = attr(som1_train_mat,
"scaled:center"), scale = attr(som1_train_mat, "scaled:scale")))
# Binary outputs, black=ransomware, white=non-ransomware, train set
som1_train_bw <- train_set$bw %>% classvec2classmat()
# Same for test set
som1_test_bw <- test_set$bw %>% classvec2classmat()
# Create Data list for supervised SOM
#
som1_train_list <- list(independent = som1_train_mat, dependent = som1_train_bw)
# Calculate idea grid size according to:
# https://www.researchgate.net/post/How-many-nodes-for-self-organizing-maps
# Formulaic method 1
grid_size <- round(sqrt(5*sqrt(nrow(train_set))))
# Based on categorical number, method 2
#grid_size = ceiling(sqrt(length(unique(ransomware$bw))))
grid_size
# Create SOM grid
som1_train_grid <- somgrid(xdim=grid_size, ydim=grid_size, topo="hexagonal", toroidal = TRUE)
## Now build the model.
som_model1 <- xyf(som1_train_mat, som1_train_bw,
grid = som1_train_grid,
rlen = 100,
mode="pbatch",
cores = detectCores(), # detectCores() - 1
keep.data = TRUE
)
# Now test predictions
# https://clarkdatalabs.github.io/soms/SOM_NBA
som1_test_list <- list(independent = som1_test_mat, dependent = som1_test_bw)
ransomware.prediction1 <- predict(som_model1, newdata = som1_test_list)
# Confusion Matrix
som1_cm_bw <- confusionMatrix(ransomware.prediction1$prediction[[2]], test_set$bw)
# Now test predictions of validation set
# Switching to supervised SOMs
valid_num <- validation %>% select(length, weight, count, neighbors, income)
# Note that when we rescale our testing data we need to scale it according to how we scaled our training data.
valid_mat <- as.matrix(scale(valid_num, center = attr(som1_train_mat,
"scaled:center"), scale = attr(som1_train_mat, "scaled:scale")))
valid_bw <- validation$bw
valid_list <- list(independent = valid_mat, dependent = valid_bw)
# Requires up to 16GB of RAM, skip if resources are limited
ransomware.prediction1.validation <- predict(som_model1, newdata = valid_list)
# Confusion Matrix
cm_bw.validation <- confusionMatrix(ransomware.prediction1.validation$prediction[[2]], validation$bw)
```
Here are some graphs, but be careful with these....
```{r binary som graphs, echo=FALSE}
# Be careful with these, some are really large and take a long time to produce.....
# Visualize clusters
#plot(som_model1, type = 'mapping', pch = 19, palette.name = topo.colors)
#cat(" \n")
# Distance map
#plot(som_model1, type = 'quality', pch = 19, palette.name = topo.colors)
#cat(" \n")
# Visualize counts
#plot(som_model1, type = 'counts', pch = 19, palette.name = topo.colors)
#cat(" \n")
# Visualize fan diagram
#plot(som_model1, type = 'codes', pch = 19, palette.name = topo.colors)
#cat(" \n")
# Visualize heatmap for variable 1
#plot(som_model1, type = 'property', property = som_model$codes[[1]][,1], main=colnames(train_num)[1], pch = 19, palette.name = topo.colors)
#cat(" \n")
# Visualize heatmap for variable 2
#plot(som_model1, type = 'property', property = som_model$codes[[1]][,2], main=colnames(train_num)[2], pch = 19, palette.name = topo.colors)
#cat(" \n")
# Visualize heatmap for variable 3
#plot(som_model1, type = 'property', property = som_model$codes[[1]][,3], main=colnames(train_num)[3], pch = 19, palette.name = topo.colors)
#cat(" \n")
# Visualize heatmap for variable 4
#plot(som_model1, type = 'property', property = som_model$codes[[1]][,4], main=colnames(train_num)[4], pch = 19, palette.name = topo.colors)
#cat(" \n")
# Visualize heatmap for variable 5
#plot(som_model1, type = 'property', property = som_model$codes[[1]][,5], main=colnames(train_num)[5], pch = 19, palette.name = topo.colors)
#cat(" \n")
```
Here are the results of the binary SOM model.
```{r binary som results, echo=FALSE}
som1_cm_bw %>% as.matrix() %>% knitr::kable()
cm_bw.validation %>% as.matrix() %>% knitr::kable()
```
This winds up being the most resource intensive and least accurate method out of those tried. It was left out of the final version of the script and is not really worth running except to compare to the next method.
### Method Part 1: Binary Random Forests to isolate ransomware addresses before categorization.
A Random Forest model is trained using ten-fold cross validation and a tuning grid with the number of variables randomly sampled as candidates at each split `mtry` set to the values $={2, 4, 6, 8, 10, 12}$, each one being checked for optimization.
### Method Part 1: Binary Random Forests to isolate ransomware addresses first.
```{r random-forest-prep, echo=FALSE, inculde=FALSE, warning=FALSE}
##############################################################################
## This is a better attempt using Random Forest to model the data set as "black"
## and "white" addresses only.
##############################################################################
# Cross Validation, ten fold
control <- trainControl(method="cv", number = 10)
@ -390,20 +550,22 @@ cm_ransomware <- confusionMatrix(ransomware_y_hat_rf, ransomware$bw)
```
Here are the results for the test set.
The confusion matrix for the test set shows excellent results, specifically in the areas of accuracy and precision.
```{r random-forest-output_test, echo=FALSE}
# Confusion matrix for test set, overall results and by class.
cm_test %>% as.matrix() %>% knitr::kable()
cm_test$overall %>% knitr::kable()
cm_test$byClass %>% knitr::kable()
```
Here are the results for the full original set.
The confusion matrix for the full ransomware set is very similar to that of the test set.
```{r random-forest-output_big, echo=FALSE}
# Confusion matrix for full ransomware set, overall results and by class.
cm_ransomware %>% as.matrix() %>% knitr::kable()
cm_ransomware$overall %>% knitr::kable()
cm_ransomware$byClass %>% knitr::kable()
@ -413,13 +575,13 @@ cm_ransomware$byClass %>% knitr::kable()
### Method Part 2: Categorical SOMs to categorize predicted ransomware addresses.
Now we train a new model after throwing away all "white" addresses.
Now we train a new model after throwing away all "white" addresses. The predictions from the Random Forest model are used to isolate all "black" addresses for further classification into ransomware addresses using SOMs.
```{r soms-prep, echo=FALSE, include=FALSE}
##############################################################################
## Now we use the Random Forest model to exclude the "white" addresses from
## the full ransomware set, to categorize the "black" addresses into families.
## Now we use the Random Forest model to classify the data set into "black"
## and "white" categories with better precision.
##############################################################################
# Now use this prediction to reduce the original set to only "black" addresses
@ -499,27 +661,22 @@ cm_labels <- confusionMatrix(ransomware_group.prediction$prediction[[2]],
```
A specific method was used to select the optimal grid size. Cite source here and explain options, and why you chose the one you did.
When selecting the grid size for a Self Organizing Map, there are at least two different schools of thought. The two that were tried here are explained (with supporting documentation) on a Researchgate forum.[8] The first method is based on the size of the training set, and in this case results in a larger, more accurate map. The second method is based on the number of known categories to classify the data into, and in this case results in a smaller, less accurate map. For this script, a grid size of `r grid_size` has been selected.
```{r grid-size, echo=FALSE}
message("A grid size of ", grid_size, " has been chosen.")
```
Here is a summary of the results for the categorization of black addresses into ransomware families. For the full table of predictions and statistics, see the Appendix.
A summary of the results for the categorization of black addresses into ransomware families follows. For the full table of predictions and statistics, see the Appendix.
```{r cm_overall, echo=FALSE}
cm_labels$overall %>% knitr::kable()
# Overall section of the confusion matrix formatted through kable()
cm_labels$overall %>% knitr::kable()
```
```{r soms-output-by-class, echo=FALSE, size="tiny"}
# By Class section of the confusion matrix formatted through kable()
cm_labels$byClass %>% knitr::kable()
```
### Clustering Visualizations: K-means clustering
@ -541,7 +698,7 @@ som.cluster <- kmeans(data.frame(som_model2$codes[[1]]), centers=n_groups)
```
Here is a nice graph.
K-means clustering categorizes the SOM grid by adding boundaries to the classification groups. This is the author's favorite graph in the entire report.
```{r clustering-plot, echo=FALSE, fig.align="center"}
# Plot clustering results
@ -556,35 +713,52 @@ add.cluster.boundaries(som_model2, som.cluster$cluster)
---
## Results & Performance (chunk #4, write up after chunk #3 is done)
## Results & Performance
### Results
In the original paper by Akcora et al, they tested several different sets of parameters on their TDA model. According to them, "In the best TDA models for each ransomware family, we predict **16.59 false positives for each
true positive**. In turn, this number is 27.44 for the best non-TDA models."[3] In fact, the highest Precision (a.k.a. Positive Predictive Value, defined as TP/(TP+FP)) they achieved was only 0.1610. Compare this to our final Precision value of 1.000? It is almost embarrassing... did I do something wrong here?
The first attempt to isolate ransomware using SOMs resulted in a model with an accuracy of `r toString(cm_bw.validation$overall["Accuracy"])` and precision `r toString(cm_bw.validation$byClass[3])`.
The the second attempt to isolate ransomware using Random forest resulted in a model with an accuracy of `r toString(cm_ransomware$overall["Accuracy"])` and precision `r toString(cm_ransomware$byClass[3])`.
Classifying the ransomware predicted by the second attempt into 28 ransomware families resulted in a model with an overall accuracy of `r toString(cm_labels$overall["Accuracy"])` and minimum nonzero precision of `r toString(min(cm_labels$byClass[,5][which(cm_labels$byClass[,5] > 0)]))`.
### Performance
The script runs on the aforementioned hardware in less than five minutes and uses less than 4GB of RAM. Given that the Bitcoin network produces one new block every ten minutes on average, then real-time analysis could theoretically be conducted on each block as it is announced using even moderate compuitng resources.
The script runs on the aforementioned hardware in less than five minutes and uses less than 4GB of RAM. Given that the Bitcoin network produces one new block every ten minutes on average, then real-time analysis could theoretically be conducted on each block as it is announced using even moderate computing resources. Just for kicks, the final script was also run on a more humble computer with the following specifications:
- CPU: Intel Atom N2600 @ 1.600GHz (64-bit Intel Atom quad-core)
- RAM: 3911MB DDR3 @ 800 MT/s (4 GB)
This is a computer known for being slow and clunky. Even on this device, which runs the same operating system and software as the hardware listed previously, the total run time for the script is around 1665 seconds. At nearly 28 minutes, this is not fast enough to analyze the Bitcoin blockchain in real time, but it does show that the script can be run on very modest hardware to completion.
---
## Summary
### Comparison to original paper and impact of findings
### Comparison to results from original paper
In the original paper by Akcora et al., they tested several different sets of parameters on their TDA model. According to them, "In the best TDA models for each ransomware family, we predict **16.59 false positives for each
true positive**. In turn, this number is 27.44 for the best non-TDA models."[3] In fact, the highest Precision (a.k.a. Positive Predictive Value, defined as TP/(TP+FP)) they achieved was only 0.1610. By comparison, although several of our predicted classes had zero or NA precision values, the lowest non-zero precision value is `r toString(min(cm_labels$byClass[,5][which(cm_labels$byClass[,5] > 0)]))`, with many well above that, approaching one in a few cases.
One might say that we are comparing apples to oranges in a sense, because their method was one single model, while these results are from a two-method stack. Still, given the run time of the final script, I think the two-model approach is superior in this case, especially when measured in terms of precision and avoiding false positives.
They suck, I rule, 'nuff said.
### Limitations
SOMs seem like they are easy to misconfigure. Perhaps a dual Random Forest approach would be better?
SOMs seem like they are easy to misconfigure. Perhaps a dual Random Forest approach would be better. this has not been attempted yet, as the two method approach presented here was satisfactory enough to present in a report.
### Future Work
I only scratched he surface of the SOM algorithm which seems to have many implementations and parameters that could be investigated further and possibly optimized via cross-validation, somehow.
I only scratched he surface of the SOM algorithm which seems to have many implementations and parameters that could be investigated further and possibly optimized via cross-validation. Also, a dual Random Forest approach to first isolate the ransomware addresses and also
The script itself has a few areas that could be further optimization. The sampling method does what it needs to do, but the ratios taken for each set could possibly be optimized.
### Conclusions
### Conclusion
#### Get Monero!
This paper/report presents a reliable method for classifying bitcoin addresses into know ransomware families, while at the same time avoiding false positives by filtering them out using a binary method before classifying them further. It leaves the author of the paper wondering how long before we see ransomware using privacy coins such as Monero. Find and cite a recent paper on the untracability of the Monero blockchain.
This paper/report presents a reliable method for classifying Bitcoin addresses into known ransomware families, while at the same time avoiding false positives by filtering them out using a binary method before classifying them further. It leaves the author of the paper wondering how much harder it would be to perform the same task for ransomware that uses privacy coins. Certain cryptocurrency networks utilize privacy coins, such as Monero, that obfuscate transactions from being analyzed in the same way that the Bitcoin network has been analyzed here. Some progress has been made towards analyzing such networks[9], but the developers of such networks continually evolve the code to complicate transaction tracking. This could be another good area for future research.
## References
@ -604,20 +778,23 @@ bitcoin transaction graph](https://doi.org/10.1007/s41109-020-00261-7)
[7] Ron Wehrens and Johannes Kruisselbrink, Package `kohonen` @ CRAN (2019) [https://cran.r-project.org/web/packages/kohonen/kohonen.pdf](https://cran.r-project.org/web/packages/kohonen/kohonen.pdf)
[XMR] Malte Möser, Kyle Soska, Ethan Heilman, Kevin Lee, Henry Heffan, Shashvat Srivastava,
[8] How many nodes for self-organizing maps? (Oct 22, 2021) [https://www.researchgate.net/post/How-many-nodes-for-self-organizing-maps](https://www.researchgate.net/post/How-many-nodes-for-self-organizing-maps)
[9] Malte Möser, Kyle Soska, Ethan Heilman, Kevin Lee, Henry Heffan, Shashvat Srivastava,
Kyle Hogan, Jason Hennessey, Andrew Miller, Arvind Narayanan, and Nicolas Christin (April 23, 2018) [An Empirical Analysis of Traceability in the Monero Blockchain](https://arxiv.org/pdf/1704.04299/)
\newpage
## Appendix:
### Categorical SOM ransowmare family prediction table and confusion matrix - detailed
```{r soms-output-table, echo=FALSE}
cm_labels
```
```{r empty block, echo=FALSE, include=FALSE}
# Comment goes here....
# Use this for other blocks, etc.

Binary file not shown.