Final edit completed.

This commit is contained in:
shelldweller 2021-11-13 20:19:32 -07:00
parent 1169804ec1
commit f6c46cd79d
3 changed files with 24 additions and 29 deletions

View File

@ -11,7 +11,7 @@
###############################################################################
# Uncomment next line to time script
tic()
#tic()
# Set the repository mirror to “1: 0-Cloud” for maximum availability
r = getOption("repos")
@ -148,7 +148,7 @@ cm_ransomware
##############################################################################
# Now use this prediction to reduce the original set to only "black" addresses
# First append the full set of predictions to the original set.
# First append the full set of predictions to the original set
ransomware$prediction <- ransomware_y_hat_rf
# Filter out all the predicted "white" addresses,
@ -162,18 +162,18 @@ test_index <- createDataPartition(y = black_addresses$prediction,
train_set <- black_addresses[-test_index,]
test_set <- black_addresses[test_index,]
# Keep only numeric columns, ignoring temporal variables.
# Keep only numeric columns, ignoring temporal variables
train_num <- train_set %>%
select(length, weight, count, looped, neighbors, income)
# SOM function can only work on matrices.
# SOM function can only work on matrices
train_mat <- as.matrix(scale(train_num))
# Select non-temporal numerical features only
test_num <- test_set %>%
select(length, weight, count, looped, neighbors, income)
# Testing data is scaled according to how we scaled our training data.
# Testing data is scaled according to how we scaled our training data
test_mat <- as.matrix(scale(test_num,
center = attr(train_mat, "scaled:center"),
scale = attr(train_mat, "scaled:scale")))
@ -247,4 +247,4 @@ add.cluster.boundaries(som_model2, som.cluster$cluster)
message("Overall accuracy is ", cm_labels$overall["Accuracy"])
# End timer
toc(quiet=FALSE)
#toc(quiet=FALSE)

View File

@ -3,7 +3,7 @@ title: \vspace{1in}Detecting Ransomware Addresses on the Bitcoin Blockchain usin
subtitle: \vspace{.5in}HarvardX PH125.9x Final Capstone CYO Project
\vspace{.5in}
author: "Kaylee Robert Tejeda"
date: "11/13/2021"
date: "11/14/2021"
abstract: "Ransomware is a persisent and growing threat in the world of cybersecurity. A specific area of focus involves detecting and tracking payments made to ransomware operators. Many attempts towards this goal have not made use of sophisticated machine learning methods. Those that have often result in models with poor precision or other performance issues. A two-step method is developed to address the issue of false positives."
keywords:
- Bitcoin
@ -163,7 +163,7 @@ ransomware %>% select(-address, -label) %>% summary() %>% knitr::kable(caption="
```
This data set has 2,916,697 observations of ten features associated with a sample of transactions from the Bitcoin blockchain. The ten features include *address* as a unique identifier, the six numerical features defined previously (*income, neighbors, weight, length, count, loop*), two temporal features in the form of *year* and *day* (day of the year as an integer from 1 to 365), and a categorical feature called *label* that categorizes each address as either *white* (i.e. not connected to any ransomware activity), or one of 28 known ransomware groups as identified by three independent ransomware analysis teams (Montreal, Princeton, and Padua)$^{[3]}$. A listing of the first ten rows provides a sample of the features associated with each observation.
This data set has 2,916,697 observations of ten features associated with a sample of transactions from the Bitcoin blockchain. The ten features include *address* as a unique identifier, the six numerical features defined previously (*income, neighbors, weight, length, count, looped*), two temporal features in the form of *year* and *day* (day of the year as an integer from 1 to 365), and a categorical feature called *label* that categorizes each address as either *white* (i.e. not connected to any ransomware activity), or one of 28 known ransomware groups as identified by three independent ransomware analysis teams (Montreal, Princeton, and Padua)$^{[3]}$. A listing of the first ten rows provides a sample of the features associated with each observation.
```{r data_head, echo=FALSE, size="tiny"}
@ -266,7 +266,7 @@ train_num <- train_samp %>% select(selected_features[1], selected_features[2])
# Binary labels, black = ransomware, white = non-ransomware, train set
train_bw <- train_samp$bw
#Sample every 100th row due to memory constraints to make test sample same size.
# Sample every 100th row due to memory constraints to make test sample same size
test_samp <- test_set[seq(1, nrow(train_set), 100), ]
# Dimension reduction again, selecting features with highest CVs
@ -342,9 +342,9 @@ knitr::kable(
```
From this, it appears that `r selected_features[1]` has the widest range of variability, followed by `r selected_features[2]`. These are also the features that are most strongly skewed to the right, meaning that a few addresses have really high values for each of these features while the bulk of the data set has very low values for these numbers.
From this, it appears that `r selected_features[1]` has the widest range of variability, followed by `r selected_features[2]`. These are also the features that are most strongly skewed to the right, meaning that a few addresses have really high values for each of these features while the bulk of the data set has very low values.
Taking the feature with the highest variation `r selected_features[1]`, we can take a look at the distribution for individual ransomware families to see if there is a similarity across families. This can be done for all the features, but we will focus on `r selected_features[1]` in the interest of saving space and to avoid repetition and redundancy. The distribution plots for `r selected_features[1]` show the most variation since it is the feature with the highest coefficient of variation, so it is a good one to focus on.
Taking the feature with the highest variation, `r selected_features[1]`, we can take a look at the distribution for individual ransomware families to see if there is a similarity across families. This can be done for all the features, but we will focus on `r selected_features[1]` in the interest of saving space and to avoid repetition and redundancy. The distribution plots for `r selected_features[1]` show the most variation since it is the feature with the highest coefficient of variation, so it is a good one to focus on.
```{r variation_histograms, echo=FALSE, fig.height=2, fig.width=2.5, fig.show="hold", out.width='35%', warning=FALSE}
@ -633,7 +633,7 @@ It appears that, although the `r selected_features[1]` distribution for ransomw
Self Organizing Maps were not covered in the coursework at any point, therefore a familiar method was sought out to compare the results to. Random Forest was chosen and applied to the data set in a binary way, classifying every address as either *white* or *black*, ignoring the ransomware families. Surprisingly, not only did the Random Forest approach result in an acceptable model, it did so much quicker than expected, taking only a few minutes to produce results.
It was very tempting to leave it there and write up a comparison of the two approaches to the binary problem by classifying all ransomware related addresses as *black*. However, a nagging feeling that more could be done eventually inspired a second look at the categorical problem of grouping the ransomware addresses into the 28 known families. Given the high accuracy and precision of the binary Random Forest approach, the sparseness of the ransomware in the larger set has been eliminated completely, along with any chances of false positives. There are a few cases of false negatives, depending on how the randomization is done during the sampling process. However, the Random Forest method does not seem to produce many false positive (if any), meaning it never seems to predict a truly white address as being black. Hence, by applying the Random Forest method first, we have effectively filtered out any possibility of false positives by correctly identifying a very large set of purely *white* addresses, which are then removed from the set. The best model used in the original paper by Akcora, et al. resulted in more false positives than true positives. This low precision rate is what made it impractical for real-world usage.$^{[3]}$
It was very tempting to leave it there and write up a comparison of the two approaches to the binary problem by classifying all ransomware related addresses as *black*. However, a nagging feeling that more could be done eventually inspired a second look at the categorical problem of grouping the ransomware addresses into the 28 known families. Given the high accuracy and precision of the binary Random Forest approach, the sparseness of the ransomware in the larger set has been mostly eliminated, along with many of the chances for false positives. There are a few cases of false negatives, depending on how the randomization is done during the sampling process. However, the Random Forest method does not seem to produce many false positive (if any), meaning it never seems to predict a truly white address as being black. Hence, by applying the Random Forest method first, we have effectively filtered out any possibility of false positives by correctly identifying a very large set of purely *white* addresses, which are then removed from the set. The best model used in the original paper by Akcora, et al. resulted in more false positives than true positives. This low precision rate is what made it impractical for real-world usage.$^{[3]}$
All of these factors combined to inspire a two-part method: first to separate the addresses into *black* and *white* groups, and then to further classify the *black* addresses into ransomware families. We shall explore each of these steps separately.
@ -648,7 +648,7 @@ The first working model that ran to completion without exhausting computer resou
##
## NOTE: This is the most computationally heavy part of the paper and takes
## several hours to run to completion. It is also completely optional, only
## used to compare with the better method. If, for some reason, you want to
## used to compare with the quicker method. If, for some reason, you want to
## compile the report without this section, you can just comment it all out
## or remove it because nothing is needed from Method Part 0 for any of the
## other methods. In other words, it can be safely skipped if you are short on
@ -658,7 +658,7 @@ The first working model that ran to completion without exhausting computer resou
# Start timer
tic("Binary SOMs", quiet = FALSE, func.tic = my.msg.tic)
# Keep only numeric columns, ignoring dates and looped.
# Keep only numeric columns, ignoring dates and looped
som1_train_num <- train_set %>% select(length, weight, count, neighbors, income)
# SOM function can only work on matrices
@ -668,7 +668,7 @@ som1_train_mat <- as.matrix(scale(som1_train_num))
som1_test_num <- test_set %>% select(length, weight, count, neighbors, income)
# Note that when we rescale our testing data we need to scale it
# according to how we scaled our training data.
# according to how we scaled our training data
som1_test_mat <-
as.matrix(scale(som1_test_num, center = attr(som1_train_mat, "scaled:center"),
scale = attr(som1_train_mat, "scaled:scale")))
@ -697,7 +697,7 @@ grid_size <- round(sqrt(5*sqrt(nrow(train_set))))
som1_train_grid <-
somgrid(xdim=grid_size, ydim=grid_size, topo="hexagonal", toroidal = TRUE)
## Now build the model.
## Now build the model
som_model1 <- xyf(som1_train_mat, som1_train_bw,
grid = som1_train_grid,
rlen = 100,
@ -706,14 +706,11 @@ som_model1 <- xyf(som1_train_mat, som1_train_bw,
keep.data = TRUE
)
# Now test predictions
som1_test_list <- list(independent = som1_test_mat, dependent = som1_test_bw)
ransomware.prediction1 <- predict(som_model1, newdata = som1_test_list)
# Confusion matrix
som1_cm_bw <-
confusionMatrix(ransomware.prediction1$prediction[[2]], test_set$bw)
@ -724,7 +721,7 @@ som1_cm_bw <-
valid_num <- validation %>% select(length, weight, count, neighbors, income)
# Note that when we rescale our testing data we need to scale it
# according to how we scaled our training data.
# according to how we scaled our training data
valid_mat <-
as.matrix(scale(valid_num, center = attr(som1_train_mat, "scaled:center"),
scale = attr(som1_train_mat, "scaled:scale")))
@ -932,7 +929,7 @@ Now we train a new model after removing all *white* addresses. The predictions
tic("Categorical SOMs", quiet = FALSE, func.tic = my.msg.tic)
# Now use this prediction to reduce the original set to only "black" addresses
# First append the full set of predictions to the original set.
# First append the full set of predictions to the original set
ransomware$prediction <- ransomware_y_hat_rf
# Filter out all the predicted "white" addresses,
@ -946,18 +943,18 @@ test_index <- createDataPartition(y = black_addresses$prediction,
train_set <- black_addresses[-test_index,]
test_set <- black_addresses[test_index,]
# Keep only numeric columns, ignoring temporal variables.
# Keep only numeric columns, ignoring temporal variables
train_num <- train_set %>%
select(income, neighbors, weight, length, count, looped)
# SOM function can only work on matrices.
# SOM function can only work on matrices
train_mat <- as.matrix(scale(train_num))
# Select non-temporal numerical features only
test_num <- test_set %>%
select(income, neighbors, weight, length, count, looped)
# Testing data is scaled according to how we scaled our training data.
# Testing data is scaled according to how we scaled our training data
test_mat <- as.matrix(scale(test_num,
center = attr(train_mat, "scaled:center"),
scale = attr(train_mat, "scaled:scale")))
@ -1189,13 +1186,13 @@ true positive.** In turn, this number is 27.44 for the best non-TDA models."$^{[
A dual Random Forest approach could be used to first isolate the ransomware addresses as well as classify them might be quick enough to run in under ten minutes on all the hardware listed. Conversely, a dual SOM method could be created for maximum precision if the necessary computing resources were available.
The script itself has a few areas that could be further optimized. The sampling method does what it needs to do, but the ratios taken for each set could possibly be optimized further. The second SOM algorithm could be optimized to correctly predict more of the low-membership families.
The script itself has a few areas that could be further optimized. The sampling method does what it needs to do, but the ratios taken for each set could possibly be optimized further. The Random Forest algorithm could be trained on more than just two features in an attempt to reduce the number of false positives. The second SOM algorithm could be optimized to correctly predict more of the low-membership families.
Hierarchical clustering was attempted in addition to K-means clustering. The correct number of families was difficult to achieve, whereas it is a direct input of the K-means method. Another look at the clustering techniques might yield different results. Other clustering techniques exist, such as "Hierarchical K-Means"$^{[13]}$, which could be explored for even more clustering visualizations.
### Conclusion
This report presents a reliable method for classifying Bitcoin addresses into known ransomware families, while at the same time avoiding false positives by filtering them out using a binary method before classifying them further. It leaves the author wondering how much harder it would be to perform the same task for ransomware that uses privacy-oriented coins. Certain cryptocurrency networks, such as Monero, utilize privacy features that obfuscate transactions from being analyzed in the same way that the Bitcoin network has been analyzed here. Some progress has been made towards analyzing these networks$^{[9]}$. At the same time, the developers of such networks continually evolve the code to complicate transaction tracking. This could be another good area for future research.
This report presents a reliable method for classifying Bitcoin addresses into known ransomware families, while at the same time avoiding false positives to a high degree by filtering out *white* addresses using a binary method before classifying the remaining addresses further. It leaves the author wondering how much harder it would be to perform the same task for ransomware that uses privacy-oriented coins. Certain cryptocurrency networks, such as Monero, utilize privacy features that obfuscate transactions from being analyzed in the same way that the Bitcoin network has been analyzed here. Some progress has been made towards analyzing these networks$^{[9]}$. At the same time, the developers of such networks continually evolve the code to complicate transaction tracking. This could be another promising area for future research.
## References
@ -1226,9 +1223,7 @@ Kyle Hogan, Jason Hennessey, Andrew Miller, Arvind Narayanan, and Nicolas Christ
Statistical Software_, *21*(5), 1-19. doi: 10.18637/jss.v021.i05 (URL:
https://doi.org/10.18637/jss.v021.i05).
- and -
Wehrens R, Kruisselbrink J (2018). “Flexible Self-Organizing Maps in kohonen 3.0.” _Journal of Statistical
[and] Wehrens R, Kruisselbrink J (2018). “Flexible Self-Organizing Maps in kohonen 3.0.” _Journal of Statistical
Software_, *87*(7), 1-18. doi: 10.18637/jss.v087.i07 (URL: https://doi.org/10.18637/jss.v087.i07).
[12] Difference between K means and Hierarchical Clustering (Jul 07, 2021) https://www.geeksforgeeks.org/difference-between-k-means-and-hierarchical-clustering/

Binary file not shown.