Things are finally starting to look balanced, visually speaking. Next is to create a few group_by graphs and put them in the right place. Get scripts working before putting into full document.

This commit is contained in:
shelldweller 2021-10-28 21:02:34 -06:00
parent 6801d0f326
commit 4b710cd413
4 changed files with 186 additions and 110 deletions

View File

@ -11,12 +11,13 @@
###############################################################################
# Timer command, uncomment following lines to time script
# tictoc libary needs to be inistalled ahead of time for this to work.
library(tictoc)
tic(quiet = FALSE)
# Set the repository
# Set the repository mirror to “0-Cloud” for maximum availability
r = getOption("repos")
r["CRAN"] = "http://cran.us.r-project.org"
r["CRAN"] = "http://cran.rstudio.com"
options(repos = r)
rm(r)

View File

@ -17,6 +17,17 @@ keywords:
output: pdf_document
geometry: margin=2cm
---
```{r tic, echo=FALSE, include=FALSE}
##############################################################################
## Uncomment these commands to time the compilation of the script.
## the tictoc library needs to be installed for this to work.
##############################################################################
library(tictoc)
tic(quiet = FALSE)
```
\def\bitcoinA{%
\leavevmode
\vtop{\offinterlineskip %\bfseries
@ -56,18 +67,18 @@ knitr::knit_hooks$set(chunk = function(x, options) {
With the graph defined as such, the following six numerical features$^{[2]}$ are associated with a given address:
1) *Income* - the total amount of coins sent to an address (decimal value with 8 decimal places)
1) *Income* - the total amount of coins sent to an address
2) *Neighbors* - the number of transactions that have this address as one of its output addresses (integer)
2) *Neighbors* - the number of transactions that have this address as one of its output addresses
3) *Weight* - the sum of fraction of coins that reach this address from address that do not have any other inputs within the 24-hour window, which are referred to as "starter transactions" (decimal value)
3) *Weight* - the sum of fraction of coins that reach this address from address that do not have any other inputs within the 24-hour window, which are referred to as "starter transactions"
4) *Length* - the number of non-starter transactions on its longest chain, where a chain is defined as an
acyclic directed path originating from any starter transaction and ending at the address in question (integer)
acyclic directed path originating from any starter transaction and ending at the address in question
5) *Count* - The number of starter addresses connected to this address through a chain (integer)
5) *Count* - The number of starter addresses connected to this address through a chain
6) *Loop* - The number of starter addresses connected to this address by more than one path (integer)
6) *Loop* - The number of starter addresses connected to this address by more than one path
These variables are defined rather abstractly, viewing the blockchain as a topological graph with nodes and edges. The rationale behind this approach is to quantify specific transaction patterns. Akcora$^{[3]}$ gives a thorough explanation in the original paper of how and why these features were chosen. We shall treat the features as general numerical variables and will not seek to justify their definitions. Several machine learning methods will be applied to the original data set from the paper by Akcora$^{[3]}$, and the results will be compared.
@ -76,11 +87,11 @@ These variables are defined rather abstractly, viewing the blockchain as a topol
This data set was discovered while exploring the [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/index.php)$^{[4]}$ as suggested in the project instructions. The author of this report, interested in Bitcoin and other cryptocurrencies since (unsuccessfully) mining them on an ASUS netbook in rural Peru in late 2010, used *cryptocurrency* as a preliminary search term. This brought up a single data set entitled ["BitcoinHeist: Ransomware Address Data Set"](https://archive.ics.uci.edu/ml/datasets/BitcoinHeistRansomwareAddressDataset#). The data set was downloaded and the exploration began.
```{r install-load-libraries&download-data, echo=FALSE, include=FALSE}
```{r install-load-libraries download-data, echo=FALSE, include=FALSE}
# Set the repository
# Set the repository mirror to “0-Cloud” for maximum availability
r = getOption("repos")
r["CRAN"] = "http://cran.us.r-project.org"
r["CRAN"] = "http://cran.rstudio.com"
options(repos = r)
rm(r)
@ -199,9 +210,8 @@ no_nas <- sum(is.na(ransomware))
```
~ ~ ~ ! ! ! ~ ~ ~
### Exploration and Visualization (rework from here to Modeling Approach)
### Exploration and Visualization (graphic rework from here to Modeling Approach)
The ransomware addresses make up less than 2% of the overall data set. This presents a challenge as the target observations are sparse within the data set, especially when we consider that this is then divided into 28 subsets. In fact, some of the ransomware groups have only a single member, making categorization a dubious task. At least there are no missing values to worry about.
@ -254,52 +264,42 @@ labels <- ransomware$label %>% summary()
knitr::kable(
list(labels[1:15], labels[16:29]),
caption = 'Ransomware group labels and frequency counts for full data set',
booktabs = TRUE
)
booktabs = TRUE)
```
Let's take a look at the distribution of the different features. Note how skewed the non-temporal features are, some of them being bimodal:
Let's take a look at the distribution of the different features. Note how skewed the non-temporal features are, some of them being bimodal. Looks better on a log scale x-axis.
```{r histograms, echo=FALSE, warning=FALSE}
```{r histograms, echo=FALSE, warning=FALSE, fig.align="center"}
########################################################
## Histograms of each of the columns to show skewness
## Plot histograms for each column using facet wrap
########################################################
train_hist <- train_samp %>% select(-address, -label, -bw)
# Remove non-numerical and temporal columns to look for patterns in
# topologically defined features
train_hist <- train_samp %>% select(-address, -label, -bw, -day, -year)
train_long <- train_hist %>% # Apply pivot_longer function
# Apply pivot_longer function to facilitate facet wrapping
train_long <- train_hist %>%
pivot_longer(colnames(train_hist)) %>%
as.data.frame()
# Histograms per column
ggp1 <- ggplot(train_long, aes(x = value)) + # Draw each column as histogram
geom_histogram(aes(y = ..density..), bins=20) +
geom_density(col = "green", size = .5) +
facet_wrap(~ name, scales = "free")
ggp1
```
Some of these look better on a log scale x-axis.
```{r log scale histograms, echo=FALSE, warning=FALSE}
# Log scale on value axis, does not make sense for temporal variables
ggp2 <- ggplot(train_long, aes(x = value)) + # Draw each column as histogram
# Log scale on value axis,
histograms <- ggplot(train_long, aes(x = value)) +
geom_histogram(aes(y = ..density..), bins=20) +
geom_density(col = "green", size = .5) +
scale_x_continuous(trans='log2') +
facet_wrap(~ name, scales = "free")
ggp2
histograms
```
Now we can compare the relative spread of each feature by calculating the coefficient of variation for each column. Larger coefficients of variation indicate larger relative spread compared to other columns.
```{r cv-results, echo=FALSE}
```{r cv-results, echo=FALSE, fig.align="center"}
# Summarize results in a table and a plot
knitr::kable(coeff_vars)
@ -317,16 +317,14 @@ Now do the following (after filling in methods, results, and conclusions, since
```{r shrimp-percentage, echo=FALSE, include=FALSE}
# Count how many wallets have less than one full bitcoin
shrimp <- train_samp %>% filter(income < 10^8 )
shrimp <- ransomware %>% filter(income < 10^8 )
```
The percentage of wallets with less than one full bitcoin as their balance is `r mean(shrimp$bw == "black")` .
~ ~ ~ ! ! ! ~ ~ ~
### Insights Gained from Exploration
### Insights Gained from Exploration (graphic rework ends here)
From the previous visual and statistical exploration of the data, it becomes clear what the challenge is. Ransomware related addresses are very sparse in the data set, making up less than 2% of all addresses. That small percentage is also further classified into 28 groups. Perhaps the original paper was a bit too ambitious in trying to categorize all the addresses into 29 categories, including the "white" addresses. To simplify our approach, we will categorize the addresses in a binary way, either "white" or "black", where "black" signifies an association with ransomware transactions. Asking this as a "ransomware or not-ransomware" question allows for application of methods that are known to be impractical otherwise.
@ -386,12 +384,13 @@ som1_train_bw <- train_set$bw %>% classvec2classmat()
som1_test_bw <- test_set$bw %>% classvec2classmat()
# Create Data list for supervised SOM
#
som1_train_list <-
list(independent = som1_train_mat, dependent = som1_train_bw)
# Calculate idea grid size according to:
# https://www.researchgate.net/post/How-many-nodes-for-self-organizing-maps
############################################################################
## Calculate idea grid size according to:
## https://www.researchgate.net/post/How-many-nodes-for-self-organizing-maps
############################################################################
# Formulaic method 1
grid_size <- round(sqrt(5*sqrt(nrow(train_set))))
@ -414,7 +413,6 @@ som_model1 <- xyf(som1_train_mat, som1_train_bw,
# Now test predictions
# https://clarkdatalabs.github.io/soms/SOM_NBA
som1_test_list <- list(independent = som1_test_mat, dependent = som1_test_bw)
@ -450,60 +448,19 @@ cm_bw.validation <-
```
Here are some graphs, but be careful with these....
Here are the results of the binary SOM model.
```{r binary som graphs, echo=FALSE}
Test set:
# Be careful with these, some are really large and take a long time to produce.
```{r binary som results1, echo=FALSE}
# Visualize clusters
#plot(som_model1, type = 'mapping', pch = 19, palette.name = topo.colors)
#cat(" \n")
# Distance map
#plot(som_model1, type = 'quality', pch = 19, palette.name = topo.colors)
#cat(" \n")
# Visualize counts
#plot(som_model1, type = 'counts', pch = 19, palette.name = topo.colors)
#cat(" \n")
# Visualize fan diagram
#plot(som_model1, type = 'codes', pch = 19, palette.name = topo.colors)
#cat(" \n")
# Visualize heatmap for variable 1
#plot(som_model1, type = 'property', property = som_model$codes[[1]][,1],
#main=colnames(train_num)[1], pch = 19, palette.name = topo.colors)
#cat(" \n")
# Visualize heatmap for variable 2
#plot(som_model1, type = 'property', property = som_model$codes[[1]][,2],
#main=colnames(train_num)[2], pch = 19, palette.name = topo.colors)
#cat(" \n")
# Visualize heatmap for variable 3
#plot(som_model1, type = 'property', property = som_model$codes[[1]][,3],
#main=colnames(train_num)[3], pch = 19, palette.name = topo.colors)
#cat(" \n")
# Visualize heatmap for variable 4
#plot(som_model1, type = 'property', property = som_model$codes[[1]][,4],
#main=colnames(train_num)[4], pch = 19, palette.name = topo.colors)
#cat(" \n")
# Visualize heatmap for variable 5
#plot(som_model1, type = 'property', property = som_model$codes[[1]][,5],
#main=colnames(train_num)[5], pch = 19, palette.name = topo.colors)
#cat(" \n")
som1_cm_bw %>% as.matrix() %>% knitr::kable()
```
Here are the results of the binary SOM model.
Validation set:
```{r binary som results, echo=FALSE}
som1_cm_bw %>% as.matrix() %>% knitr::kable()
```{r binary som results2, echo=FALSE}
cm_bw.validation %>% as.matrix() %>% knitr::kable()
@ -519,8 +476,8 @@ A Random Forest model is trained using ten-fold cross validation and a tuning gr
```{r random-forest-prep, echo=FALSE, inculde=FALSE, warning=FALSE}
##############################################################################
## This is a better attempt using Random Forest to model the data set as "black"
## and "white" addresses only.
## This is a better attempt using Random Forest to model the data set as
## "black" and "white" addresses only.
##############################################################################
# Cross Validation, ten fold
@ -551,27 +508,61 @@ cm_ransomware <- confusionMatrix(ransomware_y_hat_rf, ransomware$bw)
The confusion matrix for the test set shows excellent results, specifically in the areas of accuracy and precision.
```{r random-forest-output_test, echo=FALSE}
```{r random-forest-output_test1, echo=FALSE}
# Confusion matrix for test set, overall results and by class.
# Confusion matrix for test set
cm_test %>% as.matrix() %>% knitr::kable()
```
Here are the overall results...
```{r random-forest-output_test2, echo=FALSE}
# overall results
cm_test$overall %>% knitr::kable()
```
Here are the results by class...
```{r random-forest-output_test3, echo=FALSE}
# by class.
cm_test$byClass %>% knitr::kable()
```
The confusion matrix for the full ransomware set is very similar to that of the test set.
```{r random-forest-output_big, echo=FALSE}
Here is the confusion matrix for the full ransomware data set.
# Confusion matrix for full ransomware set, overall results and by class.
```{r random-forest-output_big1, echo=FALSE}
# Confusion matrix for full ransomware set,
cm_ransomware %>% as.matrix() %>% knitr::kable()
cm_ransomware$overall %>% knitr::kable()
cm_ransomware$byClass %>% knitr::kable()
```
Here are the big overall results....
```{r random-forest-output_big2, echo=FALSE}
# overall results
cm_ransomware$overall %>% knitr::kable()
```
Here are the big set results by class....
```{r random-forest-output_big3, echo=FALSE}
# by class.
cm_ransomware$byClass %>% knitr::kable()
```
### Method Part 2: Categorical SOMs to categorize predicted ransomware addresses.
Now we train a new model after throwing away all "white" addresses. The predictions from the Random Forest model are used to isolate all "black" addresses for further classification into ransomware addresses using SOMs.
@ -598,7 +589,6 @@ test_index <- createDataPartition(y = black_addresses$prediction,
train_set <- black_addresses[-test_index,]
test_set <- black_addresses[test_index,]
# Keep only numeric columns, ignoring temporal variables.
train_num <- train_set %>%
select(length, weight, count, looped, neighbors, income)
@ -624,8 +614,10 @@ test_label <- test_set$label %>% classvec2classmat()
# Create data list for supervised SOM
train_list <- list(independent = train_mat, dependent = train_label)
# Calculate idea grid size according to:
# https://www.researchgate.net/post/How-many-nodes-for-self-organizing-maps
############################################################################
## Calculate idea grid size according to:
## https://www.researchgate.net/post/How-many-nodes-for-self-organizing-maps
############################################################################
# Formulaic method 1, makes a larger graph in this case
grid_size <- round(sqrt(5*sqrt(nrow(train_set))))
@ -664,6 +656,8 @@ When selecting the grid size for a Self Organizing Map, there are at least two d
A summary of the results for the categorization of black addresses into ransomware families follows. For the full table of predictions and statistics, see the Appendix.
Here are the overall results.
```{r cm_overall, echo=FALSE}
# Overall section of the confusion matrix formatted through kable()
@ -671,6 +665,8 @@ cm_labels$overall %>% knitr::kable()
```
Here are the results by class.
```{r soms-output-by-class, echo=FALSE, size="tiny"}
# By Class section of the confusion matrix formatted through kable()
@ -678,14 +674,68 @@ cm_labels$byClass %>% knitr::kable()
```
### Clustering Visualizations: K-means clustering
### Clustering Visualizations: Heatmaps and K-means clustering
Here are some graphs, tell a bit more about them.
```{r binary som graphs, echo=FALSE, fig.align="center"}
# Be careful with these, some are really large and take a long time to produce.
# Visualize neural network mapping
plot(som_model2, type = 'mapping', pch = 19, palette.name = topo.colors)
cat(" \n")
# Distance map
plot(som_model2, type = 'quality', pch = 19, palette.name = topo.colors)
cat(" \n")
# Visualize counts
plot(som_model2, type = 'counts', pch = 19, palette.name = topo.colors)
cat(" \n")
# Visualize fan diagram
plot(som_model2, type = 'codes', pch = 19, palette.name = topo.colors)
cat(" \n")
# Visualize heatmap for variable 1
plot(som_model2, type = 'property', property = som_model2$codes[[1]][,1],
main=colnames(train_num)[1], pch = 19, palette.name = topo.colors)
cat(" \n")
# Visualize heatmap for variable 2
plot(som_model2, type = 'property', property = som_model2$codes[[1]][,2],
main=colnames(train_num)[2], pch = 19, palette.name = topo.colors)
cat(" \n")
# Visualize heatmap for variable 3
plot(som_model2, type = 'property', property = som_model2$codes[[1]][,3],
main=colnames(train_num)[3], pch = 19, palette.name = topo.colors)
cat(" \n")
# Visualize heatmap for variable 4
plot(som_model2, type = 'property', property = som_model2$codes[[1]][,4],
main=colnames(train_num)[4], pch = 19, palette.name = topo.colors)
cat(" \n")
# Visualize heatmap for variable 5
plot(som_model2, type = 'property', property = som_model2$codes[[1]][,5],
main=colnames(train_num)[5], pch = 19, palette.name = topo.colors)
cat(" \n")
# Visualize heatmap for variable 6
plot(som_model2, type = 'property', property = som_model2$codes[[1]][,6],
main=colnames(train_num)[6], pch = 19, palette.name = topo.colors)
cat(" \n")
```
K-means clustering offers a nice way of visualizing the final SOM grid and the categorical boundaries that were formed by the model.
```{r clustering-setup, echo=FALSE, include=FALSE}
#############################################################################
## K-Means Clustering to visualize the categorization of the SOM
## For a good tutorial, visit:
## For a good tutorial, see:
## https://www.polarmicrobes.org/microbial-community-segmentation-with-r/
#############################################################################
@ -700,7 +750,8 @@ som.cluster <- kmeans(data.frame(som_model2$codes[[1]]), centers=n_groups)
K-means clustering categorizes the SOM grid by adding boundaries to the classification groups. This is the author's favorite graph in the entire report.
```{r clustering-plot, echo=FALSE, fig.align="center"}
# Plot clustering results
# Plot K-means clustering results
plot(som_model2,
main = 'K-Means Clustering',
type = "property",
@ -786,16 +837,33 @@ Kyle Hogan, Jason Hennessey, Andrew Miller, Arvind Narayanan, and Nicolas Christ
## Appendix:
### Categorical SOM ransowmare family prediction table and confusion matrix - detailed
### Categorical SOM ransowmare family prediction table and confusion matrix
Here are the full prediction results for the categorization of black addresses into ransomware families. It is assumed that all white address have already been removed.
```{r soms-output-table, echo=FALSE}
# Final results of categorization of "black" addresses
# into ransomware families.
cm_labels
```
```{r toc, echo=FALSE}
#End timer
toc()
```
```{r empty block, echo=FALSE, include=FALSE}
# Comment goes here....
# Use this for other blocks, etc.
##############################################################################
## Description of block goes here.
## Include notes and resources as necessary.
##############################################################################
# First comment goes here.
```

Binary file not shown.

View File

@ -3,6 +3,13 @@
#!! It would be even MORE impressive if I removed all the PREDICTED whites from
#!! the test set instead and started there.
library(tidyverse)
library(caret)
library(randomForest)
library(kohonen)
library(parallel)
library(matrixStats)
blacks <- ransomware %>% filter(!label=="white")
# Validation set made from 50% of BitcoinHeist data, reduce later if possible. Categorical outcomes