nap time. re-read from beginning and start into chunk3/4 text cleanup AFTER nap.

This commit is contained in:
shelldweller 2021-10-20 11:59:26 -06:00
parent ec4e6c0d96
commit 6876625dfb
2 changed files with 69 additions and 67 deletions

View File

@ -27,10 +27,6 @@ geometry: margin=2cm
\vbox{\copy2\box0}\box2}}
```{r setup, include=FALSE}
# Load and start timer
library(tictoc)
tic()
knitr::opts_chunk$set(echo = TRUE, out.width="400px", dpi=120)
def.chunk.hook <- knitr::knit_hooks$get("chunk")
knitr::knit_hooks$set(chunk = function(x, options) {
@ -127,7 +123,6 @@ A summary of the data set tells the range of values and size of the sample.
# Summary
ransomware %>% summary() %>% knitr::kable()
```
A listing of the first ten rows provides a sample of the features associated with each observation.
@ -211,7 +206,10 @@ no_nas <- sum(is.na(ransomware))
```
---
###############################################################################
---
### Exploration and Visualization ( Chunk #2, do this part last....)
The ransomware addresses make up less than 2% of the overall data set. This presents a challenge as the target observations are sparse within the data set, especially when we consider that this is then divided into 29 subsets. In fact, some of the ransomware groups have only a single member, making categorization a dubious task. At least there are no missing values to worry about.
@ -254,6 +252,7 @@ test_num <- test_samp %>% select(selected_features[1], selected_features[2])
# Binary labels for test set
test_bw <- test_samp$bw
```
```{r data-sparsness, echo=FALSE}
@ -299,11 +298,11 @@ ggp2 <- ggplot(train_long, aes(x = value)) + # Draw each column as histogram
facet_wrap(~ name, scales = "free")
ggp2
```
Now we can compare the relative spread of each feature by calculating the coefficient of variation for each column. Larger coefficients of variation indicate larger relative spread compared to other columns.
```{r cv-results, echo=FALSE}
message("The features with the highest coefficients of variation are ",
@ -319,65 +318,37 @@ plot(coeff_vars)
From this, it appears that *income* has the widest range of variability, followed by *neighbors*. These are also the features that are most strongly skewed to the right, meaning that a few addresses have really high values for each of these features while the bulk of the data set has very low values for these numbers.
Now do the following (after filling in methods, results, and conclusions, since those are done already):
6) Break into groups somehow. Graph variables per group? Show how the variables are distributed for each ransomware group? Percent ransomware per each day of the week, for example. Is ransomware more prevalent on a particular day of the week? Break other numerical values into bins, and graph percentage per bin. Look for trends and correlations between groups/variables, and display them here. MORE OF THIS....
```{r shrimp-percentage, echo=FALSE, include=FALSE}
# Count how many wallets have less than one full bitcoin
# Count how many wallets have less than one full bitcoin
shrimp <- train_samp %>% filter(income < 10^8 )
```
```{r shrimp-output, echo=FALSE}
# Print the percentage of wallets with less than one full bitcoin
# Print the percentage of wallets with less than one full bitcoin
mean(shrimp$bw == "black")
```
7) Principle Component Analysis can go here. See "Interlinkages of Malaysian Banking Systems" for an example of detailed PCA. Is it exploratory analysis, or is it a predictive method? I was under the assumption that it is a form of analysis, but the paper mentioned extends it to a form of predictive modeling. How to do this *right* (?!?!)
```{r pca, echo=FALSE, include=FALSE}
# View distances between points of a sample to look for patterns
# This one seems to be problematic unless I can make the image smaller somehow...
#x <- train_scaled %>% as.matrix()
#d <- dist(x)
#image(as.matrix(d), col = rev(RColorBrewer::brewer.pal(9, "RdBu"))) # Change colors or Orange/Blue
train_scaled <- scale(train_num)
# Principal Component Analysis
pca <- prcomp(train_scaled)
pca
summary(pca)
pc <- 1:ncol(train_scaled)
qplot(pc, pca$sdev)
# Plot the first two PCs with color representing black/white
data.frame(pca$x[,1:2], bw=train_samp$bw) %>%
sample_n(200) %>%
ggplot(aes(PC1,PC2, fill = bw))+
geom_point(cex=3, pch=21) +
coord_fixed(ratio = 1)
# First two dimensions do NOT preserve distance very well
#d_approx <- dist(pca$x[, 1:2])
#qplot(d, d_approx) + geom_abline(color="red")
```
---
###############################################################################
## Clean up and add text from here to end.....
###############################################################################
---
### Insights Gained from Exploration
From the previous visual and statistical exploration of the data, it becomes clear what the challenge is. Ransomware addresses are very sparse in the data set, making up less than 2% of the addresses. That small percentage is also further classified into 28 groups. Perhaps the original paper was a bit too ambitious in trying to categorize all the addresses into 29 categories, including the "white" addresses. To simplify our approach, we will categorize the addresses in a binary way, either "white" or "black", where "black" signifies an association with ransomware transactions. Asking this as a "ransomware or not-ransomware" question allows for application of methods that are impractical otherwise.
---
## Modeling approach (chunk #3, needs rewriting of text parts only)
@ -387,7 +358,8 @@ data.frame(pca$x[,1:2], bw=train_samp$bw) %>%
Describe how you started with categorical SOMs, switched to binary SOMs, then applied randomForest to the binary problem, and was surprised with the results. Decided to re-apply categorical SOMS to black-only addresses, as predicted by the binary Random forest approach. The result is the following two-step approach, with the optional clustering visualizations at the end
### Method Part 1: Binary Random Forests to isolate ransomware addresses first.
```{r random-forest-prep, echo=FALSE, inculde=FALSE}
```{r random-forest-prep, echo=FALSE, inculde=FALSE, warning=FALSE}
# Cross Validation, ten fold
control <- trainControl(method="cv", number = 10)
@ -412,24 +384,34 @@ cm_test <- confusionMatrix(y_hat_rf, test_bw)
ransomware_y_hat_rf <- predict(fit_rf, ransomware)
cm_ransomware <- confusionMatrix(ransomware_y_hat_rf, ransomware$bw)
```
```{r random-forest-output, echo=FALSE}
Here are the results for the test set.
```{r random-forest-output_test, echo=FALSE}
message("Overall accuracy for the binary separation is ",
cm_test$overall["Accuracy"])
cm_test
cm_test %>% as.matrix() %>% knitr::kable()
cm_test$overall %>% knitr::kable()
cm_test$byClass %>% knitr::kable()
```
Here are the results for the full original set.
```{r random-forest-output_big, echo=FALSE}
cm_ransomware %>% as.matrix() %>% knitr::kable()
cm_ransomware$overall %>% knitr::kable()
cm_ransomware$byClass %>% knitr::kable()
message("Overall accuracy for the full data set is ",
cm_ransomware$overall["Accuracy"])
cm_ransomware
```
### Method Part 2: Categorical SOMs to categorize predicted ransomware addresses.
Now we train a new model after throwing away all "white" addresses.
```{r soms-prep, echo=FALSE, include=FALSE}
##############################################################################
@ -512,24 +494,30 @@ cm_labels <- confusionMatrix(ransomware_group.prediction$prediction[[2]],
test_set$label)
```
```{r soms-output, echo=FALSE}
A specific method was used to select the optimal grid size. Cite source here and explain options, and why you chose the one you did.
```{r grid-size, echo=FALSE}
message("A grid size of ", grid_size, " has been chosen.")
table(test_set$label, ransomware_group.prediction$prediction[[2]])
```
cm_labels
Here is a summary of the results for the categorization of black addresses into ransomware families. For the full table of predictions and statistics, see the Appendix.
message("Overall accuracy is ", cm_labels$overall["Accuracy"])
```{r soms-output, echo=FALSE, size="tiny"}
cm_labels$overall %>% knitr::kable()
cm_labels$byClass %>% knitr::kable()
```
### Clustering Visualizations: K-means clustering
K-means clustering offers a nice way of visualizing the final SOM grid and the categorical boundaries that were formed by the model.
```{r clustering-setup, echo=FALSE, include=FALSE}
#############################################################################
## K-Means Clustering to visualize the categorization of the SOM
@ -545,6 +533,8 @@ som.cluster <- kmeans(data.frame(som_model2$codes[[1]]), centers=n_groups)
```
Here is a nice graph. How do I center it?
```{r clustering-plot, echo=FALSE}
# Plot clustering results
plot(som_model2,
@ -556,16 +546,18 @@ add.cluster.boundaries(som_model2, som.cluster$cluster)
```
---
## Results & Performance (chunk #4, write up after chunk #3 is done)
### Results
In the original paper by Akcora et al, they tested several different sets of parameters on their TDA model. According to them, "In the best TDA models for each ransomware family, we predict **16.59 false positives for each
true positive**. In turn, this number is 27.44 for the best non-TDA models."[3] In fact, the highest Precision (a.k.a. Positive Predictive Value, defined as TP/(TP+FP)) they achieved was only 0.1610. Compare this to our final Precision value of 1.000? It is almost embarassing... did I do something wrong here?
true positive**. In turn, this number is 27.44 for the best non-TDA models."[3] In fact, the highest Precision (a.k.a. Positive Predictive Value, defined as TP/(TP+FP)) they achieved was only 0.1610. Compare this to our final Precision value of 1.000? It is almost embarrassing... did I do something wrong here?
### Performance
The overall script takes X hours and X minutes to run on the aforementioned hardware. This could be optimized, but given that it is an eight year old laptop, this is not too unreasonable. It takes me longer to compile LibreOffice.
The script runs on the aforementioned hardware in less than five minutes and uses less than 4GB of RAM. Given that the Bitcoin network produces one new block every ten minutes on average, then real-time analysis could theoretically be conducted on each block as it is announced using even moderate compuitng resources.
## Summary
@ -597,7 +589,7 @@ bitcoin transaction graph](https://doi.org/10.1007/s41109-020-00261-7)
[4] UCI Machine Learning Repository [https://archive.ics.uci.edu/ml/index.php](https://archive.ics.uci.edu/ml/index.php)
[5] BitcoinHeist Ransomware Address Dataset [https://archive.ics.uci.edu/ml/datasets/BitcoinHeistRansomwareAddressDataset](https://archive.ics.uci.edu/ml/datasets/BitcoinHeistRansomwareAddressDataset)
[5] BitcoinHeist Ransomware Address Dataset /n [https://archive.ics.uci.edu/ml/datasets/BitcoinHeistRansomwareAddressDataset](https://archive.ics.uci.edu/ml/datasets/BitcoinHeistRansomwareAddressDataset)
[6] Available Models - The `caret` package [http://topepo.github.io/caret/available-models.html](http://topepo.github.io/caret/available-models.html)
@ -605,10 +597,20 @@ bitcoin transaction graph](https://doi.org/10.1007/s41109-020-00261-7)
[XMR] Malte Möser*, Kyle Soska, Ethan Heilman, Kevin Lee, Henry Heffan, Shashvat Srivastava,
Kyle Hogan, Jason Hennessey, Andrew Miller, Arvind Narayanan, and Nicolas Christin (April 23, 2018) [An Empirical Analysis of Traceability in the Monero Blockchain](https://arxiv.org/pdf/1704.04299/)
```{r end timer, echo=FALSE}
# End timer
message("....that's all, folks!")
toc()
\newpage
## Appendix:
### Categorical SOM ransowmare family prediction table and confusion matrix - detailed
```{r soms-output-table, echo=FALSE}
cm_labels
```
```{r empty block, echo=FALSE, include=FALSE}
# Comment goes here....
# Use this for other blocks, etc.
```

Binary file not shown.