quick reboot, things are acting weird, but the graphs should have labels that do not overlap now, at least....

This commit is contained in:
shelldweller 2021-11-01 09:45:40 -06:00
parent 4b710cd413
commit 310cb800f8
7 changed files with 48 additions and 12 deletions

View File

@ -3,7 +3,7 @@ title: \vspace{1in}Detecting Ransomware Addresses on the Bitcoin Blockchain usin
subtitle: \vspace{.5in}HarvardX PH125.9x Final Capstone CYO Project
\vspace{.5in}
author: "Kaylee Robert Tejeda"
date: "10/31/2021"
date: "11/11/2021"
abstract: "Ransomware is a persisent and growing threat in the world of cybersecurity. A specific area of focus involves detecting and tracking payments made to ransomware operators. While many attempts towards this goal have not made use of sophisticated machine learning methods, even those that have often result in models with poor precision or other performance issues. A two-step method is developed to address the issue of false positives and improve on previous results."
keywords:
- Bitcoin
@ -78,7 +78,7 @@ acyclic directed path originating from any starter transaction and ending at the
5) *Count* - The number of starter addresses connected to this address through a chain
6) *Loop* - The number of starter addresses connected to this address by more than one path
6) *Looped* - The number of starter addresses connected to this address by more than one path
These variables are defined rather abstractly, viewing the blockchain as a topological graph with nodes and edges. The rationale behind this approach is to quantify specific transaction patterns. Akcora$^{[3]}$ gives a thorough explanation in the original paper of how and why these features were chosen. We shall treat the features as general numerical variables and will not seek to justify their definitions. Several machine learning methods will be applied to the original data set from the paper by Akcora$^{[3]}$, and the results will be compared.
@ -171,7 +171,7 @@ The original research team downloaded and parsed the entire Bitcoin transaction
All of the analysis in this report was conducted on a single laptop computer, a Lenovo Yoga S1 from late 2013 with the following specs:
- CPU: Intel i7-4600U @ 3.300GHz (4th Gen quad-core i7)
- CPU: Intel i7-4600U @ 3.300GHz (4th Gen quad-core i7 x86_64)
- RAM: 8217MB DDR3L @ 1600 MHz (8 GB)
- OS: Slackware64-current (15.0 RC1) `x86_64-slackware-linux-gnu` (64-bit GNU/Linux)
- R version 4.0.0 (2020-04-24) -- "Arbor Day" (built from source using scripts from [slackbuilds.org](https://slackbuilds.org/))
@ -210,8 +210,9 @@ no_nas <- sum(is.na(ransomware))
```
#########################################################################
### Exploration and Visualization (graphic rework from here to Modeling Approach)
### Exploration and Visualization
The ransomware addresses make up less than 2% of the overall data set. This presents a challenge as the target observations are sparse within the data set, especially when we consider that this is then divided into 28 subsets. In fact, some of the ransomware groups have only a single member, making categorization a dubious task. At least there are no missing values to worry about.
@ -219,7 +220,7 @@ no_nas <- sum(is.na(ransomware))
# Keep only numeric columns, ignoring temporal features
ransomware_num <- ransomware %>%
select(length, weight, count, looped, neighbors, income)
select(income, neighbors, weight, length, count, looped)
# Check for variation across numerical columns using coefficients of variation
#
@ -291,7 +292,8 @@ histograms <- ggplot(train_long, aes(x = value)) +
geom_density(col = "green", size = .5) +
scale_x_continuous(trans='log2') +
facet_wrap(~ name, scales = "free")
histograms
histograms + theme(axis.text.x = element_text(size = 8, angle=30, hjust=1))
```
@ -323,8 +325,9 @@ shrimp <- ransomware %>% filter(income < 10^8 )
The percentage of wallets with less than one full bitcoin as their balance is `r mean(shrimp$bw == "black")` .
###############################################################################
### Insights Gained from Exploration (graphic rework ends here)
### Insights Gained from Exploration
From the previous visual and statistical exploration of the data, it becomes clear what the challenge is. Ransomware related addresses are very sparse in the data set, making up less than 2% of all addresses. That small percentage is also further classified into 28 groups. Perhaps the original paper was a bit too ambitious in trying to categorize all the addresses into 29 categories, including the "white" addresses. To simplify our approach, we will categorize the addresses in a binary way, either "white" or "black", where "black" signifies an association with ransomware transactions. Asking this as a "ransomware or not-ransomware" question allows for application of methods that are known to be impractical otherwise.
@ -591,14 +594,14 @@ test_set <- black_addresses[test_index,]
# Keep only numeric columns, ignoring temporal variables.
train_num <- train_set %>%
select(length, weight, count, looped, neighbors, income)
select(income, neighbors, weight, length, count, looped)
# SOM function can only work on matrices.
train_mat <- as.matrix(scale(train_num))
# Select non-temporal numerical features only
test_num <- test_set %>%
select(length, weight, count, looped, neighbors, income)
select(income, neighbors, weight, length, count, looped)
# Testing data is scaled according to how we scaled our training data.
test_mat <- as.matrix(scale(test_num,
@ -777,12 +780,20 @@ add.cluster.boundaries(som_model2, som.cluster$cluster)
### Performance
The script runs on the aforementioned hardware in less than five minutes and uses less than 4GB of RAM. Given that the Bitcoin network produces one new block every ten minutes on average, then real-time analysis could theoretically be conducted on each block as it is announced using even moderate computing resources. Just for kicks, the final script was also run on a more humble computer with the following specifications:
#### ASUS Eee PC 1025C
- CPU: Intel Atom N2600 @ 1.600GHz (64-bit Intel Atom quad-core)
- CPU: Intel Atom N2600 @ 1.600GHz (64-bit Intel Atom quad-core x86)
- RAM: 3911MB DDR3 @ 800 MT/s (4 GB)
This is a computer known for being slow and clunky. Even on this device, which runs the same operating system and software as the hardware listed previously, the total run time for the script is around 1665 seconds. At nearly 28 minutes, this is not fast enough to analyze the Bitcoin blockchain in real time, but it does show that the script can be run on very modest hardware to completion.
#### Pine64 Quartz64
- CPU: Rockchip RK3566 SoC aarch64 (64-bit quad-core ARM)
- RAM: DDR4 xxxxMB (8 GB)
Single board computer / Development board. This was run to benchmark a modern 64-bit ARM processor. The script runs in about xxxx minutes on this platform, just for reference.
---

Binary file not shown.

View File

@ -91,7 +91,8 @@ ggp2 <- ggplot(train_long, aes(x = value)) + # Draw each column as histogram
geom_density(col = "green", size = .5) +
scale_x_continuous(trans='log2') +
facet_wrap(~ name, scales = "free")
ggp2
ggp2 + theme(axis.text.x = element_text(size = 8))
# Clean up environment

16
scratch/graphtests.R Normal file
View File

@ -0,0 +1,16 @@
train_hist <- train_samp %>% select(-address, -label, -bw, -day, -year)
# Apply pivot_longer function to facilitate facet wrapping
train_long <- train_hist %>%
pivot_longer(colnames(train_hist)) %>%
as.data.frame()
# Log scale on value axis,
histograms <- ggplot(train_long, aes(x = value) )+
geom_histogram(aes(y = ..density..), bins=20) +
geom_density(col = "green", size = .5) +
scale_x_continuous(trans='log2') +
facet_wrap(~ name, scales = "free")
histograms + theme(axis.text.x = element_text(size = 8, angle=30, hjust=1))

View File

@ -0,0 +1,8 @@
for(i in 1:30){
print(
ransomware %>% filter(label==levels(ransomware$label)[i]) %>%
select(income) %>%
ggplot(aes(x=income, y = ..density..)) + geom_histogram(bins="30") +
geom_density(col = "green", size = .5)
)
}