diff --git a/Detecting_Bitcoin_Ransomware.Rmd b/Detecting_Bitcoin_Ransomware.Rmd new file mode 100644 index 0000000..6964873 --- /dev/null +++ b/Detecting_Bitcoin_Ransomware.Rmd @@ -0,0 +1,229 @@ +--- +title: \vspace{1in}Detecting Ransomware Addresses on the Bitcoin Blockchain +subtitle: \vspace{.5in}HarvardX Final Capstone CYO Project + \vspace{.5in} +author: "Kaylee Robert Tejeda" +date: "10/31/2021" +abstract: "Ransomware is a persisent and growing threat in the world of cybersecurity. There is much interest in detecting and tracking transactions made to ransomware operators. While many attempts of achieving this have not relied on sophisticated machine learning methods, even those that have result in models with poor specificity. A two-step method is developed to possibly improve on previous results." +output: pdf_document +--- + +```{r setup, include=FALSE} +knitr::opts_chunk$set(echo = TRUE, out.width="400px", dpi=120) +``` + +\newpage +  +\vspace{25pt} +\tableofcontents + +\newpage + +## Introduction + + Definitions and motivations. Try to complete one section per day. Turn this in sometime in the next week. + +### Data + + Cite original paper that data is from. Specifically, describe how each variable is defined. + +```{r data-prep, echo=FALSE } + +# Install necessary packages +if(!require(tidyverse)) install.packages("tidyverse") +if(!require(caret)) install.packages("caret") + +# Load Libraries +library(tidyverse) +library(caret) + +# Download data +url <- "https://archive.ics.uci.edu/ml/machine-learning-databases/00526/data.zip" +dest_file <- "data/data.zip" +if(!dir.exists("data"))dir.create("data") +if(!file.exists(dest_file))download.file(url, destfile = dest_file) + +# Unzip +if(!file.exists("data/BitcoinHeistData.csv"))unzip(dest_file, "BitcoinHeistData.csv", exdir="data") + +# Import data from CSV +ransomware <- read_csv("data/BitcoinHeistData.csv") + +# Turn labels into factors, grey is a binary factor for ransomware/non-ransomware +ransomware <- ransomware %>% mutate(label=as.factor(label), grey=as.factor(ifelse(label=="white", "white", "black"))) + +# Validation set made from 50% of BitcoinHeist data, reduce later if possible. Binary outcomes (grey) +test_index <- createDataPartition(y = ransomware$grey, times = 1, p = .5, list = FALSE) + +workset <- ransomware[-test_index,] +validation <- ransomware[test_index,] + +# Split the working set into a training set and a test set @ 50%, reduce later if possible. Binary outcomes (grey) +test_index <- createDataPartition(y = workset$grey, times = 1, p = .5, list = FALSE) + +train_set <- workset[-test_index,] +test_set <- workset[test_index,] + +# Clean up environment +rm(dest_file, url) + + +``` + +### Goal + + What is the goal of this paper? Besides graduating? + +### Outline of Steps Taken + +1) + +2) + +3) ... + + + +## Exploration & Visualization + + I need better graphs. I have plenty, but I need them to look better and/or have more labels, etc. + + + +## Data Analysis + +### Preparation + + What did I do to prepare the data? + +### Exploration and Visualization + +```{r visuals, echo=FALSE} +# Do some graphical exploration before applying any models. +# Look at the example work for some ideas. +# Add any compelling visuals as needed here. + +# ?? Cluster graphs go at the end. + +# Install foreach package if needed +if(!require(matrixStats)) install.packages("matrixStats") + +# Load foreach library +library(matrixStats) + +## Principle Component Analysis + +names(ransomware) +str(ransomware) + +#Sample every nth row due to memory constraints +train_samp <- train_set[seq(1, nrow(train_set), 100), ] + +# What percentage of sample is ransomware? +mean(train_samp$grey=="black") + +# Keep only numeric columns +train_num <- train_samp %>% select(year, day, length, weight, count, looped, neighbors, income) + +# Keep only numeric columns +train_scaled <- train_num %>% scale() + + +# Histograms of each of the columns to show skewness +train_num$year %>% hist(main = paste("Histogram of","year")) + +train_num$day %>% hist(main = paste("Histogram of","day")) + +train_num$length %>% hist(main = paste("Histogram of","length")) + +train_num$weight %>% hist(main = paste("Histogram of","weight")) + +train_num$count %>% hist(main = paste("Histogram of","count")) + +train_num$looped %>% hist(main = paste("Histogram of","looped")) + +train_num$neighbors %>% hist(main = paste("Histogram of","neighbors")) + +train_num$income %>% hist(main = paste("Histogram of","income")) + +# Check for variability across numerical columns using coefficients of variation +sds <- train_num %>% as.matrix() %>% colSds() +means <- train_num %>% as.matrix() %>% colMeans() +coeff_vars <- sds %/% means +plot(coeff_vars) +coeff_vars + +# View distances between points of a sample to look for patterns +# This one seems to be problematic unless I can make the image smaller somehow... +#x <- train_scaled %>% as.matrix() +#d <- dist(x) +#image(as.matrix(d), col = rev(RColorBrewer::brewer.pal(9, "RdBu"))) # Change colors or Orange/Blue + +# Principal Component Analysis +pca <- prcomp(train_scaled) +pca +summary(pca) + +pc <- 1:ncol(train_scaled) +qplot(pc, pca$sdev) + +# Plot the first two PCs with color representing black/white +data.frame(pca$x[,1:2], Grey=train_samp$grey) %>% + sample_n(200) %>% + ggplot(aes(PC1,PC2, fill = Grey))+ + geom_point(cex=3, pch=21) + + coord_fixed(ratio = 1) + +# First two dimensions do NOT preserve distance very well +#d_approx <- dist(pca$x[, 1:2]) +#qplot(d, d_approx) + geom_abline(color="red") + +# Clean up environment +rm(pca, x, coeff_vars, d, means, pc, sds) + +``` + +### Insights Gained from Exploration + +### Modeling approach + + An overview of why I picked the methods that I did. Based on suggestions from original paper, that Random Forests were hard to apply here, and that it was all topological data to begin with, hence that lead me to SOMs. Also, describe the reasoning behind the binary approach. Describe what you learned about SOMs. + +#### Random Forests + +#### Self Organizing Maps + + + +### Method 1: Binary Random Forests + +If we ask a simpler question, is this a useful approach? + +### Method 2: Binary SOMs + +If we ask the same question to a more sophisticated and +### Method 3: Categorical SOMs + +### Final Method: Combined Methods 1 and 3 + +## Results & Performance + +### Results + +### Performance + + In terms of what? Time? RAM? + +## Summary + +### Comparison to original paper and impact of findings + +### Limitations + +### Future Work + +### Conclusions + + Get Monero! + + diff --git a/Detecting_Bitcoin_Ransomware.pdf b/Detecting_Bitcoin_Ransomware.pdf new file mode 100644 index 0000000..39f815f Binary files /dev/null and b/Detecting_Bitcoin_Ransomware.pdf differ