Started new Rmd document, got outline done at least.

This commit is contained in:
shelldweller 2021-10-03 22:55:21 -06:00
parent 6d44655fde
commit 39bd24e059
2 changed files with 229 additions and 0 deletions

View File

@ -0,0 +1,229 @@
---
title: \vspace{1in}Detecting Ransomware Addresses on the Bitcoin Blockchain
subtitle: \vspace{.5in}HarvardX Final Capstone CYO Project
\vspace{.5in}
author: "Kaylee Robert Tejeda"
date: "10/31/2021"
abstract: "Ransomware is a persisent and growing threat in the world of cybersecurity. There is much interest in detecting and tracking transactions made to ransomware operators. While many attempts of achieving this have not relied on sophisticated machine learning methods, even those that have result in models with poor specificity. A two-step method is developed to possibly improve on previous results."
output: pdf_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, out.width="400px", dpi=120)
```
\newpage
 
\vspace{25pt}
\tableofcontents
\newpage
## Introduction
Definitions and motivations. Try to complete one section per day. Turn this in sometime in the next week.
### Data
Cite original paper that data is from. Specifically, describe how each variable is defined.
```{r data-prep, echo=FALSE }
# Install necessary packages
if(!require(tidyverse)) install.packages("tidyverse")
if(!require(caret)) install.packages("caret")
# Load Libraries
library(tidyverse)
library(caret)
# Download data
url <- "https://archive.ics.uci.edu/ml/machine-learning-databases/00526/data.zip"
dest_file <- "data/data.zip"
if(!dir.exists("data"))dir.create("data")
if(!file.exists(dest_file))download.file(url, destfile = dest_file)
# Unzip
if(!file.exists("data/BitcoinHeistData.csv"))unzip(dest_file, "BitcoinHeistData.csv", exdir="data")
# Import data from CSV
ransomware <- read_csv("data/BitcoinHeistData.csv")
# Turn labels into factors, grey is a binary factor for ransomware/non-ransomware
ransomware <- ransomware %>% mutate(label=as.factor(label), grey=as.factor(ifelse(label=="white", "white", "black")))
# Validation set made from 50% of BitcoinHeist data, reduce later if possible. Binary outcomes (grey)
test_index <- createDataPartition(y = ransomware$grey, times = 1, p = .5, list = FALSE)
workset <- ransomware[-test_index,]
validation <- ransomware[test_index,]
# Split the working set into a training set and a test set @ 50%, reduce later if possible. Binary outcomes (grey)
test_index <- createDataPartition(y = workset$grey, times = 1, p = .5, list = FALSE)
train_set <- workset[-test_index,]
test_set <- workset[test_index,]
# Clean up environment
rm(dest_file, url)
```
### Goal
What is the goal of this paper? Besides graduating?
### Outline of Steps Taken
1)
2)
3) ...
## Exploration & Visualization
I need better graphs. I have plenty, but I need them to look better and/or have more labels, etc.
## Data Analysis
### Preparation
What did I do to prepare the data?
### Exploration and Visualization
```{r visuals, echo=FALSE}
# Do some graphical exploration before applying any models.
# Look at the example work for some ideas.
# Add any compelling visuals as needed here.
# ?? Cluster graphs go at the end.
# Install foreach package if needed
if(!require(matrixStats)) install.packages("matrixStats")
# Load foreach library
library(matrixStats)
## Principle Component Analysis
names(ransomware)
str(ransomware)
#Sample every nth row due to memory constraints
train_samp <- train_set[seq(1, nrow(train_set), 100), ]
# What percentage of sample is ransomware?
mean(train_samp$grey=="black")
# Keep only numeric columns
train_num <- train_samp %>% select(year, day, length, weight, count, looped, neighbors, income)
# Keep only numeric columns
train_scaled <- train_num %>% scale()
# Histograms of each of the columns to show skewness
train_num$year %>% hist(main = paste("Histogram of","year"))
train_num$day %>% hist(main = paste("Histogram of","day"))
train_num$length %>% hist(main = paste("Histogram of","length"))
train_num$weight %>% hist(main = paste("Histogram of","weight"))
train_num$count %>% hist(main = paste("Histogram of","count"))
train_num$looped %>% hist(main = paste("Histogram of","looped"))
train_num$neighbors %>% hist(main = paste("Histogram of","neighbors"))
train_num$income %>% hist(main = paste("Histogram of","income"))
# Check for variability across numerical columns using coefficients of variation
sds <- train_num %>% as.matrix() %>% colSds()
means <- train_num %>% as.matrix() %>% colMeans()
coeff_vars <- sds %/% means
plot(coeff_vars)
coeff_vars
# View distances between points of a sample to look for patterns
# This one seems to be problematic unless I can make the image smaller somehow...
#x <- train_scaled %>% as.matrix()
#d <- dist(x)
#image(as.matrix(d), col = rev(RColorBrewer::brewer.pal(9, "RdBu"))) # Change colors or Orange/Blue
# Principal Component Analysis
pca <- prcomp(train_scaled)
pca
summary(pca)
pc <- 1:ncol(train_scaled)
qplot(pc, pca$sdev)
# Plot the first two PCs with color representing black/white
data.frame(pca$x[,1:2], Grey=train_samp$grey) %>%
sample_n(200) %>%
ggplot(aes(PC1,PC2, fill = Grey))+
geom_point(cex=3, pch=21) +
coord_fixed(ratio = 1)
# First two dimensions do NOT preserve distance very well
#d_approx <- dist(pca$x[, 1:2])
#qplot(d, d_approx) + geom_abline(color="red")
# Clean up environment
rm(pca, x, coeff_vars, d, means, pc, sds)
```
### Insights Gained from Exploration
### Modeling approach
An overview of why I picked the methods that I did. Based on suggestions from original paper, that Random Forests were hard to apply here, and that it was all topological data to begin with, hence that lead me to SOMs. Also, describe the reasoning behind the binary approach. Describe what you learned about SOMs.
#### Random Forests
#### Self Organizing Maps
### Method 1: Binary Random Forests
If we ask a simpler question, is this a useful approach?
### Method 2: Binary SOMs
If we ask the same question to a more sophisticated and
### Method 3: Categorical SOMs
### Final Method: Combined Methods 1 and 3
## Results & Performance
### Results
### Performance
In terms of what? Time? RAM?
## Summary
### Comparison to original paper and impact of findings
### Limitations
### Future Work
### Conclusions
Get Monero!

Binary file not shown.