Initial Commit

This commit is contained in:
shelldweller 2021-09-19 21:52:11 -06:00
parent 70d6fd7c40
commit e1db85998f
3 changed files with 232 additions and 0 deletions

8
.gitignore vendored Normal file
View File

@ -0,0 +1,8 @@
.Rproj.user
.Rhistory
.RData
.Ruserdata
s41109-020-00261-7.pdf
data
1906.07852.pdf
Ransomware-Bitcoin-Addresses.Rproj

View File

@ -0,0 +1,46 @@
# Install necessary packages
if(!require(tidyverse)) install.packages("tidyverse", repos = "http://cran.us.r-project.org")
if(!require(caret)) install.packages("caret", repos = "http://cran.us.r-project.org")
# Load Libraries
library(tidyverse)
library(caret)
# Download data
url <- "https://archive.ics.uci.edu/ml/machine-learning-databases/00526/data.zip"
dest_file <- "data/data.zip"
if(!dir.exists("data"))dir.create("data")
if(!file.exists(dest_file))download.file(url, destfile = dest_file)
# Unzip
if(!file.exists("data/BitcoinHeistData.csv"))unzip(dest_file, "BitcoinHeistData.csv", exdir="data")
# Import data from CSV
ransomware <- read_csv("data/BitcoinHeistData.csv")
# Validation set made from 10% of BitcoinHeist data
test_index <- createDataPartition(y = ransomware$label, times = 1, p = 0.1, list = FALSE)
workset <- ransomware[-test_index,]
temp <- ransomware[test_index,]
# Make sure addresses in validation set are also in working set
validation <- temp %>%
semi_join(workset, by = "address")
# Add rows removed from validation set back into working set
removed <- anti_join(temp, validation)
workset <- rbind(workset, removed)
# Split the working set into a training set and a test set
test_index <- createDataPartition(y = workset$label, times = 1, p = 0.1, list = FALSE)
train_set <- workset[-test_index,]
test_set <- workset[test_index,]
# Clean up environment
rm(dest_file, url, temp, removed, ransomware, test_index)
# Inspect data frames
test_set %>% str()
test_set %>% head()
train_set %>% str()
train_set %>% head()

View File

@ -0,0 +1,178 @@
---
title: "Ransomware-Bitcoin-Addresses"
author: "Kaylee Robert Tejeda"
date: "9/19/2021"
output: pdf_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
#######################################
#
#
#
#######################################
# Bitcoin Heist Ransomware Address Dataset:
# https://archive.ics.uci.edu/ml/datasets/BitcoinHeistRansomwareAddressDataset
# https://archive.ics.uci.edu/ml/machine-learning-databases/00526/data.zip
#
## BitcoinHeistRansomwareAddressDataset Data Set
Abstract: BitcoinHeist datasets contains address features on the heterogeneous Bitcoin network to identify ransomware payments.
Data Set Characteristics:
Multivariate, Time-Series
Number of Instances:
2916697
Area:
Computer
Attribute Characteristics:
Integer, Real
Number of Attributes:
10
Date Donated
2020-06-17
Associated Tasks:
Classification, Clustering
Missing Values?
N/A
Number of Web Hits:
35019
Source:
Cuneyt Gurcan Akcora (cuneyt.akcora '@' umanitoba.ca) University of Manitoba, Canada
Yulia Gel (ygl '@' utdallas.edu) University of Texas at Dallas, USA
Murat kantarcioglu (muratk '@' utdallas.edu) University of Texas at Dallas, USA
Data Set Information:
We have downloaded and parsed the entire Bitcoin transaction graph from 2009 January to 2018 December. Using a time interval of 24 hours, we extracted daily transactions on the network and formed the Bitcoin graph. We filtered out the network edges that transfer less than B0.3, since ransom amounts are rarely below this threshold.
Ransomware addresses are taken from three widely adopted studies: Montreal, Princeton and Padua. Please see the BitcoinHeist article for references.
Attribute Information:
Features
address: String. Bitcoin address.
year: Integer. Year.
day: Integer. Day of the year. 1 is the first day, 365 is the last day.
length: Integer.
weight: Float.
count: Integer.
looped: Integer.
neighbors: Integer.
income: Integer. Satoshi amount (1 bitcoin = 100 million satoshis).
label: Category String. Name of the ransomware family (e.g., Cryptxxx, cryptolocker etc) or white (i.e., not known to be ransomware).
Our graph features are designed to quantify specific transaction patterns. Loop is intended to count how many transaction i) split their coins; ii) move these coins in the network by using different paths and finally, and iii) merge them in a single address. Coins at this final address can then be sold and converted to fiat currency. Weight quantifies the merge behavior (i.e., the transaction has more input addresses than output addresses), where coins in multiple addresses are each passed through a succession of merging transactions and accumulated in a final address. Similar to weight, the count feature is designed to quantify the merging pattern. However, the count feature represents information on the number of transactions, whereas the weight feature represents information on the amount (what percent of these transactions’ output?) of transactions. Length is designed to quantify mixing rounds on Bitcoin, where transactions receive and distribute similar amounts of coins in multiple rounds with newly created addresses to hide the coin origin.
White Bitcoin addresses are capped at 1K per day (Bitcoin has 800K addresses daily).
Note that although we are certain about ransomware labels, we do not know if all white addresses are in fact not related to ransomware.
When compared to non-ransomware addresses, ransomware addresses exhibit more profound right skewness in distributions of feature values.
Relevant Paper:
1 - Rivera-Castro, R., Pilyugina, P., & Burnaev, E. (2019, November). Topological Data Analysis for Portfolio Management of Cryptocurrencies. In 2019 International Conference on Data Mining Workshops (ICDMW) (pp. 238-243). IEEE.
https://arxiv.org/abs/1906.07852
https://arxiv.org/pdf/1906.07852
Citation Request:
@article{akcora2019bitcoinheist,
title={BitcoinHeist: Topological Data Analysis for Ransomware Detection on the Bitcoin Blockchain},
author={Akcora, Cuneyt Gurcan and Li, Yitao and Gel, Yulia R and Kantarcioglu, Murat},
journal={arXiv preprint [Web Link]},
year={2019}
}
```{r data-prep}
# Install necessary packages
if(!require(tidyverse)) install.packages("tidyverse", repos = "http://cran.us.r-project.org")
if(!require(caret)) install.packages("caret", repos = "http://cran.us.r-project.org")
# Load Libraries
library(tidyverse)
library(caret)
# Download data
url <- "https://archive.ics.uci.edu/ml/machine-learning-databases/00526/data.zip"
dest_file <- "data/data.zip"
if(!dir.exists("data"))dir.create("data")
if(!file.exists(dest_file))download.file(url, destfile = dest_file)
# Unzip data archive into CSV file
if(!file.exists("data/BitcoinHeistData.csv"))unzip(dest_file, "BitcoinHeistData.csv", exdir="data")
# Import data from CSV
ransomware <- read_csv("data/BitcoinHeistData.csv")
# Validation set made from 10% of BitcoinHeist data
test_index <- createDataPartition(y = ransomware$label, times = 1, p = 0.1, list = FALSE)
workset <- ransomware[-test_index,]
temp <- ransomware[test_index,]
# Make sure addresses in validation set are also in working set
validation <- temp %>%
semi_join(workset, by = "address")
# Add rows removed from validation set back into working set
removed <- anti_join(temp, validation)
workset <- rbind(workset, removed)
# Split the working set into a training set and a test set
test_index <- createDataPartition(y = workset$label, times = 1, p = 0.1, list = FALSE)
train_set <- workset[-test_index,]
test_set <- workset[test_index,]
# Clean up environment
rm(dest_file, url, temp, removed, ransomware, test_index)
# Inspect data frames
test_set %>% str()
test_set %>% head()
train_set %>% str()
train_set %>% head()
```