proofreading done.

This commit is contained in:
shelldweller 2021-11-13 12:10:00 -07:00
parent 42dd93894c
commit 325fe17095
1 changed files with 10 additions and 17 deletions

View File

@ -4,7 +4,7 @@ subtitle: \vspace{.5in}HarvardX PH125.9x Final Capstone CYO Project
\vspace{.5in}
author: "Kaylee Robert Tejeda"
date: "11/11/2021"
abstract: "Ransomware is a persisent and growing threat in the world of cybersecurity. A specific area of focus involves detecting and tracking payments made to ransomware operators. While many attempts towards this goal have not made use of sophisticated machine learning methods, even those that have often result in models with poor precision or other performance issues. A two-step method is developed to address the issue of false positives."
abstract: "Ransomware is a persisent and growing threat in the world of cybersecurity. A specific area of focus involves detecting and tracking payments made to ransomware operators. Many attempts towards this goal have not made use of sophisticated machine learning methods. Those that have often result in models with poor precision or other performance issues. A two-step method is developed to address the issue of false positives."
keywords:
- Bitcoin
- blockchain
@ -617,15 +617,6 @@ ransomware_big_families %>%
It appears that, although the `r selected_features[1]` distribution for ransomware groups does differ from the distribution pattern for *white* addresses, it also varies from group to group. For this reason, this makes a good feature to use in the training of the models.
```{r shrimp-percentage, echo=FALSE, include=FALSE}
# Count how many wallets have less than one hundred bitcoins
shrimp <- ransomware %>% filter(income < 10^10 )
```
The percentage of wallets with less than one hundred bitcoins as their balance is `r mean(shrimp$bw == "black")`. I have no idea why this is meaningful, but I can calculate it at least. **What else can I do here?** [A few more of these calculations might be good enough to wrap this section up, actually.]
### Insights gained from exploration
After visually and numerically exploring the data, it becomes clear what the challenge is. Ransomware-related addresses are very sparse, comprising `r ransomprop*100`% of all addresses. This small percentage is also further classified into 28 groups. Perhaps the original paper was a overly ambitious in trying to categorize all the addresses into 29 categories, including the vastly prevalent *white* addresses. To simplify our approach, we will categorize the addresses in a binary way: as either *white* or *black*, where *black* signifies an association with ransomware transactions. Asking this as a "ransomware or not-ransomware" question allows for application of methods that have been shown to be impractical otherwise.
@ -1194,11 +1185,11 @@ true positive.** In turn, this number is 27.44 for the best non-TDA models."$^{[
### Future Work
I only scratched he surface of the SOM algorithm, which seems to have many implementations and parameters that could be investigated further and possibly optimized via cross-validation. For example, the grid size used to train the SOM was calculated using an algorithm based on the size of the training set, and while this performed better than a grid size based on the number of categories, it may not be ideal. Optimization around grid size could still be carried out. Hexagonal grids with toroidal topology were the only type used. Other types, such as square grids and non-toroidal topology are also possible, and may also be worth investigating.
We only scratched the surface of the SOM algorithm, which seems to have many implementations and parameters that could be investigated further and possibly optimized via cross-validation. For example, the grid size used to train the SOM was calculated using an algorithm based on the size of the training set, and while this performed better than a grid size based on the number of categories, it may not be ideal. Optimization around grid size could still be carried out. Hexagonal grids with toroidal topology were the only type used. Other types, such as square grids and non-toroidal topology are also possible, and may also be worth investigating.
A dual Random Forest approach could be used to first isolate the ransomware addresses as well as classify them might be quick enough to run in under ten minutes on all the hardware listed. Conversely, a dual SOM method could be created for maximum precision if the necessary computing resources were available.
The script itself has a few areas that could be further optimization. The sampling method does what it needs to do, but the ratios taken for each set could possibly be optimized further. The second SOM algorithm could be optimized to correctly predict more of the low-membership families.
The script itself has a few areas that could be further optimized. The sampling method does what it needs to do, but the ratios taken for each set could possibly be optimized further. The second SOM algorithm could be optimized to correctly predict more of the low-membership families.
Hierarchical clustering was attempted in addition to K-means clustering. The correct number of families was difficult to achieve, whereas it is a direct input of the K-means method. Another look at the clustering techniques might yield different results. Other clustering techniques exist, such as "Hierarchical K-Means"$^{[13]}$, which could be explored for even more clustering visualizations.
@ -1231,12 +1222,14 @@ Kyle Hogan, Jason Hennessey, Andrew Miller, Arvind Narayanan, and Nicolas Christ
[10] KR Tejeda, Detecting Bitcoin Ransomware, https://git.disroot.org/shelldweller/ransomware
[11b] Wehrens R, Kruisselbrink J (2018). “Flexible Self-Organizing Maps in kohonen 3.0.” _Journal of Statistical
Software_, *87*(7), 1-18. doi: 10.18637/jss.v087.i07 (URL: https://doi.org/10.18637/jss.v087.i07).
[11a] Wehrens R, Buydens LMC (2007). “Self- and Super-Organizing Maps in R: The kohonen Package.” _Journal of
[11] Wehrens R, Buydens LMC (2007). “Self- and Super-Organizing Maps in R: The kohonen Package.” _Journal of
Statistical Software_, *21*(5), 1-19. doi: 10.18637/jss.v021.i05 (URL:
https://doi.org/10.18637/jss.v021.i05).
https://doi.org/10.18637/jss.v021.i05).
- and -
Wehrens R, Kruisselbrink J (2018). “Flexible Self-Organizing Maps in kohonen 3.0.” _Journal of Statistical
Software_, *87*(7), 1-18. doi: 10.18637/jss.v087.i07 (URL: https://doi.org/10.18637/jss.v087.i07).
[12] Difference between K means and Hierarchical Clustering (Jul 07, 2021) https://www.geeksforgeeks.org/difference-between-k-means-and-hierarchical-clustering/