Co-author: Trishala Neeraj, Data Scientist at CyberCube.
Cyber attacks have a massive impact on a worldwide economy that is ever-growing in its reliance on technology. With the increase in internet connectivity, more individuals, as well as enterprises, are vulnerable to cyber attacks. Furthermore, in recent years, cyber-related news attracted significant attention from media outlets as well as viewers. Currently, there are many journalists dedicated to covering technology news in general, and cyber news in particular.
In this work, we conducted a correlation analysis over four years of cyber-related news articles obtained from the global data on events, location, and tone data source. We applied both supervised and unsupervised text analysis techniques to understand spatial, temporal and distributional topic patterns. Experimental results show interesting trends with respect to cyber attacks such as ransomware, data breach and denial of service attacks as well as more general cyber-related concepts such as cryptocurrency. This work helps practitioners in understanding an increasingly evolving spectrum of cyber events.
This material was recently presented at the 5th IEEE International Conference on Data Science in Cyberspace (IEEE DSC 2020). Below is a summary of findings:
We explored the use of both supervised and unsupervised machine learning algorithms on a large scale data set of cyber-related news articles spanning between 2016 to 2019. We applied a variety of traditional as well as state-of-the-art text analysis methods. A range of insights regarding the text patterns and associations of cyber-related concepts are summarized:
- A temporal correlation analysis using a bag-of-words representation shows a statistical increasing trend in areas such as phishing, scamming, Remote Desktop Protocol (RDP) and cryptocurrency, and a decreasing one in areas such as malware and denial of service attacks. These trends would indicate an increase (or decrease) in the volume of events related to those areas and/or media attention to them.
- Using a similar model, but trained on the spatial aspect of the events, we observed a strong correlation between different countries and cyber-related concepts. These correlations can be used to infer attribution (e.g., WannaCry or SamSam with the highest positive correlations with North Korea or Iran respectively). They can also be used to help understand the focus of cyber events and/or cyber news coverage (e.g., Hacking with a positive correlation with the United States versus with a negative correlation with the United Kingdom).
- A sentiment correlation model showed the highest correlation between very negative news coverage and both hacking or scamming. This can be used as a proxy for measuring the severity of these attacks which result in loss of large amounts of personal and identifiable information.
- Developed cyber-related "topics" emerged in the news published between 2016 and 2019. We observed how topics evolved over the years, i.e., how the probability of each topic changes in the topic mixture that Latent Dirichlet Allocation algorithm produces. We noticed that the increasing or decreasing trends of topics often correlated with major cyber attacks or political events.
- A document embedding analysis with dimensionality reduction shows a distinctive grouping of news articles. These groups were diverse in their coverage with an average of 1,142 unique news sources per group.
The IEEE gathering was a great place to begin the debate on modeling and understanding cyber news. As the field is ever-growing, it is very important to build tools to effectively extract insights from cyber news at scale. This research is a major step forward towards achieving that goal.