As an insurance industry, we spend a great deal of time talking about how badly we need claims data, particularly within the relatively new cyber realm. After all, core actuarial functions (pricing and reserving) are often incredibly reliant on empirical information. In established lines with many sophisticated players, superior extraction of useful information from a historical database is often the only way to truly build an edge against the competition. Yet, while this information undoubtedly still provides great insight within a growing cyber line, we need to be acutely aware of several aspects of the data that give us a great reason to take pause before nonchalantly throwing numbers into an algorithm that could ultimately define corporate strategy. In no particular order, these data aspects are:
Quality
Despite the existence of multiple ongoing efforts to collect comprehensive breach claim and related financial loss information, we face several realities when using such sources. Firstly, and perhaps most obviously, in merging multiple sources we necessarily have to deal with the data deduplication issue, not always a trivial task depending on how cleanly event metadata is captured. Secondly, we have to concern ourselves with how accurately and usefully claims are described and categorized, as a too-concise, typo-riddled, or confusing description destroys value immediately. Thirdly, as financial loss data is understandably difficult to obtain for an outsider, we encounter many cases with entirely missing loss data even when categorical or qualitative metadata exists.
Completeness
Currently, data breach regulation in the United States is marked by a significant lack of standardization. Each state maintains a different set of stipulations on who must be notified, as well as when and how that notification needs to occur. Determining a threshold for notification across many states (for multi-state conglomerates) thus becomes a headache, as this threshold varies by state depending on what information is exposed, how many people are affected, and whether or not that information is encrypted. It’s worth noting that while mandatory data breach laws exist, historically we may well have events in which a particularly affected entity chooses not to notify and instead face penalties and fines, as the potential reputational hit in such situations is typically severe. Knowing that data breaches are but one type of attritional claim of interest and that reporting of even breaches may be spotty in places, we need to be aware of the potential incompleteness of any historical claims database
Bias
Notably and naturally, different industry reports on breach cost potential focus on different slices of the breach population. Some may focus on breaches of a certain size (i.e. with the number of records exposed < ###) while others may tend toward focusing on mega breaches. This focus is important when interpreting results, and we have to be careful to not generalize findings too broadly. In general, while there may be richer data on breaches that have gained media attention, this doesn’t completely imply that ultimate loss figures for these breaches can be relied on as credible potential estimates looking forward. Each breach is different, and we must recognize that variability.
Relevance
Within the cyber realm, to use a cliché, the only constant is change. The threat landscape changes often and abruptly, and the existence of thinking beings behind attack events brings about additional complexity in modelling. As victims recover from attacks and build resiliency for the future, attackers are simultaneously analyzing new attack vectors, attempting to stay one step ahead of inevitable patches. Due to this ever-changing peril, we need to simply ask whether data (perfect or imperfect) from long ago even remain relevant today. Are we using claims data around historical attack types to parameterize the attacks of tomorrow?
Granularity
We can safely assume that as a company is recovering from a breach, publicly disclosing actual financial losses by cost component, both before and after the application of insurance terms and conditions, is not high on its to-do list. Unfortunately, there is little incentive for a company to ever report information at this much-needed level of granularity. Exact composition across notification costs, legal liability, data restoration, etc. is both generally unavailable and critical for the actuarial modelling exercise, demanding creative solutions that carefully maneuver potential pitfalls of blindly trusting empirical data.
While there certainly exists overlap among these categories, this nonetheless represents a core set of aspects to keep in mind as we obtain, cleanse, prepare, and use data for cyber pricing or attritional loss modelling. Honestly and openly working toward understanding caveats and potential pitfalls up front will help one build robustness in modelled results, as well as more clearly work toward addressing gaps in a domain that will continue to rapidly evolve over time.