Since CrowdStrike pushed a faulty update to customers on July 19, the cyber (re)insurance industry has had much to consider. Concern about catastrophic cyber risk is regularly identified as one of the main factors constraining the growth of the market. It’s frequently said in analytics and modeling circles that we need events to learn, improve models, and increase our confidence in what they tell us. Following the impact of the CrowdStrike outage – coined CrowdOut by the CyberCube (Cyber Aggregation Event Response Service) CAERS team – we now have an important opportunity to do those things.
These lessons are costly to learn – CyberCube has estimated a total insurance cost that could range from $400 million to $1.5 billion – so let’s make sure we glean as much from this event as possible. Five themes stand out for me.
I’ve been curious about this for years. The calculations for Business Interruption (BI) and Contingent Business Interruption (CBI) in modeling are not complicated – start with a company’s annual revenue, divide by 365, divide by 24, multiply by the number of hours of downtime, apply a profit margin and you’re close to finished. The math is simple, but that doesn’t mean it’s right. There’s the matter of income (or profit) lost versus simply deferred. Forensic accountants become involved, and the policyholder claims will be subject to considerable scrutiny.
Given that a high percentage of modeled tail losses – which drive capital requirements and reinsurance purchasing – comes from BI and CBI, or Dependent BI coverage, it’s important for (re)insurers to study such claims when they occur. We almost had such an opportunity from the Change Healthcare attack in February, but UnitedHealth’s actions to provide short-term cashflow to medical providers likely stemmed what otherwise could have been a larger wave of CBI claims. This was a fortunate turn for insurers financially, but a missed opportunity to learn.
With CrowdOut, it seems unavoidable that we will get the chance to learn from claims. Sure, it’s (contingent) system failure coverage – which is often sublimited – but the hurdles for getting claims paid will be the same.
The cyber insurance world, and most cyber modelers, conceptualize three types of catastrophic perils: a widespread cloud outage, a widespread ransomware (or wiper malware) attack, and a widespread privacy breach. For the most part, these perils are considered to be separate and distinct loss processes.
CrowdOut demonstrated that such thinking is misguided. As it turns out, cloud networks and corporate networks both run on computers(!), and most of them are Windows computers. Whether a company rents its computing (cloud) or owns it (on premises) does not change the reliance on common software and operating systems.
True, it was tenant environments and providers of cloud software-as-a-service (SaaS) – not cloud infrastructure itself – that were affected by this event. This is an important distinction, and one that we make in our modeling (uniquely, I believe). Why? A common concern we hear from cloud experts is that most people do not understand the shared security model. Cloud providers are committed to doing their job, but in many cases their job descriptions are far narrower than many people believe. You also must do your job.
The companies directly impacted by CrowdOut (with first-order impacts) were generally the ones doing things right, with automatic security updates enabled to protect their critical systems. Ideally companies would test such software updates in sandbox environments before deploying, and many do. But for the systems we most rely on – the “always on” systems such as 911 emergency services – our drive for constant availability compels us to embrace automatic updates, to maximize the chance that systems would be patched or at least alerted to vulnerabilities before they can be exploited. In this situation, that exact behavior is what made systems vulnerable.
While insurers count the claims coming in, it’s worth remembering: this was an accidental event, and it was resolved very quickly. CrowdStrike revoked its faulty update ~90 minutes after it was initially pushed, and data from SevCo Security indicates 90% of systems were restored 19 hours later (or after the push).
It’s been noted that adversaries of the US are certainly paying careful attention to this kind of vulnerability in our network dependencies – as if they didn’t know already.
Security researchers have been warning about the complex dependencies amongst software supply chains for some time. In CrowdOut, we are again fortunate to have stumbled upon one such dependency – literally by accident – when in other circumstances it might have been far more costly to discover.
For cyber insurance underwriting purposes, the companies affected here would have looked like the good risks. And in a real sense, they are. We are too quick to assign blame and reassess situations with the seemingly clear perspective of hindsight. The “resulting” fallacy is the term for when we attribute the quality of the result to the quality of the decision that was made¹. Remember, if insurers were underwriting to security controls, these companies would have strong patch management.
What protects us in one situation leaves us vulnerable in another. Software supply chain issues behave differently than typical malware exploits – regardless of whether issues arise by accident or on purpose.
This highlights a tension for organizations and particularly their insurers: doing the right thing to avoid most risks may be exactly the thing that opens up vulnerability to large-scale systemic risk. Insurers, at least, would do well to consider cyber pricing carefully – a lower attritional loss rate may warrant a higher catastrophe load, and vice versa.
One specific note of importance: we understand that XDR/MDR vendors are less and less interested in giving customers the option of manual updates; most customers are subscribed to automatic updates unless they push back strongly. In light of the events of the CrowdStrike outage, this is at least one behavior in the cybersecurity ecosystem that should be reexamined. The counterpoint to consider is: what would the exposure to a malicious zero-day or new exploit event have looked like for these organizations? In a widespread exploit situation, we could imagine the outcome being far worse for organizations that waited to apply patches for the purposes of end user validation or sandboxing.
This nuanced debate is not new to IT and Cyber teams, but it will likely lead to a reinvestigation of whether systems should use one method or the other for many organizations.
CyberCube recently published a paper exploring how counterfactual analysis can be used to understand and quantify tail risk in tangible terms. CrowdOut gives us another prime example to consider questions such as:
What if the fix had taken longer?
What if, instead of an accident, this event had been malicious in nature – put in motion either by an external actor or a rogue agent operating within CrowdStrike?
What would the recovery costs have totaled if the event had lasted days or weeks?
These alternative versions of events would suggest a return period for the event different than the 2-to-6 year range that we estimated, but that’s exactly the point. In looking at CrowdOut through counterfactuals, we can see further out into the tail. While the particulars of this event differ, if you squint at it this event could follow a set of counterfactuals similar to those we explored for SolarWinds in our paper.
We are fortunate in many ways that this event was as benign as it was. As systems are recovering and claims are being adjusted, let’s make sure cyber (re)insurers are learning lessons so our industry is prepared for future events as we can and should be.
¹ See Annie Duke.com. Or John Kay and Mervyn King, Radical Uncertainty, 2020.