Last week saw another in a series of major system outage events that continue to plague the major “Infrastructure as a Service” (IaaS) cloud providers. This time it was Amazon’s AWS service that suffered an outage of around 8 hours, affecting the company’s Northern Virginia (“EC-1 Datacenter”) and disrupting many hundreds of reliant customers typically, “Software as a Service” (SaaS) providers who rely on Amazon’s service to run their commercial applications.
This is not the first time that Amazon have suffered from major system outage (interesting, the Northern Virginia operations seems to have seen an unfair share of issues over the years when compared to other sites). We saw a Distributed Denial of Service (DDoS) attack impact Amazon customers in October, 2019 (reported by CyberCube here). Adverse weather conditions took AWS services down for 10 hours in 2016 and these major failures go way back to 2013 when the infamous “Friday the 13th outage,” saw a load balancing issue take systems down for a 2-hour period, once again, at the Virginia site.
It should be pointed out that, in every instance, Amazon’s professionalism and expertise has shone through. The “Big 3” cloud service providers (Amazon, Google and Microsoft) all boast infrastructure operations and management capabilities that are world-class and thankfully, these capabilities have kept customer disruption to a minimum. It should also be pointed out that Amazon are by no means the only company to suffer these kinds of issues. June 2019, for example, saw Google’s services take a 4-hour hit due to a series of system configuration issues.
As I consider this latest outage event, a couple of things strike me as being important to consider both for the organisations that are reliant on cloud infrastructure and for the cyber insurers who are providing them with coverage.
Firstly, most major outage events to date have not been due to malicious actors. Imagine how much worse they could be with a well-financed and motivated criminal actor behind them.
Secondly, it seems possible to create fairly substantial chaos within these huge compute environments through fairly trivial acts (this latest outage with caused by a "a relatively small addition of capacity", according to Amazon).
Thirdly, we haven’t seen “the big one” yet. Looking at this through the lens of cyber insurance, most coverages typically come with an 8 to 12 hour waiting period retention, which an insured must bear before coverage applies. So, “disaster”, at least from an insurance perspective, hasn’t really occurred in the cloud yet.
One of our most important jobs at CyberCube is to create models that allow our customers to run simulations of possible major cyber disaster events across insurance portfolios in order to model financial risk. Cloud and the sorts of events that last week’s outage hints at are top of mind for us right now… and for good reason.
I think that is very likely that we will see a 16+ hour outage event at one of the major cloud providers in the next 5 years. This event will most likely be caused by a combination of malicious actors, intent on disrupting businesses at scale and internal issues at the targeted cloud provider as they struggle to conduct root-cause analysis. The event will cause a “cascade” effect across platform and application providers and, ultimately, businesses across a variety of vertical markets, leading to major loss accumulation probably on a national scale.
For me it is only a matter of time before we see a combination of factors converging to create this kind of event. In the meantime, organisations who a heavily reliant on cloud and the insurers that cover them with cyber policy should consider the following:
If I am right in my 5-year prediction (and I hope I am not), careful consideration of the above may lessen the blow of a true “disaster in the cloud”.