The Ever Changing Face of Cyber Data

Written by Richard Ford | May 20, 2026 7:45:01 AM

All models are built on data. Get the inputs wrong, and the output will be wrong — it’s the classic “garbage in, garbage out” problem. In cyber risk, that relationship is usually treated as a data quality problem: better data leads to better models. Thus, improve the data in terms of coverage and quality, and voila... our predictions improve.

Unfortunately, there’s a deeper issue. In cyber, the data itself isn’t stable. The “right answer” today is not the right answer tomorrow — not just because new vulnerabilities emerge or systems change, but because the structure of the Internet, and the meaning of the data we observe, is constantly evolving.

This article explores that shift: how cyber data is changing underneath us, and what that means for those trying to measure and price risk. To understand why this matters for cyber insurance, we first need to look at how the Internet itself has changed, and why that makes cyber data harder to interpret.

The structural evolution of the Internet

A significant portion of contemporary modeling friction arises from a core transformation in the digital landscape: moving away from straightforward links between hardware and utility toward environments that are abstracted, multi-tenant, and defined by software.

In the early days of the Internet the world tended to have one IP for one service. If you were standing up a new public server, you'd get a new public IP address, and away you went. Contiguous IP blocks were at a premium, but still comparatively easy to get. The Internet was largely static, changing slowly over months and years.

All of this changed due to two primary drivers: a need for resilience/efficiency improvements, and IPv4 exhaustion. Architects quickly figured out that serving up static content from places that were closer to end users was more efficient, and that the whole "one IP, one service" model was incredibly brittle. Thus, content delivery networks (CDNs) were born.

This process accelerated as the Internet continued to grow, as there simply weren't enough IP addresses for the “one to one” mapping model to survive. Thus, providers began to take advantage of CDN-like technology to map many domains to one IP.

This shift created a practical challenge for anyone trying to map cyber exposure — seeing an IP address is no longer the same as understanding who or what is behind it.

The attribution problem

Attribution is one of the clearest examples of why cyber data can be difficult to translate into risk. The same observable infrastructure can represent multiple services, customers, and owners depending on context.

For the WWW, this many-to-one mapping is disambiguated by leveraging either a Host Header (HTTP) or the SNI field in SSL (HTTPS), which identify the actual content one is looking for.

This SNI/Host Header approach works well for HTTP and SSL, but it's mostly ineffective for other protocols; these are handled by port-forwarding. All of this works well for service providers and end users, but a side effect is that it makes the job of mapping the Internet more difficult, as we see multiple customers using the same IP address but different ports. Furthermore, this forwarded traffic is often pushed to hidden (from the outside) load balancers. Without knowing the mapping of ports, it can be impossible to figure out who owns port 5432 on a particular device. Even when you can figure out a particular port is open, knowing who logically (not physically) owns the asset that responds is impossible without additional data.

A snapshot that starts to decay

Finally, these mappings are not inherent in the network itself — they are defined in configuration, and can change instantly without any observable signal at the IP layer. This makes creating a picture that is both complete and clear effectively impossible without “internal” knowledge of the system as a whole.

Much of this complexity is obscured behind shared, externally facing infrastructure, making attribution significantly harder. Where we once could scan an IPv4 address and work backward to identify ownership, we now often need to start with a domain or service identity (remember, edge networks use mechanisms like SNI to route traffic internally, meaning the same IP address may serve many different underlying systems… and owners). Put another way, you need to declare recipient identity to even get your traffic routed to the right destination.

This creates layers of ownership: who controls the IP itself (for example, AWS), who owns the physical infrastructure (also AWS, in our current example), and who owns the application actually handling the request — often a completely different entity, abstracted away by configuration.

Viewed this way, the challenge is not simply that our data is incomplete. Even with a perfect snapshot of the world at a given moment, that picture begins to decay immediately as the system evolves. We are observing a moving target.

The problem is no longer that we lack data. It’s that the meaning of the data we have is constantly shifting — sometimes gradually, sometimes in sudden jumps.

Silent signal decay

The result is not always a dramatic failure. More often, it is a slow weakening of signals that once helped explain risk. Observations mean nothing in a vacuum: it's what we do with these observations (essentially to turn data into information – or in our case, risk) that matters.

All models, at some level, do the same thing: some combination of observations become a signal that either increases or decreases risk. The art and science (and yes, there is some “art” there) of modeling is to combine these observations into signals that are highly predictive in nature.

When signals stop working

Unfortunately, as both the world and our observations of it shift, models have a strong tendency to degrade without obvious triggering events. While it might be comforting to think that a model's lifespan is driven by its ability to model new vulnerabilities, what happens is a little more insidious as steady contextual drift in turn drives model drift. Put simply, that means that what worked last year slowly stops working this year.

In many ways, it would be simpler if the decay was purely event driven. Modelers would see the change, and deal with it. However, because the decay is slow, the loss in efficacy isn't immediately obvious. In addition, given that risk calculations impact insurability and action, the model is not simply a passive observer of the system: it influences it.

What data evolution means for cyber insurance

Cyber risk isn't just evolving — the data needed to accurately measure it and model it is changing too.

Moreover, this evolution is not smooth and steady, but punctuated by fundamental changes in the semantic meaning of data, changing signal predictiveness and utility. Coupled with the dynamic cybersecurity insurance market, it's not an easy time to be on the "sell" side of insurance.

So what happens when the relationship between data and risk is no longer stable?

Check out part two, What Data Drift Means for Cyber Insurance Models, where we explore the implications for insurers, reinsurers, and modelers — from pricing and underwriting to portfolio management, model governance, and the need to treat models as adaptive processes rather than static answers.

View full post