Network telemetry is a big-data problem
Your video call freezes for three seconds, then recovers. Somewhere, in the network between you and the other person, something went briefly wrong. The interesting question for an engineer is: could anyone actually prove what happened, after the fact? Answering that turns out to be a data problem, and a surprisingly large one. The systems that watch a network for a living generate a torrent of data about it, and handling that torrent well is the same kind of high-throughput data-infrastructure challenge I usually write about in the context of AI. This post is about why network monitoring is, underneath, a big-data problem, and why the hard parts rhyme with the hard parts of feeding data to AI.
I am going to assume no networking background. If you have used the internet, you know enough to follow along.
What “telemetry” even means
Telemetry is just a fancy word for “measurements a system reports about itself.” A car’s dashboard is telemetry: speed, fuel, engine temperature. A network produces telemetry too, and it comes in three main flavors. The cleanest way to keep them straight is by how much detail each one carries, and how much data that detail costs you.
1. Packets. When data moves across a network, it does not travel as one continuous stream. It is chopped into small chunks called packets, and each packet carries a header that says where it came from, where it is going, and what kind of traffic it is. Watching the actual packets go by is called packet capture. It is the most detailed view possible, the equivalent of recording every byte, and also the most overwhelming.
2. Flows. Most of the time you do not need every packet. You just need to know “machine A talked to machine B, sent 4 megabytes, over five minutes, using video-call traffic.” That summary is called a flow record: one row per conversation, with the totals but not the contents. Flows are far smaller than full packet capture but still tell you most of what you want for everyday monitoring.
3. Metrics. Finally there are plain numbers sampled over time: how many packets per second crossed this link, how many had errors, how full was this queue, what was the average delay. These are metrics (sometimes called counters). Each individual number is tiny, but you collect thousands of them, every few seconds, forever.
A useful way to see the three together is as successive summaries of the same traffic. Each step down throws away detail and, in exchange, shrinks the data dramatically.
flowchart TD
P["Packets<br/>every byte on the wire<br/>highest detail, highest volume"]
F["Flows<br/>one summary row per conversation<br/>medium detail, medium volume"]
M["Metrics<br/>counts and averages over time<br/>lowest detail, lowest volume"]
P -->|"summarize each conversation"| F
F -->|"count and average"| M
Logs and traces exist too, but packets, flows, and metrics are enough to make the point.
Why this is a big-data problem and not a small one
Here is the part that surprises people. A single busy network link can carry tens of billions of bits per second. If you wanted to capture every packet on it, you would be writing data to disk faster than most storage systems can accept it. And a large company does not have one link. It has thousands, across data centers, offices, and cloud regions.
So you immediately hit the defining tension of the whole field:
- Keep everything, and you drown. The cost of storing full packet capture for even a day across a large network is enormous, and most of it is boring traffic you will never look at.
- Keep a sample, and you save money, but you might throw away the one packet that explains the outage. Rare events are exactly the ones that matter, and rare events are exactly what sampling tends to miss.
This is not a tooling detail you can shrug off. It is the central design decision, and every network-monitoring product is, in large part, an opinion about how to resolve it: what to keep in full, what to summarize, what to throw away, and how fast.
If that tension sounds familiar from my other writing, it should. It is the same problem as deciding what data to retain and at what fidelity when you are building pipelines to feed an AI system. The volumes and the words differ, but the shape is identical.
It maps cleanly onto the data stack you already know
In an earlier post on data engineering I described the modern data stack as four layers: sources, storage, transformation, and serving. Network telemetry slots into exactly the same four layers. That is really the whole thesis of this post: network monitoring is a data-pipeline problem wearing a networking costume.
flowchart LR
S[("Sources<br/>packets, flows, metrics")] --> I["Ingestion<br/>(capture at line rate)"]
I --> ST[("Storage<br/>time-series + retention tiers")]
ST --> T["Transformation<br/>(rollups + enrichment)"]
T --> SV["Serving<br/>dashboards, alerts, forensic search, AIOps"]
Let me walk the layers, because each one has a twist that the networking context adds.
Sources. The packets, flows, and metrics above. The twist: unlike a database you can query at your leisure, this data flies past once. If you do not capture it as it happens, it is gone. There is no “re-run yesterday.”
Ingestion. Getting the data off the wire and into a system without dropping it. This is the genuinely hard part, because the data arrives at “line rate,” meaning as fast as the network can physically move it. The monitoring system has to keep up at full speed or it starts dropping the very data it exists to collect. In ordinary data engineering you can usually slow down a source if your pipeline is behind. Here you cannot ask the network to wait.
Storage. Once captured, the data is almost always time-series, meaning every record is stamped with when it happened and you mostly ask time-based questions (“show me the last hour”). Nobody keeps the most detailed data forever. The standard move is retention tiers: keep the last few hours in full detail (expensive, fast to search), keep summaries for a few weeks (cheaper), and keep only coarse aggregates for a year (cheapest). Hot, warm, cold. Your phone’s photo storage works the same way, recent photos on the device, older ones pushed to the cloud.
Transformation. Turning raw firehose into something a human can read: rolling per-second numbers up into per-minute averages, counting flows by application, attaching context (“this address belongs to the finance department”). This is the same aggregate-and-enrich work that pipelines do everywhere, including in the tools I build around Floe. The network flavor just runs continuously and never really finishes.
Serving. Dashboards that show red or green, alerts that page someone at 3am, and forensic search for when you need to reconstruct exactly what happened during that three-second video freeze. Increasingly, this layer also feeds machine-learning systems that try to spot trouble automatically, which I will come back to.
The one genuinely networky idea worth learning: cardinality
There is a single concept that makes telemetry harder than it looks, and it is worth understanding because it shows up everywhere once you see it. It is called cardinality, which is just a heavy word for “how many distinct things you are counting separately.”
Counting total packets per second is cheap: it is one running number. But the moment you want to count packets per second for every pair of communicating machines, broken down by application, you are no longer keeping one number. You might be keeping millions of separate little counters, one for each unique combination. As the network grows, the number of combinations can explode far faster than the traffic itself.
High cardinality is the silent budget killer of monitoring systems. It is the reason you cannot simply “track everything, sliced every possible way.” The slicing is what gets expensive, not the raw volume. A huge part of designing a telemetry system is deciding which breakdowns are worth keeping and which are not, which is, again, a data-modeling decision, the same kind covered in the data-engineering post.
A sketch of how you would actually build one
It helps to make this concrete. Here is roughly what a telemetry system looks like sketched on a whiteboard. None of the boxes are exotic. They are the same boxes you would draw for any high-throughput data pipeline, which is exactly the point.
flowchart LR
subgraph capture["Capture at the edges"]
A["Network taps /<br/>mirror ports"]
B["Router exporters<br/>(NetFlow, sFlow)"]
C["Host agents"]
end
A --> Q
B --> Q
C --> Q
Q["Buffer<br/>(message queue)"] --> SP["Stream processor<br/>sample, roll up, enrich"]
SP --> TS[("Time-series DB<br/>metrics + flows")]
SP --> OBJ[("Object storage<br/>raw packet capture")]
TS --> API["Query + alerting layer"]
OBJ --> API
API --> D["Dashboards"]
API --> AL["Alerts / paging"]
API --> AI["AIOps / anomaly detection"]
Walking it left to right, a handful of design decisions carry most of the weight:
Capture at the edges. You collect data where the traffic already is: passive taps or mirror ports that copy packets off a link, exporters built into routers that emit flow records, and agents running on individual machines. Each source gives a different detail-versus-volume tradeoff, which is why real systems use all three rather than picking one.
Put a buffer in front of everything. A message queue sits between capture and processing so that a momentary slowdown downstream does not turn into dropped packets upstream. This is the single most important move in the whole design, because, as before, the network will not wait. The buffer absorbs bursts and lets the rest of the system run at a sane average pace instead of the terrifying peak.
Process as a stream, not a batch. The sampling, the rollups (per-second into per-minute), and the enrichment (attaching “this address is the finance department”) all happen in flight, as the data flows past, not in a nightly job. This is where you make the keep-everything-versus-keep-a-sample decision real, and where you decide which cardinality breakdowns are allowed to exist.
Split storage by how you will read it. The numbers you query constantly (metrics and flows) go into a time-series database tuned for “show me the last hour.” The raw packet capture you will rarely touch but must keep briefly goes into cheap object storage. Layered on top is tiered retention: full detail for hours, summaries for weeks, coarse aggregates for a year.
flowchart LR
H["Hot<br/>last few hours<br/>full detail, fast, costly"] --> W["Warm<br/>last few weeks<br/>summarized, cheaper"]
W --> C["Cold<br/>up to a year<br/>coarse aggregates, cheapest"]
Serve through one query layer. Dashboards, alerting, and any AIOps models all read from the same query layer rather than each reaching into storage directly, so there is one consistent answer to “what was happening at 2:47am.”
If you have built a data pipeline before, none of this is new. Swap “packets” for “events” and this is a generic streaming architecture. The networking details (line-rate capture, taps, NetFlow) change the constraints, but not the shape.
Garbage in, garbage out, for the systems that are supposed to tell the truth
Now the part that ties back to the thesis running through everything I write. My standing argument is that data infrastructure, not the model, is the real bottleneck for AI, because “garbage in, garbage out” governs any system fed by data. Network telemetry is a sharp example, with an extra sting: this is the data that is supposed to tell you whether everything else is healthy.
Consider what “garbage in” looks like here:
- Dropped data. The monitoring system fell behind at line rate and silently skipped some traffic. Now your dashboard shows a dip in activity that never happened, or worse, hides a spike that did.
- Clock skew. Two devices disagree about what time it is by a few hundred milliseconds. When you stitch their telemetry together to reconstruct an incident, the order of events is wrong, and you chase the wrong cause.
- Sampling bias. You kept one packet in a thousand to save money, and the problem lived in the packets you discarded. The data says everything is fine because the evidence was thrown away before anyone looked.
- Truncated cardinality. To control cost, the system lumped many distinct sources into an “other” bucket, and the one misbehaving machine is now invisible inside that bucket.
In every case the dashboard can glow green while users suffer. A monitoring system built on bad telemetry does not just fail to help, it actively misleads, because people trust it precisely when they are stressed and debugging. The quality of the underlying data is not a nice-to-have. It is the entire value of the system.
Why this is worth caring about
Two reasons, one practical and one bigger.
The practical one: networks are now being watched by machine learning, not just humans. The industry calls this AIOps, the idea that models can spot anomalies and point at root causes faster than a person scrolling dashboards. It is a genuinely good idea. But it inherits the rule without exception: a model trying to detect network anomalies is only as good as the telemetry feeding it, and if that telemetry is dropped, skewed, or sampled away, the model confidently learns the wrong normal and raises the wrong alarms. The smartest anomaly detector in the world cannot detect what it was never shown.
The bigger reason: networks are critical infrastructure. The same plumbing carries hospital records, payment systems, power-grid coordination, and government services. When it degrades, knowing precisely what happened, and proving it, is not a luxury. And that knowing is, at the bottom, a data-infrastructure capability: capture the right signals at full speed, store them affordably at the right fidelity, and keep them honest enough to trust at 3am. I spent an early part of my career on the security side of a large financial network, automating away slow manual processes, and the lesson stuck: the network is only as understandable as the data it is willing and able to tell you about itself.
So when I say network telemetry is a big-data problem, I do not mean it as a metaphor. It is the same ingestion, the same storage tradeoffs, the same transformation work, the same quality discipline, and the same painful tension between keeping everything and affording anything that runs through all of modern data infrastructure. The packets are just a particularly fast, particularly unforgiving source. And like every other source, what you get out of the system is decided long before the dashboard, by how well you handled the data going in.