Data infrastructure is the bottleneck for enterprise AI, not the models
There is a popular narrative right now that AI is bottlenecked by model capabilities — that we need bigger models, smarter agents, better reasoning. After several years of building data infrastructure in logistics and now enterprise AI, I have come to a different view: the real bottleneck is data infrastructure, and most companies will never get the full value of AI until they fix it.
The GIGO (“garbage in, garbage out”) principle has been around in computing for as long as computing has existed. It applies in full force to AI. A frontier model fed with stale, biased, malformed, or fragmented data will produce stale, biased, malformed, or fragmented outputs — and at machine speed and machine scale. The model is the easy part. The hard part is everything that has to happen before the data ever reaches the model.
This post is about why I think enterprise AI dominance — both at the company level and the country level — depends much more on the underlying data plumbing than on which model sits on top of it.
New to data infrastructure as a topic? My companion piece on what data engineering is and why AI needs it is a gentler introduction to the same territory — start there if you have not built a pipeline before.
Models are getting commoditized fast
OpenAI, Anthropic, Google, Meta, Mistral, DeepSeek, Qwen — there are now half a dozen credible frontier model providers, and the gap between them on most enterprise tasks is small. Open-weight models from a year ago are roughly comparable to the proprietary models of the moment they were released. Inference cost per token has fallen by orders of magnitude.
What that means in practice for an enterprise: the model you choose is not your moat. Anyone can swap from one model to another with a few config changes. If your competitive advantage is “we use GPT-5 instead of GPT-4,” your competitive advantage will evaporate the day GPT-6 ships.
The actual moat is what you feed the model. Specifically:
- The data your AI has access to that nobody else has
- The freshness of that data
- The quality, structure, and completeness of that data
- Your ability to feed it to the model reliably, at scale, with appropriate filtering and grounding
This is all infrastructure. None of it is “AI” in the popular sense.
The data problem is bigger than the model problem
Anyone who has worked on a large-scale data pipeline knows that the messy reality of production data is the real engineering challenge. Data arrives late, in the wrong format, with missing fields, with duplicate records, with schema changes that nobody told you about, with timezone confusion, with units that disagree across upstream systems. The pipeline that handles all of this gracefully — that catches anomalies before they hit production, that gracefully degrades when an upstream goes down, that maintains lineage so you can debug what happened three weeks ago — is not a weekend project. It is years of infrastructure work.
When companies bolt an LLM on top of a data system that has been duct-taped together for a decade, the LLM does not magically clean things up. It cheerfully produces confident-sounding answers based on whatever it was given. If the data was wrong, the answer is wrong, just expressed in better English.
I have seen this pattern repeatedly. At Amazon, the team I was on ran event-driven pipelines processing very large volumes of delivery performance data every day — the infrastructure underneath the incentive program for the company’s last-mile delivery partners across multiple countries. The hard problem there was not the analytics on top, it was attribution: a delivery performance metric that does not distinguish controllable defects from uncontrollable ones — bad weather, traffic, system errors, customer-side data errors — produces unfair signals no matter how sophisticated the downstream model is. Getting that right meant joining heterogeneous signals (GPS, weather, traffic, system logs, address validation) at the per-event level. Until your infrastructure can do that, no model on top can make decisions that people will trust.
At Microsoft, I see the same shape of problem in enterprise AI. I work on platform infrastructure that supports products like Copilot, Power Apps, Dynamics 365, and Power Automate across Azure regions globally, including government cloud environments, with high availability targets. The platform itself is extremely reliable. What tends to be missing is on the other side of the API: customers adopting these tools want agents, RAG (retrieval-augmented generation), tool use, copilots embedded in every workflow — but they have a tangle of internal systems, inconsistent identity, stale documents, manual approvals, ad-hoc data sources. The AI works fine in a demo and falls over in production, because the production data infrastructure on the customer side is not ready.
What “good” data infrastructure looks like
If I had to name the properties that distinguish data infrastructure good enough for AI from infrastructure that is not, I would list four:
1. Reliability. Pipelines run on schedule, retry on failure, alert on anomalies, and never silently produce wrong outputs. If a piece of data is bad, the system knows and routes it to a quarantine queue rather than letting it propagate.
2. Freshness. Data flows from source to consumer with predictable, bounded latency. For most enterprise AI applications, freshness measured in minutes is the difference between useful and useless.
3. Quality validation at ingest. Bad data does not enter the system. Schema enforcement, statistical anomaly detection, and reference-data validation catch issues before they reach the model.
4. Lineage. When a downstream system produces a wrong answer, you can trace exactly which upstream record caused it, how it was transformed along the way, and when. Without lineage, you cannot debug AI failures.
Notably, none of these are AI features. They are good engineering practices that have existed for decades. What is new is that we have stopped treating them as optional. Once you put an AI model on top, every weakness in the underlying infrastructure gets amplified.
Why this matters at the country level
I think the same argument scales. The countries that will lead in enterprise AI are not the ones with the best models — that race is being run by a handful of companies and the gap is closing every quarter. The lead will go to the countries whose enterprises have built the best underlying data infrastructure: ingestion pipelines, quality frameworks, governance systems, identity, lineage, and the people who know how to build all of that.
This is true for the economy as a whole and it is true for individual sectors. Logistics, healthcare, financial services, and manufacturing are the places where AI applied to good data will create huge value. They are also the places where data infrastructure is hardest, because the data is heterogeneous, regulated, and historically siloed.
Most of the people doing the visible work right now are working on agents and models. There is plenty of attention on that. But there is also a quieter set of engineers building the plumbing — and that work will, I think, end up being the more durable contribution.
What I am working on
A lot of my time is now spent on this kind of infrastructure work, both in my day job and in personal open-source projects. I started Floe recently — an early-stage open-source engine that targets one specific piece of this plumbing: keeping derived data tables fresh, automatically, as their sources change.
If that does not immediately mean anything, here is the simplest way I can put it:
Floe is like a spreadsheet formula, but for database tables. When you put
=SUM(A1:A10)into a spreadsheet cell, the cell updates automatically when any of A1 through A10 changes. Floe does the same thing, except the “cells” are large database tables with millions of rows, the “formula” is a SQL query, and “recalculating” means re-running the SQL incrementally on the part of the data that actually changed.
The problem in concrete terms. Imagine you run an e-commerce business. Your raw data is a table called orders — every transaction as it happens. For analytics, dashboards, or AI, you do not want to query raw orders directly: it is messy, it does not have customer region joined in, refunds are not subtracted, currency is not normalized. You want a derived table — say, daily_sales_by_region — that already does all of those things, kept up to date as new orders arrive.
The standard way to build that today is to write a custom Spark or Redshift script that reads from orders, applies the transformations, writes the result to daily_sales_by_region, schedules itself to run every hour, monitors itself for failure, and gets rebuilt whenever the schema changes. Every team writes that boilerplate by hand for every derived table. A typical enterprise data platform might have hundreds of these. The result: most of an analytics team’s engineering time goes into orchestration plumbing rather than actual analysis.
What Floe does instead. You declare the derived table once, in plain SQL:
CREATE DYNAMIC TABLE silver.daily_sales_by_region
REFRESH_MODE = INCREMENTAL AS
SELECT region, SUM(amount_usd) AS total_sales
FROM bronze.orders
GROUP BY region;
That is the entire pipeline. Floe figures out which raw tables this one depends on, watches them for new data, runs the SELECT only on the rows that have changed since the last refresh (not the entire history), writes the result back, and stamps every output row with the upstream snapshot it was computed from so you can trace any result back to its source.
It does all of this on top of Apache Iceberg, an open table format that has rapidly become the industry standard for data lakes. That means the resulting tables are readable by any modern data tool — Spark, DuckDB, Trino, Snowflake, Databricks, BigQuery — without vendor lock-in. You can switch your query engine without rewriting your data.
Why it matters for AI. Models do not consume raw data. The pipeline from “events streaming into a database” to “data ready for a model to use” is exactly the layer where derived tables live — and where most of the engineering cost in an enterprise AI project actually sits.
flowchart LR
A[("Raw tables<br/>orders, events, telemetry")] -->|Floe keeps fresh| B[("Derived tables<br/>cleaned, joined, aggregated")]
B --> C[Analytics dashboards]
B --> D[ML feature pipelines]
B --> E[LLM retrieval & agent context]
If Floe makes that middle layer dramatically cheaper to build and operate, more organizations can afford to do real AI on real production data — not just polished demos on cleaned-up training sets. That is the long-term bet.
It is early — v0.1 ships the core refresh engine and polling-based event-driven refresh; data quality routing, push-based event hooks, and multi-cloud deployment profiles are still on the roadmap. But the direction feels right. If “data infrastructure is the bottleneck for AI” is the thesis, then the most valuable thing I can spend nights and weekends on is not another agent demo — it is making the underlying plumbing better, more open, and more accessible to everyone trying to do real AI work.
If you are working on similar problems, I would love to hear from you.