Data infrastructure is the bottleneck for enterprise AI, not the models

2026-05-05

Most discussion of enterprise AI is about models. This post is about the layer underneath them, the data infrastructure that quietly decides whether any of that AI actually works, and it builds up from the ground.

The first half is a plain-language primer: what a data pipeline even is, and why an AI system needs one. If you have used ChatGPT, Gemini, or Copilot but never built a model or a pipeline yourself, start here. The second half is the argument I actually want to make, that this layer (not the model) is the real bottleneck for enterprise AI. If you already build pipelines for a living, skip ahead to why the data is the bottleneck.

Start with a question: why does AI need anything between it and the data?

When you ask ChatGPT, Gemini, or Copilot a general-knowledge question, it “knows” the answer in some sense: the knowledge is baked into the model from training. That is the easy case.

Now imagine an AI system that is supposed to answer questions about your company. “How many packages did we ship to California last week?” The model was trained on the public internet, so it does not know anything about your company. To answer that question, it has to be given the relevant data, pulled from your databases, transformed into something it can read, and handed to it as context.

Here is the catch: the data you have in your databases is almost never in a form that the model, or anything else, can use directly.

What raw data actually looks like

Imagine you run an e-commerce business. Your “shipping data” is not one table. In reality it is:

An orders table in your transactional database, with one row per order placed.
A customers table in a separate system, with one row per customer.
A returns table, populated by a different system, with one row per item returned.
A shipments feed from your logistics provider’s API, refreshed every few minutes, with carrier and tracking number.
A delivery_photos bucket in object storage (Amazon S3, Azure Blob Storage, Google Cloud Storage), where the driver uploads a “proof of delivery” photo of the package at the doorstep when each shipment is completed. The file paths look like s3://delivery-photos/2026/05/14/<shipment_id>.jpg, but the metadata about which shipment a photo belongs to, when it was taken, and whether the customer disputed delivery lives back in another database.
A regions mapping in a Google Sheet that the operations team updates by hand, mapping zip codes to delivery regions.
A currency_rates feed from a financial data vendor, pulled hourly.

Notice that the data lives in different kinds of systems. Some of it sits in relational databases as neat rows and columns. Some of it sits as photos (or PDFs, audio clips, video) in object storage, where each file has a path but no schema. Some of it lives in spreadsheets that humans edit by hand. A complete picture of the business requires reading from all of these systems and assembling the pieces, and each kind of system has its own access pattern, access controls, and quirks.

To answer “how many packages did we ship to California last week, by carrier, in USD”, you have to join all of these together. The customer’s California-ness lives in customers. The carrier lives in shipments. The dollar amount lives in orders, but in the customer’s local currency, so you have to convert it using currency_rates. You have to subtract returns. You have to group by carrier. You have to filter to last week.

None of that is AI work. It is data work. And no model in the world can do it without the data being prepared first. (Yes, you could in principle ask a model to write the SQL for you, modern models are quite good at this, but you still have to run that SQL against the actual data, on infrastructure that knows where every piece lives and can put them together reliably and on schedule.)

What is “data ingestion”?

Before any of that, before you can model the data, before you can build pipelines, before you can join customers to shipments, you have to actually get the data into a place where you can work with it. That step is called data ingestion, and it is its own engineering discipline. The phrase shows up regularly in job postings as “data ingestion engineer,” “data integration engineer,” or sometimes folded into “data platform engineer.”

Ingestion is the work of moving data from where it is generated (application databases, third-party APIs, IoT sensors, vendor file drops, mobile app event streams, photo uploads from a delivery driver’s phone) into a central place where it can be queried and combined. Concretely, this might mean:

Replicating new rows from the production orders PostgreSQL database into your data warehouse every few minutes (a technique called change data capture, or CDC).
Pulling the daily currency_rates CSV from your vendor’s SFTP server every morning.
Subscribing to a Kafka topic of mobile app events and writing each event out as a Parquet file in S3.
Polling a third-party logistics API every five minutes and storing each response as a JSON record.
Copying delivery photos uploaded by drivers into your central object-storage bucket, with metadata recorded in a separate index so downstream queries can find them.

Done well, ingestion is invisible: data shows up where downstream teams expect it, fresh, complete, and on schedule. Done poorly, every downstream team builds its own bespoke connection to every source, the same data gets pulled five different ways, nobody knows which copy is authoritative, and your cloud bill starts climbing mysteriously.

Ingestion is upstream of everything else in this post. Without it, there is nothing to model and nothing to transform.

Data is never clean

Here is something that is rarely covered in school: the data you ingest is never clean.

The customers table I described above is a good example. The same person might have signed up three times with slightly different details: john.smith@gmail.com, John.Smith@gmail.com, and jsmith@gmail.com (different email, probably same person), with three slightly different mailing addresses (123 Main Street, 123 Main St, 123 Main St. Apt 4B) and three different phone formats. To the database, those are three customers. To your business, they are one. If you join orders directly against customers without resolving this, your “customers in California” count is overstated on day one, and the gap widens over time.

The shipments feed has its own duplicate problem: the carrier’s API sometimes emits the same shipment event twice (network retries, idempotency bugs upstream), so one physical shipment shows up as two rows with the same tracking number. If you do not catch this, your “packages shipped last week” number goes up artificially, and so does your estimated operational cost.

These are not edge cases. Every real-world data source has issues like this: duplicates, schema drift, late-arriving rows, time-zone confusion, units that disagree across systems, encoding bugs, sentinel values like 9999-12-31 standing in for “no date,” and so on. A substantial part of the data engineering job is finding these issues, writing rules to resolve them (often called entity resolution for the deduplication case), and continuously applying those rules as new data flows in. The cleaned, deduplicated, validated tables are what downstream consumers, including AI models, actually want to query. The raw tables are not safe to use directly.

Modern AI makes this worse, not better. If you feed a model an inflated customer count or a duplicated shipment, the model will confidently report the wrong answer. It will not flag the duplicates for you. Whatever sloppiness exists in your data, the AI inherits, and produces wrong answers at a much larger scale than a human analyst would have.

What is “data modeling”?

The set of decisions about what shape your data should take to be useful is what data engineers call data modeling. It is design work, not coding.

For our e-commerce example, a data engineer would decide that:

There will be a clean, deduplicated customers table with one row per real-world customer (resolving duplicate sign-ups, normalizing email addresses, deciding what “customer” even means when a household shares an account).
There will be a daily_shipments table with one row per shipment per day, in USD, with region, carrier, and customer attributes already joined in.
There will be a monthly_revenue_by_region table aggregated up from the daily one, ready for the finance team.

Each of these is a derived table, meaning it is computed from the raw tables, not directly captured. The data engineer’s job is to decide what derived tables should exist, what columns they should have, and how they are computed from raw data.

This is design work, like designing a database schema or an API. Two different teams given the same raw data will produce very different derived tables, with very different downstream consequences. Good data modeling pays back for years. Bad data modeling produces years of pain.

What is a “data pipeline”?

OK, so you have decided which derived tables you want. Now you need to compute and maintain them. That is what a data pipeline is: the machinery that reads from raw tables, transforms the data according to your model, and writes the results into derived tables, kept fresh as new raw data arrives.

A pipeline for the daily_shipments table might look like:

Every hour, read all new rows from orders and shipments since the last run.
Join them with customers to attach region and segment.
Join with currency_rates to convert to USD.
Subtract any returns from returns that arrived for orders in this batch.
Group by day, region, and carrier.
Write the result into daily_shipments, replacing the day’s rows if they were partially recomputed.
Log what was processed, in case something goes wrong and you need to backfill.

Each step is code, often SQL, sometimes Python, sometimes Spark or Flink, running on a server somewhere, on a schedule (or in response to an event). A typical enterprise has hundreds of these pipelines, all running, all needing to be monitored, fixed when an upstream source goes down, updated when a schema changes, and re-run when a bug is found.

This is the “data engineering” part of the stack. It is most of the engineering work that goes into a real-world AI system.

The layers, top to bottom

Once you internalize the idea, the modern data stack falls into a few layers:

Sources — where data is created. Application databases, event streams, third-party APIs, files dropped into a folder, sensors.
Storage — where data is kept once you collect it. Data warehouses (Snowflake, BigQuery) and data lakes (Amazon S3 with Apache Iceberg, Databricks) are the two main flavors. The distinction matters less than it used to.
Transformation — the pipelines that turn raw data into derived tables. This is where tools like dbt, Snowflake Dynamic Tables, Apache Spark, and the project I am building, Floe, live.
Serving — how downstream consumers read the cleaned, modeled data. BI dashboards (Tableau, Power BI), ML feature stores, vector indexes for LLM retrieval, and the AI applications themselves.

flowchart LR
    S[("Sources<br/>databases, events, APIs")] --> ST[("Storage<br/>warehouses, lakes")]
    ST --> T["Transformation<br/>(pipelines + modeling)"]
    T --> SV[("Derived tables<br/>daily_shipments, etc.")]
    SV --> D[Dashboards &amp; BI]
    SV --> M[ML feature pipelines]
    SV --> A[LLM retrieval / agents]

Most enterprise AI ambition lives in the last layer on the right, the ML feature pipelines and LLM retrieval / agents boxes. But most of the engineering cost lives in the Transformation layer in the middle. That is the gap I find most interesting to work on, and the gap most enterprises are still digging out of.

Why this exists at all

Here is the honest answer to “why does AI need a data pipeline?”: the same reason a CEO does, the same reason your finance team does, the same reason your customer-success dashboard does. None of them can read raw transactional data and produce useful answers either.

A data pipeline is not an AI thing. It is the universal layer that any consumer of company data, human or model, needs in order to work. AI just made it more visible, because (a) AI systems are extraordinarily sensitive to data quality, and (b) AI projects have higher visibility and bigger budgets, so when an AI rollout fails because of a broken pipeline, people notice in a way they did not notice when an executive dashboard had bad numbers.

The alternative, every team querying raw data directly, building their own transformations, getting different numbers, breaking when schemas change, does exist, in plenty of companies. It is what people mean when they say “our data is a mess.” It is also what most data engineers spend their careers trying to fix.

Why this is good news if you are learning

If you have been hearing AI hype and feeling like the only way to participate is to learn how to train transformer models, this is the news: most of the actual work, and most of the actual hiring, is in the layer I just described. SQL, data modeling, pipeline orchestration, infrastructure: these are durable, learnable skills that do not require a PhD or a GPU cluster, and they are less prone to obsolescence than any specific modeling technique.

That layered picture is also exactly why I think this layer, not the model, is where enterprise AI is really won or lost. After several years of building data infrastructure in logistics and now enterprise AI, my view is blunt: the real bottleneck is data infrastructure, and most companies will never get the full value of AI until they fix it. The GIGO (“garbage in, garbage out”) principle has been around for as long as computing itself, and it applies in full force to AI: a frontier model fed stale, biased, malformed, or fragmented data produces stale, biased, malformed, or fragmented outputs, at machine speed and machine scale. The model is the easy part. The rest of this post is why that bottleneck, both at the company level and the country level, depends far more on the underlying data plumbing than on which model sits on top of it.

Models are getting commoditized fast

OpenAI, Anthropic, Google, Meta, Mistral, DeepSeek, Qwen — there are now half a dozen credible frontier model providers, and the gap between them on most enterprise tasks is small. Open-weight models from a year ago are roughly comparable to the proprietary models of the moment they were released. Inference cost per token has fallen by orders of magnitude.

What that means in practice for an enterprise: the model you choose is not your moat. Anyone can swap from one model to another with a few config changes. If your competitive advantage is “we use GPT-5 instead of GPT-4,” your competitive advantage will evaporate the day GPT-6 ships.

The actual moat is what you feed the model. Specifically:

The data your AI has access to that nobody else has
The freshness of that data
The quality, structure, and completeness of that data
Your ability to feed it to the model reliably, at scale, with appropriate filtering and grounding

This is all infrastructure. None of it is “AI” in the popular sense.

The data problem is bigger than the model problem

Anyone who has worked on a large-scale data pipeline knows that the messy reality of production data is the real engineering challenge. Data arrives late, in the wrong format, with missing fields, with duplicate records, with schema changes that nobody told you about, with timezone confusion, with units that disagree across upstream systems. The pipeline that handles all of this gracefully — that catches anomalies before they hit production, that gracefully degrades when an upstream goes down, that maintains lineage so you can debug what happened three weeks ago — is not a weekend project. It is years of infrastructure work.

When companies bolt an LLM on top of a data system that has been duct-taped together for a decade, the LLM does not magically clean things up. It cheerfully produces confident-sounding answers based on whatever it was given. If the data was wrong, the answer is wrong, just expressed in better English.

I have seen this pattern repeatedly. At Amazon, the team I was on ran event-driven pipelines processing very large volumes of delivery performance data every day — the infrastructure underneath the incentive program for the company’s last-mile delivery partners across multiple countries. The hard problem there was not the analytics on top, it was attribution: a delivery performance metric that does not distinguish controllable defects from uncontrollable ones — bad weather, traffic, system errors, customer-side data errors — produces unfair signals no matter how sophisticated the downstream model is. Getting that right meant joining heterogeneous signals (GPS, weather, traffic, system logs, address validation) at the per-event level. Until your infrastructure can do that, no model on top can make decisions that people will trust.

At Microsoft, I see the same shape of problem in enterprise AI. I work on platform infrastructure that supports products like Copilot, Power Apps, Dynamics 365, and Power Automate across Azure regions globally, including government cloud environments, with high availability targets. The platform itself is extremely reliable. What tends to be missing is on the other side of the API: customers adopting these tools want agents, RAG (retrieval-augmented generation), tool use, copilots embedded in every workflow — but they have a tangle of internal systems, inconsistent identity, stale documents, manual approvals, ad-hoc data sources. The AI works fine in a demo and falls over in production, because the production data infrastructure on the customer side is not ready.

What “good” data infrastructure looks like

If I had to name the properties that distinguish data infrastructure good enough for AI from infrastructure that is not, I would list four:

1. Reliability. Pipelines run on schedule, retry on failure, alert on anomalies, and never silently produce wrong outputs. If a piece of data is bad, the system knows and routes it to a quarantine queue rather than letting it propagate.

2. Freshness. Data flows from source to consumer with predictable, bounded latency. For most enterprise AI applications, freshness measured in minutes is the difference between useful and useless.

3. Quality validation at ingest. Bad data does not enter the system. Schema enforcement, statistical anomaly detection, and reference-data validation catch issues before they reach the model.

4. Lineage. When a downstream system produces a wrong answer, you can trace exactly which upstream record caused it, how it was transformed along the way, and when. Without lineage, you cannot debug AI failures.

Notably, none of these are AI features. They are good engineering practices that have existed for decades. What is new is that we have stopped treating them as optional. Once you put an AI model on top, every weakness in the underlying infrastructure gets amplified.

Why this matters at the country level

I think the same argument scales. The countries that will lead in enterprise AI are not the ones with the best models — that race is being run by a handful of companies and the gap is closing every quarter. The lead will go to the countries whose enterprises have built the best underlying data infrastructure: ingestion pipelines, quality frameworks, governance systems, identity, lineage, and the people who know how to build all of that.

This is true for the economy as a whole and it is true for individual sectors. Logistics, healthcare, financial services, and manufacturing are the places where AI applied to good data will create huge value. They are also the places where data infrastructure is hardest, because the data is heterogeneous, regulated, and historically siloed.

Most of the people doing the visible work right now are working on agents and models. There is plenty of attention on that. But there is also a quieter set of engineers building the plumbing — and that work will, I think, end up being the more durable contribution.

What I am working on

A lot of my time is now spent on this kind of infrastructure work, both in my day job and in personal open-source projects. I started Floe recently — an early-stage open-source engine that targets one specific piece of this plumbing: keeping derived data tables fresh, automatically, as their sources change.

If that does not immediately mean anything, here is the simplest way I can put it:

Floe is like a spreadsheet formula, but for database tables. When you put =SUM(A1:A10) into a spreadsheet cell, the cell updates automatically when any of A1 through A10 changes. Floe does the same thing, except the “cells” are large database tables with millions of rows, the “formula” is a SQL query, and “recalculating” means re-running the SQL incrementally on the part of the data that actually changed.

The problem in concrete terms. Imagine you run an e-commerce business. Your raw data is a table called orders — every transaction as it happens. For analytics, dashboards, or AI, you do not want to query raw orders directly: it is messy, it does not have customer region joined in, refunds are not subtracted, currency is not normalized. You want a derived table — say, daily_sales_by_region — that already does all of those things, kept up to date as new orders arrive.

The standard way to build that today is to write a custom Spark or Redshift script that reads from orders, applies the transformations, writes the result to daily_sales_by_region, schedules itself to run every hour, monitors itself for failure, and gets rebuilt whenever the schema changes. Every team writes that boilerplate by hand for every derived table. A typical enterprise data platform might have hundreds of these. The result: most of an analytics team’s engineering time goes into orchestration plumbing rather than actual analysis.

What Floe does instead. You declare the derived table once, in plain SQL:

CREATE DYNAMIC TABLE silver.daily_sales_by_region
  REFRESH_MODE = INCREMENTAL AS
  SELECT region, SUM(amount_usd) AS total_sales
  FROM bronze.orders
  GROUP BY region;

That is the entire pipeline. Floe figures out which raw tables this one depends on, watches them for new data, runs the SELECT only on the rows that have changed since the last refresh (not the entire history), writes the result back, and stamps every output row with the upstream snapshot it was computed from so you can trace any result back to its source.

It does all of this on top of Apache Iceberg, an open table format that has rapidly become the industry standard for data lakes. That means the resulting tables are readable by any modern data tool — Spark, DuckDB, Trino, Snowflake, Databricks, BigQuery — without vendor lock-in. You can switch your query engine without rewriting your data.

Why it matters for AI. Models do not consume raw data. The pipeline from “events streaming into a database” to “data ready for a model to use” is exactly the layer where derived tables live — and where most of the engineering cost in an enterprise AI project actually sits.

flowchart LR
    A[("Raw tables<br/>orders, events, telemetry")] -->|Floe keeps fresh| B[("Derived tables<br/>cleaned, joined, aggregated")]
    B --> C[Analytics dashboards]
    B --> D[ML feature pipelines]
    B --> E[LLM retrieval &amp; agent context]

If Floe makes that middle layer dramatically cheaper to build and operate, more organizations can afford to do real AI on real production data — not just polished demos on cleaned-up training sets. That is the long-term bet.

It is early — v0.1 ships the core refresh engine and polling-based event-driven refresh; data quality routing, push-based event hooks, and multi-cloud deployment profiles are still on the roadmap. But the direction feels right. If “data infrastructure is the bottleneck for AI” is the thesis, then the most valuable thing I can spend nights and weekends on is not another agent demo — it is making the underlying plumbing better, more open, and more accessible to everyone trying to do real AI work.

If you are working on similar problems, I would love to hear from you.