What is data engineering, and why does AI need it?

2026-05-14

A friend recently read my last post on why data infrastructure is the real bottleneck for enterprise AI, and asked exactly the right question: what is a data pipeline, and why does AI need one in the first place? If you have used ChatGPT, Gemini, or Copilot but never built a model or a pipeline yourself, that question deserves a proper answer — and I realized I had never written one.

This post is that answer. It is a tour of the part of the AI stack that nobody puts in a product launch, but where most of the engineering effort actually goes — and the part where the most interesting opportunities to learn and contribute live, even if you do not have a machine learning background.

Start with a question: why does AI need anything between it and the data?

When you ask ChatGPT, Gemini, or Copilot a general-knowledge question, it “knows” the answer in some sense — the knowledge is baked into the model from training. That is the easy case.

Now imagine an AI system that is supposed to answer questions about your company. “How many packages did we ship to California last week?” The model was trained on the public internet, so it does not know anything about your company. To answer that question, it has to be given the relevant data — pulled from your databases, transformed into something it can read, and handed to it as context.

Here is the catch: the data you have in your databases is almost never in a form that the model — or anything else — can use directly.

What raw data actually looks like

Imagine you run an e-commerce business. Your “shipping data” is not one table. In reality it is:

An orders table in your transactional database, with one row per order placed.
A customers table in a separate system, with one row per customer.
A returns table, populated by a different system, with one row per item returned.
A shipments feed from your logistics provider’s API, refreshed every few minutes, with carrier and tracking number.
A delivery_photos bucket in object storage (Amazon S3, Azure Blob Storage, Google Cloud Storage), where the driver uploads a “proof of delivery” photo of the package at the doorstep when each shipment is completed. The file paths look like s3://delivery-photos/2026/05/14/<shipment_id>.jpg, but the metadata about which shipment a photo belongs to, when it was taken, and whether the customer disputed delivery lives back in another database.
A regions mapping in a Google Sheet that the operations team updates by hand, mapping zip codes to delivery regions.
A currency_rates feed from a financial data vendor, pulled hourly.

Notice that the data lives in different kinds of systems. Some of it sits in relational databases as neat rows and columns. Some of it sits as photos (or PDFs, audio clips, video) in object storage, where each file has a path but no schema. Some of it lives in spreadsheets that humans edit by hand. A complete picture of the business requires reading from all of these systems and assembling the pieces — and each kind of system has its own access pattern, access controls, and quirks.

To answer “how many packages did we ship to California last week, by carrier, in USD”, you have to join all of these together. The customer’s California-ness lives in customers. The carrier lives in shipments. The dollar amount lives in orders, but in the customer’s local currency, so you have to convert it using currency_rates. You have to subtract returns. You have to group by carrier. You have to filter to last week.

None of that is AI work. It is data work. And no model in the world can do it without the data being prepared first. (Yes, you could in principle ask a model to write the SQL for you — modern models are quite good at this — but you still have to run that SQL against the actual data, on infrastructure that knows where every piece lives and can put them together reliably and on schedule.)

What is “data ingestion”?

Before any of that — before you can model the data, before you can build pipelines, before you can join customers to shipments — you have to actually get the data into a place where you can work with it. That step is called data ingestion, and it is its own engineering discipline. The phrase shows up regularly in job postings as “data ingestion engineer,” “data integration engineer,” or sometimes folded into “data platform engineer.”

Ingestion is the work of moving data from where it is generated — application databases, third-party APIs, IoT sensors, vendor file drops, mobile app event streams, photo uploads from a delivery driver’s phone — into a central place where it can be queried and combined. Concretely, this might mean:

Replicating new rows from the production orders PostgreSQL database into your data warehouse every few minutes (a technique called change data capture, or CDC).
Pulling the daily currency_rates CSV from your vendor’s SFTP server every morning.
Subscribing to a Kafka topic of mobile app events and writing each event out as a Parquet file in S3.
Polling a third-party logistics API every five minutes and storing each response as a JSON record.
Copying delivery photos uploaded by drivers into your central object-storage bucket, with metadata recorded in a separate index so downstream queries can find them.

Done well, ingestion is invisible: data shows up where downstream teams expect it, fresh, complete, and on schedule. Done poorly, every downstream team builds its own bespoke connection to every source, the same data gets pulled five different ways, nobody knows which copy is authoritative, and your cloud bill starts climbing mysteriously.

Ingestion is upstream of everything else in this post. Without it, there is nothing to model and nothing to transform.

Data is never clean

Here is something that is rarely covered in school: the data you ingest is never clean.

The customers table I described above is a good example. The same person might have signed up three times with slightly different details — john.smith@gmail.com, John.Smith@gmail.com, and jsmith@gmail.com (different email, probably same person) — with three slightly different mailing addresses (123 Main Street, 123 Main St, 123 Main St. Apt 4B) and three different phone formats. To the database, those are three customers. To your business, they are one. If you join orders directly against customers without resolving this, your “customers in California” count is overstated on day one, and the gap widens over time.

The shipments feed has its own duplicate problem: the carrier’s API sometimes emits the same shipment event twice (network retries, idempotency bugs upstream), so one physical shipment shows up as two rows with the same tracking number. If you do not catch this, your “packages shipped last week” number goes up artificially, and so does your estimated operational cost.

These are not edge cases. Every real-world data source has issues like this — duplicates, schema drift, late-arriving rows, time-zone confusion, units that disagree across systems, encoding bugs, sentinel values like 9999-12-31 standing in for “no date,” and so on. A substantial part of the data engineering job is finding these issues, writing rules to resolve them (often called entity resolution for the deduplication case), and continuously applying those rules as new data flows in. The cleaned, deduplicated, validated tables are what downstream consumers — including AI models — actually want to query. The raw tables are not safe to use directly.

Modern AI makes this worse, not better. If you feed a model an inflated customer count or a duplicated shipment, the model will confidently report the wrong answer. It will not flag the duplicates for you. Whatever sloppiness exists in your data, the AI inherits — and produces wrong answers at a much larger scale than a human analyst would have.

What is “data modeling”?

The set of decisions about what shape your data should take to be useful is what data engineers call data modeling. It is design work, not coding.

For our e-commerce example, a data engineer would decide that:

There will be a clean, deduplicated customers table with one row per real-world customer (resolving duplicate sign-ups, normalizing email addresses, deciding what “customer” even means when a household shares an account).
There will be a daily_shipments table with one row per shipment per day, in USD, with region, carrier, and customer attributes already joined in.
There will be a monthly_revenue_by_region table aggregated up from the daily one, ready for the finance team.

Each of these is a derived table — meaning it is computed from the raw tables, not directly captured. The data engineer’s job is to decide what derived tables should exist, what columns they should have, and how they are computed from raw data.

This is design work, like designing a database schema or an API. Two different teams given the same raw data will produce very different derived tables, with very different downstream consequences. Good data modeling pays back for years. Bad data modeling produces years of pain.

What is a “data pipeline”?

OK, so you have decided which derived tables you want. Now you need to compute and maintain them. That is what a data pipeline is: the machinery that reads from raw tables, transforms the data according to your model, and writes the results into derived tables — kept fresh as new raw data arrives.

A pipeline for the daily_shipments table might look like:

Every hour, read all new rows from orders and shipments since the last run.
Join them with customers to attach region and segment.
Join with currency_rates to convert to USD.
Subtract any returns from returns that arrived for orders in this batch.
Group by day, region, and carrier.
Write the result into daily_shipments, replacing the day’s rows if they were partially recomputed.
Log what was processed, in case something goes wrong and you need to backfill.

Each step is code — often SQL, sometimes Python, sometimes Spark or Flink — running on a server somewhere, on a schedule (or in response to an event). A typical enterprise has hundreds of these pipelines, all running, all needing to be monitored, fixed when an upstream source goes down, updated when a schema changes, and re-run when a bug is found.

This is the “data engineering” part of the stack. It is most of the engineering work that goes into a real-world AI system.

The layers, top to bottom

Once you internalize the idea, the modern data stack falls into a few layers:

Sources — where data is created. Application databases, event streams, third-party APIs, files dropped into a folder, sensors.
Storage — where data is kept once you collect it. Data warehouses (Snowflake, BigQuery) and data lakes (Amazon S3 with Apache Iceberg, Databricks) are the two main flavors. The distinction matters less than it used to.
Transformation — the pipelines that turn raw data into derived tables. This is where tools like dbt, Snowflake Dynamic Tables, Apache Spark, and the project I am building, Floe, live.
Serving — how downstream consumers read the cleaned, modeled data. BI dashboards (Tableau, Power BI), ML feature stores, vector indexes for LLM retrieval, and the AI applications themselves.

flowchart LR
    S[("Sources<br/>databases, events, APIs")] --> ST[("Storage<br/>warehouses, lakes")]
    ST --> T["Transformation<br/>(pipelines + modeling)"]
    T --> SV[("Derived tables<br/>daily_shipments, etc.")]
    SV --> D[Dashboards &amp; BI]
    SV --> M[ML feature pipelines]
    SV --> A[LLM retrieval / agents]

Most enterprise AI ambition lives in the last layer on the right — the ML feature pipelines and LLM retrieval / agents boxes. But most of the engineering cost lives in the Transformation layer in the middle. That is the gap I find most interesting to work on, and the gap most enterprises are still digging out of.

Why this exists at all

Here is the honest answer to “why does AI need a data pipeline?”: the same reason a CEO does, the same reason your finance team does, the same reason your customer-success dashboard does. None of them can read raw transactional data and produce useful answers either.

A data pipeline is not an AI thing. It is the universal layer that any consumer of company data — human or model — needs in order to work. AI just made it more visible, because (a) AI systems are extraordinarily sensitive to data quality, and (b) AI projects have higher visibility and bigger budgets, so when an AI rollout fails because of a broken pipeline, people notice in a way they did not notice when an executive dashboard had bad numbers.

The alternative — every team querying raw data directly, building their own transformations, getting different numbers, breaking when schemas change — does exist, in plenty of companies. It is what people mean when they say “our data is a mess.” It is also what most data engineers spend their careers trying to fix.

Why this is good news if you are learning

If you have been hearing AI hype and feeling like the only way to participate is to learn how to train transformer models, this is the news: most of the actual work, and most of the actual hiring, is in the layer I just described. SQL, data modeling, pipeline orchestration, infrastructure — these are durable, learnable skills that do not require a PhD or a GPU cluster, and they are less prone to obsolescence than any specific modeling technique.

I write more about why this layer specifically is the bottleneck for enterprise AI — and what I am building to make it easier — in the companion piece. If the layered picture above made sense, that one will too.