From manual script to production pipeline

2026-04-20

A pattern I have seen in nearly every company I have worked at, regardless of size or sophistication: somewhere in the engineering org, there is a critical data pipeline that is “automated” only in the loosest sense. An engineer runs a script. The script does some things. Sometimes it works. Sometimes it produces malformed output and somebody has to spend two hours figuring out what went wrong, manually cleaning up, and re-running.

These pipelines are not automated. They are manually-operated, with code as the implementation detail. And they are everywhere — in startups, in scale-ups, even at FAANG companies. They tend to accumulate in places that the company does not yet consider “real” infrastructure: insurance data ingestion, partner onboarding, billing reconciliation, internal reporting feeds.

The interesting engineering question is not “should we automate this” — usually the answer is yes, eventually. The harder questions are: when does the cost of leaving it manual exceed the cost of automating? And when you do automate, what does the right replacement architecture look like?

I rebuilt one of these pipelines at a health-tech startup during the COVID-19 pandemic, going from manually-executed scripts to a worker-based automated system. Throughput went up 8x and engineering toil dropped to near zero. Here is what I learned about when to make this kind of investment and how to do it well.

Signs your manual pipeline has hit its expiration date

A manually-operated pipeline is fine when:

It runs rarely (once a week or less)
The data volume is small enough to validate by eye
Failures are rare and recoverable in minutes
The engineering cost to automate exceeds the cumulative manual operating cost

It needs to be replaced when:

The pipeline is on the critical path for the business. If failures or delays here cause customer-facing problems, manual operation is incompatible with reliability.
The frequency has gone up by 5–10x without anyone noticing. A pipeline that ran weekly two years ago and now runs daily was never re-evaluated.
The “fix” workflow is consuming significant engineering hours. If even one engineer spends a day a week recovering from pipeline failures, that is 50+ engineering days a year of pure overhead.
The error mode is silent. If the pipeline can produce wrong output without anyone knowing, that is the worst combination — the manual operator’s eye is the only quality gate, and the operator is busy.
The team is growing and the institutional knowledge is concentrated in one person. When the only engineer who knows how to run the script goes on vacation, you find out very fast how critical the pipeline was.

In the case I rebuilt, all five of these were true at once. The original system was a Ruby script that processed insurance member data files dropped from carriers. Engineers ran it on demand. It frequently produced malformed records that had to be manually cleaned up. As the company scaled toward 350,000+ test kits per month, the pipeline went from “OK” to “active liability” within a few months.

When NOT to do this

I have also seen this kind of rewrite go badly. Common failure modes:

The team rewrites the pipeline before anyone agrees on the requirements. You end up with a beautiful new system that automates the wrong thing.
The team chooses an over-engineered architecture for a small problem. A 100-record-per-day pipeline does not need Kafka. A simple cron + script with proper logging is fine.
The team replaces the manual pipeline but does not change the upstream contracts. The new system is now silently producing the same garbage outputs the manual one did, just at higher throughput.

The point of automation is to reduce engineering toil and increase reliability. If the rewrite does not do those things, it does not matter how clean the architecture diagram is.

What “good” automation actually looks like

The naive automation move — wrap the script in a cron job — is usually a mistake. It takes the worst aspects of manual operation (no observability, no error handling, no parallelism) and removes the one good aspect (a human in the loop catching obvious failures).

A worker-based architecture is generally a better target. Concretely, the shape it converges on looks like this:

flowchart LR
    SRC[("Source")] --> ING["Ingest<br/>+ validate"]
    ING -->|"valid"| Q["Work queue<br/>(durable)"]
    ING -->|"invalid"| DLQ[("Quarantine /<br/>dead-letter")]
    Q --> W["Workers<br/>(idempotent, retried)"]
    W --> ST[("State + output<br/>store")]
    W -.failures.-> DLQ
    W --> OBS["Observability<br/>(metrics, logs, audit)"]

Everything in the checklist further down is just making each box in that diagram trustworthy.

This architecture has several properties that the wrapped-cron version does not:

Horizontal scaling. Run more workers, process more data. No pipeline rewrite needed.
Graceful degradation. If one record is malformed, that record alone fails. Everything else continues.
Self-healing on transient failures. A worker that crashes mid-task does not lose work; the queue redrives the item.
Visible health. When the queue is backing up, you see it. When malformed data is coming in, you see it.
No engineer in the loop on the happy path. Engineers only get involved when something genuinely needs human judgment.

If all you needed was the decision of whether and when, you can stop here. The rest of this post is the detailed version: the specific properties that turn that diagram into something you can leave running unattended. Think of it less as a tutorial and more as a list of questions to ask about any pipeline you would be paged for at 2am.

1. Durable job state

A script holds its state in memory and in the engineer’s head: which files it has processed, where it stopped, what it was about to do next. The moment the process dies, all of that is gone, and recovery means an engineer reconstructing it by hand.

A pipeline keeps that state somewhere durable: a queue, a database table, a job ledger. “We have processed records 1 through 4,000; 4,001 onward are still pending” should survive a crash, a deploy, and a restart. The test is simple: if you kill the process right now, can it pick up exactly where it left off without a human deciding where that was? If the answer lives only in someone’s memory, you have a script.

2. Idempotency

Idempotency means processing the same item twice produces the same result as processing it once. This sounds like a nicety. It is actually the property that makes every other reliability mechanism safe, because retries, redrives, and reruns all reprocess things that may have partially succeeded.

In practice this means: deduplicate on a stable key, make writes upserts rather than blind inserts, and design so that “did this already happen?” is a question the system can answer. Without idempotency, a retry after a partial failure double-counts, double-charges, or double-sends. With it, “just run it again” becomes a safe operation instead of a gamble, which is the difference between a pipeline you can operate and one you have to babysit.

3. Retries with backoff, and a limit

Transient failures are not the exception in distributed systems, they are the weather. A provider times out, a database has a brief blip, a network hiccups. A script treats the first transient failure as fatal. A pipeline expects them and retries.

The details matter. Retry with exponential backoff so you do not hammer a struggling dependency into the ground. Add jitter so a thousand workers do not all retry in lockstep and create a thundering herd. And crucially, cap the retries: an item that has failed ten times is not transient, it is broken, and it needs to stop retrying and go somewhere a human will see it. Which brings us to the next item.

4. A quarantine path, not a halt

This is the single biggest behavioral difference between a script and a pipeline. When a script hits a bad record, the whole run stops. One malformed row out of fifty thousand, and nothing gets processed until someone intervenes.

A pipeline isolates the failure. The bad item goes to a quarantine area, often called a dead-letter queue, with enough context to debug it later, and every other item keeps flowing. The blast radius of one bad record is one record. This is what lets the pipeline maintain throughput while still surfacing problems, rather than forcing a choice between “stop everything” and “silently skip.” You get neither halt nor silent loss: you get a visible pile of failures you can triage on your own schedule.

5. Validation at the boundary

Every upstream you depend on has its own private definition of “valid input,” and real data is worse than you expect: missing fields, extra fields, wrong types, and the dangerous category, values that are technically well-formed but semantically nonsense (a delivery timestamp from 1970, a negative weight, a country code that does not exist).

Validate at ingestion, before the data is deep inside your processing logic. Type and schema checks are the easy half. The hard, valuable half is semantic validation: range checks, referential checks, sanity checks that encode what the data is actually supposed to mean. Data caught at the boundary is a quarantined item with a clear reason. The same data caught three stages later is a corrupted output and a confusing bug. This is the practical face of “garbage in, garbage out,” and it is cheaper to enforce at the door than to chase downstream.

6. An audit trail

When someone asks “why does this record look wrong?” three weeks later, you need to be able to answer it. That means the pipeline records what it did: which input produced which output, when it ran, what version of the code and config processed it, and what decisions it made along the way.

This is not the same as logging. Logs are for debugging the system; an audit trail is for explaining a specific result after the fact. In regulated or financial contexts it is mandatory, but it earns its keep everywhere, because the question “what happened to this particular item?” is one you will be asked constantly, and “let me read the code and guess” is not an answer you want to give.

7. Observability you can act on

A script’s health is “the engineer who ran it is still watching the terminal.” A pipeline runs unattended, so it has to expose its own health: queue depth, processing rate, error rate, end-to-end latency, and the size of the quarantine pile. These are the vital signs, and they need to be on a dashboard the on-call engineer actually looks at.

The bar is not “we have logs.” The bar is: when the pipeline is backing up or quietly dropping things, does a signal reach a human before a customer does? If queue depth quietly climbing for six hours produces no alert, the pipeline is automated but not observable, and the first you will hear of it is the downstream complaint. I wrote more about why this telemetry is itself a hard data problem in a separate post on network telemetry; the same discipline applies to any pipeline watching itself.

8. Safe reruns

Things go wrong, and when they do you will need to reprocess: a day’s worth of data, a single failed batch, everything since Tuesday. A pipeline makes this a routine, parameterized operation (“reprocess this window”), not a heroic one-off where an engineer hand-edits the script and prays.

Safe reruns are where items 1 through 6 pay off together. Durable state tells you what to reprocess. Idempotency makes reprocessing non-destructive. The quarantine path gives you a clean set of failures to retry. The audit trail lets you confirm the rerun did what you expected. If reruns are scary, it is usually a sign one of those foundations is missing, and the fear is the signal telling you where.

9. The database is part of the pipeline

A point that is easy to miss: a pipeline that was written at single-engineer speed often has a database that is also at single-engineer speed. The moment you parallelize the work across many workers, those queries get hit far harder, and the bottleneck just moves from the script to the database.

So pipeline design includes data-access design: the right indexes for the queries the workers actually run, denormalizing the few hot paths that need it, caching reads that do not change, and being honest about which queries will run a million times a day. Otherwise the shiny new worker pool simply queues up behind a slow query, and you have moved the problem without solving it.

The architecture is not the hard part

Here is the dirty secret of this kind of work: the architecture diagram is the easy part. The hard parts are:

1. Migrating without dropping data. You cannot just “switch over.” You need a period of running both systems in parallel, comparing outputs, validating that the new pipeline produces the same results as the old (modulo the bugs you are fixing). This usually takes longer than building the new pipeline itself.

2. Designing schema validation that catches real bad data. Every upstream provider you depend on has its own definition of “valid input.” Real data has missing fields, extra fields, fields with wrong types, fields with values that are technically valid but semantically nonsense. Schema validation needs to catch the semantically-nonsense category, which is much harder than catching type mismatches.

3. Retrofitting database performance. A pipeline that was running at single-engineer speed often has database queries that are also at single-engineer speed. When you parallelize the pipeline, those queries hit the database harder. You usually need to add indexes, denormalize a few tables, and cache hot reads. In the case I worked on, applying targeted indexing strategies to the underlying database improved filtering query performance by 40x — separately from the pipeline rewrite. Without that, the worker pool would have just shifted the bottleneck downstream.

4. Data quality work that exposes existing bugs. The first time the new system runs, you will find data quality issues that were silently being papered over by the manual operator. This is good — those bugs were always there — but it means your “switch over” timeline includes fixing problems that have nothing to do with the pipeline rewrite itself.

5. Operational handoff. The on-call engineer needs to know how to debug the new system. The runbook needs to exist. The dashboards need to be intuitive. None of this is glamorous, all of it is required for the system to actually run unattended.

But the checklist is what you are migrating toward, and it is worth being explicit about it, because every item is a place where a pipeline silently degrades back into a script when no one is watching. Retries get removed during an incident and never restored. Validation gets loosened to push a deadline. The dashboard breaks and nobody notices because nothing is on fire yet.

A production pipeline is not a thing you build once. It is a set of properties you keep choosing to maintain, and the checklist is how you keep the score. The same instinct shows up in the open-source pipeline tooling I work on around Floe: the interesting problems are almost never the transformation logic, they are durability, idempotency, and safe reprocessing, the unglamorous properties that decide whether the thing can be trusted to run without you.