Designing fairness-aware performance metrics for gig-economy workforces

2026-05-02

Gig-economy platforms — Uber, Lyft, DoorDash, Instacart, Amazon DSP, and many smaller ones — share a deep operational problem: they need to evaluate the performance of large numbers of independent workers using objective data, but the data is full of factors the worker cannot control.

A delivery is late. Was it the driver’s fault? Or was it the weather? Or was traffic gridlocked? Or did the dispatch system route them through an impossible path? Or was the address itself wrong because the customer mistyped it?

If your performance metric counts every late delivery against the driver, you punish drivers for things they did not cause. Trust evaporates. Drivers leave the platform. The data signal becomes meaningless because everyone learns to game whatever they can control.

If your metric does not count any of these events, you give up the ability to enforce baseline professional behavior. People who genuinely fail to do their jobs are indistinguishable from people who are blameless.

The interesting engineering problem in the middle is: how do you build a performance metric that holds workers accountable for what they can actually control, and only that?

I spent two years at Amazon working on exactly this problem in the context of last-mile delivery, and the conclusions generalize far beyond any single company. Here is the mental model I now use when thinking about fairness in performance metrics.

Step 1: Make the controllable / uncontrollable distinction explicit

The first move sounds obvious but is often skipped: write down every category of “defect” or “negative event” your platform can record, and explicitly label each one as driver-controllable or not.

For a delivery defect, the categories might look like:

Defect type	Controllable?	Reasoning
Driver did not attempt delivery during window	Yes	Driver chose not to attempt
Driver delivered to wrong address	Yes	Driver did not check
Driver did not follow delivery instructions	Yes	Driver chose not to follow
Address not found because customer entered wrong address	No	Customer / data error
Severe weather prevented safe delivery	No	Environmental
Traffic incident on route	No	Environmental
Dispatch system error caused incorrect routing	No	Platform error
Customer not available and refused alternate delivery	No	Customer behavior

This table is the foundation of everything else. Without it, every conversation about fairness becomes opinion. With it, the discussion shifts from “is this fair” to “is each row labeled correctly,” which is a tractable, verifiable engineering question.

Step 2: Attribute causality to events using multi-source data

Once you have the categories, the harder problem is: for any given delivery defect, what category was it actually in?

This is fundamentally a causality attribution problem. It requires combining data sources that none of the individual components were designed to combine:

The delivery record itself (timestamps, GPS trace, status updates)
Weather data (was there severe weather at the delivery location at the relevant time?)
Traffic data (was there a known incident on the planned route?)
System logs (did our routing system output something nonsensical?)
Address validation data (does the address geocode to a real location?)
Customer behavior signals (did the customer respond to delivery attempts?)

Each individual data source is messy, has gaps, and arrives with its own latency. Joining them at the per-event level is an industrial-scale data engineering problem before it is a fairness problem. You are processing tens or hundreds of terabytes a day, joining heterogeneous sources, and producing one classification per delivery event.

This is why fairness is, at its core, a data infrastructure problem. The conceptual classification logic is something a small team can write down. The infrastructure to actually compute it correctly at scale, day in and day out, with all the edge cases, is the actual work.

Step 3: Decide your default

A subtle but important question: when the data is ambiguous, do you default to “controllable” or “uncontrollable”?

There are good arguments both ways. Defaulting to “uncontrollable” means you let some real driver-fault defects slip through, which can erode platform integrity. Defaulting to “controllable” means you punish workers for ambiguous events, which is exactly the unfairness you are trying to avoid.

My view: default to uncontrollable when data is ambiguous, and invest engineering effort over time in reducing ambiguity. A metric that occasionally lets a real defect slip through but never punishes blameless workers is more trusted by the workforce than the inverse. Trust compounds. Erosion of trust compounds faster.

Step 4: Make the categorization auditable

Workers need to know why a specific defect counted against them, or why it did not. This is partly a fairness requirement and partly a debugging requirement: when the classification is wrong, you need to be able to fix it.

In practice this means logging, for every defect event:

The classification result
Which signals contributed to that classification
The version of the classification logic that was applied

A worker who appeals a defect should be able to see “this defect was classified as controllable because GPS showed you stopped 200 meters from the delivery address for 30 seconds and then drove away without attempting delivery.” That is a debuggable claim. “Our model says it was your fault” is not.

Step 5: Iterate the classification logic continuously

You will not get the categories right on day one. New defect types appear. Edge cases accumulate. The data gets better as upstream systems improve. Workers find legitimate gaps in your logic and surface them through appeals.

The classification rules need to be a living artifact. The platform needs to invest in:

A feedback loop where appeal outcomes feed back into the classification logic
Regular review by a cross-functional team (engineering, operations, workforce ops, sometimes legal)
Versioning so that you can replay old events with the new logic to assess impact
Stat-based monitoring to detect when classification distributions drift unexpectedly

This is not glamorous AI work. It is metric engineering. But it is what makes the difference between a platform whose workforce trusts the system and one where they do not.

Step 6: Build the coaching dashboard

The classification system tells you that a driver had N controllable defects last week. Knowing the number is not the same as being able to act on it. To actually improve performance, operations needs a per-defect coaching dashboard — a view a manager can sit down and walk through with the driver, one event at a time.

For each defect, the dashboard should show:

Where the delivery was going. The address, the customer-provided delivery instructions, and any prior notes for that address.
Where the driver actually went. The planned route, the actual GPS trace, and how the actual stop locations compare to the delivery address.
What the driver did at the stop. Time on site, whether they attempted delivery, whether they followed the instructions, photos or scans captured, customer responsiveness signals.
Why the defect was raised in the first place. Which signals triggered the classification, what threshold was crossed, and the version of the logic applied.

This overlaps with the auditability data in Step 4, but the audience is different: Step 4 is the worker’s view of their own defects, used for appeals. Step 6 is the operations view, used to have a productive conversation. A driver who sees “your GPS shows you stopped two blocks from the address for 30 seconds and never approached the door” can talk about what happened. A driver who hears “you got dinged” cannot.

Without this layer, the fairness work in steps 1–5 is academic. The metric is fair, but the workforce experience stays opaque, and there is no mechanism for the platform to actually help workers improve.

Why this generalizes beyond delivery

The same problem shape applies to any platform that uses data to evaluate humans:

Rideshare (Uber, Lyft): driver ratings, cancellations, pickup times — all confounded by traffic, weather, customer behavior, surge events
Food delivery (DoorDash, Instacart): order accuracy, delivery time — confounded by restaurant prep time, customer instructions
Customer service (call centers, gig support): handle time, resolution rate — confounded by ticket complexity, system outages
Sales platforms: deal close rate, response time — confounded by lead quality, market conditions
Online education: student outcomes — confounded by socioeconomic factors and prior knowledge

In each of these, the same anti-pattern shows up: a “simple” metric that measures something measurable, without controlling for the factors outside the worker’s control. And the same fix applies: explicit controllable / uncontrollable categorization, multi-source causality attribution, ambiguity-favors-the-worker defaults, auditability, iteration, and coaching surfaces that translate the metric into per-event conversations.

Closing thought

There is a common framing that “AI” or “ML models” will solve fairness problems in workforce evaluation. I disagree. Models can help with the causality attribution step — for example, by combining noisy signals to produce a probabilistic classification. But the core engineering problem is not modeling. It is the data infrastructure to feed the classification with timely, multi-source, ground-truth data; the rules to translate raw classifications into accountable decisions; and the auditability and iteration that maintain trust with the workforce over time.

Fairness in performance metrics is, fundamentally, an infrastructure and process problem. It is not a model problem. The platforms that get this right will have lower workforce churn, higher service quality, and a more defensible position when scrutinized by regulators or workers’ advocates. The ones that do not will continue to have unhappy workforces and unstable metrics, regardless of how sophisticated their downstream AI is.