Designing fairness-aware performance metrics for gig-economy workforces
Gig-economy platforms — Uber, Lyft, DoorDash, Instacart, Amazon DSP, and many smaller ones — share a deep operational problem: they need to evaluate the performance of large numbers of independent workers using objective data, but the data is full of factors the worker cannot control.
A delivery is late. Was it the driver’s fault? Or was it the weather? Or was traffic gridlocked? Or did the dispatch system route them through an impossible path? Or was the address itself wrong because the customer mistyped it?
If your performance metric counts every late delivery against the driver, you punish drivers for things they did not cause. Trust evaporates. Drivers leave the platform. The data signal becomes meaningless because everyone learns to game whatever they can control.
If your metric does not count any of these events, you give up the ability to enforce baseline professional behavior. People who genuinely fail to do their jobs are indistinguishable from people who are blameless.
The interesting engineering problem in the middle is: how do you build a performance metric that holds workers accountable for what they can actually control, and only that?
I spent two years at Amazon working on exactly this problem in the context of last-mile delivery, and the conclusions generalize far beyond any single company. Here is the mental model I now use when thinking about fairness in performance metrics.
Step 1: Make the controllable / uncontrollable distinction explicit
The first move sounds obvious but is often skipped: write down every category of “defect” or “negative event” your platform can record, and explicitly label each one as driver-controllable or not.
For a delivery defect, the categories might look like:
| Defect type | Controllable? | Reasoning |
|---|---|---|
| Driver did not attempt delivery during window | Yes | Driver chose not to attempt |
| Driver delivered to wrong address | Yes | Driver did not check |
| Driver did not follow delivery instructions | Yes | Driver chose not to follow |
| Address not found because customer entered wrong address | No | Customer / data error |
| Severe weather prevented safe delivery | No | Environmental |
| Traffic incident on route | No | Environmental |
| Dispatch system error caused incorrect routing | No | Platform error |
| Customer not available and refused alternate delivery | No | Customer behavior |
This table is the foundation of everything else. Without it, every conversation about fairness becomes opinion. With it, the discussion shifts from “is this fair” to “is each row labeled correctly,” which is a tractable, verifiable engineering question.
Step 2: Attribute causality to events using multi-source data
Once you have the categories, the harder problem is: for any given delivery defect, what category was it actually in?
This is fundamentally a causality attribution problem. It requires combining data sources that none of the individual components were designed to combine:
- The delivery record itself (timestamps, GPS trace, status updates)
- Weather data (was there severe weather at the delivery location at the relevant time?)
- Traffic data (was there a known incident on the planned route?)
- System logs (did our routing system output something nonsensical?)
- Address validation data (does the address geocode to a real location?)
- Customer behavior signals (did the customer respond to delivery attempts?)
Each individual data source is messy, has gaps, and arrives with its own latency. Joining them at the per-event level is an industrial-scale data engineering problem before it is a fairness problem. You are processing tens or hundreds of terabytes a day, joining heterogeneous sources, and producing one classification per delivery event.
This is why fairness is, at its core, a data infrastructure problem. The conceptual classification logic is something a small team can write down. The infrastructure to actually compute it correctly at scale, day in and day out, with all the edge cases, is the actual work.
Step 3: Decide your default
A subtle but important question: when the data is ambiguous, do you default to “controllable” or “uncontrollable”?
There are good arguments both ways. Defaulting to “uncontrollable” means you let some real driver-fault defects slip through, which can erode platform integrity. Defaulting to “controllable” means you punish workers for ambiguous events, which is exactly the unfairness you are trying to avoid.
My view: default to uncontrollable when data is ambiguous, and invest engineering effort over time in reducing ambiguity. A metric that occasionally lets a real defect slip through but never punishes blameless workers is more trusted by the workforce than the inverse. Trust compounds. Erosion of trust compounds faster.
Step 4: Make the categorization auditable
Workers need to know why a specific defect counted against them, or why it did not. This is partly a fairness requirement and partly a debugging requirement: when the classification is wrong, you need to be able to fix it.
In practice this means logging, for every defect event:
- The classification result
- Which signals contributed to that classification
- The version of the classification logic that was applied
A worker who appeals a defect should be able to see “this defect was classified as controllable because GPS showed you stopped 200 meters from the delivery address for 30 seconds and then drove away without attempting delivery.” That is a debuggable claim. “Our model says it was your fault” is not.
Step 5: Iterate the classification logic continuously
You will not get the categories right on day one. New defect types appear. Edge cases accumulate. The data gets better as upstream systems improve. Workers find legitimate gaps in your logic and surface them through appeals.
The classification rules need to be a living artifact. The platform needs to invest in:
- A feedback loop where appeal outcomes feed back into the classification logic
- Regular review by a cross-functional team (engineering, operations, workforce ops, sometimes legal)
- Versioning so that you can replay old events with the new logic to assess impact
- Stat-based monitoring to detect when classification distributions drift unexpectedly
This is not glamorous AI work. It is metric engineering. But it is what makes the difference between a platform whose workforce trusts the system and one where they do not.
Step 6: Build the coaching dashboard
The classification system tells you that a driver had N controllable defects last week. Knowing the number is not the same as being able to act on it. To actually improve performance, operations needs a per-defect coaching dashboard — a view a manager can sit down and walk through with the driver, one event at a time.
For each defect, the dashboard should show:
- Where the delivery was going. The address, the customer-provided delivery instructions, and any prior notes for that address.
- Where the driver actually went. The planned route, the actual GPS trace, and how the actual stop locations compare to the delivery address.
- What the driver did at the stop. Time on site, whether they attempted delivery, whether they followed the instructions, photos or scans captured, customer responsiveness signals.
- Why the defect was raised in the first place. Which signals triggered the classification, what threshold was crossed, and the version of the logic applied.
This overlaps with the auditability data in Step 4, but the audience is different: Step 4 is the worker’s view of their own defects, used for appeals. Step 6 is the operations view, used to have a productive conversation. A driver who sees “your GPS shows you stopped two blocks from the address for 30 seconds and never approached the door” can talk about what happened. A driver who hears “you got dinged” cannot.
Without this layer, the fairness work in steps 1–5 is academic. The metric is fair, but the workforce experience stays opaque, and there is no mechanism for the platform to actually help workers improve.
Why this generalizes beyond delivery
The same problem shape applies to any platform that uses data to evaluate humans:
- Rideshare (Uber, Lyft): driver ratings, cancellations, pickup times — all confounded by traffic, weather, customer behavior, surge events
- Food delivery (DoorDash, Instacart): order accuracy, delivery time — confounded by restaurant prep time, customer instructions
- Customer service (call centers, gig support): handle time, resolution rate — confounded by ticket complexity, system outages
- Sales platforms: deal close rate, response time — confounded by lead quality, market conditions
- Online education: student outcomes — confounded by socioeconomic factors and prior knowledge
In each of these, the same anti-pattern shows up: a “simple” metric that measures something measurable, without controlling for the factors outside the worker’s control. And the same fix applies: explicit controllable / uncontrollable categorization, multi-source causality attribution, ambiguity-favors-the-worker defaults, auditability, iteration, and coaching surfaces that translate the metric into per-event conversations.
Closing thought
There is a common framing that “AI” or “ML models” will solve fairness problems in workforce evaluation. I disagree. Models can help with the causality attribution step — for example, by combining noisy signals to produce a probabilistic classification. But the core engineering problem is not modeling. It is the data infrastructure to feed the classification with timely, multi-source, ground-truth data; the rules to translate raw classifications into accountable decisions; and the auditability and iteration that maintain trust with the workforce over time.
Fairness in performance metrics is, fundamentally, an infrastructure and process problem. It is not a model problem. The platforms that get this right will have lower workforce churn, higher service quality, and a more defensible position when scrutinized by regulators or workers’ advocates. The ones that do not will continue to have unhappy workforces and unstable metrics, regardless of how sophisticated their downstream AI is.