How I designed the end-to-end evaluation workflow, information architecture, and decision frameworks for 50+ users assessing AI agent performance across 1M+ data points.
An AI research team was developing an autonomous web agent capable of navigating websites, extracting information, and completing tasks on behalf of users. To improve the model, the team needed human evaluators to accurately annotate thousands of real-world agent workflows — assessing whether the AI successfully completed each task, identifying where it failed, and categorizing the type and severity of errors.
The problem: the evaluation process was complex. Annotators needed to distinguish between multiple workflow statuses (success, failure, unknown, out of scope, demo, guardrail), identify specific failure steps, classify errors across a severity hierarchy, and assign granular failure buckets — all while handling sensitive customer data under strict security protocols.
I started by mapping the full annotation workflow end-to-end, identifying every decision point a user faces. I analyzed the cognitive demands at each step — from basic pattern recognition (identifying workflow statuses) to complex judgment calls (classifying failure severity). I catalogued the types of errors being made and traced them back to gaps in the existing tool design and documentation.
I redesigned the annotation conventions around the user's actual task flow — not the system's architecture. Each section opens with context, states the objective, provides guided examples, and closes with practice scenarios. This structure supports users who need to understand the "why" before the "how."
I built a severity-based error taxonomy with explicit rules: success runs can only pair with low-impact or no-failure categories, while failure runs require high-impact classifications. Decision tables replaced dense paragraphs, making correct pairings scannable at a glance.
Real-world AI behavior is messy. I developed a pattern library of edge cases covering scenarios like: workflows with minor errors but successful outcomes, ambiguous prompts, cookie popup handling, human-in-the-loop activations, and model hallucinations. Each pattern included the scenario, the correct evaluation, and the reasoning — turning abstract rules into concrete, reusable decision frameworks.
I designed a structured audit system with a 7-question pass/fail rubric, standardized error codes with severity levels (Critical P0, Major P1, Minor P2), and clear scoring thresholds. I evaluated effectiveness across multiple levels — from user satisfaction and task accuracy to behavioral changes and business impact on quality metrics.
AI models evolve fast, and the tools need to keep pace. I implemented a versioning system with dated entries, clear change descriptions, and "UPDATED" / "ADDED" tags inline so users could quickly identify what changed. This reduced confusion during transitions and gave experienced users a fast path to the latest updates.
Rather than organizing by concept, I structured the system to mirror the user's actual task flow — step 1 through step 9. This reduced cognitive load and made the interface usable as a real-time reference during annotation.
For complex classification rules (like status + failure category combinations), I used scannable tables with columns for status, category, and reasoning. This made correct pairings visible at a glance instead of buried in paragraphs.
I created a "what the model can and can't do" reference — organized into CAN, CANNOT, SHOULDN'T, FORBIDDEN, and NOT YET GOOD AT. This gave users the context they needed to make fair evaluations without over- or under-penalizing the AI.
Instead of expecting users to read full release notes, I added "UPDATED" and "ADDED" markers directly in the relevant sections. Experienced users could scan for changes without re-reading the entire system.
Users adopted the standardized workflow and evaluation framework
Improvement in model accuracy through better annotation quality
Reduction in AI inconsistencies and biases
Increase in operational efficiency through standardized processes
This project reinforced something I believe deeply about product design: the best tools don't just tell people what to do — they help people make better decisions. Annotation work is fundamentally about decision-making under ambiguity, and the experience needed to reflect that. These users are experienced professionals who need autonomy, clarity, and immediate applicability.
If I were to iterate further, I'd build interactive scenario-based onboarding flows where users practice classifying real (anonymized) workflows before going live — with branching paths that mirror the decision complexity they face on the job. The edge case pattern library is a strong foundation for that kind of practice-driven product experience.