Case Study: AI Evaluation Workflow Design

The Challenge

An AI research team was developing an autonomous web agent capable of navigating websites, extracting information, and completing tasks on behalf of users. To improve the model, the team needed human evaluators to accurately annotate thousands of real-world agent workflows — assessing whether the AI successfully completed each task, identifying where it failed, and categorizing the type and severity of errors.

The problem: the evaluation process was complex. Annotators needed to distinguish between multiple workflow statuses (success, failure, unknown, out of scope, demo, guardrail), identify specific failure steps, classify errors across a severity hierarchy, and assign granular failure buckets — all while handling sensitive customer data under strict security protocols.

Key Challenges

Complex multi-step evaluation process with 7+ classification dimensions
12+ failure bucket categories requiring nuanced judgment
Strict security requirements for handling customer data
Frequent guideline updates as the AI model evolved
Mixed audience: new hires needing onboarding + experienced annotators needing retraining on updates

My Approach

01

User Research & Workflow Audit

I started by mapping the full annotation workflow end-to-end, identifying every decision point a user faces. I analyzed the cognitive demands at each step — from basic pattern recognition (identifying workflow statuses) to complex judgment calls (classifying failure severity). I catalogued the types of errors being made and traced them back to gaps in the existing tool design and documentation.

02

Information Architecture & Workflow Design

I redesigned the annotation conventions around the user's actual task flow — not the system's architecture. Each section opens with context, states the objective, provides guided examples, and closes with practice scenarios. This structure supports users who need to understand the "why" before the "how."

I built a severity-based error taxonomy with explicit rules: success runs can only pair with low-impact or no-failure categories, while failure runs require high-impact classifications. Decision tables replaced dense paragraphs, making correct pairings scannable at a glance.

03

Edge Case Pattern Library

Real-world AI behavior is messy. I developed a pattern library of edge cases covering scenarios like: workflows with minor errors but successful outcomes, ambiguous prompts, cookie popup handling, human-in-the-loop activations, and model hallucinations. Each pattern included the scenario, the correct evaluation, and the reasoning — turning abstract rules into concrete, reusable decision frameworks.

04

Quality Assurance Framework & Feedback Loop

I designed a structured audit system with a 7-question pass/fail rubric, standardized error codes with severity levels (Critical P0, Major P1, Minor P2), and clear scoring thresholds. I evaluated effectiveness across multiple levels — from user satisfaction and task accuracy to behavioral changes and business impact on quality metrics.

05

Living Design System & Change Management

AI models evolve fast, and the tools need to keep pace. I implemented a versioning system with dated entries, clear change descriptions, and "UPDATED" / "ADDED" tags inline so users could quickly identify what changed. This reduced confusion during transitions and gave experienced users a fast path to the latest updates.

Design Decisions

Task-Aligned Information Architecture

Rather than organizing by concept, I structured the system to mirror the user's actual task flow — step 1 through step 9. This reduced cognitive load and made the interface usable as a real-time reference during annotation.

Decision Tables Over Prose

For complex classification rules (like status + failure category combinations), I used scannable tables with columns for status, category, and reasoning. This made correct pairings visible at a glance instead of buried in paragraphs.

System Capability Reference

I created a "what the model can and can't do" reference — organized into CAN, CANNOT, SHOULDN'T, FORBIDDEN, and NOT YET GOOD AT. This gave users the context they needed to make fair evaluations without over- or under-penalizing the AI.

Inline Change Indicators

Instead of expecting users to read full release notes, I added "UPDATED" and "ADDED" markers directly in the relevant sections. Experienced users could scan for changes without re-reading the entire system.

Outcomes & Impact

50+

Users adopted the standardized workflow and evaluation framework

20%

Improvement in model accuracy through better annotation quality

13.5%

Reduction in AI inconsistencies and biases

30%

Increase in operational efficiency through standardized processes

Key Takeaways

Complex evaluation tools need structure that mirrors the user's workflow, not the system's architecture
Edge cases are where most user errors happen — investing in a concrete pattern library pays off quickly
Living design systems with inline change indicators are essential when workflows evolve frequently
A structured quality framework creates a feedback loop that continuously improves both the product and user performance

Reflection

This project reinforced something I believe deeply about product design: the best tools don't just tell people what to do — they help people make better decisions. Annotation work is fundamentally about decision-making under ambiguity, and the experience needed to reflect that. These users are experienced professionals who need autonomy, clarity, and immediate applicability.

If I were to iterate further, I'd build interactive scenario-based onboarding flows where users practice classifying real (anonymized) workflows before going live — with branching paths that mirror the decision complexity they face on the job. The edge case pattern library is a strong foundation for that kind of practice-driven product experience.

Designing Complex Workflows for AI Evaluation at Scale

Timeline

Users

Methods

Tools