AI Product Case Study · Raoul Kahn

EvalBench: Evaluating AI-Generated Airbnb Listings

A spec-driven evaluation of an AI listing description generator using the Hamel/Shreya error analysis methodology — identifying 9 failure modes, building validated LLM judges, and reducing the top failure rate from 82% to 1%.

Raoul Kahn · raoulkahn.com/portfolio

The Scenario

Airbnb's listing creation flow collects structured data — property type, bedrooms, amenities — plus a free-text description field. Today, the platform auto-generates a generic placeholder that ignores everything the host provided:

Create your description
Share what makes your place special.
You'll have a great time at this comfortable place to stay.
59/500
Airbnb's current auto-generated placeholder

Imagine the PM team wants to introduce an AI assistant that takes the host's rough notes and structured data, then generates a polished listing description. The hypothesis: better descriptions could increase renter engagement and booking rates.

I built a prototype of this AI assistant, then systematically evaluated whether the output was trustworthy enough to ship. The question wasn't just "can AI write listings?" — it was "can we trust the output before it reaches real users?"

This project applies the offline eval methodology used by applied AI teams at companies like Anthropic and Airbnb: evaluate systematically, find and fix failure modes, validate your measurements, then decide whether to ship.

Methodology

I followed the error analysis methodology taught by Hamel Husain (ex-Airbnb, ex-GitHub ML) and Shreya Shankar (UC Berkeley) — the standard approach for application-centric AI evaluation.

01

Define the Spec

Wrote a 9-rule quality spec for listing descriptions based on Airbnb research — platform policies, real listing analysis, and guest complaints from Reddit.

02

Generate Synthetic Traces

Created 100 input-output pairs: realistic host inputs across 3 property types and 3 price tiers, each run through a zero-shot baseline prompt. Included 10 adversarial edge cases.

03

Open Coding (Manual Review)

Manually reviewed every trace. Read the host input, read the AI output, noted the first thing wrong. No LLM assistance — this step cannot be automated.

04

Axial Coding (Categorization)

Organized raw notes into 9 formal failure categories using LLM-assisted pattern matching, then risk-tiered each one by severity.

05

Build Binary Judges

Created 4 LLM-as-judge evaluators — one per top failure mode. Each returns TRUE (failure present) or FALSE. No 1-5 scales.

06

Validate Judges

Compared each judge's labels against 34 human-labeled traces using confusion matrices. When judge and human disagreed, documented why — disagreement analysis.

07

Improve & Measure

Improved the generation prompt with spec-derived guardrails, re-ran all 100 traces, measured before/after failure rates.

The Spec (Excerpt)

Before reviewing any AI output, I wrote a 9-rule quality spec defining what a good listing description must do. Every failure mode maps back to a spec violation. Here are three example rules:

Rule 1 — Amenity Accuracy

The description must only mention amenities the host selected or described. Do not add qualifiers ("high-speed", "fully equipped", "luxury") unless the host used those exact words.

Rule 2 — No Unverified Claims

Do not add neighborhood details, landmarks, or proximity claims the host did not provide. "Near restaurants" cannot become "trendy restaurants" or "world-class dining."

Rule 6 — No Liability Language

Do not make promises the host cannot control ("guaranteed quiet", "perfectly safe") or encourage unsafe behavior ("dive into the pool"). No safety claims about the neighborhood.

Each rule is binary and operationalizable — you can look at any trace and determine pass or fail without subjective judgment. That's what makes them useful as judge criteria.

Failure Modes Discovered

Manual review revealed 9 distinct failure categories. The AI wasn't making random errors — it was systematically embellishing, overselling, and inventing details in predictable patterns.

Failure Mode Risk Tier Example
Liability-risk language Legal "Dive into crystal-clear waters" — unknown pool depth
Discriminatory content Legal Host said "no kids" → AI softened to 21+ age restriction
Misleading property claims Legal 3BR + den listed as "4-Bedroom" in title
Contradiction handling Trust Pool not in amenities list but AI promoted it from notes
Amenity embellishment Trust "wifi" → "high-speed wifi", "kitchen" → "fully equipped kitchen"
Location exaggeration Quality "5 blocks from ocean" → "Steps from the beach"
Insufficient input handling Quality Host wrote "nice place" → AI fabricated entire listing
Injected marketing adjectives Cosmetic "Charming," "stunning," "stylish" — host used no adjectives
AI breaking character Cosmetic Meta-note to host: "I've reframed the listing to be welcoming..."

Judge Validation

Built 4 binary LLM judges and validated each against my human labels on 34 traces. Agreement percentage alone is misleading — the confusion matrix reveals the real performance.

Amenity Embellishment

Agreement
79%
Precision
93%
Recall
84%
Judge: FALSEJudge: TRUE
Human: FALSE12
Human: TRUE526

Injected Adjectives

Agreement
88%
Precision
100%
Recall
88%
Judge: FALSEJudge: TRUE
Human: FALSE00
Human: TRUE430

Liability Language

Agreement
82%
Precision
14%
Recall
100%
Judge: FALSEJudge: TRUE
Human: FALSE276
Human: TRUE01

Location Exaggeration

Agreement
68%
Precision
75%
Recall
53%
Judge: FALSEJudge: TRUE
Human: FALSE143
Human: TRUE89

The liability judge is the key insight. 82% agreement looks fine — but the confusion matrix reveals only 14% precision. The judge was flagging marketing enthusiasm ("perfect getaway") as legal liability. After refining the prompt to distinguish marketing phrases from actual safety/guarantee language, precision improved significantly.

Results: Before & After

After improving the generation prompt with spec-derived guardrails, failure rates dropped dramatically across all four evaluated categories.

The guardrail strategy was direct: I injected each spec rule as an explicit constraint in the system prompt. For example, Rule 1 became "ONLY mention amenities the host listed. Do not upgrade them — don't say 'high-speed wifi' if host said 'wifi'." Ten rules total, each mapped to a discovered failure mode. No structured output or complex prompting techniques — just clear, specific instructions derived from real failure data.

Amenity Embellishment
73%
0%
-73%
Injected Adjectives
82%
1%
-81%
Liability Language
30%
1%
-29%
Location Exaggeration
26%
0%
-26%
Caveat

These results are from offline evaluation with synthetic data. I'd want to validate on real host inputs before shipping, and the near-zero numbers likely benefit from using the same model to generate and judge. In production, I'd add human spot-checks on a rolling sample to guard against judge drift.

What Would I Ship?

Not every failure mode has the same fix. Some are prompt problems, some are product problems, and some are acceptable risks.

Failure Mode Before → After Action Rationale
Liability language 30% → 1% Block launch until 0% Legal risk. Add post-processing filter as safety net.
Amenity embellishment 73% → 0% Ship with verification UX Prompt fixed it, but add a host review step before publishing to catch edge cases.
Location exaggeration 26% → 0% Ship, monitor Prompt fixed it. In production, cross-reference against verified location data.
Sparse input hallucination N/A UX gate: minimum input Not a prompt problem. Require minimum character count before AI generates, or have AI ask follow-up questions.
Injected adjectives 82% → 1% Accept for v1 Cosmetic. If data shows "charming" drives bookings without increasing complaints, consider allowing it.

Beyond Prompt Fixes

The evaluation revealed that a production-grade AI listing assistant needs more than a good prompt. These are the product-level recommendations I'd advocate for as a PM:

Multi-signal input architecture.

Cross-reference text notes, structured data, and photos. Three signals agreeing = high confidence. Signals conflicting = flag, don't guess.

Human-in-the-loop review.

Show the host the AI's draft alongside their original input and let them approve, edit, or regenerate. This catches embellishments at the source and builds host trust in the tool.

Upstream input validation.

If the host provides fewer than N characters, don't generate — ask follow-up questions instead. The hallucination problem shrinks the more verified input the AI has to work with.

Offline eval as quality gate.

Before any A/B test reaches real users, the AI output passes through validated eval judges. This pattern applies beyond Airbnb — any platform where AI assists user-generated content benefits from the same framework.

Tradeoff Worth Testing

Strict anti-embellishment rules may reduce warmth and persuasiveness. If internal data shows listings with words like "charming" drive 80% more bookings with minimal complaints, then injected adjectives might be a feature, not a failure. The eval identifies what the AI is doing; business data determines whether it matters. I'd A/B test the strict prompt against a version that allows subjective adjectives while still blocking verifiable embellishments like "high-speed wifi."

What This Demonstrates

This project demonstrates how I approach AI features: spec-first, failure-mapping, judge validation, and product-layer safeguards before experimentation. The eval is the quality gate between "the AI can do this" and "we should ship this to users." That's the gap where most AI products fail — not because the model isn't capable, but because no one systematically checked whether the output is trustworthy.

Technical Details

Model: Claude Sonnet 4.5 (generation + judges)

Traces: 100 synthetic (50 whole home, 30 private room, 20 unique stay) across 3 price tiers, 10 adversarial

Judges: 4 binary LLM-as-judge evaluators, validated against 34 human-labeled traces

Total API cost: ~$5 across all generation and evaluation runs

LLM evaluation error analysis open coding axial coding LLM-as-judge confusion matrix disagreement analysis prompt engineering risk tiering Python Anthropic API