UXBench

Measuring the actionability of LLM-generated UX critiques.

Large language models (LLMs) are increasingly deployed as UX judges that inspect interfaces, diagnose usability problems, and propose repairs. Yet no controlled benchmark measures whether the resulting critiques are reliable and actionable across heterogeneous product surfaces. We introduce UXBench, a benchmark for evaluating LLMs as interaction-grounded UX judges. UXBench comprises local-first runnable web fixtures spanning ten product-surface families, paired with coverage-gated browser exploration that forces models to collect interaction evidence before reporting. Each judge model produces a structured UX report over seven rubric dimensions; report quality is measured by whether a fixed downstream repair agent can improve the interface based on the critique. We evaluate eight frontier models under both an automated repair-lift protocol and a blind human validation study. Results show that UX judging is neither saturated nor one-dimensional: models differ meaningfully in report actionability, exhibit distinct rubric-level repair signatures, vary in fixture-level reliability, and trade leadership across surface categories.

Code

Method

Benchmark Construction

UXBench fixes everything downstream of the judge so that any score difference reflects the report, not the environment. Only the first two stages depend on the judge under test.

UXBench method overview connecting real product anchors, synthetic sibling fixtures, surface families, browser-grounded judging, fixed repair, and scoring. — **Evaluation pipeline.** Real-product anchors and synthetic siblings form the fixture suite; each judge explores the same runnable surfaces, writes an evidence-grounded critique, and is compared through a fixed repair-and-score pipeline.

Run protocol

Explore judge model
Browser exploration with coverage checks.
Report judge model
Evidence-grounded critique over seven dimensions.
Repair held fixed
Fixed agent edits from the report.
Score held fixed
Fixed scorer rates the repaired interface.

Rubric dimensions

View dimension definitions

Fixture families and catalog

41 fixtures across 10 surface families: 11 real-product anchors and 30 synthetic siblings.

View fixture catalog

Main results

Leaderboard

Every judge starts from the same unrepaired baseline. Each result below carries its own toggle, so you can read the leaderboard table, the pairwise win-rate heatmap, the per-family profiles, and fixture-level reliability independently — through either the automated repair-lift scorer or the blind human study.

Automated evaluation

Win-rate heatmap

Scores by surface family

Surface families

One dial per surface family. Each wedge is one of the eight judges, drawn to its mean repaired score within that family, with the family's average and best score annotated below.

Agent per-family profiles: each surface family shows the eight judges as wedges sized by their repaired score

Blind-human per-family profiles: each surface family shows the eight judges as wedges sized by their repaired score

Category-level repaired-score profiles across the eight evaluated models. Each panel reports the category mean and the best repaired score. The leading model changes across surface types, and the gap between category averages and best-model scores varies noticeably across surfaces.

Scores by fixture

per-fixture robustness

Each judge's mean repaired score (dot) and its min–max spread across individual fixtures, repeated on the right as Δ over each fixture's own baseline.

Agent fixture-level range plot: per-judge mean repaired score and min–max spread, plus delta over baseline

Blind-human fixture-level range plot: per-judge mean repaired score and min–max spread, plus delta over baseline

Per-fixture repaired scores and uplift Δ relative to site baseline. Models with higher mean lift exhibit wider site-level spreads, separating the sweep along an axis the aggregate column cannot expose.

Human review

Human validation

This section holds the two protocols side by side: how the ranking reorders, where human and automated scores diverge by surface, and the paired significance behind every lift estimate.

Human score gaps

Where humans and the automated scorer diverge

Each cell is the blind human rating minus the automated LLM score for one judge on one surface family. Green means human reviewers rated the repaired page higher than the automated scorer; red means the reverse.

Heatmap of human-minus-LLM repaired-score gaps for eight judge models across ten surface families; values are mostly positive (green), largest on dashboard, data-visualization, chatbot, and mobile surfaces.

Calibration, not contradiction. The gaps are almost entirely positive and widest on visually dense, stateful, or mobile-facing surfaces — dashboard, data visualization, chatbot, and mobile UI — where the automated scorer is least sensitive to the visual coherence and interaction legibility that humans perceive directly. Automated repair lift stays a reliable screen; human review is what interprets the close ranks.

Show ranking shift and paired significance tests

Rank shift & correlation

Rank comparison

Directional, not identical. GPT-5.4 tops both protocols and Claude-Sonnet-4.6 stays near the top, but the middle of the pack reorders once humans rate the repaired interfaces.

Automated lift — site-paired tests

Human lift — site-paired tests

Bootstrap CIs over fixtures; paired t and Wilcoxon p-values against each fixture's unrepaired baseline; d_z is the paired effect size. Every judge stays significantly above zero under both protocols.

Findings

Key findings

UX judging is measurable, useful, and still unresolved: all eight judges improve the same fixtures under the same fixed repair and scoring pipeline, yet they differ along several axes at once. The six findings below summarize what the benchmark reveals.

Case study

AeroIQ dashboard comparison

Explore two runnable AeroIQ versions of the same monitoring dashboard, then choose which one better supports triage and diagnosis.

Version A

Interactive page

Version B

Interactive page

Which one is better?

Hidden

Selection pending Variant hidden

Example report

Report excerpt

One excerpt from a Booking-family report. Each issue is grounded in a visible failure mode and already phrased as a localized fix the repair agent can act on.

Booking fixture · persona: curious first-time visitor · 9 findings

“The core hotel funnel is generally understandable.” The biggest UX weaknesses are in result filtering and the mobile checkout layout, where overflow, tiny targets, and weak progress feedback make completion feel brittle.

High severity · mobile usability Key booking pages overflow the mobile viewport

Pages: reservation, confirmation, hotel-detail, room-selection
Evidence: Content wider than the screen on every funnel step, pushing cards and header links off-screen.
Fix: Collapse containers, cards, headers, and form rows to a single-column mobile layout with no horizontal scroll.

High severity · feedback Results pages render contradictory state

Pages: tokyo.html, shinjuku.html
Evidence: Hotel cards and counts render while the page simultaneously claims that no properties match the filters.
Fix: Render exactly one state at a time: real results, or a true empty state with recovery actions.

Medium severity · accessibility Several core inputs are unlabeled

Pages: signin.html, register.html, reservation.html
Evidence: Placeholders stand in for labels, leaving password, guest-count, and payment fields weakly announced.
Fix: Add explicit labels, helper text, and validation messages that stay visible after typing begins.

Show one more finding

Medium severity · flow Checkout progress is weak where commitment is highest

Pages: reservation.html, confirmation.html
Evidence: The funnel asks for sensitive data without a persistent sense of what remains, what is optional, or when the booking is final.
Fix: Add inline validation and a visible completion checklist near the primary action.

Citation

If UXBench helps your research, please cite the manuscript. This placeholder will be updated after public release or review.

% Citation placeholder · update after public release
@misc{uxbench2026,
  title     = {UXBench: Measuring the Actionability of LLM-Generated UX Critiques},
  author    = {Anonymous},
  year      = {2026},
  note      = {Manuscript in preparation}
}