UXBench logo

UXBench

Measuring the actionability of LLM-generated UX critiques.

Large language models (LLMs) are increasingly deployed as UX judges that inspect interfaces, diagnose usability problems, and propose repairs. Yet no controlled benchmark measures whether the resulting critiques are reliable and actionable across heterogeneous product surfaces. We introduce UXBench, a benchmark for evaluating LLMs as interaction-grounded UX judges. UXBench comprises local-first runnable web fixtures spanning ten product-surface families, paired with coverage-gated browser exploration that forces models to collect interaction evidence before reporting. Each judge model produces a structured UX report over seven rubric dimensions; report quality is measured by whether a fixed downstream repair agent can improve the interface based on the critique. We evaluate eight frontier models under both an automated repair-lift protocol and a blind human validation study. Results show that UX judging is neither saturated nor one-dimensional: models differ meaningfully in report actionability, exhibit distinct rubric-level repair signatures, vary in fixture-level reliability, and trade leadership across surface categories.

Method

Benchmark Construction

UXBench fixes everything downstream of the judge so that any score difference reflects the report, not the environment. Only the first two stages depend on the judge under test.

UXBench method overview connecting real product anchors, synthetic sibling fixtures, surface families, browser-grounded judging, fixed repair, and scoring.
Evaluation pipeline. Real-product anchors and synthetic siblings form the fixture suite; each judge explores the same runnable surfaces, writes an evidence-grounded critique, and is compared through a fixed repair-and-score pipeline.

Run protocol

  1. Explore judge model

    Browser exploration with coverage checks.

  2. Report judge model

    Evidence-grounded critique over seven dimensions.

  3. Repair held fixed

    Fixed agent edits from the report.

  4. Score held fixed

    Fixed scorer rates the repaired interface.

Rubric dimensions

    View dimension definitions

    Fixture families and catalog

    41 fixtures across 10 surface families: 11 real-product anchors and 30 synthetic siblings.

    View fixture catalog
    Main results

    Leaderboard

    Every judge starts from the same unrepaired baseline. Each result below carries its own toggle, so you can read the leaderboard table, the pairwise win-rate heatmap, the per-family profiles, and fixture-level reliability independently — through either the automated repair-lift scorer or the blind human study.

    Automated evaluation

    Win-rate heatmap

    Agent pairwise win-rate heatmap across the eight judge models Blind-human pairwise win-rate heatmap across the eight judge models
    Each cell is P(row model beats column model) on shared fixtures; the diagonal is omitted. Red favours the column, green the row.
    Scores by surface family

    Surface families

    One dial per surface family. Each wedge is one of the eight judges, drawn to its mean repaired score within that family, with the family's average and best score annotated below.

    Agent per-family profiles: each surface family shows the eight judges as wedges sized by their repaired score Blind-human per-family profiles: each surface family shows the eight judges as wedges sized by their repaired score

    Category-level repaired-score profiles across the eight evaluated models. Each panel reports the category mean and the best repaired score. The leading model changes across surface types, and the gap between category averages and best-model scores varies noticeably across surfaces.

    Scores by fixture

    per-fixture robustness

    Each judge's mean repaired score (dot) and its min–max spread across individual fixtures, repeated on the right as Δ over each fixture's own baseline.

    Agent fixture-level range plot: per-judge mean repaired score and min–max spread, plus delta over baseline Blind-human fixture-level range plot: per-judge mean repaired score and min–max spread, plus delta over baseline

    Per-fixture repaired scores and uplift Δ relative to site baseline. Models with higher mean lift exhibit wider site-level spreads, separating the sweep along an axis the aggregate column cannot expose.

    Human review

    Human validation

    This section holds the two protocols side by side: how the ranking reorders, where human and automated scores diverge by surface, and the paired significance behind every lift estimate.

    Human score gaps

    Where humans and the automated scorer diverge

    Each cell is the blind human rating minus the automated LLM score for one judge on one surface family. Green means human reviewers rated the repaired page higher than the automated scorer; red means the reverse.

    Heatmap of human-minus-LLM repaired-score gaps for eight judge models across ten surface families; values are mostly positive (green), largest on dashboard, data-visualization, chatbot, and mobile surfaces.

    Calibration, not contradiction. The gaps are almost entirely positive and widest on visually dense, stateful, or mobile-facing surfaces — dashboard, data visualization, chatbot, and mobile UI — where the automated scorer is least sensitive to the visual coherence and interaction legibility that humans perceive directly. Automated repair lift stays a reliable screen; human review is what interprets the close ranks.

    Show ranking shift and paired significance tests
    Rank shift & correlation

    Rank comparison

    Directional, not identical. GPT-5.4 tops both protocols and Claude-Sonnet-4.6 stays near the top, but the middle of the pack reorders once humans rate the repaired interfaces.

    Automated lift — site-paired tests

    Human lift — site-paired tests

    Bootstrap CIs over fixtures; paired t and Wilcoxon p-values against each fixture's unrepaired baseline; dz is the paired effect size. Every judge stays significantly above zero under both protocols.

    Findings

    Key findings

    UX judging is measurable, useful, and still unresolved: all eight judges improve the same fixtures under the same fixed repair and scoring pipeline, yet they differ along several axes at once. The six findings below summarize what the benchmark reveals.

    Case study

    AeroIQ dashboard comparison

    Explore two runnable AeroIQ versions of the same monitoring dashboard, then choose which one better supports triage and diagnosis.

    Version A

    Interactive page

    Version B

    Interactive page
    Which one is better?
    Hidden
    Selection pending Variant hidden
    Example report

    Report excerpt

    One excerpt from a Booking-family report. Each issue is grounded in a visible failure mode and already phrased as a localized fix the repair agent can act on.

    Booking fixture · persona: curious first-time visitor · 9 findings

    “The core hotel funnel is generally understandable.” The biggest UX weaknesses are in result filtering and the mobile checkout layout, where overflow, tiny targets, and weak progress feedback make completion feel brittle.
    High severity · mobile usability Key booking pages overflow the mobile viewport
    Pages
    reservation, confirmation, hotel-detail, room-selection
    Evidence
    Content wider than the screen on every funnel step, pushing cards and header links off-screen.
    Fix
    Collapse containers, cards, headers, and form rows to a single-column mobile layout with no horizontal scroll.
    High severity · feedback Results pages render contradictory state
    Pages
    tokyo.html, shinjuku.html
    Evidence
    Hotel cards and counts render while the page simultaneously claims that no properties match the filters.
    Fix
    Render exactly one state at a time: real results, or a true empty state with recovery actions.
    Medium severity · accessibility Several core inputs are unlabeled
    Pages
    signin.html, register.html, reservation.html
    Evidence
    Placeholders stand in for labels, leaving password, guest-count, and payment fields weakly announced.
    Fix
    Add explicit labels, helper text, and validation messages that stay visible after typing begins.
    Show one more finding
    Medium severity · flow Checkout progress is weak where commitment is highest
    Pages
    reservation.html, confirmation.html
    Evidence
    The funnel asks for sensitive data without a persistent sense of what remains, what is optional, or when the booking is final.
    Fix
    Add inline validation and a visible completion checklist near the primary action.
    Citation

    Citation

    If UXBench helps your research, please cite the manuscript. This placeholder will be updated after public release or review.

    % Citation placeholder · update after public release
    @misc{uxbench2026,
      title     = {UXBench: Measuring the Actionability of LLM-Generated UX Critiques},
      author    = {Anonymous},
      year      = {2026},
      note      = {Manuscript in preparation}
    }