uxtest runs synthetic-user UX studies against live web pages with Playwright
and EDSL remote inference. This README is written for coding agents that have
installed the package and need to discover the docs, copy examples, run studies,
and inspect evidence from a terminal.
If the package is not already installed, install it from the public repository:
pip install git+https://github.com/expectedparrot/uxtest.git
python -m playwright install chromiumWhen uxtest is published to PyPI, the install command can become
pip install uxtest.
After installing uxtest, do not assume you have the source repository. Use the
built-in documentation commands first:
uxtest doctor
uxtest --version
uxtest docs
uxtest docs list
uxtest docs show root
uxtest examples list
uxtest figma doctorUse docs show when you need instructions in the current terminal context:
uxtest docs show task-discovery
uxtest docs show report-writer-agent
uxtest docs show conversion-path-testing
uxtest docs show information-architecture
uxtest docs show study-types
uxtest docs show specUse docs path when another tool needs a filesystem path:
uxtest docs path root
uxtest docs path task-discoveryUse examples path when you only need to inspect a bundled fixture:
uxtest examples path expectedparrot-enterprise-demo
uxtest examples path saas-regression-edslUse examples copy before editing or running a bundled example as a project
artifact:
uxtest examples copy ./uxtest-examples
uxtest examples copy ./enterprise-demo.yaml --name expectedparrot-enterprise-demo
uxtest examples copy ./task-discovery-guide --name task-discoverydocs open exists for local interactive use, but sandboxed agents may need
permission before launching GUI apps:
uxtest docs open rootUse this map when deciding what kind of research study to run:
- First impression and orientation: use
task-discoveryto learn what visitors think a page is for, what they click first, what they misunderstand, and where they hesitate. - Message and content interpretation: use
content-comprehensionto learn whether visitors can explain the offer, audience, claims, jargon, proof, and next step. - Target-action paths: use
conversion-path-testingto study demo, signup, pricing, checkout, contact-sales, lead form, or gated-asset paths. - Findability and navigation: use
information-architectureto study where visitors expect content such as docs, security, pricing, examples, support, API references, or case studies to live. - Feature or capability proof: use
feature-findabilityto learn whether visitors can determine if a product supports a feature, integration, workflow, API, export, permission model, or use case. - Enterprise evaluation: use
enterprise-buying-researchto study whether buyers, technical evaluators, risk reviewers, and operators can find enough product, proof, security, implementation, and commercial evidence to continue. - Competitive or variant comparison: use
competitive-benchmark-studiesto compare the same task across competitors, variants, staging/production, or before/after designs. - First-run product activation: use
onboarding-activationto study whether newly invited or signed-up users can reach a first meaningful action. - Authenticated product workflows: use
post-login-workflow-testingfor role-specific logged-in tasks such as inviting users, configuring integrations, exporting reports, or changing settings. - Inclusive UX risk discovery: use
accessibility-inclusive-uxto locate risks for mobile-only, low-confidence, plain-language, low-vision, keyboard-oriented, or unfamiliar-domain users. This complements, but does not replace, formal accessibility testing. - Tracking fixes over time: use
longitudinal-regressionto rerun known UX tasks, expected flaws, and redesign hypotheses across releases.
Each study guide includes an EDSL AgentList persona export pattern, fixture
template, run command, log.html inspection workflow, narrative report shape,
human screenshot validation guidance, and follow-on studies.
Choose the smallest doc that matches the task:
root: this operational agent guide.report-writer-agent: instructions for a research/coding agent that needs to turnuxtestevidence into a narrative stakeholder report.task-discovery: fully worked study guide for first-impression and first-click discovery studies.conversion-path-testing: demo, signup, pricing, checkout, contact, or other target-action paths.information-architecture: expected content findability across nav, menus, search/find behavior, and mobile.enterprise-buying-research: enterprise trust, proof, technical, risk, and commercial evidence.competitive-benchmark-studies: same-task comparison across competitors, variants, or before/after designs.content-comprehension: messaging, audience, value proposition, jargon, claims, and next-step interpretation.feature-findability: feature, integration, workflow, or use-case evidence.onboarding-activation: first-run setup and first meaningful product action.post-login-workflow-testing: authenticated role-specific product workflows.accessibility-inclusive-ux: constrained personas and device/access needs.longitudinal-regression: repeated tasks, known flaws, and release regression checks.study-types: index of UXR study patterns bundled with the package.spec: implementation and architecture notes.
Choose examples by alias:
expectedparrot: all Expected Parrot live-site fixtures.expectedparrot-task-discovery: first-impression and first-click fixture.expectedparrot-content-comprehension: homepage comprehension fixture.expectedparrot-conversion-path: demo/contact next-step fixture.expectedparrot-information-architecture: docs, product, company, and resource findability fixture.expectedparrot-feature-findability: EDSL/programmatic workflow evidence fixture.expectedparrot-enterprise-demo: live-site fixture for enterprise demo intent.expectedparrot-credibility: live-site fixture for credibility and seriousness.jjh-discovery: live-site fixture for homepage discovery.jjh-targeted: targeted live-site fixture.saas-regression: deterministic local SaaS fixture.saas-regression-edsl: EDSL-backed local SaaS fixture.task-discovery: the task discovery guide as an example resource.
If an alias is not enough, run:
uxtest docs list
uxtest examples listThen pass any listed relative path to docs show, docs path, examples path,
or examples copy.
From a source checkout, use uv run:
uv sync
uv run uxtest --help
uv run python -m pytest -q testsFrom an installed package, call uxtest directly unless your environment
requires a wrapper:
uxtest --help
uxtest doctorEDSL is a normal PyPI dependency. Do not depend on a local EDSL checkout path.
Remote inference credentials are expected in .env or the environment. Normal
browser runs use EDSL remote inference and should not require a local OpenAI API
key.
uxtest doctor checks that uxtest is importable, EDSL is importable,
Playwright Chromium can launch, and pandoc is available for HTML/PDF narrative
reports. If Playwright browsers are missing, run:
python -m playwright install chromiumuxtest stores state under .uxtest/ in the current project:
.uxtest/personas/*.yaml: reusable persona templates..uxtest/studies/<study-id>/study.yaml: study definition..uxtest/studies/<study-id>/runs/<run-id>/: raw traces, screenshots, and run metadata..uxtest/studies/<study-id>/analysis/: generated reports and summaries..uxtest/comparisons/*.html: multi-study comparison reports.
Raw run traces are the source of truth. Reports are derived and can be regenerated.
Copy a fixture into your workspace before editing it:
uxtest examples copy ./enterprise-demo.yaml --name expectedparrot-enterprise-demo
uxtest ci ./enterprise-demo.yamlFrom a source checkout, existing source-tree examples can also be run directly:
uv run uxtest ci examples/saas_site/regression-edsl.yaml
uv run uxtest ci examples/jjh_site/discovery.yaml
uv run uxtest ci examples/jjh_site/targeted.yamlThe fixture runner will:
- Create or update personas.
- Create or update fixture-backed studies.
- Launch Playwright browser runs.
- Ask EDSL for browser decisions at each step when
driver: edsl. - Analyze the runs.
- Generate reports, animations, eval outputs, and comparison HTML when configured.
For public live sites, keep max_concurrent_runs low to avoid request bursts
from one IP.
Use this when no fixture exists:
uxtest study new "Homepage discovery" \
--url "https://example.com/" \
--task "Starting from the homepage, decide what you would click next and why." \
--success-criteria "The visitor identifies a relevant next action." \
--persona academic-researcher \
--runs-per-persona 1Then run and analyze:
uxtest study run <study-id> --driver edsl --max-steps 8 --max-concurrent-runs 2
uxtest analyze <study-id> --include-interrupted
uxtest animate <study-id>
uxtest uxr <study-id>A typical live-site fixture:
id: my-site-discovery
name: My Site Discovery
mode: live-site
comparison_title: My Site Discovery
comparison_output: my-site-discovery.html
url_template: https://example.com/
study_title: My Site Discovery ({variant})
task: >
Starting from the homepage, decide whether this product is relevant to your
goal. Find the most useful next item and explain what you would do next.
success_criteria: >
The visitor identifies relevant evidence and can explain a concrete next
action.
personas:
- academic-researcher
runs_per_persona: 1
driver: edsl
max_steps: 8
max_concurrent_runs: 2
keep_runs: 8
analysis_driver: local
eval_policy: report_only
variants:
- name: desktop
device: desktop
- name: mobile
device: iphoneBuilt-in devices are desktop, iphone, and pixel.
For sign-in and multi-step setup, use deterministic setup_steps before EDSL
takes over. Setup values can come from environment variables and sensitive typed
values are redacted from setup traces.
env_file: secrets.env
redact_patterns:
- "test-user-[^\\s]+"
- "s3cr3t-[^\\s]+"
auth_state:
save: .uxtest/auth/example-user.json
setup_steps:
- type: click
label: Log in
- type: type
name: email
env: TEST_USER_EMAIL
sensitive: true
- type: type
name: password
env: TEST_USER_PASSWORD
sensitive: true
- type: click
label: ContinueLoad saved Playwright storage state in later studies:
auth_state:
load: .uxtest/auth/example-user.jsonSupported setup actions are click, type, select, wait, back, scroll,
and find. Supported selectors include selector, name, placeholder, and
label.
Do not use real user credentials. Use staging accounts, static test OTPs, or a test auth bypass. CAPTCHA and live MFA need a test bypass or manual hook.
After a run, inspect these first:
.uxtest/comparisons/<comparison>.html
.uxtest/studies/<study-id>/analysis/report.html
.uxtest/studies/<study-id>/analysis/log.html
.uxtest/studies/<study-id>/analysis/uxr_report.html
.uxtest/studies/<study-id>/analysis/findings.json
.uxtest/studies/<study-id>/analysis/scores.json
.uxtest/studies/<study-id>/analysis/animations/index.html
Use log.html to debug the system. It shows step-level details: persona,
scenario, screenshots, EDSL prompts, remote job metadata, model answers, and
browser action results.
Use report.html, uxr_report.html, and comparison reports to review findings.
Use raw traces when the reports look wrong:
uxtest show <study-id> --json
uxtest show <study-id> <run-id> --trace --json
uxtest trace <study-id>
uxtest trace <study-id> --edsl-jobs
rg -n '"outcome"|"action_recovery"|"type": "find"|"gave_up"|"max_steps"' .uxtest/studies/<study-id>
rg -n '"action_outcome"|"no_visible_change"|"menu_opened"|"same_page_state_change"' .uxtest/studies/<study-id>
rg -n '"stop_signal"|"stop_quality"|"enough_evidence_but_continued"|"blocked_by_auth"' .uxtest/studies/<study-id>Generate a narrative report when the user wants a stakeholder-readable summary instead of the technical evidence report:
uxtest report <study-id>
uxtest report <study-id> --format html
uxtest report <study-id> --format md,html,pdfThe report command reads existing traces, screenshots, findings, and scores. It
writes analysis/narrative_report.md by default and uses pandoc for HTML/PDF.
When a coding or research agent is writing the final narrative itself, read the packaged report-writing guide first:
uxtest docs show report-writer-agentThat guide explains which artifacts to inspect, how to cite screenshots and
traces, how to use action_outcome and stop_quality, and how to avoid
overclaiming from synthetic browser-agent evidence.
Use batch report when several study reports need to become one cross-study
research synthesis. This command reads existing scores.json, findings.json,
run metadata, traces, screenshots, and logs. It deduplicates recurring findings,
summarizes outcomes, flags trace-quality signals such as clicks with no visible
advance, links source reports, and writes a narrative Markdown report. New
traces classify each browser action as url_navigation, hash_change,
new_tab, menu_opened, same_page_state_change, scroll,
no_visible_change, or a related outcome so agents do not treat all
non-navigation clicks as failures.
Batch reports also classify run resolution quality. Use stop_quality in
scores.json and the "Run Resolution" section to distinguish done,
enough_evidence_but_continued, looping, blocked_by_auth,
blocked_by_no_visible_advance, unresolved, and error. New traces include
per-step stop_signal hints when the browser agent appears to have enough
evidence to answer an exploratory task but has not stopped yet.
uxtest batch report expectedparrot-cross-study \
--title "Expected Parrot Cross-Study Report" \
--study <study-id-a> \
--study <study-id-b> \
--comparison .uxtest/comparisons/<comparison-a>.html \
--format md,html,pdfUse batch run when you have a YAML manifest listing fixture files and want to
run them before generating the synthesis:
id: expectedparrot-cross-study
title: Expected Parrot Cross-Study Report
formats: [md, html, pdf]
fixtures:
- examples/expectedparrot_site/task-discovery.yaml
- examples/expectedparrot_site/content-comprehension.yaml
- examples/expectedparrot_site/conversion-path.yamlRun it:
uxtest batch run expectedparrot-batch.yamlBatch reports are written to .uxtest/comparisons/ by default:
.uxtest/comparisons/<name>.md
.uxtest/comparisons/<name>.html
.uxtest/comparisons/<name>.pdf
.uxtest/comparisons/<name>.manifest.json
Use figma commands when the target is a Figma design or prototype rather than
a live web page. There are two workflows:
- Static design frames: import Figma image exports and ask EDSL vision models what a persona understands or would click next.
- Clickable prototypes: audit Figma metadata for visible labels versus wired interactions, then generate a Playwright runner that opens the shared prototype URL, captures screenshots, asks EDSL for the next action, and records a step trace.
Set a Figma access token before importing:
export FIGMA_ACCESS_TOKEN=...
uxtest figma doctorStatic frame imports require FIGMA_ACCESS_TOKEN. Prototype runners can run
without the Figma API, but prototype audits and high-quality prototype runners
use Figma metadata when FIGMA_ACCESS_TOKEN is set. Metadata is cached under
.uxtest/figma/cache/; if Figma returns 429, uxtest uses stale cached
metadata when available and records the rate-limit details.
Import a selected frame from a copied Figma selection URL:
uxtest figma import "https://www.figma.com/design/<file-key>/<name>?node-id=<node>"If the URL is a whole file rather than a selected frame, import top-level frames:
uxtest figma import "https://www.figma.com/design/<file-key>/<name>" --frames top-level --limit 12The command writes a local design evidence bundle:
.uxtest/figma/<import-id>/manifest.json
.uxtest/figma/<import-id>/frames/*.png
Generate an EDSL vision study script from an import:
uxtest figma study <import-id> \
--task "Can an enterprise visitor figure out what to click to schedule a demo?"Or import and generate the script in one step:
uxtest figma study "https://www.figma.com/design/<file-key>/<name>?node-id=<node>" \
--task "What would a new visitor click first?"The generated script is dry-run by default:
python .uxtest/figma/<import-id>/figma_vision_study.pyRun it with EDSL remote inference:
python .uxtest/figma/<import-id>/figma_vision_study.py --launchAudit a clickable prototype before asking an agent to navigate it:
uxtest figma audit "https://www.figma.com/proto/<file-key>/<name>?node-id=<node>"The audit writes:
.uxtest/figma/audit-<file-key>-<node>/audit.json
.uxtest/figma/audit-<file-key>-<node>/audit.md
Use the audit to identify visible labels that are not wired as prototype interactions, vague interaction labels, and likely dead-end affordances. For agents, this is the first command to run against a prototype because Figma renders much of the experience as canvas content; Playwright cannot reliably query visible labels from the DOM.
Generate a clickable prototype runner from a /proto/ URL:
uxtest figma prototype "https://www.figma.com/proto/<file-key>/<name>?node-id=<node>" \
--task "Can an enterprise visitor understand the product and find the demo path?" \
--max-steps 8The generated runner is dry-run by default:
python .uxtest/figma/<prototype-id>/figma_prototype_runner.pyLaunch the browser and EDSL coordinate-click loop:
python .uxtest/figma/<prototype-id>/figma_prototype_runner.py --launchThe runner records failure_type values such as unwired_visible_affordance,
repeated_no_op, coordinate_miss, and agent_invalid_decision. When Figma
interaction metadata is available, the runner gives EDSL exact candidate
interaction centers and snaps matching decisions to those centers before
clicking.
Use --headed when Figma requires browser login or when you need to inspect
prototype behavior manually:
python .uxtest/figma/<prototype-id>/figma_prototype_runner.py --launch --headedWrite a Markdown report of imported frames or a prototype run trace:
uxtest figma report <import-id>
uxtest figma report <prototype-id>Use this workflow for design-stage questions:
- What does a persona think this screen is for?
- What would they click first?
- Which labels, CTAs, or visual hierarchy are confusing?
- Does the frame communicate credibility or enterprise readiness?
- Where does a clickable prototype route a persona, and where does the flow become blocked?
- How does the design intent compare to the later live site?
Use humanize-export when you want to validate synthetic findings with human
respondents through EDSL humanize(). The exporter does not record human
browser sessions. It turns selected uxtest screenshots into an EDSL survey
script that humans can answer.
uxtest humanize-export <study-id> \
--template task-discovery \
--screenshots representative \
--max-screenshots 8The command writes:
.uxtest/studies/<study-id>/analysis/humanize_survey.py
.uxtest/studies/<study-id>/analysis/humanize_survey.manifest.json
The generated script is safe by default:
python .uxtest/studies/<study-id>/analysis/humanize_survey.pyIt prints the study and scenario count without launching anything. To create the human survey on Expected Parrot, run:
python .uxtest/studies/<study-id>/analysis/humanize_survey.py --launchThe generated script uses EDSL's humanize_schema to control human-survey
presentation. In the script, look for:
HUMANIZE_SCHEMA = {
"survey": {
"custom_css": "..."
}
}When --launch is used, the script passes that schema to EDSL:
survey.by(scenarios).humanize(
human_survey_name=args.name,
scenario_list_method="ordered",
survey_visibility=args.visibility,
humanize_schema=HUMANIZE_SCHEMA,
)The default exporter CSS constrains screenshots so they do not dominate the survey page:
img {
display: block;
width: auto !important;
max-width: min(100%, 760px) !important;
max-height: 70vh !important;
height: auto !important;
object-fit: contain !important;
}Edit HUMANIZE_SCHEMA["survey"]["custom_css"] in the generated script before
launching if you need smaller screenshots, different borders, tighter spacing,
or other survey-level styling. This is the right place for presentation changes;
do not resize the original trace screenshots unless you specifically need lower
resolution evidence files.
Available templates:
task-discovery: what is this page for, what would you click next, and how confident are you?credibility: what evidence makes the company/product credible, what proof is missing, and how confident are you?conversion: what is the next action, what blocks conversion, and how clear is the path?comprehension: what does the content say, what is confusing, and how confident is the reader?
Screenshot selection modes:
representative: first, highest-frustration, and last screenshot per run, deduped up to--max-screenshots.first: first screenshot per run.last: last screenshot per run.first-last: first and last screenshot per run.high-frustrationorconfusing: highest-frustration screenshot per run.all: every trace screenshot, capped by--max-screenshots.
Use the manifest to see exactly which run, persona, step, URL, synthetic action, and screenshot were exported. After collecting human responses with EDSL, an agent can compare human stated interpretations and intended clicks against the synthetic browser traces.
Use saliency run when you want visual-attention evidence for screenshots from
a completed study. This is useful for questions about whether CTAs, trust
signals, forms, or navigation are likely to attract attention before the
synthetic visitor acts.
uxtest does not fabricate saliency maps. This command requires a real
external saliency model command. If no command is configured, it fails.
The command writes:
.uxtest/studies/<study-id>/analysis/saliency/manifest.json
.uxtest/studies/<study-id>/analysis/saliency/index.html
.uxtest/studies/<study-id>/analysis/saliency/*-overlay.png
.uxtest/studies/<study-id>/analysis/saliency/*-map.png
Use SUM by wrapping its inference command:
export UXTEST_SUM_DIR=/path/to/SUM
uxtest saliency run <study-id> --sum \
--screenshots representative \
--max-screenshots 12The --sum shorthand runs a command shaped like:
python $UXTEST_SUM_DIR/inference.py \
--img_path {input} \
--condition 3 \
--output_path {output_dir} \
--saliency_map_type OverlayIf your SUM checkout or another saliency model needs a different invocation,
use --engine command:
uxtest saliency run <study-id> \
--engine command \
--command-template "python /path/to/inference.py --img_path {input} --output_path {output_dir} --condition 3 --saliency_map_type Overlay"Supported placeholders are:
{input}: source screenshot path.{output}: preferred overlay output path.{map}: preferred raw saliency-map output path.{output_dir}: per-screenshot working output directory.{scenario_id}: stable screenshot scenario id.
If the command does not write {output}, uxtest copies the newest image from
{output_dir} as the overlay. The manifest records the exact command,
return code, stdout, stderr, screenshot, overlay, persona, run, step, and
synthetic action.
Use agents export when you want downstream EDSL jobs or coding agents to work
from completed browser sessions instead of re-reading raw trace JSON. The export
creates one EDSL Agent per run. Each agent includes the original persona,
study task, outcome, final URL, step-by-step journey, visible text snippets,
actions, thinking, frustration scores, and screenshot references.
uxtest agents export <study-id>The command writes:
.uxtest/studies/<study-id>/analysis/agent_list.py
.uxtest/studies/<study-id>/analysis/agent_list.manifest.json
Inspect the generated list without launching inference:
python .uxtest/studies/<study-id>/analysis/agent_list.pyInside the generated script, call build_agent_list() to get an EDSL
AgentList. Screenshot paths are also materialized as EDSL FileStore objects
under each agent's screenshot_files trait, so vision-capable EDSL jobs can
inspect the same screens the browser agent saw.
Use interview when you want EDSL to ask follow-up questions of those rich
trace agents:
uxtest interview <study-id> \
--question "What evidence made the company feel serious or not serious?" \
--question "What would you need before scheduling a demo?"The command writes:
.uxtest/studies/<study-id>/analysis/agent_interview.py
.uxtest/studies/<study-id>/analysis/agent_interview.manifest.json
Dry-run first:
python .uxtest/studies/<study-id>/analysis/agent_interview.pyLaunch remote EDSL inference only when ready:
python .uxtest/studies/<study-id>/analysis/agent_interview.py --launchThis pattern is useful for post-study synthesis questions such as:
- What did each synthetic visitor believe after the first screen?
- Which proof points were actually seen before conversion?
- Which screenshots should be quoted in a narrative report?
- Where do persona groups disagree about credibility, clarity, or risk?
Each browser step can create an EDSL remote job. A study with 4 personas,
2 device variants, and max_steps: 8 may create up to 64 remote decision jobs,
plus model-analysis jobs if enabled.
Remote job progress URLs such as the following are expected:
https://www.expectedparrot.com/home/remote-job-progress/<job_uuid>
To verify that EDSL was used, inspect log.html or grep traces:
uxtest trace <study-id> --edsl-jobs
rg -n '"question_type"|"progress_url"|"results_url"' .uxtest/studies/<study-id>/runsThe EDSL browser decision schema supports:
click: click a supplied interactive element ref.type: fill an input ref.select: choose an option on a select ref.scroll: scroll down.find: find and scroll to text or a heading on the page.back: go back.wait: wait briefly.none: no browser action, often withstatus: done.
The runner supplies screenshot, visible text, visible interactive elements, visible headings, and recent event history.
If EDSL asks to click static headings such as Research or Bio, the runner
can recover many of those requests into find actions when the text exists.
Treat this as UX evidence: synthetic visitors may be expecting section
navigation.
Run outcomes are not the same as runtime errors:
done: agent marked the task complete or success criteria were detected.gave_up: agent gave up or repeated the same action too many times.max_steps: step budget was exhausted.error: execution error.interrupted: stale or incomplete run recovered by the store.
For exploratory studies, max_steps and gave_up can still contain useful
evidence. Check traces before concluding the site failed.
Technical reports are evidence reports. When asked for a narrative report, read the generated artifacts and write a new stakeholder-facing report.
Read at least:
.uxtest/studies/<study-id>/study.yaml
.uxtest/studies/<study-id>/analysis/scores.json
.uxtest/studies/<study-id>/analysis/findings.json
.uxtest/studies/<study-id>/analysis/log.html
.uxtest/studies/<study-id>/runs/*/meta.json
.uxtest/studies/<study-id>/runs/*/trace.jsonl
For comparison studies, also read:
.uxtest/comparisons/<comparison>.html
Write Markdown first:
.uxtest/studies/<study-id>/analysis/narrative_report.md
Then compile to HTML or PDF if requested.
Use this structure unless the user asks otherwise:
- Title and one-paragraph summary.
- Context: product, page, audience, and reason for study.
- Method: personas, devices, runs, max step budget, Playwright, and EDSL remote inference.
- What happened: user journeys, first clicks, navigation paths, confusion, and stopping points.
- Results: completion outcomes, common failure modes, and strongest findings.
- Main conclusions: 3-6 implications tied to observed behavior.
- Follow-on steps: separate product/site work from tool/model limitations.
- Appendix: study IDs, run IDs, and artifact paths.
Style rules:
- Write for stakeholders, not a test harness.
- Lead with what was learned.
- Distinguish UX friction from runtime errors.
- Distinguish site findings from model/tool limitations.
- Do not claim statistical validity from small synthetic samples.
- Prefer "synthetic visitors", "persona runs", or "agents" when discussing evidence.
- Do not invent visual evidence; use screenshots from traces or findings.
Use eval specs to detect known flaws or expected first-click behavior:
uxtest eval <study-id> \
--expect examples/saas_site/expected_flaws.yaml \
--variant clear \
--policy thresholdFixtures can include:
expected_flaws: expected_flaws.yaml
eval_policy: threshold
minimum_recovered_expected: 1Use report_only for exploratory live-site studies where the goal is evidence
collection rather than pass/fail gating.
Regenerate reports from existing traces:
uxtest analyze <study-id> --include-interrupted
uxtest uxr <study-id>Analysis writes:
findings.jsonscores.jsonreport.htmllog.htmlstudy_plan.mduxr_report.htmlhuman_test_protocol.md
When a study looks wrong:
- Check the command exit code.
- Inspect
scores.jsonoutcomes. - Open
log.htmland inspect the last step for failed runs. - Distinguish execution errors from UX friction.
- Check whether the model selected a missing ref or static heading.
- Check whether repeated clicks did not navigate.
- Check whether an external site opened and the task should have ended.
- Narrow the task or reduce
max_stepsif the agent is wandering.
Opening files on macOS requires a GUI command:
open .uxtest/comparisons/<report>.htmlSandboxed agents may need approval before running open.
Before handing off package changes:
uv run python -m pytest -q tests
uv buildDo not delete .uxtest/studies unless explicitly asked. Those artifacts are
often the evidence the user wants to inspect.