Deterministic browser automation

Purpose

Instantiate PL3-browser-web’s three architectural properties — deterministic, inspectable, version-controlled — via development-time-generated browser-automation scripts that live with code and are invocable by both humans and agents. Previously encoded as criterion-level prescription in rubric v0.17 through v0.25 (level-2 anchor named Playwright dev-time codegen as the mechanism); extracted to a recipe in v0.26 because the prescription was mechanism language, which per canon’s rubric-vs-recipe distinction belongs in recipes rather than in the diagnostic instrument.

Architecture

Tool choice. Playwright is the primary recommendation as of 2026 — cross-browser, cross-platform, mature codegen inspector, first-class TypeScript support, active maintenance. Alternatives: Cypress (strong for web-app E2E; Chromium-family only, more opinionated), Puppeteer (lower-level, Chromium-only, useful when Playwright’s abstractions are too heavy). The specific tool matters less than the architectural properties it delivers; tool choice is a per-project decision.
Workflow — record-and-codegen, not hand-written. A developer (or agent-assisted human) records the flow in the tool’s inspector (npx playwright codegen), reviews the generated script, edits for clarity and stable selectors, commits via PR. Hand-writing scripts from scratch is possible but slower and tends to produce brittle selectors.
Scripts live with code. Repo-level directory (conventionally e2e/, tests/browser/, or similar); the same PR review and code-owner discipline applies as to any other code. Automation artefacts are inspectable by reading the file; the agent’s intent is visible without running anything.
Stable selectors. Prefer data-testid, ARIA roles, text content, or semantic attributes over CSS class selectors. Class names change with styling refactors; testids are contract. Codegen should be configured to prefer stable selectors; generated scripts reviewed with that in mind.
Read-only / dry-run modes. The automation framework offers a mode where destructive actions (form submits, destructive button clicks, network-mutation POSTs) are skipped while traversal, reads, and screenshots still work. Playwright supports this natively via conditional page.locator().click() suppression, route mocking, or flow-level flags. Dry-run mode is the agent’s default for investigation; explicit mutation mode required for side-effecting runs.
Agent invocation surface. Agent can invoke scripts via shell (npx playwright test e2e/login.spec.ts --project=chromium) or via an MCP wrapper that exposes scripts as typed actions. Shell invocation is simpler; MCP wrapping gives the agent structured return values (pass/fail, extracted data, screenshots as artefacts).
Flow-failure loop. Scripts that fail in CI or scheduled runs flag the flow for regeneration. Level-3 extension: the agent can re-record the failed flow against the current UI and propose a regenerated script via PR. Without this loop, scripts rot as UIs evolve.
Skill vs. test. Browser automation serves two distinct purposes: tests (pass/fail assertions for CI) and skills (agent actions that return data, e.g. “fetch the current dashboard status”). Separate them structurally — different directories, different return contracts. A test asserts; a skill returns.

Criteria advanced

PL3-browser-web — primary. Deterministic browser automation via dev-time codegen satisfies all three architectural properties (deterministic, inspectable, version-controlled) that the reframed criterion requires. On its own this recipe moves a project to level-2 on the criterion for the critical flows it covers; level-3 requires the flow-failure loop (agent-initiated regeneration) which is partially a matter of this recipe’s maturity and partially of broader agent-workflow capability.

Prerequisites

PL3-source-control ≥ 2 Source control interaction. Automation artefacts must live in version control and be reviewed via PRs. Without agent-capable source-control operations, the recipe can be humans-only but loses the agent-invocable property.

Failure modes

Runtime-AI drift into codebase. A contributor adds Browseruse, Stagehand, or another runtime-AI-DOM-parsing library for a quick ad-hoc script; the project now has both deterministic and non-deterministic browser automation, violating the architectural property at the codebase level. Mitigation: team convention + lint/import rule that flags runtime-AI-browser-library imports; explicit policy in the repo README about the deterministic-only posture.
Stale scripts. UI evolves; scripts break or (worse) silently pass on outdated selectors that happen to still match something. Mitigation: CI runs scripts on every PR and nightly; flow-failure tracking tied to observability; agent-initiated regeneration on failure (level-3 of the criterion); stable-selector preference reduces the rot rate.
Brittle selectors. CSS-class-based selectors change with every styling refactor; scripts require constant maintenance. Mitigation: data-testid attributes on interactive elements as a repo convention; codegen configured to prefer testids; review PRs for selector stability.
“Read-only” isn’t. Dry-run mode accidentally triggers side effects — analytics pings, telemetry events, form-abandonment tracking, password-manager autofill, third-party SDK initialisation. Mitigation: dry-run tests exercised only against ephemeral or staging environments, never production; route mocking for third-party calls; explicit list of actions the dry-run mode skips; post-run verification that no side-effect metrics fired.
Secret leakage via recorded flows. Codegen captures credentials, session tokens, or PII in the generated script literal. Mitigation: post-recording inspection for secrets; test-fixture credentials stored as environment variables and injected; secret scanners (PL2-secret-hygiene) scan the automation directory.
Script-as-test vs. script-as-skill confusion. A script designed as a CI assertion is invoked by the agent as a skill, returning a pass/fail instead of structured data (and vice versa). Mitigation: different directories (e2e/tests/ vs. e2e/skills/); contract documented in directory README; agent tool surface wraps only the skill directory.
Runtime-environment drift. Scripts pass locally, fail in CI (or vice versa) due to browser versions, viewport sizes, network conditions, timezone, locale. Mitigation: pin browser versions in Playwright config; CI runs in container matching developer setup; test against multiple viewports if the UI adapts.
Flakiness that erodes trust. Race conditions (SPA lazy-loading, async state), random ordering in test fixtures, network variance. Mitigation: Playwright’s auto-wait helps but isn’t sufficient; flaky tests quarantined (not muted — quarantined until fixed) per PL2-ui-test-coverage level-3; no flaky script left in the automation corpus long-term.

Cost estimate

Medium. First deployment: 3–5 engineer-days for tool setup (Playwright install, config, CI integration), data-testid convention on first critical UI surface, recording and review workflow documented, first 3–5 critical flows captured. Per-flow incremental cost after: 15–60 min (record + clean up + PR review). Ongoing maintenance correlates with UI change velocity — typical teams spend 5–10% of front-end engineering time on automation upkeep; pre-commit + CI pressure keeps that bounded.

Open design questions

MCP wrapper vs. shell invocation for agent surface. Shell is simpler; MCP gives the agent structured returns (extracted data, screenshot URIs, pass/fail records) and tighter typing. Depends on how the agent consumes the automation — ad-hoc shell calls favour shell; integrated-workflow use favours MCP.
Rigorous dry-run definition. Route mocking? Fake response fixtures? No-network mode? The stricter the definition, the safer — but also the more maintenance. Where the line sits depends on whether dry-run runs against staging (live network, mock mutations) or against snapshot (fully offline).
Agent-automated recording. Fully-automating the record-codegen-commit loop drifts toward runtime-AI interpretation, violating the architectural property this recipe implements. Human-records-agent-commits is probably the right split; the line needs codifying.
Relationship to PL5-spec-first-loop. For UI-change tasks, recording an acceptance-Playwright-test before implementing is attractive but often over-engineered (UI spikes are explicitly exempt from spec-first per v0.17). When is a pre-implementation Playwright script justified? Per project and per task type.
Skill-vs-test boundary enforcement. Directory separation is the minimum; whether to enforce more rigorously (separate CI jobs, separate MCP tool surfaces, separate review discipline) is an implementation choice.
Visual regression. Screenshot-based visual regression testing (Playwright supports via toHaveScreenshot) is adjacent — does it belong in this recipe or its own? Different failure modes (flaky on font rendering, antialiasing, timing), different workflow (baseline approval vs. assertion). Probably its own recipe when written.

Composes with: GitHub Actions scheduler — scheduled UI-regression runs (nightly flow tests, weekly critical-path coverage) use the scheduler as the substrate; aligns with PL2-ui-test-coverage level-2 “coverage across critical flows, run daily.”
Composes with: GitOps JIT privilege elevation — destructive browser actions against production (deleting a record, cancelling a subscription) should route through the elevation gate rather than run directly from automation, even with credentials present.
Composes with: Bot-token credential tenancy — automation credentials (test-user logins, API keys for mocked third-party services) sit under a service identity, not a human account; rotation discipline applies.
Alternatives to (and why they don’t fit): Runtime-AI DOM parsing (Browseruse, Stagehand, and the broader category of “LLM looks at the page and decides what to do at runtime”). These can work but fail the architectural properties of PL3-browser-web: non-deterministic (LLM output varies), non-inspectable (you can’t read what the agent will do without running it), often not version-controlled (prompt-driven actions don’t live as committed artefacts). Appropriate for truly one-off browser tasks where none of the three properties matter; inappropriate as the default browser-interaction substrate.
Alternatives to (and why they don’t fit): Manual clicking — a human driving the browser. Doesn’t fail the criterion architecturally (humans are deterministic-ish, inspectable, not version-controlled but manually repeatable) but doesn’t advance the agent’s capability at all. The criterion is about agent browser interaction; human-in-the-browser is the level-0 baseline.

Deterministic browser automation

Deterministic browser automation

Purpose

Architecture

Criteria advanced

Prerequisites

Failure modes

Cost estimate

Open design questions

Related recipes