2. Validation · Safe Agentic

Safe Agentic A working canon · v0.27

§ criteria

Hard validation gates

violations of deterministic code-quality rules (lint, typecheck, format; e.g. Biome, ESLint, SwiftLint) cannot reach the protected branch undetected. Enforcement happens at multiple checkpoints with the same ruleset across layers, so no surprise surfaces between them, and bypassing an earlier checkpoint is caught by a later one

current · target

Test location discipline and coverage

a single test-location convention applies repo-wide (colocated with source, parallel-tree, or other), consistently applied so tests are predictably findable for humans and agents; coverage is enforced via global target *and* per-PR differential threshold

current · target

Test quality verification

tests are verified to actually catch bugs, not just exercise lines. Mutation testing (Stryker for TS, Muter for Swift) or equivalent mechanism confirms that tests assert behaviour, not merely execution

current · target

UI test coverage on mobile / frontend

current · target

SAST / DAST present

static and dynamic application security testing with agent-actionable findings; findings, suppressions, and rule disables carry accountability (rationale, named reviewer, expiry where applicable). Tool choice is project-dependent (e.g. Aikido, SonarQube for compliance cases); the concern is coverage across both testing classes, not a specific vendor

current · target

Aikido blocks new leaks *and* historical secrets are rotated / cleaned

current · target

External PR review

human glance *or* multi-model review (e.g. Claude Code Reviews $25/PR council-of-experts model)

current · target

Qualitative taste validation

humans (not agents) test for taste / UX / defect discovery, not just correctness. Delivered either via a dedicated usability research service (Netizen Experience, UserTesting, Lookback, Maze, PlaybookUX) or via your own user base through canary releases and A/B tests. Catches what automated tests can't: is this confusing? Does this feel wrong? Would a real user hit this edge case?

current · target

Agent action audit trail

every agent decision is logged with reasoning, retrievable, and reversible at granular level (catches *quiet drift* — subtle wrong actions humans don't notice for weeks)

current · target

Load / stress testing

capability exists *and* is actually exercised. Least frequent of validation types — typically weekly or on major releases

current · target