Pillar 02 · Validation
Hard, deterministic
rules catch drift.
Hard, deterministic rules that catch non-deterministic output.
Pillar at a glance
Criteria 10
Realistic target 2.0
Current maturity
Recipes available 10
§ criteria
2.1
Hard validation gates
violations of deterministic code-quality rules (lint, typecheck, format; e.g. Biome, ESLint, SwiftLint) cannot reach the protected branch undetected. Enforcement happens at multiple checkpoints with the same ruleset across layers, so no surprise surfaces between them, and bypassing an earlier checkpoint is caught by a later one
current · target
→
2.2
Test location discipline and coverage
a single test-location convention applies repo-wide (colocated with source, parallel-tree, or other), consistently applied so tests are predictably findable for humans and agents; coverage is enforced via global target *and* per-PR differential threshold
current · target
→
2.3
Test quality verification
tests are verified to actually catch bugs, not just exercise lines. Mutation testing (Stryker for TS, Muter for Swift) or equivalent mechanism confirms that tests assert behaviour, not merely execution
current · target
→
2.4
UI test coverage on mobile / frontend
current · target
→
2.5
SAST / DAST present
static and dynamic application security testing with agent-actionable findings; findings, suppressions, and rule disables carry accountability (rationale, named reviewer, expiry where applicable). Tool choice is project-dependent (e.g. Aikido, SonarQube for compliance cases); the concern is coverage across both testing classes, not a specific vendor
current · target
→
2.6
Secret hygiene
Aikido blocks new leaks *and* historical secrets are rotated / cleaned
current · target
→
2.7
External PR review
human glance *or* multi-model review (e.g. Claude Code Reviews $25/PR council-of-experts model)
current · target
→
2.8
Qualitative taste validation
humans (not agents) test for taste / UX / defect discovery, not just correctness. Delivered either via a dedicated usability research service (Netizen Experience, UserTesting, Lookback, Maze, PlaybookUX) or via your own user base through canary releases and A/B tests. Catches what automated tests can't: is this confusing? Does this feel wrong? Would a real user hit this edge case?
current · target
→
2.9
Agent action audit trail
every agent decision is logged with reasoning, retrievable, and reversible at granular level (catches *quiet drift* — subtle wrong actions humans don't notice for weeks)
current · target
→
2.10
Load / stress testing
capability exists *and* is actually exercised. Least frequent of validation types — typically weekly or on major releases
current · target
→