Agentic Engineering Rubric
The rubric's pillar bodies — the criteria themselves — each live on their own page so they can be browsed without one scroll to rule them all:
- 1. Focus — Narrow the agent’s world to what matters; what remains is the right context.
- 2. Validation — Hard, deterministic rules that catch non-deterministic output.
- 3. Actions — The agent’s ability to act externally in the real world.
- 4. Safe Space — Blast-radius containment, so "going wrong" has bounded cost.
- 5. Workflow — The meta-layer that ties 1–4 together, including periodic and proactive loops.
Agentic Engineering Rubric
Version: 0.27 Date: 19 April 2026 Source: Synthesised from Agentic Engineering discussion (Granola, 16 Apr 2026) Purpose: Score any codebase / project on its readiness for agentic software engineering. The goal is to move from “AI can write code” to “AI can safely, reliably, and continuously ship production value with minimal human shepherding.”
Philosophy
The gap between “AI can do the task” and “AI actually ships production value with trust” is not a model-capability problem — it is an engineering environment problem. Agentic engineering is the discipline of constructing that environment.
Five pillars, working together:
- Focus — narrow the agent’s world to what matters; what remains is the right context. Focus excludes, context includes — inseparable
- Validation — hard, deterministic rules that catch non-deterministic output
- Actions — the agent’s ability to act externally in the real world
- Safe Space — blast-radius containment, so “going wrong” has bounded cost
- Workflow — the meta-layer that ties 1–4 together, including periodic and proactive loops
Guiding principle: What is good for humans is good for the AI. A tidy, well-instrumented, well-guarded codebase scores well on this rubric whether the next contributor is a senior engineer or an agent. “Agentic engineering” is arguably just engineering done properly.
Corollary: Structural enforcement over procedural gating. Where a concern can be enforced by a mechanism — IaC for infrastructure, branch protection for source control, credential tenancy for identity, policy-as-code for IAM — the rubric scores the mechanism. Humans as judges of a mechanism’s correctness remain load-bearing; humans as executors of a procedural step are flagged as sub-level-2. Agents stress-test at scale and speed what would have broken under human load too.
Scope and boundary
The rubric is focused on engineering environments — it scores the readiness of a codebase to host agentic work. It is not a compliance framework; it does not replace organisational governance, formal attestation standards (SOC 2, ISO 27001, NIST), or third-party-risk programs. Where the rubric’s concerns coincide with those frameworks — most often in access control, change management, monitoring, availability — applying the rubric naturally builds toward compliance readiness on the overlapping dimensions. Where the rubric has not yet addressed a compliance-adjacent concern, open questions about which concerns to absorb, refine, or leave to complementary instruments are captured in the Open Questions section. Coexistence with compliance frameworks is the current stance; convergence is neither the goal nor foreclosed.
The rubric holds one additional boundary explicitly: engineering does not need PII. PII lives on the production-data side of the boundary between production and engineering systems. Logs, memory, caches, git history, CI artefacts, and agent tool surfaces are PII-free by design, not by layered masking. The criteria that implement this — PL4-pii-masking, PL4-memory-safety, PL4-prompt-injection-defence, and the ingestion discipline in PL1-real-world-feedback and PL3-emission-quality — realise a single bright line, not parallel defences.
Scoring
Each criterion is scored 0 / 1 / 2 / 3:
| Score | Anchor | Meaning |
|---|---|---|
| 0 | Absent | Not in place, or in name only |
| 1 | Present | Exists but with meaningful gaps, inconsistent coverage, or high friction |
| 2 | Effective | Consistently in place, low friction, agent-usable |
| 3 | Compounding | Improves with use — outcomes are captured, fed back, and demonstrably make the criterion cheaper or better over time |
How to read the scale
- 2 is the realistic operational target for most criteria. A project that hits 2 on every line is a well-engineered codebase.
- 3 is the bar for criteria where compounding is structurally possible and high-leverage. Reaching 3 requires building learning infrastructure: instrumentation, retrieval, hygiene, decay protocols.
- Some criteria are tagged
(max 2)— compounding isn’t structurally meaningful (e.g. lint either passes or it doesn’t). For these, 2 is the ceiling; the rubric doesn’t penalise the absence of a “3.”
Why this matters
This single-scale design embeds memory and learning into every criterion rather than treating them as a separate concern. A codebase that scores 2s everywhere has capability; a codebase that scores 3s has a compounding system — one that gets cheaper to operate the longer it runs. The gap between the two is the gap between “AI-assisted engineering” and “agentic engineering.”
What scoring requires
Scoring a project needs more than codebase access. The rubric assumes the project’s Actions pillar already provides agent-readable operational access — structured state (PL3-structured-state-read), observability (PL3-emission-quality / PL3-agent-queryability), source control metadata (PL3-source-control), CI/deploy results (PL3-deployment-cicd). That same access is what makes scoring feasible: a scorer queries the agent’s own read surfaces rather than chasing dashboards by hand. A project with weak Actions is simultaneously harder to use and harder to audit.
For criteria that can’t be fully scored from agent-readable sources alone (e.g. PL2-taste-validation human taste validation, PL2-secret-hygiene secret rotation confirmation), expect to supplement with brief process interviews.
Maximum total: 146 points. (Calculated below in the Scoring Summary.)
A project should aim to reach the maximum on at least one flagship codebase before attempting to scale the methodology across the portfolio.
Meta-Metrics
Beyond the rubric score, track four operational signals. The rubric measures capability and compounding; these measure whether the loop actually runs and improves.
Glance Threshold — median time to approve a PR
- > 15 min — something upstream failed (planning, actions, or validation)
- 5–15 min — acceptable, but PR is doing too much
- < 5 min — target state: PR is glanceable because trust has compounded
If you have to read a PR for an hour, you might as well have written it yourself.
Cost per merged PR
- All-in cost (agent inference + CI minutes + canary infrastructure + log retention) divided by merged PRs in the period
- Tracks whether agentic engineering is actually cheaper than the alternative
- A high rubric score with runaway cost-per-PR means the rubric is being gamed
Signal-to-deploy time — median hours from user signal received to fix deployed
- User signal = review, support ticket, production alert, meeting note, canary metric breach
- Captures whether the full loop (
PL1-real-world-feedback→PL5-signal-driven-tasks→PL5-outcome-input-loop→ release) actually closes - This is the metric that proves the “month-long holiday and the app has grown 30 features” vision is real, not aspirational
Compounding Index — fraction of compounding-eligible criteria scored at 3
- Numerator: criteria scored at 3
- Denominator: criteria where 3 is structurally achievable (i.e. excluding
(max 2)criteria) — currently 46 of 50 - Tracks whether the project is building learning infrastructure or just static capability
- A high Compounding Index is the rubric’s strongest signal that agentic engineering is actually compounding, not just present
- Target: > 0.3 within 12 months of starting; > 0.6 indicates a mature compounding system
Scoring Summary
| Pillar | Criteria | Max |
|---|---|---|
| 1. Focus | 10 | 30 |
| 2. Validation | 10 | 28 |
| 3. Actions | 10 | 30 |
| 4. Safe Space | 10 | 28 |
| 5. Workflow | 10 | 30 |
| Totals | 50 | 146 |
Per-pillar reporting is mandatory
A single total hides where a project is weak. Always report the per-pillar breakdown alongside the total:
Example:
(P1: 20/30, P2: 20/28, P3: 20/30, P4: 20/28, P5: 24/30) → Total 104/146
Pillar weaknesses become visible by inspection; teams can’t hide a P4 problem inside a flattering total.
The Risk Floor rule
Pillars 2 (Validation) and 4 (Safe Space) are categorically different from the others — failures here are catastrophic (data loss, prod incidents, PII leaks), not merely slow. So the project’s reported maturity level is capped by its weakest score on P2 or P4:
| P2 or P4 score | Reported maturity ceiling |
|---|---|
| < 50% of pillar max | Exposed — high risk debt; no maturity claim possible regardless of total |
| 50–75% of pillar max | Capable — total score is meaningful but qualified |
| ≥ 75% of pillar max | Mature — total score stands on its own |
Categorical, not arithmetic. A project at 118/146 with P4 at 12/28 (43%) is Exposed, not “almost mature” — regardless of total.
Application Workflow
- Baseline — score the target project across all 50 criteria. Report the per-pillar breakdown (
P1/30, P2/28, P3/30, P4/28, P5/30), the total (/146), the Risk Floor classification, and the Compounding Index. Most real codebases baseline in the 36–62 / 146 range, with Compounding Index near 0. - Prioritise — if Risk Floor is Exposed, fix P2/P4 gaps first, full stop. Otherwise, pick the three lowest-scoring criteria across the rubric. For criteria already at 2, pursuing 3 is a separate strategic choice — pick 2–3 high-leverage ones per quarter rather than spreading thin.
- Sprint — dedicate an infrastructure sprint to closing those gaps.
- Re-score — quarterly cadence. Track the four meta-metrics alongside the rubric score.
- Extract — whenever a gap is closed with a reusable mechanism (e.g.
pg_columnmasksetup, dynamic logging skill, agent action audit middleware, memory substrate), extract it into a shared library so the next project starts at a higher baseline. This is the compounding effect at portfolio level.
Candidate pilot projects
- SinarAI / Surge — platform-side scoring. Active, production, known gaps around load testing and virtual-charger actions. Scores the agentic-readiness of the platform itself, independent of any client deployment.
- Gentari (CEP Phase 2 / OCPI roaming hub) — client-deployment scoring of the same codebase under GMOB-specific constraints: IP boundaries (SinarAI vs. GMOB work product), ECSGF cybersecurity compliance, PDPA / PII requirements on Aurora PostgreSQL, Cloudflare WAF for OCPI polling. Divergence between this score and the SinarAI / Surge score is itself diagnostic — it tells you which gaps are platform-level vs. client-context-level.
- Tumpang — currently frozen; rebuilding would be a clean test of whether a high-scoring rubric allows “any new project” to be picked up cheaply.
Open Questions
- Observability placement. (Resolved in v0.2.) Split across Pillar 3 / Actions (
PL3-emission-quality,PL3-agent-queryability) and Pillar 4 / Safe Space (PL4-dynamic-debug-loggingdynamic/cost-governed logging, PII-safe telemetry folded intoPL4-pii-masking). Revisit after first scoring pass to check whether emission vs. queryability scores are too correlated to justify separation. - Learning / Memory placement. (Resolved in v0.5.) Embedded into the scoring scale itself — every criterion has a level-3 “compounding” descriptor that captures the learning dimension. Two dedicated infrastructure criteria:
PL3-memory-substrateandPL4-memory-safety(hygiene, access control, and write-path validation; merged in v0.17 from the former 4.9 + 4.10). Compounding Index added as fourth meta-metric. Revisit after first scoring pass to check whether maturing criteria from 2→3 is too uniformly hard, suggesting need for substrate investment as a prerequisite. - Agent deployment safety (
PL3-deployment-cicd/PL4-dynamic-debug-loggingintersection). (Resolved in v0.17.) Closed by thePL4-release-strategyrefinement — level-2 now requires platform-enforced constraints on the agent’s deployment actions (parameter caps, immutable pipeline stages, metric-gated promotion).PL3-deployment-cicd’s deployment capability is qualified byPL4-release-strategy’s enforcement. - Prompt injection / jailbreak surface at ingestion boundaries. (Resolved in v0.17; scope clarified in v0.19.) Closed by new
PL4-prompt-injection-defence, which requires a unified sanitization layer applied consistently across all ingestion surfaces (PL1-real-world-feedbackfeedback loop,PL5-signal-driven-taskssignal-driven task generation,PL4-memory-safetymemory write-path). v0.19 narrowed scope to persistent agent context — interactive ingestion in user-supervised sessions is out of scope forPL4-prompt-injection-defence; blast radius there is contained by Pillar 4 substrate. Recipe realisation: Ingestion as PR. - Taste / UX validation. Sitting in Pillar 2 as
PL2-taste-validation, but it’s the only qualitative criterion in a quantitative pillar. May deserve its own mini-pillar if it proves load-bearing. - Brownfield-readiness scoring. The Tumpang test (“can a high-scoring rubric revive a frozen project cheaply?”) is currently not operationalised — the rubric scores current state, not gap-to-viable. Worth considering for v0.6.
- Does this rubric generalise beyond software? The five-pillar framing (focus / validation / actions / safe space / workflow) may apply to agentic product management, agentic HR, etc. The 0-1-2-3 scale with embedded compounding generalises particularly well — every domain has the same “set up vs. learning over time” dimension. Worth testing in the HELP University conversation next week.
- Are some level-3 descriptors aspirational rather than actionable? Some level-3 anchors (e.g.
PL5-multi-agent-delegation“underperforming roles auto-flagged for prompt / skill refinement”) describe outcomes that may require significant infrastructure to even measure. After first scoring pass, audit which level-3 descriptors are practically achievable vs. which need refinement. - Does IaC for infrastructure-writ-large warrant a standalone criterion? (Opened in v0.27.) The Corollary’s structural-enforcement principle currently routes IaC concerns into existing anchors:
PL3-deployment-cicd(deploy-target infra),PL4-least-privilege(IAM policy-as-code),PL5-cicd-pipeline-health(CI config as code). Whether this routing is sufficient — or whether IaC has its own maturity curve (declaration → drift detection → reconciliation → policy-as-code composition) that deserves a standalone criterion — should be revisited after a few projects have scored against the v0.27 refinements. A standalone criterion would most likely land under Pillar 4 asPL4-infra-as-code; purpose-tight criteria discipline (see also rubric-vs-recipe reframings in v0.24–v0.26) argues against premature extraction.
Criteria index
Stable slug references for every criterion. Downstream artefacts (recipes, case studies, integrations, reviews) should reference criteria by slug rather than by current number — numbers reshuffle when criteria are added, merged, or renumbered; slugs survive. Format: PL<pillar>-<semantic-slug>.
Pillar 1 — Focus
| Slug | # | Criterion |
|---|---|---|
PL1-corpus-taxonomy | 1.1 | Corpus taxonomy, filing, indexing |
PL1-codebase-scoping | 1.2 | Codebase-aware scoping |
PL1-task-decomposition | 1.3 | Task decomposition |
PL1-tech-research | 1.4 | Tech research precedes implementation |
PL1-primary-source-access | 1.5 | Primary source access |
PL1-decision-records | 1.6 | Decision records (ADRs) |
PL1-design-intent | 1.7 | Design intent accessible |
PL1-documentation-loop | 1.8 | Documentation loop — operational and product docs |
PL1-stakeholder-context | 1.9 | Client / stakeholder context |
PL1-real-world-feedback | 1.10 | Real-world feedback loop |
Pillar 2 — Validation
| Slug | # | Criterion |
|---|---|---|
PL2-hard-validation-gates | 2.1 | Hard validation gates |
PL2-test-colocation-coverage | 2.2 | Test colocation and coverage |
PL2-test-quality | 2.3 | Test quality verification |
PL2-ui-test-coverage | 2.4 | UI test coverage on mobile / frontend |
PL2-sast-dast | 2.5 | SAST / DAST present |
PL2-secret-hygiene | 2.6 | Secret hygiene |
PL2-external-pr-review | 2.7 | External PR review |
PL2-taste-validation | 2.8 | Qualitative taste validation |
PL2-agent-audit-trail | 2.9 | Agent action audit trail |
PL2-load-stress-testing | 2.10 | Load / stress testing |
Pillar 3 — Actions
| Slug | # | Criterion |
|---|---|---|
PL3-structured-state-read | 3.1 | Structured state read access |
PL3-emission-quality | 3.2 | Emission quality |
PL3-agent-queryability | 3.3 | Agent queryability |
PL3-memory-substrate | 3.4 | Memory substrate exists |
PL3-source-control | 3.5 | Source control interaction |
PL3-domain-action-skills | 3.6 | Domain-specific action skills |
PL3-deployment-cicd | 3.7 | Deployment and CI/CD interaction |
PL3-browser-web | 3.8 | Browser / web interaction |
PL3-communication | 3.9 | Communication actions |
PL3-skill-library-health | 3.10 | Skill library health |
Pillar 4 — Safe Space
| Slug | # | Criterion |
|---|---|---|
PL4-environment-isolation | 4.1 | Environment isolation |
PL4-least-privilege | 4.2 | IAM scoped read-only by default |
PL4-branch-protection | 4.3 | Branch protection and source-control write scoping |
PL4-pii-masking | 4.4 | PII masking at data-access and telemetry layers |
PL4-prompt-injection-defence | 4.5 | Prompt injection defence at ingestion boundary |
PL4-egress-capability-scoping | 4.6 | Egress capability scoping at emission boundary |
PL4-release-strategy | 4.7 | Canary / blue-green / partial release |
PL4-agent-invokable-rollback | 4.8 | Rollback is trivial and agent-invokable |
PL4-cost-governance | 4.9 | Operating cost observable, capped, and attributed |
PL4-memory-safety | 4.10 | Memory safety |
Pillar 5 — Workflow
| Slug | # | Criterion |
|---|---|---|
PL5-pipeline-reliability | 5.1 | Pipeline reliability |
PL5-cicd-pipeline-health | 5.2 | CI/CD pipeline health |
PL5-change-sets | 5.3 | Change sets / release management |
PL5-multi-agent-delegation | 5.4 | Multi-agent delegation |
PL5-spec-first-loop | 5.5 | Spec-first agent loop |
PL5-pr-reviewability | 5.6 | PR reviewability |
PL5-signal-driven-tasks | 5.7 | Signal-driven task generation |
PL5-outcome-input-loop | 5.8 | Outcome → input loop |
PL5-experiment-tracking | 5.9 | Experiment tracking |
PL5-portfolio-skill-reuse | 5.10 | Reusable skills extracted across projects |
Slugs introduced v0.20. Historical changelog entries keep numeric references; new entries, new cross-references, and all downstream artefacts use slugs.
Changelog
Pre-release. One-line entries for minor changes; 2–3 lines for structural ones. Cosmetic edits and wording-only fixes not logged.
- v0.27 — Structural-enforcement-over-procedural-gating principle codified; anchors tightened on
PL4-least-privilegeandPL3-deployment-cicd. Surfaced by applying the rubric to the canon-web-pass-1 feature’s blockers (CF Pages project creation, domain attachment, Web Analytics enablement) — all “a human clicks the right button” steps that the rubric currently permits at level-2 despite being procedural-not-structural. No new criteria added; no renumbering; PL3 / PL4 / rubric totals unchanged.- Philosophy — new Corollary added after the existing guiding principle: Structural enforcement over procedural gating. Where a concern can be enforced by a mechanism (IaC, branch protection, credential tenancy, policy-as-code), the rubric scores the mechanism. Humans as judges of a mechanism’s correctness remain load-bearing; humans as executors of a procedural step are flagged as sub-level-2. Extension of “What is good for humans is good for the AI” — agents stress-test at scale and speed what would have broken under human load too.
PL4-least-privilege— level-2 anchor tightened. Previous “write requires explicit elevation” permitted a human-ticketed approval that then executed with unscoped credentials as level-2. Reframed to structurally-enforced elevation: platform-gated (IAM policy-as-code + JIT, credential tenancy, GitOps-triggered grants), not procedural. Cross-referencesrecipes/gitops-jit-privilege-elevation.mdas known-good shape. Level-0, level-1, level-3 anchors unchanged.PL3-deployment-cicd— level-2 anchor adds a structural requirement for the deploy target itself to be declared in-repo as IaC (cloud project, DNS, TLS, CDN / proxy, WAF, edge config). Dashboard-only provisioning of the deploy surface is explicitly level-1 regardless of trigger capability — a fully agent-driven CI pipeline whose target project was created by clicking in a vendor console still has out-of-band state in the critical path. Level-1 anchor extended to name this failure shape; level-0 and level-3 unchanged.PL4-branch-protectionnot edited — already structural (“structurally impossible, not merely discouraged”); it’s the gold standard the Corollary generalises from.PL5-pipeline-reliabilitynot edited — purpose-tight to trigger / webhook / transition reliability; IaC for pipeline config is covered byPL5-cicd-pipeline-health, IaC for deploy target is now covered byPL3-deployment-cicd. Adding IaC language here would duplicate coverage.- New Open Question — whether IaC warrants a standalone criterion (
PL4-infra-as-code) once projects have scored against the refined anchors, or whether the current routing intoPL3-deployment-cicd+PL4-least-privilege+PL5-cicd-pipeline-healthis sufficient. Premature extraction rejected for v0.27 under purpose-tight discipline.
- v0.26 —
PL3-browser-webreframed to diagnostic; deterministic browser automation extracted to recipe. Continues the rubric-vs-recipe hygiene pass. The criterion had prescribed Playwright dev-time codegen at level-2 (a specific tool and mechanism); reframed to the three architectural properties that matter — deterministic, inspectable, version-controlled. Playwright dev-time codegen is one concrete realisation, now living inrecipes/deterministic-browser-automation.md(status: proposed). Level-1 retains the runtime-AI-DOM-parsing (Browseruse, Stagehand) call-out as a concrete example of the failure shape — these products are named as illustrations of non-deterministic browser use, not prescribed against at the criterion level (that’s the recipe’s domain). No criterion added or removed; no renumbering; PL3 max unchanged; rubric total unchanged; criteria count unchanged. - v0.25 — Prescriptive-mechanism language reframed across three Pillar 2 criteria. Continues the rubric-vs-recipe hygiene pass from v0.24. Three criteria had prescribed specific mechanisms where diagnostic concerns would have sufficed; reframes preserve the evaluable state while removing mechanism prescription. Scores on these criteria should not shift materially for any project already at level-2.
PL2-hard-validation-gates— reframed from “lint, typecheck, format via pre-commit hooks + CI” to the outcome concern: “violations cannot reach the protected branch undetected; enforcement at multiple checkpoints with consistent ruleset across layers; bypass at any earlier checkpoint caught by a later one.” Pre-commit + CI is now one example realisation; merge queue + CI, pre-push + CI, pre-receive server-side hook + CI are equally valid.PL2-test-colocation-coverage— reframed from prescribing colocation as the convention to prescribing discipline in choice: “a single test-location convention (colocated with source, parallel-tree, or other), applied repo-wide without exception.” Coverage requirements (global + per-PR differential) unchanged. Platform-support notes for colocation removed from the criterion body; they would move to a colocation-specific recipe if one is written. The slugPL2-test-colocation-coverageno longer perfectly reflects the reframed concern — candidate for rename (e.g.PL2-test-location-coverage) in a future bump; not acted on here to avoid rotting downstream slug references.PL2-sast-dast— “at minimum Aikido; SonarQube if compliance demands” product-floor language dropped. Reframed to category-only: “static and dynamic application security testing with agent-actionable findings; suppression accountability.” Aikido and SonarQube remain named examples but are no longer the prescribed floor.- Audit surface:
PL3-browser-webis the remaining Pillar-3 criterion still prescribing mechanism (Playwright dev-time codegen at level-2). Held for separate discussion.
- v0.24 — Dynamic debug logging extracted from criterion to recipe.
PL4-dynamic-debug-loggingremoved from the rubric; mechanism extracted to new reciperecipes/dynamic-debug-logging.md. Driver: the criterion’s text (“per-device, time-boxed, cost-aware”) was prescriptive mechanism language, which per canon’s rubric-vs-recipe distinction (rubric is diagnostic — criteria + level anchors only; recipe is prescriptive — known-good patterns) belongs in recipes, not in the evaluation instrument.PL4-cost-governancelevel-2 expanded to name that cost-prone domains with runaway characteristics (verbose logging, inference tokens, canary spin-up) require domain-specific containment mechanisms in addition to global observability; the dynamic-debug-logging recipe is the first such mechanism formalised. PL4 contracts from 11 to 10 criteria; renumbering 4.8 → 4.7 through 4.11 → 4.10 (PL4-release-strategy,PL4-agent-invokable-rollback,PL4-cost-governance,PL4-memory-safety). PL4 max 31 → 28; rubric total 149 → 146; criteria count 51 → 50 — balancing the v0.23PL4-egress-capability-scopingaddition with a v0.24 extraction and restoring the pre-v0.23 numeric shape. Slugs remain stable for retained criteria (v0.20 discipline). Historical reference toPL4-dynamic-debug-logginginresearch/soc2-tsc-rubric-mapping.md(conducted against v0.20) annotated in place rather than edited — the v0.20 mapping remains accurate for that point in time. - v0.23 — Lethal-trifecta coverage pass: new
PL4-egress-capability-scopingcriterion,PL5-multi-agent-delegationrefined for trifecta-leg separation. Closes two structural gaps in the rubric’s coverage of Willison’s lethal trifecta threat model (seeresearch/rubric-stance-on-lethal-trifecta.md): leg 3 (external communication / egress) and the combinatorial “break the trifecta” architectural principle.- NEW
PL4-egress-capability-scopingat 4.6 (renumbering 4.6 → 4.7 through 4.10 → 4.11 forPL4-dynamic-debug-logging,PL4-release-strategy,PL4-agent-invokable-rollback,PL4-cost-governance,PL4-memory-safety). Egress gate at emission boundary for unsupervised agent outbound paths — chat posts, webhook calls, email, HTTP, image-rendering URLs, link-preview fetches. Scope is application-layer egress from automated / scheduled / unattended agent action; interactive responses in user-supervised sessions are out of scope (symmetric withPL4-prompt-injection-defence’s v0.19 scope narrowing). Level-2 enforces per-destination allowlists, rate limits per destination, elevation gates on novel destinations. Content-based output scanning is defence-in-depth, not primary. Max 3. PL4 max moves 28 → 31; rubric total 146 → 149; criteria count 50 → 51. Surfaced by the lethal-trifecta coverage research during conversations on exfiltration defence and agent-invocable scheduling; slugs remain stable across renumbering (v0.20 discipline). PL5-multi-agent-delegationlevel-2 refined — trifecta-leg separation added alongside the existing segregation-of-approval-authority clause (v0.22). Role partitioning now serves not only parallelism and specialisation but also lethal-trifecta separation: no single role simultaneously holds access to private data, exposure to untrusted content, and ability to externally communicate. Concrete realisations named: sparse-checkout context scoping, differentiated MCP / tool scopes per role, segregated credential tenancy per role. “Segregation of incompatible duties” clause extended to cover both approval duties and trifecta-leg duties. Level-0, level-1, level-3 anchors unchanged.- Out of scope for v0.23: R3 from the research — a level-3 clause on
PL4-prompt-injection-defencebinding it as defence-in-depth paired with the two architectural criteria (PL4-egress-capability-scoping+PL5-multi-agent-delegation) — held for separate confirmation. Level-3 addition toPL5-multi-agent-delegation(auto-audit of trifecta-leg assignment per role on MCP / tool-surface changes) also held. Both candidates for a v0.24 bump if and when the level-3 anchor change is desired.
- NEW
- v0.22 — SOC 2 coexistence pass. Five refinements from the SOC 2 TSC coexistence research (see
research/soc2-tsc-rubric-mapping.md), all anchor tightenings or documentation additions — no new criteria, no renumbering, no hard caps.- Philosophy — new Scope and boundary subsection. Two stance-claims made explicit: (a) the rubric is engineering-environment focused and does not replace compliance frameworks — coexistence, not convergence, with “not yet addressed” rather than “out of scope” framing for compliance-adjacent concerns; (b) engineering does not need PII — PII stays on the production-data side; engineering surfaces are PII-free by design, not by layered masking. Names the five criteria that implement the PII boundary (
PL4-pii-masking,PL4-memory-safety,PL4-prompt-injection-defence,PL1-real-world-feedback,PL3-emission-quality). PL3-memory-substrate— criterion description clarifies “customer context” as “customer-context references (pseudonymous; raw PII does not enter the memory substrate)”. Inline duplication of the PII boundary in the memory criterion; no cross-references added.PL4-memory-safety— criterion description and level-2 add retention discipline as a fourth concern alongside hygiene, access control, and write-path validation. Retention framing is relevance-primary with time-bound backstops — the primary disposal trigger is relevance decay; time-bound floors and ceilings (from customer contracts, privacy obligations, regulated data) operate as backstops. Preserves the rubric’s freshness-over-time stance while accommodating commitment-driven retention where applicable.PL5-multi-agent-delegation— level-2 adds segregation-of-approval-authority clause. Approval authority for material changes remains human; agent-to-agent approval permitted only within a platform-codified low-risk policy with audit trail. ExtendsPL4-branch-protection’s human-approval requirement to approval surfaces outside git (elevation requests, release promotions, production-impacting actions).PL5-portfolio-skill-reuse— level-2 adds tenant-boundary discipline. Extracted skills operate on the abstract pattern; tenant-specific context (client names, negotiated decisions, proprietary patterns) stays within its tenant.- Compliance-adjacent questions surfaced by the research — not added to the rubric’s Open Questions (the rubric does not assume stakeholders have SOC 2 as a goal); captured in the alignment recipe instead. If any of them (e.g. incident response as a first-class capability) proves genuinely universal via independent motivation, it can be raised as a rubric open question from the rubric’s own perspective without SOC 2 citations. Rubric stays agentic-engineering-universal.
- Out of scope for v0.22: no new criteria (recommendations R1–R3 from the research’s earlier draft, proposing criterion additions for risk assessment / third-party risk / incident response, were explicitly withdrawn — adoption rationale would have been SOC 2 alignment, which is not a valid reason to grow the rubric). Composition-rule additions to Pillar 4 intro considered and withdrawn — Risk Floor already carries that discipline categorically.
- Philosophy — new Scope and boundary subsection. Two stance-claims made explicit: (a) the rubric is engineering-environment focused and does not replace compliance frameworks — coexistence, not convergence, with “not yet addressed” rather than “out of scope” framing for compliance-adjacent concerns; (b) engineering does not need PII — PII stays on the production-data side; engineering surfaces are PII-free by design, not by layered masking. Names the five criteria that implement the PII boundary (
- v0.21 — Engineering-surface PII discipline made explicit at ingestion and emission. Two criterion refinements surfacing a principle the rubric already held implicitly — engineering does not need PII; PII stays on the production-data side of the boundary between production and engineering surfaces.
PL1-real-world-feedbacklevel-2 now explicitly requires ingestion sanitisation against PII alongside the existing structure and instruction-shape requirements; cross-referencesPL4-pii-maskingandPL4-prompt-injection-defenceas the two layers of ingestion discipline.PL3-emission-qualitycriterion description and level-2 now require correlation identifiers to be pseudonymous tokens (user-ID, session-ID, request-ID), not PII-derived (email, phone, name); log payloads are scrubbed of PII at emission perPL4-pii-masking. No new criteria, no renumbering — these are cross-references and anchor clarifications that make the existing principle auditable at the criterion level. Surfaced by the SOC 2 TSC coexistence research (seeresearch/soc2-tsc-rubric-mapping.md) while working through how PII retention interacts with rubric memory criteria; the research is also being updated to reflect the refined principle. - v0.20 — Stable slug references introduced. Every criterion now carries a
PL<pillar>-<semantic-slug>reference (e.g.PL4-prompt-injection-defencefor 4.5). Slugs are inline in each criterion row and listed in the new Criteria index section. Downstream artefacts (recipes, integrations, case studies, reviews) now reference criteria by slug rather than by number, so criterion renumbering no longer silently rots references. Motivated by dogfooding: recipe frontmatter used"4.5"-style numeric IDs, and the rubric’s own changelog (“renumbering ripples through Pillars 1, 4, 5”) surfaces the failure mode we wanted to avoid downstream. Pillar-prefix (PL1–PL5) chosen as the stability anchor — pillars themselves have been stable since v0.8, and a pillar-level renumbering would force a broader rewrite anyway. Historical changelog entries retain numeric references; all new cross-references and artefacts use slugs. - v0.19 — 4.5 Prompt injection defence scope narrowed. Criterion text changed from “all external content entering agent context” to “all external content entering persistent agent context.” Scope is durable ingestion paths (memory writes, indexed knowledge, unsupervised scheduled ingestion); interactive turn context in user-supervised sessions is out of scope — cooperative user is the defence layer, and blast radius of compromise is contained by Pillar 4 substrate (4.2 IAM scoping, 4.3 branch protection). Level anchors unchanged; cross-references to 1.10 / 5.7 / 4.10 still hold (all persistent-context surfaces). Surfaced while drafting the Ingestion as PR recipe — the recipe’s scope boundary needed a rubric anchor to match.
- v0.18 — 5.7 Signal-driven task generation tightened. Level-2 now explicitly requires an agent-invokable scheduler (agent can create, edit, and cancel scheduled jobs through the project’s own tool surface, not just observe ops-configured cron); absence of the primitive caps the criterion at level 1 regardless of reactive-source coverage. Criterion intro also flags that scheduling is load-bearing for 2.3, 2.4, 2.10, 4.7, 5.1, and 5.8 — made explicit rather than scored as a standalone criterion, to avoid rubric-growth regression. Surfaced by dogfooding: attempting to schedule a Slack reminder from this repo exposed that the project has no agent-invokable scheduler, and that the rubric had been silently presupposing one across multiple criteria. Parallel to v0.17’s 4.7 refinement (agent-bounding platform constraints made explicit at level-2).
- v0.17 — Major pass integrating Li Theen’s v0.16 review (see
reviews/rubric-review-v0.16.md) plus structural refinements from triage. 50 criteria preserved; renumbering ripples through Pillars 1, 4, 5.- Pillar 1 Focus — NEW 1.1 Corpus taxonomy, filing, indexing (substrate for retrieval-dependent criteria). 1.2 renamed monorepo-aware → Codebase-aware scoping; grouped with 1.1 as substrate; strengthened — level-2 now requires structurally-enforced boundaries (sparse-checkout, MCP tool scoping, or equivalent), not declarative-only. 1.3 Task decomposition refined to require type-specific templates with acceptance-criterion fields (load-bearing for 5.5). 1.6/1.7 swapped so ADRs precede Design intent in the internal-knowledge group. 1.7 Runbooks + 1.8 User/admin/dev docs merged into 1.8 Documentation loop (ops + product). 1.10 refined to own the signal-quality dimension; ingestion automation routes to 5.7. Pillar intro now reads “substrate → task-level → external → internal → signal.”
- Pillar 2 Validation — 2.1 renamed Hard pre-merge rules → Hard validation gates; level-2 requires pre-commit hooks alongside CI with matching ruleset, hook bypass caught by CI. 2.2 level-2 adds per-PR differential coverage threshold. 2.5 level-2 adds suppression accountability (rationale + named reviewer; expiry required only on high-severity suppressions). 2.9 level-3 adds audit-to-gate feedback loop.
- Pillar 3 Actions — 3.9 level-2 requires at least one structural safety layer on outbound communication (allowlist of recipients/channels, content filter, dry-run default, rate limiting, OR human approval for sensitive categories). 3.1 cross-reference updated for Pillar 4’s renumbering.
- Pillar 4 Safe Space — pillar intro adds the safety-composition principle (“Safety is a composition of mechanisms, not a single gate”). 4.1 Staging-isolation + 4.2 Load-testing replica merged into 4.1 Environment isolation. NEW 4.3 Branch protection and source-control write scoping (max 2), including platform-enforced freshness-at-merge. NEW 4.5 Prompt injection defence at ingestion boundary — unified sanitization policy applied consistently across 1.10, 5.7, 4.10. 4.7 Canary / blue-green refined with agent-bounding platform constraints (parameter caps, immutable pipeline stages, platform-verified metric gates). 4.9 Memory hygiene + 4.10 Memory access merged into 4.10 Memory safety, now also covering write-path validation. Max-2 criteria become 4.3 and 4.8 (were 4.1 and 4.7).
- Pillar 5 Workflow — NEW 5.5 Spec-first agent loop — implementation tasks enter the loop with an executable acceptance criterion before code generation (depends on 1.3 type-specific templates). 5.4 Multi-agent delegation strengthened — level-2 now requires differentiated full-stack roles (context scope per 1.2, tools, permissions, skills, prompts); prompt-only role separation drops to level-1. 5.5 PR self-presentation renamed to 5.6 PR reviewability; level-2 adds branch-currency requirement; level-3 reframed from process claim (“presentation quality improves”) to outcome (Glance Threshold trends down) with PR freshness across the review lifecycle. 5.6 Periodic loops + 5.7 Signal ingestion merged into 5.7 Signal-driven task generation (proactive + reactive sources both contribute). 5a now has 6 criteria, 5b has 4.
- Cross-pillar structure — Compounding Index denominator unchanged at 46/50 (max-2 criteria: 2.1, 2.6, 4.3, 4.8). Total max unchanged at 146.
- v0.16 — Broadened three Actions criteria to dissolve artificial read/write splits. 3.1 (DB access) → Structured state read access, covering app DB and infra-as-code state. 3.5 → Source control interaction, covering both git write ops and PR metadata reads. 3.7 → Deployment and CI/CD interaction, covering both CI invocation and result reading. 3.1 also tightened to explicitly delegate PII/IAM concerns to Safe Space 4.3/4.4 via cross-reference, preventing scorer double-count. Added “What scoring requires” note explaining that Actions provides the operational access scoring itself needs. No new criterion, no count change.
- v0.15 — Clarified 2.8 (human taste validation: research services or canary/A-B on real users) and 1.10 (bug reports as enriched real-world signal, belonging in Focus not Validation).
- v0.13 — Major reorganisation. Focus reordered as task-level → external → internal → signal, and gained 1.9 Client/stakeholder context. Validation reordered by frequency of execution (load test last). Actions reordered read → write → meta, with the
a/bnumbering artifacts removed. Cross-references updated throughout. 49→50 criteria, 143→146 max. - v0.12 — Merged 5.1 (task inbox) and 5.2 (webhooks) into a single Pipeline reliability criterion (they double-counted). Added 5.2 CI/CD pipeline health as a distinct concern from what CI runs.
- v0.11 — Added 3.10 Skill library health (project-level). Distinct from 5.10 which is portfolio-level. Pillar symmetry hit — four of five pillars at 10 criteria.
- v0.10 — Added 1.9 Primary source access. Scores whether upstream docs (Apple, OCPI/OCPP, vendor SDKs) are in agent-consumable form. Addresses the silent-failure mode of training-cutoff fallback.
- v0.9 — Pillar gap analysis. Focus gained 1.7 (ADRs) and 1.8 (design intent). Actions gained 3.6 (communication), 3.7 (source control), 3.8 (browser/web, favouring deterministic scripts over runtime AI). 42→47 criteria.
- v0.8 — Pillar rename to final form: Focus, Validation, Actions, Safe Space, Workflow.
- v0.7 — Tightened 2.2 to specify colocation. Added 2.2b test quality verification (mutation testing).
- v0.6 — Dropped risk-weighted scoring. Replaced with mandatory per-pillar reporting + categorical Risk Floor rule (Exposed / Capable / Mature). Eliminates second-number overhead and arithmetic awkwardness (37.5 rounding) while preserving the diagnostic signal.
- v0.5 — Scale change: 0/1/2 → 0/1/2/3, where 3 = “Compounding” (improves with use). Embeds memory and learning into every criterion rather than a separate pillar.
(max 2)tag for criteria where compounding isn’t structurally meaningful. Added Compounding Index as fourth meta-metric. - v0.4 — Major loop-closure pass. Added outcome→input loop (5.8), PR self-presentation (5.5), real-world signal ingestion (5.7), cost governance (4.8), agent action audit trail (2.9). Pillar 5 split into 5a pipeline mechanics and 5b compounding loop.
- v0.3 — Gentari added as distinct pilot (client-deployment scoring of same codebase as SinarAI/Surge). Divergence between the two scores is diagnostic.
- v0.2 — Resolved observability placement: split across Actions (emission + queryability) and Safe Space (cost-governed logging + PII-safe telemetry).
- v0.1 — Initial draft synthesised from Agentic Engineering discussion (Granola, 16 Apr 2026).