Skip to content

Agentic Engineering Rubric

The rubric's pillar bodies — the criteria themselves — each live on their own page so they can be browsed without one scroll to rule them all:

  • 1. FocusNarrow the agent’s world to what matters; what remains is the right context.
  • 2. ValidationHard, deterministic rules that catch non-deterministic output.
  • 3. ActionsThe agent’s ability to act externally in the real world.
  • 4. Safe SpaceBlast-radius containment, so "going wrong" has bounded cost.
  • 5. WorkflowThe meta-layer that ties 1–4 together, including periodic and proactive loops.

Agentic Engineering Rubric

Version: 0.27 Date: 19 April 2026 Source: Synthesised from Agentic Engineering discussion (Granola, 16 Apr 2026) Purpose: Score any codebase / project on its readiness for agentic software engineering. The goal is to move from “AI can write code” to “AI can safely, reliably, and continuously ship production value with minimal human shepherding.”


Philosophy

The gap between “AI can do the task” and “AI actually ships production value with trust” is not a model-capability problem — it is an engineering environment problem. Agentic engineering is the discipline of constructing that environment.

Five pillars, working together:

  1. Focus — narrow the agent’s world to what matters; what remains is the right context. Focus excludes, context includes — inseparable
  2. Validation — hard, deterministic rules that catch non-deterministic output
  3. Actions — the agent’s ability to act externally in the real world
  4. Safe Space — blast-radius containment, so “going wrong” has bounded cost
  5. Workflow — the meta-layer that ties 1–4 together, including periodic and proactive loops

Guiding principle: What is good for humans is good for the AI. A tidy, well-instrumented, well-guarded codebase scores well on this rubric whether the next contributor is a senior engineer or an agent. “Agentic engineering” is arguably just engineering done properly.

Corollary: Structural enforcement over procedural gating. Where a concern can be enforced by a mechanism — IaC for infrastructure, branch protection for source control, credential tenancy for identity, policy-as-code for IAM — the rubric scores the mechanism. Humans as judges of a mechanism’s correctness remain load-bearing; humans as executors of a procedural step are flagged as sub-level-2. Agents stress-test at scale and speed what would have broken under human load too.

Scope and boundary

The rubric is focused on engineering environments — it scores the readiness of a codebase to host agentic work. It is not a compliance framework; it does not replace organisational governance, formal attestation standards (SOC 2, ISO 27001, NIST), or third-party-risk programs. Where the rubric’s concerns coincide with those frameworks — most often in access control, change management, monitoring, availability — applying the rubric naturally builds toward compliance readiness on the overlapping dimensions. Where the rubric has not yet addressed a compliance-adjacent concern, open questions about which concerns to absorb, refine, or leave to complementary instruments are captured in the Open Questions section. Coexistence with compliance frameworks is the current stance; convergence is neither the goal nor foreclosed.

The rubric holds one additional boundary explicitly: engineering does not need PII. PII lives on the production-data side of the boundary between production and engineering systems. Logs, memory, caches, git history, CI artefacts, and agent tool surfaces are PII-free by design, not by layered masking. The criteria that implement this — PL4-pii-masking, PL4-memory-safety, PL4-prompt-injection-defence, and the ingestion discipline in PL1-real-world-feedback and PL3-emission-quality — realise a single bright line, not parallel defences.


Scoring

Each criterion is scored 0 / 1 / 2 / 3:

ScoreAnchorMeaning
0AbsentNot in place, or in name only
1PresentExists but with meaningful gaps, inconsistent coverage, or high friction
2EffectiveConsistently in place, low friction, agent-usable
3CompoundingImproves with use — outcomes are captured, fed back, and demonstrably make the criterion cheaper or better over time

How to read the scale

  • 2 is the realistic operational target for most criteria. A project that hits 2 on every line is a well-engineered codebase.
  • 3 is the bar for criteria where compounding is structurally possible and high-leverage. Reaching 3 requires building learning infrastructure: instrumentation, retrieval, hygiene, decay protocols.
  • Some criteria are tagged (max 2) — compounding isn’t structurally meaningful (e.g. lint either passes or it doesn’t). For these, 2 is the ceiling; the rubric doesn’t penalise the absence of a “3.”

Why this matters

This single-scale design embeds memory and learning into every criterion rather than treating them as a separate concern. A codebase that scores 2s everywhere has capability; a codebase that scores 3s has a compounding system — one that gets cheaper to operate the longer it runs. The gap between the two is the gap between “AI-assisted engineering” and “agentic engineering.”

What scoring requires

Scoring a project needs more than codebase access. The rubric assumes the project’s Actions pillar already provides agent-readable operational access — structured state (PL3-structured-state-read), observability (PL3-emission-quality / PL3-agent-queryability), source control metadata (PL3-source-control), CI/deploy results (PL3-deployment-cicd). That same access is what makes scoring feasible: a scorer queries the agent’s own read surfaces rather than chasing dashboards by hand. A project with weak Actions is simultaneously harder to use and harder to audit.

For criteria that can’t be fully scored from agent-readable sources alone (e.g. PL2-taste-validation human taste validation, PL2-secret-hygiene secret rotation confirmation), expect to supplement with brief process interviews.

Maximum total: 146 points. (Calculated below in the Scoring Summary.)

A project should aim to reach the maximum on at least one flagship codebase before attempting to scale the methodology across the portfolio.


Meta-Metrics

Beyond the rubric score, track four operational signals. The rubric measures capability and compounding; these measure whether the loop actually runs and improves.

Glance Threshold — median time to approve a PR

  • > 15 min — something upstream failed (planning, actions, or validation)
  • 5–15 min — acceptable, but PR is doing too much
  • < 5 min — target state: PR is glanceable because trust has compounded

If you have to read a PR for an hour, you might as well have written it yourself.

Cost per merged PR

  • All-in cost (agent inference + CI minutes + canary infrastructure + log retention) divided by merged PRs in the period
  • Tracks whether agentic engineering is actually cheaper than the alternative
  • A high rubric score with runaway cost-per-PR means the rubric is being gamed

Signal-to-deploy time — median hours from user signal received to fix deployed

  • User signal = review, support ticket, production alert, meeting note, canary metric breach
  • Captures whether the full loop (PL1-real-world-feedbackPL5-signal-driven-tasksPL5-outcome-input-loop → release) actually closes
  • This is the metric that proves the “month-long holiday and the app has grown 30 features” vision is real, not aspirational

Compounding Index — fraction of compounding-eligible criteria scored at 3

  • Numerator: criteria scored at 3
  • Denominator: criteria where 3 is structurally achievable (i.e. excluding (max 2) criteria) — currently 46 of 50
  • Tracks whether the project is building learning infrastructure or just static capability
  • A high Compounding Index is the rubric’s strongest signal that agentic engineering is actually compounding, not just present
  • Target: > 0.3 within 12 months of starting; > 0.6 indicates a mature compounding system

Scoring Summary

PillarCriteriaMax
1. Focus1030
2. Validation1028
3. Actions1030
4. Safe Space1028
5. Workflow1030
Totals50146

Per-pillar reporting is mandatory

A single total hides where a project is weak. Always report the per-pillar breakdown alongside the total:

Example: (P1: 20/30, P2: 20/28, P3: 20/30, P4: 20/28, P5: 24/30) → Total 104/146

Pillar weaknesses become visible by inspection; teams can’t hide a P4 problem inside a flattering total.

The Risk Floor rule

Pillars 2 (Validation) and 4 (Safe Space) are categorically different from the others — failures here are catastrophic (data loss, prod incidents, PII leaks), not merely slow. So the project’s reported maturity level is capped by its weakest score on P2 or P4:

P2 or P4 scoreReported maturity ceiling
< 50% of pillar maxExposed — high risk debt; no maturity claim possible regardless of total
50–75% of pillar maxCapable — total score is meaningful but qualified
≥ 75% of pillar maxMature — total score stands on its own

Categorical, not arithmetic. A project at 118/146 with P4 at 12/28 (43%) is Exposed, not “almost mature” — regardless of total.


Application Workflow

  1. Baseline — score the target project across all 50 criteria. Report the per-pillar breakdown (P1/30, P2/28, P3/30, P4/28, P5/30), the total (/146), the Risk Floor classification, and the Compounding Index. Most real codebases baseline in the 36–62 / 146 range, with Compounding Index near 0.
  2. Prioritise — if Risk Floor is Exposed, fix P2/P4 gaps first, full stop. Otherwise, pick the three lowest-scoring criteria across the rubric. For criteria already at 2, pursuing 3 is a separate strategic choice — pick 2–3 high-leverage ones per quarter rather than spreading thin.
  3. Sprint — dedicate an infrastructure sprint to closing those gaps.
  4. Re-score — quarterly cadence. Track the four meta-metrics alongside the rubric score.
  5. Extract — whenever a gap is closed with a reusable mechanism (e.g. pg_columnmask setup, dynamic logging skill, agent action audit middleware, memory substrate), extract it into a shared library so the next project starts at a higher baseline. This is the compounding effect at portfolio level.

Candidate pilot projects

  • SinarAI / Surge — platform-side scoring. Active, production, known gaps around load testing and virtual-charger actions. Scores the agentic-readiness of the platform itself, independent of any client deployment.
  • Gentari (CEP Phase 2 / OCPI roaming hub) — client-deployment scoring of the same codebase under GMOB-specific constraints: IP boundaries (SinarAI vs. GMOB work product), ECSGF cybersecurity compliance, PDPA / PII requirements on Aurora PostgreSQL, Cloudflare WAF for OCPI polling. Divergence between this score and the SinarAI / Surge score is itself diagnostic — it tells you which gaps are platform-level vs. client-context-level.
  • Tumpang — currently frozen; rebuilding would be a clean test of whether a high-scoring rubric allows “any new project” to be picked up cheaply.

Open Questions

  • Observability placement. (Resolved in v0.2.) Split across Pillar 3 / Actions (PL3-emission-quality, PL3-agent-queryability) and Pillar 4 / Safe Space (PL4-dynamic-debug-logging dynamic/cost-governed logging, PII-safe telemetry folded into PL4-pii-masking). Revisit after first scoring pass to check whether emission vs. queryability scores are too correlated to justify separation.
  • Learning / Memory placement. (Resolved in v0.5.) Embedded into the scoring scale itself — every criterion has a level-3 “compounding” descriptor that captures the learning dimension. Two dedicated infrastructure criteria: PL3-memory-substrate and PL4-memory-safety (hygiene, access control, and write-path validation; merged in v0.17 from the former 4.9 + 4.10). Compounding Index added as fourth meta-metric. Revisit after first scoring pass to check whether maturing criteria from 2→3 is too uniformly hard, suggesting need for substrate investment as a prerequisite.
  • Agent deployment safety (PL3-deployment-cicd / PL4-dynamic-debug-logging intersection). (Resolved in v0.17.) Closed by the PL4-release-strategy refinement — level-2 now requires platform-enforced constraints on the agent’s deployment actions (parameter caps, immutable pipeline stages, metric-gated promotion). PL3-deployment-cicd’s deployment capability is qualified by PL4-release-strategy’s enforcement.
  • Prompt injection / jailbreak surface at ingestion boundaries. (Resolved in v0.17; scope clarified in v0.19.) Closed by new PL4-prompt-injection-defence, which requires a unified sanitization layer applied consistently across all ingestion surfaces (PL1-real-world-feedback feedback loop, PL5-signal-driven-tasks signal-driven task generation, PL4-memory-safety memory write-path). v0.19 narrowed scope to persistent agent context — interactive ingestion in user-supervised sessions is out of scope for PL4-prompt-injection-defence; blast radius there is contained by Pillar 4 substrate. Recipe realisation: Ingestion as PR.
  • Taste / UX validation. Sitting in Pillar 2 as PL2-taste-validation, but it’s the only qualitative criterion in a quantitative pillar. May deserve its own mini-pillar if it proves load-bearing.
  • Brownfield-readiness scoring. The Tumpang test (“can a high-scoring rubric revive a frozen project cheaply?”) is currently not operationalised — the rubric scores current state, not gap-to-viable. Worth considering for v0.6.
  • Does this rubric generalise beyond software? The five-pillar framing (focus / validation / actions / safe space / workflow) may apply to agentic product management, agentic HR, etc. The 0-1-2-3 scale with embedded compounding generalises particularly well — every domain has the same “set up vs. learning over time” dimension. Worth testing in the HELP University conversation next week.
  • Are some level-3 descriptors aspirational rather than actionable? Some level-3 anchors (e.g. PL5-multi-agent-delegation “underperforming roles auto-flagged for prompt / skill refinement”) describe outcomes that may require significant infrastructure to even measure. After first scoring pass, audit which level-3 descriptors are practically achievable vs. which need refinement.
  • Does IaC for infrastructure-writ-large warrant a standalone criterion? (Opened in v0.27.) The Corollary’s structural-enforcement principle currently routes IaC concerns into existing anchors: PL3-deployment-cicd (deploy-target infra), PL4-least-privilege (IAM policy-as-code), PL5-cicd-pipeline-health (CI config as code). Whether this routing is sufficient — or whether IaC has its own maturity curve (declaration → drift detection → reconciliation → policy-as-code composition) that deserves a standalone criterion — should be revisited after a few projects have scored against the v0.27 refinements. A standalone criterion would most likely land under Pillar 4 as PL4-infra-as-code; purpose-tight criteria discipline (see also rubric-vs-recipe reframings in v0.24–v0.26) argues against premature extraction.

Criteria index

Stable slug references for every criterion. Downstream artefacts (recipes, case studies, integrations, reviews) should reference criteria by slug rather than by current number — numbers reshuffle when criteria are added, merged, or renumbered; slugs survive. Format: PL<pillar>-<semantic-slug>.

Pillar 1 — Focus

Slug#Criterion
PL1-corpus-taxonomy1.1Corpus taxonomy, filing, indexing
PL1-codebase-scoping1.2Codebase-aware scoping
PL1-task-decomposition1.3Task decomposition
PL1-tech-research1.4Tech research precedes implementation
PL1-primary-source-access1.5Primary source access
PL1-decision-records1.6Decision records (ADRs)
PL1-design-intent1.7Design intent accessible
PL1-documentation-loop1.8Documentation loop — operational and product docs
PL1-stakeholder-context1.9Client / stakeholder context
PL1-real-world-feedback1.10Real-world feedback loop

Pillar 2 — Validation

Slug#Criterion
PL2-hard-validation-gates2.1Hard validation gates
PL2-test-colocation-coverage2.2Test colocation and coverage
PL2-test-quality2.3Test quality verification
PL2-ui-test-coverage2.4UI test coverage on mobile / frontend
PL2-sast-dast2.5SAST / DAST present
PL2-secret-hygiene2.6Secret hygiene
PL2-external-pr-review2.7External PR review
PL2-taste-validation2.8Qualitative taste validation
PL2-agent-audit-trail2.9Agent action audit trail
PL2-load-stress-testing2.10Load / stress testing

Pillar 3 — Actions

Slug#Criterion
PL3-structured-state-read3.1Structured state read access
PL3-emission-quality3.2Emission quality
PL3-agent-queryability3.3Agent queryability
PL3-memory-substrate3.4Memory substrate exists
PL3-source-control3.5Source control interaction
PL3-domain-action-skills3.6Domain-specific action skills
PL3-deployment-cicd3.7Deployment and CI/CD interaction
PL3-browser-web3.8Browser / web interaction
PL3-communication3.9Communication actions
PL3-skill-library-health3.10Skill library health

Pillar 4 — Safe Space

Slug#Criterion
PL4-environment-isolation4.1Environment isolation
PL4-least-privilege4.2IAM scoped read-only by default
PL4-branch-protection4.3Branch protection and source-control write scoping
PL4-pii-masking4.4PII masking at data-access and telemetry layers
PL4-prompt-injection-defence4.5Prompt injection defence at ingestion boundary
PL4-egress-capability-scoping4.6Egress capability scoping at emission boundary
PL4-release-strategy4.7Canary / blue-green / partial release
PL4-agent-invokable-rollback4.8Rollback is trivial and agent-invokable
PL4-cost-governance4.9Operating cost observable, capped, and attributed
PL4-memory-safety4.10Memory safety

Pillar 5 — Workflow

Slug#Criterion
PL5-pipeline-reliability5.1Pipeline reliability
PL5-cicd-pipeline-health5.2CI/CD pipeline health
PL5-change-sets5.3Change sets / release management
PL5-multi-agent-delegation5.4Multi-agent delegation
PL5-spec-first-loop5.5Spec-first agent loop
PL5-pr-reviewability5.6PR reviewability
PL5-signal-driven-tasks5.7Signal-driven task generation
PL5-outcome-input-loop5.8Outcome → input loop
PL5-experiment-tracking5.9Experiment tracking
PL5-portfolio-skill-reuse5.10Reusable skills extracted across projects

Slugs introduced v0.20. Historical changelog entries keep numeric references; new entries, new cross-references, and all downstream artefacts use slugs.


Changelog

Pre-release. One-line entries for minor changes; 2–3 lines for structural ones. Cosmetic edits and wording-only fixes not logged.

  • v0.27Structural-enforcement-over-procedural-gating principle codified; anchors tightened on PL4-least-privilege and PL3-deployment-cicd. Surfaced by applying the rubric to the canon-web-pass-1 feature’s blockers (CF Pages project creation, domain attachment, Web Analytics enablement) — all “a human clicks the right button” steps that the rubric currently permits at level-2 despite being procedural-not-structural. No new criteria added; no renumbering; PL3 / PL4 / rubric totals unchanged.
    • Philosophy — new Corollary added after the existing guiding principle: Structural enforcement over procedural gating. Where a concern can be enforced by a mechanism (IaC, branch protection, credential tenancy, policy-as-code), the rubric scores the mechanism. Humans as judges of a mechanism’s correctness remain load-bearing; humans as executors of a procedural step are flagged as sub-level-2. Extension of “What is good for humans is good for the AI” — agents stress-test at scale and speed what would have broken under human load too.
    • PL4-least-privilege — level-2 anchor tightened. Previous “write requires explicit elevation” permitted a human-ticketed approval that then executed with unscoped credentials as level-2. Reframed to structurally-enforced elevation: platform-gated (IAM policy-as-code + JIT, credential tenancy, GitOps-triggered grants), not procedural. Cross-references recipes/gitops-jit-privilege-elevation.md as known-good shape. Level-0, level-1, level-3 anchors unchanged.
    • PL3-deployment-cicd — level-2 anchor adds a structural requirement for the deploy target itself to be declared in-repo as IaC (cloud project, DNS, TLS, CDN / proxy, WAF, edge config). Dashboard-only provisioning of the deploy surface is explicitly level-1 regardless of trigger capability — a fully agent-driven CI pipeline whose target project was created by clicking in a vendor console still has out-of-band state in the critical path. Level-1 anchor extended to name this failure shape; level-0 and level-3 unchanged.
    • PL4-branch-protection not edited — already structural (“structurally impossible, not merely discouraged”); it’s the gold standard the Corollary generalises from.
    • PL5-pipeline-reliability not edited — purpose-tight to trigger / webhook / transition reliability; IaC for pipeline config is covered by PL5-cicd-pipeline-health, IaC for deploy target is now covered by PL3-deployment-cicd. Adding IaC language here would duplicate coverage.
    • New Open Question — whether IaC warrants a standalone criterion (PL4-infra-as-code) once projects have scored against the refined anchors, or whether the current routing into PL3-deployment-cicd + PL4-least-privilege + PL5-cicd-pipeline-health is sufficient. Premature extraction rejected for v0.27 under purpose-tight discipline.
  • v0.26PL3-browser-web reframed to diagnostic; deterministic browser automation extracted to recipe. Continues the rubric-vs-recipe hygiene pass. The criterion had prescribed Playwright dev-time codegen at level-2 (a specific tool and mechanism); reframed to the three architectural properties that matter — deterministic, inspectable, version-controlled. Playwright dev-time codegen is one concrete realisation, now living in recipes/deterministic-browser-automation.md (status: proposed). Level-1 retains the runtime-AI-DOM-parsing (Browseruse, Stagehand) call-out as a concrete example of the failure shape — these products are named as illustrations of non-deterministic browser use, not prescribed against at the criterion level (that’s the recipe’s domain). No criterion added or removed; no renumbering; PL3 max unchanged; rubric total unchanged; criteria count unchanged.
  • v0.25Prescriptive-mechanism language reframed across three Pillar 2 criteria. Continues the rubric-vs-recipe hygiene pass from v0.24. Three criteria had prescribed specific mechanisms where diagnostic concerns would have sufficed; reframes preserve the evaluable state while removing mechanism prescription. Scores on these criteria should not shift materially for any project already at level-2.
    • PL2-hard-validation-gates — reframed from “lint, typecheck, format via pre-commit hooks + CI” to the outcome concern: “violations cannot reach the protected branch undetected; enforcement at multiple checkpoints with consistent ruleset across layers; bypass at any earlier checkpoint caught by a later one.” Pre-commit + CI is now one example realisation; merge queue + CI, pre-push + CI, pre-receive server-side hook + CI are equally valid.
    • PL2-test-colocation-coverage — reframed from prescribing colocation as the convention to prescribing discipline in choice: “a single test-location convention (colocated with source, parallel-tree, or other), applied repo-wide without exception.” Coverage requirements (global + per-PR differential) unchanged. Platform-support notes for colocation removed from the criterion body; they would move to a colocation-specific recipe if one is written. The slug PL2-test-colocation-coverage no longer perfectly reflects the reframed concern — candidate for rename (e.g. PL2-test-location-coverage) in a future bump; not acted on here to avoid rotting downstream slug references.
    • PL2-sast-dast — “at minimum Aikido; SonarQube if compliance demands” product-floor language dropped. Reframed to category-only: “static and dynamic application security testing with agent-actionable findings; suppression accountability.” Aikido and SonarQube remain named examples but are no longer the prescribed floor.
    • Audit surface: PL3-browser-web is the remaining Pillar-3 criterion still prescribing mechanism (Playwright dev-time codegen at level-2). Held for separate discussion.
  • v0.24Dynamic debug logging extracted from criterion to recipe. PL4-dynamic-debug-logging removed from the rubric; mechanism extracted to new recipe recipes/dynamic-debug-logging.md. Driver: the criterion’s text (“per-device, time-boxed, cost-aware”) was prescriptive mechanism language, which per canon’s rubric-vs-recipe distinction (rubric is diagnostic — criteria + level anchors only; recipe is prescriptive — known-good patterns) belongs in recipes, not in the evaluation instrument. PL4-cost-governance level-2 expanded to name that cost-prone domains with runaway characteristics (verbose logging, inference tokens, canary spin-up) require domain-specific containment mechanisms in addition to global observability; the dynamic-debug-logging recipe is the first such mechanism formalised. PL4 contracts from 11 to 10 criteria; renumbering 4.8 → 4.7 through 4.11 → 4.10 (PL4-release-strategy, PL4-agent-invokable-rollback, PL4-cost-governance, PL4-memory-safety). PL4 max 31 → 28; rubric total 149 → 146; criteria count 51 → 50 — balancing the v0.23 PL4-egress-capability-scoping addition with a v0.24 extraction and restoring the pre-v0.23 numeric shape. Slugs remain stable for retained criteria (v0.20 discipline). Historical reference to PL4-dynamic-debug-logging in research/soc2-tsc-rubric-mapping.md (conducted against v0.20) annotated in place rather than edited — the v0.20 mapping remains accurate for that point in time.
  • v0.23Lethal-trifecta coverage pass: new PL4-egress-capability-scoping criterion, PL5-multi-agent-delegation refined for trifecta-leg separation. Closes two structural gaps in the rubric’s coverage of Willison’s lethal trifecta threat model (see research/rubric-stance-on-lethal-trifecta.md): leg 3 (external communication / egress) and the combinatorial “break the trifecta” architectural principle.
    • NEW PL4-egress-capability-scoping at 4.6 (renumbering 4.6 → 4.7 through 4.10 → 4.11 for PL4-dynamic-debug-logging, PL4-release-strategy, PL4-agent-invokable-rollback, PL4-cost-governance, PL4-memory-safety). Egress gate at emission boundary for unsupervised agent outbound paths — chat posts, webhook calls, email, HTTP, image-rendering URLs, link-preview fetches. Scope is application-layer egress from automated / scheduled / unattended agent action; interactive responses in user-supervised sessions are out of scope (symmetric with PL4-prompt-injection-defence’s v0.19 scope narrowing). Level-2 enforces per-destination allowlists, rate limits per destination, elevation gates on novel destinations. Content-based output scanning is defence-in-depth, not primary. Max 3. PL4 max moves 28 → 31; rubric total 146 → 149; criteria count 50 → 51. Surfaced by the lethal-trifecta coverage research during conversations on exfiltration defence and agent-invocable scheduling; slugs remain stable across renumbering (v0.20 discipline).
    • PL5-multi-agent-delegation level-2 refined — trifecta-leg separation added alongside the existing segregation-of-approval-authority clause (v0.22). Role partitioning now serves not only parallelism and specialisation but also lethal-trifecta separation: no single role simultaneously holds access to private data, exposure to untrusted content, and ability to externally communicate. Concrete realisations named: sparse-checkout context scoping, differentiated MCP / tool scopes per role, segregated credential tenancy per role. “Segregation of incompatible duties” clause extended to cover both approval duties and trifecta-leg duties. Level-0, level-1, level-3 anchors unchanged.
    • Out of scope for v0.23: R3 from the research — a level-3 clause on PL4-prompt-injection-defence binding it as defence-in-depth paired with the two architectural criteria (PL4-egress-capability-scoping + PL5-multi-agent-delegation) — held for separate confirmation. Level-3 addition to PL5-multi-agent-delegation (auto-audit of trifecta-leg assignment per role on MCP / tool-surface changes) also held. Both candidates for a v0.24 bump if and when the level-3 anchor change is desired.
  • v0.22SOC 2 coexistence pass. Five refinements from the SOC 2 TSC coexistence research (see research/soc2-tsc-rubric-mapping.md), all anchor tightenings or documentation additions — no new criteria, no renumbering, no hard caps.
    • Philosophy — new Scope and boundary subsection. Two stance-claims made explicit: (a) the rubric is engineering-environment focused and does not replace compliance frameworks — coexistence, not convergence, with “not yet addressed” rather than “out of scope” framing for compliance-adjacent concerns; (b) engineering does not need PII — PII stays on the production-data side; engineering surfaces are PII-free by design, not by layered masking. Names the five criteria that implement the PII boundary (PL4-pii-masking, PL4-memory-safety, PL4-prompt-injection-defence, PL1-real-world-feedback, PL3-emission-quality).
    • PL3-memory-substrate — criterion description clarifies “customer context” as “customer-context references (pseudonymous; raw PII does not enter the memory substrate)”. Inline duplication of the PII boundary in the memory criterion; no cross-references added.
    • PL4-memory-safety — criterion description and level-2 add retention discipline as a fourth concern alongside hygiene, access control, and write-path validation. Retention framing is relevance-primary with time-bound backstops — the primary disposal trigger is relevance decay; time-bound floors and ceilings (from customer contracts, privacy obligations, regulated data) operate as backstops. Preserves the rubric’s freshness-over-time stance while accommodating commitment-driven retention where applicable.
    • PL5-multi-agent-delegation — level-2 adds segregation-of-approval-authority clause. Approval authority for material changes remains human; agent-to-agent approval permitted only within a platform-codified low-risk policy with audit trail. Extends PL4-branch-protection’s human-approval requirement to approval surfaces outside git (elevation requests, release promotions, production-impacting actions).
    • PL5-portfolio-skill-reuse — level-2 adds tenant-boundary discipline. Extracted skills operate on the abstract pattern; tenant-specific context (client names, negotiated decisions, proprietary patterns) stays within its tenant.
    • Compliance-adjacent questions surfaced by the research — not added to the rubric’s Open Questions (the rubric does not assume stakeholders have SOC 2 as a goal); captured in the alignment recipe instead. If any of them (e.g. incident response as a first-class capability) proves genuinely universal via independent motivation, it can be raised as a rubric open question from the rubric’s own perspective without SOC 2 citations. Rubric stays agentic-engineering-universal.
    • Out of scope for v0.22: no new criteria (recommendations R1–R3 from the research’s earlier draft, proposing criterion additions for risk assessment / third-party risk / incident response, were explicitly withdrawn — adoption rationale would have been SOC 2 alignment, which is not a valid reason to grow the rubric). Composition-rule additions to Pillar 4 intro considered and withdrawn — Risk Floor already carries that discipline categorically.
  • v0.21Engineering-surface PII discipline made explicit at ingestion and emission. Two criterion refinements surfacing a principle the rubric already held implicitly — engineering does not need PII; PII stays on the production-data side of the boundary between production and engineering surfaces. PL1-real-world-feedback level-2 now explicitly requires ingestion sanitisation against PII alongside the existing structure and instruction-shape requirements; cross-references PL4-pii-masking and PL4-prompt-injection-defence as the two layers of ingestion discipline. PL3-emission-quality criterion description and level-2 now require correlation identifiers to be pseudonymous tokens (user-ID, session-ID, request-ID), not PII-derived (email, phone, name); log payloads are scrubbed of PII at emission per PL4-pii-masking. No new criteria, no renumbering — these are cross-references and anchor clarifications that make the existing principle auditable at the criterion level. Surfaced by the SOC 2 TSC coexistence research (see research/soc2-tsc-rubric-mapping.md) while working through how PII retention interacts with rubric memory criteria; the research is also being updated to reflect the refined principle.
  • v0.20Stable slug references introduced. Every criterion now carries a PL<pillar>-<semantic-slug> reference (e.g. PL4-prompt-injection-defence for 4.5). Slugs are inline in each criterion row and listed in the new Criteria index section. Downstream artefacts (recipes, integrations, case studies, reviews) now reference criteria by slug rather than by number, so criterion renumbering no longer silently rots references. Motivated by dogfooding: recipe frontmatter used "4.5"-style numeric IDs, and the rubric’s own changelog (“renumbering ripples through Pillars 1, 4, 5”) surfaces the failure mode we wanted to avoid downstream. Pillar-prefix (PL1–PL5) chosen as the stability anchor — pillars themselves have been stable since v0.8, and a pillar-level renumbering would force a broader rewrite anyway. Historical changelog entries retain numeric references; all new cross-references and artefacts use slugs.
  • v0.194.5 Prompt injection defence scope narrowed. Criterion text changed from “all external content entering agent context” to “all external content entering persistent agent context.” Scope is durable ingestion paths (memory writes, indexed knowledge, unsupervised scheduled ingestion); interactive turn context in user-supervised sessions is out of scope — cooperative user is the defence layer, and blast radius of compromise is contained by Pillar 4 substrate (4.2 IAM scoping, 4.3 branch protection). Level anchors unchanged; cross-references to 1.10 / 5.7 / 4.10 still hold (all persistent-context surfaces). Surfaced while drafting the Ingestion as PR recipe — the recipe’s scope boundary needed a rubric anchor to match.
  • v0.185.7 Signal-driven task generation tightened. Level-2 now explicitly requires an agent-invokable scheduler (agent can create, edit, and cancel scheduled jobs through the project’s own tool surface, not just observe ops-configured cron); absence of the primitive caps the criterion at level 1 regardless of reactive-source coverage. Criterion intro also flags that scheduling is load-bearing for 2.3, 2.4, 2.10, 4.7, 5.1, and 5.8 — made explicit rather than scored as a standalone criterion, to avoid rubric-growth regression. Surfaced by dogfooding: attempting to schedule a Slack reminder from this repo exposed that the project has no agent-invokable scheduler, and that the rubric had been silently presupposing one across multiple criteria. Parallel to v0.17’s 4.7 refinement (agent-bounding platform constraints made explicit at level-2).
  • v0.17 — Major pass integrating Li Theen’s v0.16 review (see reviews/rubric-review-v0.16.md) plus structural refinements from triage. 50 criteria preserved; renumbering ripples through Pillars 1, 4, 5.
    • Pillar 1 Focus — NEW 1.1 Corpus taxonomy, filing, indexing (substrate for retrieval-dependent criteria). 1.2 renamed monorepo-aware → Codebase-aware scoping; grouped with 1.1 as substrate; strengthened — level-2 now requires structurally-enforced boundaries (sparse-checkout, MCP tool scoping, or equivalent), not declarative-only. 1.3 Task decomposition refined to require type-specific templates with acceptance-criterion fields (load-bearing for 5.5). 1.6/1.7 swapped so ADRs precede Design intent in the internal-knowledge group. 1.7 Runbooks + 1.8 User/admin/dev docs merged into 1.8 Documentation loop (ops + product). 1.10 refined to own the signal-quality dimension; ingestion automation routes to 5.7. Pillar intro now reads “substrate → task-level → external → internal → signal.”
    • Pillar 2 Validation — 2.1 renamed Hard pre-merge rules → Hard validation gates; level-2 requires pre-commit hooks alongside CI with matching ruleset, hook bypass caught by CI. 2.2 level-2 adds per-PR differential coverage threshold. 2.5 level-2 adds suppression accountability (rationale + named reviewer; expiry required only on high-severity suppressions). 2.9 level-3 adds audit-to-gate feedback loop.
    • Pillar 3 Actions — 3.9 level-2 requires at least one structural safety layer on outbound communication (allowlist of recipients/channels, content filter, dry-run default, rate limiting, OR human approval for sensitive categories). 3.1 cross-reference updated for Pillar 4’s renumbering.
    • Pillar 4 Safe Space — pillar intro adds the safety-composition principle (“Safety is a composition of mechanisms, not a single gate”). 4.1 Staging-isolation + 4.2 Load-testing replica merged into 4.1 Environment isolation. NEW 4.3 Branch protection and source-control write scoping (max 2), including platform-enforced freshness-at-merge. NEW 4.5 Prompt injection defence at ingestion boundary — unified sanitization policy applied consistently across 1.10, 5.7, 4.10. 4.7 Canary / blue-green refined with agent-bounding platform constraints (parameter caps, immutable pipeline stages, platform-verified metric gates). 4.9 Memory hygiene + 4.10 Memory access merged into 4.10 Memory safety, now also covering write-path validation. Max-2 criteria become 4.3 and 4.8 (were 4.1 and 4.7).
    • Pillar 5 Workflow — NEW 5.5 Spec-first agent loop — implementation tasks enter the loop with an executable acceptance criterion before code generation (depends on 1.3 type-specific templates). 5.4 Multi-agent delegation strengthened — level-2 now requires differentiated full-stack roles (context scope per 1.2, tools, permissions, skills, prompts); prompt-only role separation drops to level-1. 5.5 PR self-presentation renamed to 5.6 PR reviewability; level-2 adds branch-currency requirement; level-3 reframed from process claim (“presentation quality improves”) to outcome (Glance Threshold trends down) with PR freshness across the review lifecycle. 5.6 Periodic loops + 5.7 Signal ingestion merged into 5.7 Signal-driven task generation (proactive + reactive sources both contribute). 5a now has 6 criteria, 5b has 4.
    • Cross-pillar structure — Compounding Index denominator unchanged at 46/50 (max-2 criteria: 2.1, 2.6, 4.3, 4.8). Total max unchanged at 146.
  • v0.16 — Broadened three Actions criteria to dissolve artificial read/write splits. 3.1 (DB access) → Structured state read access, covering app DB and infra-as-code state. 3.5 → Source control interaction, covering both git write ops and PR metadata reads. 3.7 → Deployment and CI/CD interaction, covering both CI invocation and result reading. 3.1 also tightened to explicitly delegate PII/IAM concerns to Safe Space 4.3/4.4 via cross-reference, preventing scorer double-count. Added “What scoring requires” note explaining that Actions provides the operational access scoring itself needs. No new criterion, no count change.
  • v0.15 — Clarified 2.8 (human taste validation: research services or canary/A-B on real users) and 1.10 (bug reports as enriched real-world signal, belonging in Focus not Validation).
  • v0.13 — Major reorganisation. Focus reordered as task-level → external → internal → signal, and gained 1.9 Client/stakeholder context. Validation reordered by frequency of execution (load test last). Actions reordered read → write → meta, with the a/b numbering artifacts removed. Cross-references updated throughout. 49→50 criteria, 143→146 max.
  • v0.12 — Merged 5.1 (task inbox) and 5.2 (webhooks) into a single Pipeline reliability criterion (they double-counted). Added 5.2 CI/CD pipeline health as a distinct concern from what CI runs.
  • v0.11 — Added 3.10 Skill library health (project-level). Distinct from 5.10 which is portfolio-level. Pillar symmetry hit — four of five pillars at 10 criteria.
  • v0.10 — Added 1.9 Primary source access. Scores whether upstream docs (Apple, OCPI/OCPP, vendor SDKs) are in agent-consumable form. Addresses the silent-failure mode of training-cutoff fallback.
  • v0.9 — Pillar gap analysis. Focus gained 1.7 (ADRs) and 1.8 (design intent). Actions gained 3.6 (communication), 3.7 (source control), 3.8 (browser/web, favouring deterministic scripts over runtime AI). 42→47 criteria.
  • v0.8 — Pillar rename to final form: Focus, Validation, Actions, Safe Space, Workflow.
  • v0.7 — Tightened 2.2 to specify colocation. Added 2.2b test quality verification (mutation testing).
  • v0.6 — Dropped risk-weighted scoring. Replaced with mandatory per-pillar reporting + categorical Risk Floor rule (Exposed / Capable / Mature). Eliminates second-number overhead and arithmetic awkwardness (37.5 rounding) while preserving the diagnostic signal.
  • v0.5Scale change: 0/1/2 → 0/1/2/3, where 3 = “Compounding” (improves with use). Embeds memory and learning into every criterion rather than a separate pillar. (max 2) tag for criteria where compounding isn’t structurally meaningful. Added Compounding Index as fourth meta-metric.
  • v0.4 — Major loop-closure pass. Added outcome→input loop (5.8), PR self-presentation (5.5), real-world signal ingestion (5.7), cost governance (4.8), agent action audit trail (2.9). Pillar 5 split into 5a pipeline mechanics and 5b compounding loop.
  • v0.3 — Gentari added as distinct pilot (client-deployment scoring of same codebase as SinarAI/Surge). Divergence between the two scores is diagnostic.
  • v0.2 — Resolved observability placement: split across Actions (emission + queryability) and Safe Space (cost-governed logging + PII-safe telemetry).
  • v0.1 — Initial draft synthesised from Agentic Engineering discussion (Granola, 16 Apr 2026).