Agent Workflow Kit
Chapter 10
Technical engraving of a mechanical verification gate moving paper through inspection stations
Fig. 10The verification gate.

10 — Verification gates

When to read this: When you want to understand why the kit ships four read-only review subagents instead of one big "review my code" command. Conceptual + reference. Read it once. Return to it when you're configuring or extending the gates.

The core idea: gates aren't substitutable#

A verification gate is a check that runs before work is considered done. The kit's central claim is that different kinds of work need different kinds of gate, and they aren't interchangeable.

Specifically:

  • Tests pass verifies that the code does what its tests say. It does not verify that the UI looks right, that a migration is safe at production scale, or that an API endpoint correctly checks tenant ownership.
  • A clean visual rubric verifies that UI matches DESIGN.md. It does not verify that the API behind the UI is sound or that a schema change is safe.
  • An architecture review verifies that the system shape stays coherent. It does not verify that the per-line SQL in a migration won't lock a hot table.
  • A migration review verifies the per-line DDL safety. It does not verify that the architectural strategy was right at all.

You can't ship a UI feature on the strength of a passing test suite. You can't ship a schema migration on the strength of a clean architectural review. Each layer needs its own check.

The kit ships four read-only review subagents, each owning a different gate:

SubagentLayerAnchored against
design-reviewerUI vs DESIGN.md, multi-viewport visualDESIGN.md
architecture-reviewerSystem shape, cross-cutting concernsdocs/ARCHITECTURE.md
migration-reviewerPer-line DDL safetydocs/ARCHITECTURE.md §1 (stack), §4 (migration strategy), §5 (bets)
api-reviewerPer-endpoint completeness and safetydocs/ARCHITECTURE.md §3 (API conventions), §4 (auth, errors, rate limiting)

Plus a fifth subagent — agents-memory-updater — that handles continual memory rather than verification (covered in Chapter 12).

A few rules apply to all four reviewers:

  • They are read-only. They never edit code. They produce reports with specific diff suggestions; you (or the parent agent) apply the diffs.
  • They anchor against written-down rules. Findings cite specific DESIGN.md or ARCHITECTURE.md sections. Without those docs, the reviewers fall back to a code-only sanity pass and announce that fact at the top of the report.
  • They never approve work that violates a 🔴 BLOCKING finding, even if you push back. Either the code is wrong or the rule is wrong. Both require a deliberate update. Not silent acceptance.
  • They use a consistent severity taxonomy:
SeverityMeaningAction
🔴 BLOCKINGDirect conflict with a written ruleCannot ship. Fix the code or fix the rule.
🟡 NEEDS DECISIONDeviates from the rule but the deviation might be intentionalYou decide: revert, or update the rule + log the trade-off
🟢 ADVISORYFine, but a related improvement is recommendedOptional

How to invoke the reviewers#

All four reviewers are invoked the same way: via the Agent tool, either by you or by the parent agent.

By you, in chat:

have the design-reviewer check this

review the migration with migration-reviewer

ask the api-reviewer to check the new endpoints

By the parent agent during a workflow: /craft-ui Phase 8 invokes design-reviewer automatically. The kit's CLAUDE.md template instructs the parent agent to invoke architecture-reviewer / migration-reviewer / api-reviewer after relevant changes without asking. No behavioral difference between user-triggered and agent-triggered invocations.

Run reviewers in parallel when their concerns don't overlap. A feature that adds both a migration and a new endpoint can run migration-reviewer and api-reviewer at the same time.

design-reviewer#

File: .claude/agents/design-reviewer.md after bootstrap.

What it checks (the rubric):

  • Aesthetic match. Does the work reflect the named direction in DESIGN.md? Specific lines from DESIGN.md are cited when calling regression toward generic.
  • Forbidden defaults. Inter / Roboto / Open Sans / Lato / Arial / system stacks; purple-on-white gradients; generic SaaS blue/teal; three-column feature grids with identical card heights and centered icons; Material default easing cubic-bezier(0.4, 0, 0.2, 1); hover scales for decoration; animations over 400ms for state change. Plus every entry in DESIGN.md's project-specific forbidden section.
  • Token discipline. Greps the changed files for hex literals in className strings, bg-{color}-{n} Tailwind utilities (red, blue, zinc, etc.), text-{color}-{n}, border-{color}-{n}, hardcoded font stacks. Each is a violation.
  • Type contrast. Display vs body distinguishable at a glance. Weight extremes used per DESIGN.md (typically 200/800, not 400/600).
  • Spatial rhythm. Spacing values come from the scale. Vertical rhythm clean.
  • State coverage. Every interactive element has visible hover, focus-visible at 3:1 contrast minimum, active, disabled. Loading, error, and empty states present where applicable.
  • Motion sanity. Durations under 400ms for state. Only transform and opacity animated. prefers-reduced-motion respected.

Stress tests for interactive components:

  • Click rapidly 10x on the primary action. Anything shift unexpectedly?
  • Tab through with keyboard. Focus order logical? Indicators visible at every stop?
  • Trigger an error state. Does the UI degrade locally or block everything?
  • Resize to 320px width. Layout survives?
  • Throttle to slow 3G in browser devtools. Is the loading state better than no state?

Visual capture: if Playwright MCP is available and a dev server is running, the reviewer captures screenshots at 375px, 768px, 1280px, and 1920px. If not, it produces a code-only review based on the diff and announces that at the top of the report.

To enable Playwright MCP:

claude mcp add playwright -- npx @playwright/mcp@latest

Output format:

## Design Review: [feature or route name]
 
### Verdict
[PASS | NEEDS CHANGES | FAIL]
 
### Aesthetic match
[1–2 sentences. Quote specific DESIGN.md lines.]
 
### Violations
- [file:line — specific violation, with the rule it breaks]
 
### Diff suggestions
- `path/to/file.tsx:23` — replace `bg-zinc-900` with `bg-[--color-bg-elevated]`
- `app/page.tsx:88` — body leading is 1.4; DESIGN.md specifies 1.6 for body
- Hero animation duration is 600ms; DESIGN.md ceiling is 400ms
 
### State coverage
[List of states verified or missing per component touched.]
 
### Stress test results
[If interactive. Otherwise omit.]
 
### Screenshots captured
[List of viewports successfully captured, or note if Playwright was unavailable.]

architecture-reviewer#

File: .claude/agents/architecture-reviewer.md after bootstrap.

Anchor: docs/ARCHITECTURE.md. If the file doesn't exist, the reviewer:

  • Says so at the top of the report.
  • Asks whether the project is genuinely architecture-light (static site, simple CLI, library) or whether ARCHITECTURE.md should exist.
  • For architecture-light projects, produces a code-only sanity pass. Flags any obvious cross-cutting concerns introduced (auth, persistent storage, payments, queues) that should trigger a real ARCHITECTURE.md.

What it checks (the rubric):

  • Stack alignment. Do imports / config / dependencies match §1? Flag any new library that wasn't authorized. Verify against the spec's Allowed packages line.
  • Data model. Do new entities / FKs / columns / indexes / migrations match §2? Identity strategy consistent, multi-tenancy column present, soft-delete handling correct, time fields correct, ON DELETE behavior matches.
  • Service shape. Module boundaries (§3) respected? No reaching across boundaries. New API surface introduced without an ARCHITECTURE.md update is a violation.
  • Cross-cutting concerns. Auth at the documented enforcement layer. Errors thrown from defined classes. Logging matches the format. Caching obeys §4. Background work registered with the documented queue + idempotency strategy. Secrets fetched via the documented mechanism.
  • Migration & rollout. Forward-only vs expand-contract pattern matches §4 strategy. Backfill stated and reversible. Rollback story described. Migration is idempotent.
  • Evolution / bets. Does the change conflict with a stated bet in §5? E.g. "users won't exceed 10k rows per tenant" — does the new query break if they do?
  • Trade-off log freshness. Does an architecturally material change have a §6 entry? If not, the deviation is undocumented even if it's right.

Output format: standard verdict + sections + 🔴 / 🟡 / 🟢 findings + suggested diffs or trade-off log entries.

A representative finding:

🔴 BLOCKING
- src/server/queries/getUserDashboard.ts:14 — query lacks `tenant_id` filter.
  ARCHITECTURE.md §2 commits to row-level multi-tenancy on every query path.
  Fix: add `where(eq(users.tenantId, ctx.tenantId))` or use the project's
  scoped-query helper.

migration-reviewer#

File: .claude/agents/migration-reviewer.md after bootstrap.

Lane: This reviewer's sole concern is production safety of this migration. Strategic questions ("should we even use Postgres?") belong to architecture-reviewer. Code quality belongs to pre-commit-review. Visual review belongs to design-reviewer. The migration-reviewer stays in lane.

Anchor: docs/ARCHITECTURE.md §1 (stack — what database, what version, what host), §4 (Migration & deploy strategy), §5 (Bets — especially scale assumptions). If absent, the reviewer asks for the database engine + version explicitly. Lock-impact rules vary across Postgres, MySQL, SQLite, Cloud Spanner.

Per-statement DDL review:

  • Lock acquisition. For each DDL, name the lock acquired (ACCESS EXCLUSIVE, SHARE, ROW EXCLUSIVE, none) and whether it blocks reads, writes, or both.
  • Backfill cost. Row count estimate, batch strategy, idempotency, default-value cost on column add (volatile vs constant defaults have different lock profiles in modern Postgres).
  • NOT NULL adds. Direct SET NOT NULL on a populated column = 🔴. Safe pattern: nullable → backfill → CHECK NOT VALID → VALIDATE CONSTRAINT → optionally SET NOT NULL.
  • FK adds. NOT VALID then separate VALIDATE CONSTRAINT. Index on referencing column required. ON DELETE behavior matches §2.
  • Index hygiene. Every FK has an index. CONCURRENTLY on hot Postgres tables. No redundant indexes. Composite index column order matches actual query patterns.
  • Rename safety. Single-step renames are deploy-time race conditions unless the application is offline. Safe pattern: expand-contract.
  • Transaction wrapping. CREATE INDEX CONCURRENTLY cannot be in a transaction. Long backfills shouldn't hold a transaction open.
  • Rollback story. Reversible / forward-only / irreversible — must match what the spec declares.
  • Data integrity at the boundary. CHECK constraints with NOT VALID then VALIDATE. Triggers idempotent. Enum changes safe.
  • Cross-tenant exposure. Multi-tenant projects: every new table needs tenant_id + index, every new query path needs tenant_id in the WHERE clause, RLS policies (if used) added in the same migration.

The output is shaped to be actionable:

🔴 BLOCKING
- db/migrations/20250320_add_org_id.sql:7 — `ALTER TABLE users ADD COLUMN org_id UUID NOT NULL DEFAULT gen_random_uuid()` will rewrite the table under ACCESS EXCLUSIVE for the duration. ARCHITECTURE.md §5 bet: "we expect 100k+ rows in users by P3" — this will lock for several minutes.
 
Suggested rewrite (split into 4 migrations):
  20250320_a_add_org_id_nullable.sql
    ALTER TABLE users ADD COLUMN org_id UUID;
  20250320_b_backfill_org_id.sql
    -- batch in 5k chunks via app code or pg_cron
  20250320_c_add_check_not_valid.sql
    ALTER TABLE users ADD CONSTRAINT users_org_id_not_null CHECK (org_id IS NOT NULL) NOT VALID;
  20250320_d_validate_check.sql
    ALTER TABLE users VALIDATE CONSTRAINT users_org_id_not_null;

The reviewer always provides copy-paste-ready safer DDL for each 🔴 finding, including the multi-step pattern split into separate migrations if needed.

api-reviewer#

File: .claude/agents/api-reviewer.md after bootstrap.

Lane: Per-endpoint completeness and safety. Strategic API design ("REST or RPC?") belongs to architecture-reviewer. Schema changes to migration-reviewer. Code quality to pre-commit-review. Visual review to design-reviewer. The api-reviewer stays in lane.

Anchor: docs/ARCHITECTURE.md §3 (API style, error envelope, pagination, idempotency, versioning) and §4 (auth & authorization, error handling, rate limiting, logging). If absent, the reviewer falls back to universal rules (authorization granularity, input validation, mass-assignment safety, webhook signature verification, open-redirect/SSRF safety, status-code correctness) and recommends running architecture-md-builder.

The most common 🔴 finding in AI-generated APIs isn't "missing auth." It's broken access control with granularity. The endpoint authenticates the user, but it doesn't check whether the user belongs to the resource being queried. Example:

// 🔴 BLOCKING — missing granularity check
export async function GET(req: Request, { params }: { params: { orgId: string } }) {
  const user = await getCurrentUser();
  if (!user) return new Response("Unauthorized", { status: 401 });
  const org = await db.org.findUnique({ where: { id: params.orgId } });
  return Response.json(org);
}

This authenticates the user, then returns any org the user requests by ID. The fix is verifying membership:

// ✅ Safe
const membership = await db.orgMember.findFirst({
  where: { orgId: params.orgId, userId: user.id },
});
if (!membership) return new Response("Forbidden", { status: 403 });

The full rubric:

  • Authorization granularity. Auth check verifies the user belongs to the specific resource. Not just that the user exists.
  • Multi-tenant filtering. Every query in the handler filters by tenant_id (or equivalent) if the project commits to row-level multi-tenancy.
  • Role checks. Admin-only endpoints verify role, not just authentication.
  • Internal-only endpoints rejected at the platform layer, not by URL obscurity.
  • Input validation. Schema-validated (Zod / Yup / Valibot). Mass-assignment safe — no db.update({ ...req.body }). String length limits. Array bounds.
  • Idempotency. POST endpoints with external side effects (charging, email, third-party API) need idempotency keys.
  • Rate limiting. Per-route, per-user, scoped where ARCHITECTURE.md §4 says.
  • Error envelope. Matches the shape documented in §3 / §4.
  • Status codes. 201 on create, 422 (or 400) on validation, 401 vs 403, 429 with Retry-After, 409 on conflict.
  • Response shape consistency. Pagination format consistent. No leaked internal fields (password_hash, internal_notes).
  • URL safety. Open redirect — any user-supplied URL in Location or redirect() validated against an allowlist. SSRF — any user-supplied URL fetched server-side validated; private IP ranges disallowed.
  • Webhook handlers — signature verification before parsing, replay protection, event-ID idempotency, async-safe (queue work >5s).
  • Audit / logging. Mutation handlers log actor + action + resource ID per §4.

🔴 findings block the kickoff verification gate. The reviewer produces copy-paste-ready safer code in the project's existing style.

When to skip a reviewer#

The reviewers exist to catch specific failure modes. Some changes don't trigger any of them:

  • Pure refactor with no behavior change. No design, architecture, migration, or API surface to review. Tests pass + pre-commit-review covers it.
  • Doc edits. None of the four reviewers apply.
  • Config tweaks that don't touch the runtime. Same.

Don't run a reviewer just because the kit ships one. Run the reviewer whose lane the change is in.

When in doubt, the kit's CLAUDE.md template gives default rules:

After any UI change, invoke the design-reviewer subagent before considering the work done.

After any architecturally-loaded change (schema, service boundaries, auth, caching, queues, migrations, public APIs, new cross-cutting libraries), invoke the architecture-reviewer subagent before considering the work done.

After any database schema migration is added or modified, invoke the migration-reviewer subagent before considering the work done.

After any HTTP endpoint, server action, RPC handler, or webhook receiver is added or modified, invoke the api-reviewer subagent before considering the work done.

These rules ship in the bootstrapped CLAUDE.md, so the parent agent reads them every session.

Why a reviewer might say PASS when the work is actually bad#

A reviewer can only catch what's anchored against a written rule. If your DESIGN.md is sparse, design-reviewer has nothing to anchor regression against. If your ARCHITECTURE.md doesn't commit to multi-tenancy, architecture-reviewer can't flag missing tenant filters as 🔴.

When a reviewer says PASS but the work feels off:

  • The doc is too vague. Sharpen via design-md-builder / architecture-md-builder / /forbid / architecture-review.
  • The reviewer's rubric is missing a check that matters for your project. Edit the subagent's markdown file directly. They're just files. Chapter 13: Extending the kit covers extending review rubrics.

Building your own reviewer#

The four shipped reviewers are templates. If your project has a domain-specific concern that's not covered — a permissions model, a billing pipeline, a particular library's pitfalls — write your own.

A reviewer subagent is a markdown file in .claude/agents/<name>.md with:

  • A frontmatter blockname, description (when to invoke it), tools (typically read-only: Read, Grep, Glob, Bash for git diff), model.
  • A role statement — "You are a senior X doing read-only review."
  • A lane statement — what's in scope and what isn't.
  • A workflow — read these files, identify what changed, apply this rubric.
  • A rubric — specific checks with severity rules.
  • An output format — typically verdict + section breakdowns + 🔴 / 🟡 / 🟢 findings + suggested diffs.
  • A what-you-never-do list — guardrails: never edit, never approve 🔴 findings, never use vague feedback, stay in lane.

Copy one of the shipped reviewers (api-reviewer.md is the most thorough template) and adapt the rubric. Chapter 13 walks through this.

Continue#

Next: Chapter 11: Token discipline goes deep on one specific check that runs across all UI work: semantic-token enforcement.