by FXBITS

kata: a governed model for AI software delivery

Post Main Image

型 (kata) means a form perfected through repetition: deliberate, correct, in order. It is the discipline we wrap around AI delivery. Five layers, each owning one responsibility and shipping one customer-readable artifact. Together they form a loop that compounds, each cycle better than the last.

This is the model, explained the way we present it to a room.

Most teams bolt AI onto the typing step

It feels fast. But velocity was never the constraint. Cost of change is.

Clarification, rework, review, traceability: that is where delivery effort actually sits, and where ungoverned AI quietly adds risk. Scope wanders with no one watching. Decisions get discovered late, during the build. And you are left with nothing you can put a fixed price on.

A faster way to type does not fix any of that. It just produces the wrong thing sooner.

Govern the whole workstream, not just the code generation

Speed is not the lever. Clarification, alignment, guardrails and telemetry are.

kata makes each of them a deliberate, auditable step, so AI work becomes predictable, defensible engineering instead of a slot machine. One form every piece of work flows through.

One responsibility each. One artifact each.

Five layers. Each owns a single job and ships a single thing a non-engineer can read.

L1 · Engagement Context · /context Ground the work in business intent. One shared project space: product overview, KPIs, glossary, a short constitution. Plain documents the team already trusts. The AI reads intent and tickets on demand, read-only. No database, no embeddings on day one. It owns business intent, constraints, and the KPIs you will be measured on, and ships a context doc anyone in the room can read.

L2 · Spec Engineering · spec + PR Lock alignment before any code. A signal arrives. The AI self-grills against existing context, resolves what it can, flags the rest. The PM curates; tech and compliance approve; the spec freezes as a contract in Git. Build refuses to start without it. It owns acceptance criteria, written and reviewed like code, and ships a merged PR whose signatures are the deliverable.

L3 · Agentic Execution · provenance log Build it tests-first, at the trust level you choose. One agent reads the acceptance examples, writes failing tests, makes them pass, refactors. It owns red → green → refactor with tests leading the build, and ships human-approved PRs behind a feature flag, every action traced to an acceptance criterion.

L4 · Runtime Guardrails · audit pack Safety wired once, firing on every action. Four guards, configured at setup, invisible during clean builds: a tool allowlist (deny by default), a PII/PHI scanner before any LLM call, an append-only action log, and a feature flag with a seconds-fast kill switch. It ships evidence a regulator could check, generated not authored, at near-zero cost per release.

L5 · Outcome Telemetry · variance brief Close the loop on what you can see. App telemetry, feature usage, and delivery KPIs on one dashboard. A weekly 30-minute review on an AI-drafted brief feeds variance back into the next cycle. It owns the L1 KPIs, measured live against a real baseline, and ships a weekly variance brief that catches drift early.

The line is always drawn: intent → L1 → L2 → L3 → L4 → L5 → outcome. The audit trail is not a deliverable you add at the end. It is the path itself.

Autonomy is earned, not assumed

Trust is opt-in on a five-rung ladder, per workstream.

M0 Manual · M1 Assisted · M2 Augmented · M3 Orchestrated · M4 Autonomous

M3 is the day-one default: the agent orchestrates, a human reviews every PR. M4 unlocks only after the telemetry earns it. You move a workstream up the ladder when the numbers say it is safe, not because a slide said so.

The outcome: the Triple Thirty

Three targets we commit to and measure against your own baseline over a 12-week pilot. Not a claim about past clients.

+30% features. +20–30% per team, per quarter.

−30% defects. −20–35% defect escape.

−30% variance. Estimate vs actual within ±15%.

Non-negotiable baseline underneath the targets: 100% audit-pack coverage, 100% rollback-drill success. The hygiene is not optional. The percentages are what we are aiming at, on your work, in the open.

What we don't build on day one

Day one runs on the tools your team already trusts. No vector store. No embeddings. No policy engine. No multi-agent orchestration.

We add each one only when a specific pain shows up, not when it is fashionable. Deferring the heavy thing until it earns its place is the same discipline the methodology preaches. Restraint is the credibility signal.

Provable delivery for high-stakes work

Fintech, health, energy, public sector. Provenance, not promises.

Data sovereignty. The model runs in your cloud, in-region. Personal data is blocked before the LLM; only hashed IDs and codes leave the network.

Tool composability. Reference integrations ship in the kit: wiki, tracker and scanner, ready day one.

Trust. Opt-in M0–M4 ladder. Human review on every PR by default; autonomy only after clean telemetry.

Audit overhead. Generated, not authored. Near-zero cost per release once the action log is wired.

Where a request is actually decided

Two requests, two very different shapes, the same five layers. The work that decides whether either ships well happens long before the typing.

The clean ask. "Let people cancel a booking and get their deposit back." Looks like one payment-provider call. It is not. Eight decisions surface in L1, and not one is a coding problem: cancellation tiers, no-show handling, who eats the processing fee, force-majeure. L2 turns every ambiguity into a testable line; owner, finance, legal and ops sign off; merging the PR locks the rulings. L3 builds tests-first, one per criterion, inventing zero policy. L4 keeps personal data away from the LLM, the provider on an allowlist, every refund logged. L5 watches resolution rate, time-to-refund, dispute rate and revenue retained against the L1 baseline. Once the contract is locked, there is nothing left to get wrong.

The ambiguous ask. "Move every app onto one federated login, so a user signs in once and reaches the whole suite." Looks like swapping a login button. The request changes its mind four times. Existing users cannot be assumed to exist in the new provider, so L1 reopens for a reconciliation pass. The invisible cutover gets revised into a scheduled window with comms and one forced re-auth. QA finds logout leaves the provider session alive: bug → refine → fix → re-test, inside the locked contract. Telemetry later shows a locked-out cohort, and the owner reopens L1 for a self-serve account-claim path.

Eleven exchanges, four of them going backward. Each a decision revised before it could cost a rebuild.

The cost of change is lowest before the first line of code

A clean ask runs almost straight. An ambiguous one changes its mind four times, and the layers absorb it for the cost of a conversation, not a rebuild. L5 feeds the next L1. The model is a loop, not a line.

From here to validated, in 12 weeks

01 · Diagnostic (wk 0–2). Kit forked, L1 wired, baseline captured. You walk away with the kit either way.

02 · First feature (wk 3–6). One feature through all five layers, behind a flag.

03 · Validation (wk 7–12). Three to five features. The Triple Thirty, measured against baseline.

04 · Expand (wk 13+). Deepen, broaden, or extend. Your call.

Diagnostic and pilot are separate fixed-price contracts. You only proceed if week 2 earns it.

That is kata. A form every piece of work flows through, so AI delivery becomes something you can commit to.

See the model →