How we built the kata site (a field report from the token mines)

So I built a one-page site. One. Page.

It cost nearly 5 million output tokens.

Two days of wall-clock time, about twenty hours of it actually hands-on, and a level of stubbornness I am not proud of. This is the retrospective for the people who will actually enjoy it: CTOs who want to know what building-with-AI costs in practice, and fellow devs and designers who have also screamed at a ScrollTrigger at 3am. No framework evangelism. Just the receipts.

One clarification before those receipts, because it changes how you read them. "I" did the deciding; an AI coding agent did the typing. Two days, one human pointing, one model editing, 485 edits deep. When I say "we" below, that is the pair I mean. (Everything here is measured from the actual session logs. I went and counted. It was worse than I thought.)

First, the number that broke my brain

Output tokens, the stuff the model actually wrote: ~4.9 million. That is around 3.7 million words, or more than 7,000 printed pages, to produce a landing page you can read in ninety seconds.

But output is the cute number. The real one is total token traffic: over 425 million. Almost all of it, around 400 million, was cache reads: the model re-ingesting the growing codebase and our entire rambling conversation on nearly every single turn. Cache reads ran about 83 times the volume of new output. (Those figures are a snapshot. Writing this very post nudged them higher, which is its own kind of funny.)

If you take one technical thing from this post, take that ratio. When people say AI coding is cheap, they are looking at output. The actual cost center is context: the model paying rent to remember what it already built, over and over, every turn. The bill is not what it types. It is what it has to re-read to type it.

A detour, because most people only know one kind of token

When people say "tokens," they picture one number. There are actually four on the bill, and the gap between them is the whole story. Quickly, with the real counts:

Output, ~4.9M. What the model actually wrote: every line of code, every reply, every command. The work it produced. This is the number everyone quotes, and it is the smallest one that matters.

Input, ~2.4M. Fresh text sent in that the model had never seen before: my new messages, a file opened for the first time.

Cache creation, ~16.7M. Here is where it gets interesting. A language model has no memory between turns. To keep going, it has to be re-shown everything so far, every single turn. Re-sending all of that as fresh input would bankrupt you, so the system stashes it in a cache. This number is the one-time cost of stashing it.

Cache read, ~404M. The monster. Every turn, the model re-reads the entire stashed pile: the system prompt, the whole conversation, every file of the growing codebase. It is cheap per token, about a tenth the price of fresh input, which is the only reason any of this is affordable. But it is paid again on every turn, and the pile grows all day.

The picture that makes it click: imagine writing a novel where, before you add each sentence, you must re-read the entire book from page one. The sentence you add is the output, tiny. The re-reading is the cache, enormous, and it gets heavier with every page. By the three-thousandth turn, the model is re-reading a whole codebase just to append a button.

So when I say output was about 1% of the traffic, this is why. You are not mostly paying an AI to write. You are paying it to re-read what it already wrote, so it remembers where it is.

Why a one-pager justified any of this

Fair pushback: it is a brochure site. Why not a Tailwind template and a Tuesday afternoon?

Because of what it is selling. kata is a methodology for disciplined, auditable AI software delivery, pitched at CTOs in fintech, health, and public sector who cannot get AI wrong. The whole value prop is "we are the grown-ups who do this with rigor." For kata, we needed a site that screams professionalism before anyone reads a word.

You cannot make that argument on a site that looks like everyone else's. The medium is the message. If we want a regulated-industry CTO to believe we can execute, the site has to itself be an artifact of execution. Taste is the proof of capability. So no, we could not ship the template. The polish was not vanity, it was the pitch.

That is the product justification. Now here is what that justification cost.

The glyph that ate one hundred edits

The hero is a 3D Japanese kanji, 型, rendered in WebGL. It is the single most-edited file in the entire project: KanjiGL.tsx, one hundred separate edits. For comparison, the entire homepage layout took 96. We spent more time art-directing one character than building the page it sits in. (I directed. The agent re-rendered. One hundred times.)

Here is the lesson, and it is a real one. You cannot specify a material in words. The brief was "make it glass." Simple, right? It is not. "Glass" sent us through a parade of wrong: it looked like flat plastic, then like cheap chrome, then like a gummy bear, then like nothing at all. The instruction "this is how it looks, but I'm expecting something like that reference" is not a spec. It is a vibe. And a renderer does not do vibes.

What actually got us there was abandoning adjectives and chasing physics: the final glyph is polished chrome at full metalness, lit by an HDR studio environment with specific reflective panels, run through AgX tone mapping. None of those words were in the original ask. We discovered them one disappointed render at a time. The gap between "I'll know it when I see it" and a deterministic machine is where one hundred edits go to die.

Scroll choreography, where dreams go to die

If the glyph was the most-edited, scroll was the most aggravating. By a measure I find very funny in hindsight, scroll-related work generated the highest concentration of "no, that's wrong" notes from me of anything in the build.

The site has a tunnel section: you scroll, and you travel through the five layers of the methodology as gates. Conceptually gorgeous. In practice, three separate fights:

The tunnel rendered vertically when it needed to be horizontal. "The tunnel is not displayed horizontally" is a sentence I typed with great calm.
Snap points refused to land centered or even legible. Pinning a section and making its content arrive in a readable spot are two different problems, and both fought back.
Scroll did not know when to stop. After the last gate left the screen, the tunnel kept scrolling into the void, and reining that in took embarrassingly long.

Scroll-linked animation is a genuinely hard problem and the AI is not magic at it, because the spec lives entirely in your eyes and your scroll wheel. You feel that it is wrong half a second before you can say why. Translating that flinch into a config is the actual job, and it is slow no matter who, or what, is typing.

The gate wizard and the opacity hell

Honorable mention to the interactive gate wizard, which taught the most transferable lesson of the build.

We tried a clever thing where text faded in and out by opacity as you moved through each gate. It was a nightmare of timing: the text would go transparent at exactly the wrong moment, one beat off on every interaction. We chased opacity timing for a while before the actual fix, which was to stop being clever and put the text in solid cards, so there was nothing to fade at all.

That is the whole lesson. Half of building with AI fast is recognizing when the elegant approach is fighting you and a dumber, more robust one is sitting right there. The model will happily help you polish the wrong approach forever. Knowing when to bail is still a human job.

What the machine was genuinely great at

Lest this read as a complaint, the honest other half.

Volume and stamina. 1,299 tool calls. 485 edits, 360 reads, 339 shell commands, across 38 sessions in two days. No human pair holds that pace without the wheels coming off.

Delegation. I fanned out 28 sub-agents for parallel chunks of work, and that pattern genuinely held up. The build-this-while-I-think-about-that loop is real leverage.

The tight visual loop. We edited far more than we read, which is exactly the rhythm you want when iterating on something visual: change, look, change, look. The model is a phenomenal "try it and show me" partner. It is a weaker "decide what good looks like" partner. That is still us.

The division of labor that emerged: the machine is fast hands, the human is taste and the stop button. Anyone selling you the machine as the taste is just selling you something.

The actual takeaways

For the CTOs:

Output tokens are the wrong thing to budget. Context is the cost. Long sessions on a growing codebase get expensive on re-reading, not writing.
AI does not collapse "I'll know it when I see it" into a spec. Taste-driven work like materials, motion, and scroll stays slow and iterative. Budget for it.
The speed is real, but it is leverage on your judgment, not a replacement for it. Point it at the wrong target and it sprints there beautifully.

For the devs and designers:

You still cannot describe a material or a motion in prose. Bring references, bring physics, bring numbers.
The bravest refactor is deleting the clever approach. (See: cards, opacity.)
Your job moves up the stack. Less typing, far more deciding what good looks like and saying when to stop.

Would I do it again? In a heartbeat. Two days, one stubborn human, one tireless model, and a single Japanese character that finally looks expensive.

Go look at the thing we suffered for →