A Month of Agentic Delivery

A month ago a small product team — a PM and two engineers, me being one of them — shipped an internal AI tool. A vision model looks at a returned leased vehicle, spots the damage, and drafts a natural-language description that the business-operations team reviews before the customer sees a quote.

The MVP is live, the next phase is scoped, and I wrote maybe a dozen lines of code by hand. Everything else — the FastAPI service, the Next.js operator UI, the Terraform, the CI/CD pipeline, the eval harness — was directed. Claude Code did the work. I reviewed diffs.

A project this shape is three months minimum if you're lucky with scope, and six-to-twelve in a Japanese-enterprise IT cycle by JUAS's standard-duration benchmark (2.4 × (person-months)). We committed to one. That worked because three other things were already in motion.

How we got here

The month-to-MVP wasn't a magic trick. It was three other teams quietly having done the unglamorous work that lets a fourth team move fast.

Three streams converging is what made one month feel possible. None of them happened for this project. They happened, and this project got to be the visible one.

The brief

The stack

LayerTool
VisionClaude (multimodal)
DescriptionClaude (text)
FrontendNext.js + Tailwind
BackendFastAPI + Pydantic
InfrastructureTerraform on AWS
Source + CI/CDAWS CodeCommit + CodePipeline → dev / qa / prod
Dev environmentSelf-hosted Coder on AWS
Security baselineAWS Security Hub findings, audited
ObservabilityLangfuse
Build agentClaude Code (primary) + Codex (second opinion)

Boringly correct. If the agent is directing construction, you want the stack with the deepest training signal. Anything exotic costs the agent traction.

Most of these layers — the AWS landing zone, the Coder dev environment, the CodeCommit + CodePipeline pattern, the Security Hub baseline — were already in place. We weren't choosing them; we were inheriting them and building on top.

Where the agent was doing the work

Requirements. I described the problem in plain language. Claude Code asked clarifying questions and turned them into a spec. The spec lived in docs/product.md — the product artifact, edited by both of us. CLAUDE.md sat alongside it with the agent's instructions: style, conventions, what to avoid. Two documents doing two jobs.

Design. Plan mode for anything non-trivial. Claude would explore the code with Explore subagents in parallel, sketch two or three approaches, name the trade-offs, and write the result to ~/.claude/plans/<feature>.md. I'd read the plan, push back, iterate. Nothing was approved without a review pass. The plan file, not a chat thread, was the alignment artifact.

Code. Describe the unit of work, let Claude Code implement, review the diff, run tests, commit. On small features I'd approve every change. On scaffolding I'd let it run until it reported back. For anything risky I'd send the same design to Codex and compare — different error modes.

Tests. Generated alongside the features. Integration tests ran against a real staging AWS environment, not mocks. Claude set up the ephemeral fixtures on its own.

Infrastructure. Terraform modules were maybe 90% Claude, 10% me fixing patterns that didn't fit a multi-account setup. Internal product, scoped IP — but the bar was still enterprise: dev/qa/prod isolation, full audit trail, Security Hub findings triaged and remediated by Claude before merge. Claude had full AWS CLI access through a scoped IAM role, so it could plan, apply, and verify without me brokering credentials. First terraform apply in prod was around week three.

CI/CD. AWS CodeCommit for source, CodePipeline for builds and deploys, written by Claude, reviewed by Codex. Multi-env deploy (dev → qa → prod) working before any business logic was written — a deliberate choice to never ship undeployable code.

Deployment. I didn't deploy. Claude did, through CI/CD.

The harness, not just the model

The agent is only half of it. The part that made this work was the harness around it.

The harness around the agent

A model alone is a junior engineer. A model in a good harness is a team. Click a node.

Agent
Allow-list the safe stuff, prompt on the rest
settings.json permissions

Read-only commands, test runners, terraform plan, gh pr view — all on the allow-list. Destructive commands always prompted. Long runs stopped asking 'can I list files?' every ninety seconds.

settings.json permissions. Read-only commands, test runners, terraform plan, gh pr view — all on the allow-list. Destructive commands always prompted. Long runs stopped asking "can I list files?" every ninety seconds.

Skills. Commit-and-push, deploy, bootstrap a new module, run the eval dataset. Once a slot had been done twice, it became a skill. The agent invoked them like an engineer runs a Makefile.

Hooks. Light touch this round — a couple of approval pings on long runs, nothing fancy. The lever I'm planning to lean on harder next phase.

MCP servers. Langfuse for pulling eval traces into context. The cloud provider for logs and deploy status. The issue tracker for reading tickets and closing them. The agent's hands extended to the operational systems, not just the repo.

Subagents. Explore for codebase survey, Plan for design review. Both ran in parallel, both kept the main context window clean. Half of the "two-agent sanity check" was Explore + Plan running against each other; the other half was Codex on the outside.

This is the lever that gets underweighted. A model alone asks you to sit at the terminal. A model in a good harness runs overnight.

How the month went

Week 0Brief

Vision model + LLM + operator UI + AWS infra + multi-env CI/CD. Demo in a month.

Week 1Infra and pipeline first

A hello endpoint reached prod through the full CI/CD pipeline before any business logic existed. Killed late-stage integration pain.

Week 2Vision and description models wired

Detection → draft description loop running end-to-end on staging. Eval hooks captured operator confirmations from day one.

Week 3First terraform apply in prod

Operator UI tightened, multi-account Terraform settled, the daily director-mode loop felt natural.

Week 4MVP live, next phase scoped

Roughly a dozen lines of hand-written code. Everything else was directed and reviewed.

What worked

Shipping infrastructure first. The first thing deployed was a hello endpoint reaching prod through the full pipeline. Every feature after that was additive. Killed a whole class of late-stage integration pain.

Letting the agent ask. The instinct is to over-specify. The better move is to give Claude Code enough to start and let it surface ambiguity as questions. You learn what's actually unclear in your own requirements — usually quite a lot.

Two agents for anything risky. Codex as a second opinion on architecture and security caught two things Claude missed, and one thing Claude caught that Codex missed. Marginal cost near zero, error modes different enough to matter.

Eval hooks from day zero. The operator confirms the LLM's description before it goes out. That confirmation is captured into Langfuse. By week three there was a dataset to fine-tune against. Build the measurement into the product, not beside it.

Boring infra choices. Next.js, FastAPI, Terraform, CodePipeline. Whenever I was tempted by something newer I asked whether the gain was worth the agent losing traction. Almost always no.

What didn't

Visual design. The first UI was usable and ugly. A twenty-minute pass from me fixed it. Agents are excellent at structure, bad at taste. Plan for a human design pass; don't expect the agent to own the look.

Long refactors in one session. I once asked Claude Code to rework the description-generation module twice in a single session. The second attempt lost context from the first. Commit between logical units. Don't let the session history get long.

Prompts in the chat. The right prompts for the vision model live in code, not in Claude Code's scratch. Moved them into version-controlled YAML with snapshot tests and everything got noticeably better.

A day, roughly

By week three the loop had settled. The shape barely changed day to day.

Not a heavy-touch role. Closer to product director than engineer.

If you're starting

The month is not a stunt. It's what the job looks like when the foundation is already there. Scaffolding, CI, tests, infra — no longer where the time goes. What's left is product judgment, which is what the job should have been all along.

Why this one made it to production

Most AI PoCs stop at PoC — MIT's NANDA initiative put the figure at 95% in 2025, against $30–40B of enterprise spend. Not because the model can't do the work — the model can almost always do the work. They stop because nobody on the business side was waiting for them. They were tech-led demos; the org admired them; nobody had them in their workflow on Monday morning.

This one shipped because the brief didn't come from engineering. It came out of the sales-cooperation workshops where junior staff were brainstorming where AI could actually move a business outcome. The business-operations team — the people doing this work today, who own the operational cost — were waiting for the result. The eval test wasn't "did the model output JSON" — it was "did the operator click confirm without editing." When that's your scoreboard, you can't fake it.

The numbers aren't mine to share precisely, but the directional shape matters: hundreds of millions of yen in annual business value sit on top of this product, and tens of millions of yen of annual operations cost run through the new flow. The eval tells us whether the operator confirmed the draft. The P&L tells us whether the company felt it.

The goal was never "make use of an AI model." The goal was "make the business better, with AI as the lever." That sentence sounds obvious written down. In practice it's the line most PoCs cross in the wrong direction.

What's next

The framework we're rolling out next is built around that principle — outcome before model. Same model, same harness, different verbs. Product, security, legal/HR, ops — all of them have repeatable work that an agent can run if you frame it as a business outcome first.

Six pillars from outcome through delivery, each with a clear owner, supporting methods, and a small library of skill.md files so a non-engineer can run their own work through Claude Code or Codex without writing code.

That's the next post — and probably the one after it. The framework, the pillars, the ownership model, and the skills library that makes it portable across the business.


Related: Building AgentHive — the control plane I'm building toward overnight, hands-off runs. Claude Code Channels — the API behind permission relay.