Silicon Team S2E02: 40 Ticks to Production

Silicon Team S2E02

Three in the morning, my terminal still flickering. The OPC loop (automated AI build pipeline) wakes up every 20 minutes, reads loop-state.json, sees status: pipeline_complete, prints “No-op.”, goes back to sleep.

This was the 50th time it did this. The pipeline was done, but the loop wouldn’t stop — the most ironic bug in the entire project.

Last episode’s family calendar was a direction problem — AI picked the most common UI pattern from training data, human had to override. That project wrapped in 23 hours, $92, modest scope. This episode is a heavier product — 47 hours, 40 execution units, five deploys to get it live — and it exposed a different class of problem: infrastructure migration and the boundaries of autonomous loops. When a project is complex enough and runs long enough, AI hits a wall: context blowouts, environment gaps, and the inability to stop itself.

From Personal Tool to SaaS

The product started as a single-user vertical tool — functional but built for exactly one person. The tech stack matched — SQLite single-file database, Flask backend, content hardcoded as JSON, no user system. Code scattered across four GitHub repos, nobody sure which was current. Classic “personal tool” state: it works for its creator, but the architecture can’t scale. No user system means you can’t track different users’ data. No database migration means schema changes require manual SQL surgery. No deployment pipeline means it only runs on the developer’s own machine.

The goal: turn it into sellable SaaS. PostgreSQL multi-tenancy, OAuth authentication, Stripe payments, AI Coach (LLM-powered personalized content), internationalization, admin dashboard, Fly.io production deployment.

When the OPC loop received the task, Claude looked at the scope and pushed back: “i18n, multi-provider OAuth, Stripe — these are SaaS infrastructure. A single-user tool doesn’t need them.”

I pushed back harder: “I’m planning to commercialize.”

The agent has its own judgment, but it can’t guess the user’s strategic intent.

Same class of problem as S2E01’s Calendar grid — AI reasons from the current code state (“this is a single-user tool, skip the SaaS infrastructure”), but it doesn’t know your business plan. Direction must come from humans.

47 Hours, 40 Ticks

14 SCOPEs (feature areas), broken into 40 units (each unit = one tick = one execution step: plan → build → test → review), across 5 phases. The phase ordering was deliberate — foundation first (database, auth), then core product, then security and deployment. Reverse the order — say, build the frontend before the database — and later schema changes cascade through already-completed frontend code, breaking everything downstream.

Phase	Content
Foundation	PG schema → migration → multi-tenancy → Auth → security hardening
Core Product	Core product → frontend rebuild → AI Coach → content management
Monetization	Stripe Billing → i18n → Admin dashboard
Security + Deploy	Security audit → Fly.io deployment
E2E Verification	End-to-end acceptance

Every unit’s plan included verify: (how to run tests) and eval: (who reviews). For example:

verify: pytest tests/test_stripe.py -v && curl localhost:8000/api/healthz
eval: security, backend

These two lines look trivial, but they’re the key to surviving amnesia — more on that next.

Three Context Blowouts

The first blowout hit at 1,189 messages, 37M input tokens. The session had been running 47 hours — Claude’s context window couldn’t hold it all, triggering automatic compaction.

“This session is being continued from a previous conversation that ran out of context.”

Claude forgot the code details it had written, forgot why decisions were made, forgot which tests had already failed. But it didn’t stop — because loop-state.json was still there.

Loop state is the single source of truth. Context blows up, the agent forgets, but as long as loop-state.json + plan.md survive, the pipeline continues.

The loop protocol wasn’t designed around “how agents execute” — it was designed around “how agents recover after amnesia.” Every tick starts by force-reading loop-state and plan, never relying on conversation history. That’s why every unit has verify: and eval: lines — a post-amnesia agent reads those two lines and knows exactly how to validate its own work.

Three blowouts, three recoveries, pipeline unbroken. The cost: token consumption exploded — 270M of 309M total tokens were cache reads, mostly from post-blowout context rebuilding.

Deployment: A Different Kind of Debugging

Everything passing locally. Time to push to Fly.io. Five attempts to get it running.

Attempt 1: psycopg2-binary won’t compile on Alpine Linux. Switch to Debian slim + libpq-dev.

Attempt 2: Fly.io’s database URL uses postgres://, but SQLAlchemy needs postgresql+asyncpg://. Write normalize_database_url() — solve it once, solve it forever.

Attempt 3: Initial SQL dump already contains tables from later migrations. alembic upgrade head tries to create them again. Write migrate.sh to detect fresh databases — only run initial migration + alembic stamp head.

Attempt 4: Add /api/healthz endpoint + fly.toml health check config.

Attempt 5: Finally running. But CORS only allows localhost:5173 — forgot to add the production domain. I fixed it with a one-line config change after the pipeline ended, but the bug itself reveals something: AI tests on localhost and considers the task done, never proactively checking production domain configuration.

All five failures share a single pattern: the gap between local development and cloud production. Different compilers (Alpine vs Debian), different URL schemes (postgres:// vs postgresql+asyncpg://), different database states (fresh vs pre-seeded), different network configs (localhost vs public domain). AI passes all local tests and concludes “done” — but deployment isn’t “passing tests.” Deployment is making code survive in an entirely different environment. That’s why it deserves its own iteration loop.

Deployment isn’t “the last step” — it’s a different kind of debugging. Treat it as its own iteration loop.

The Loop That Wouldn’t Stop

Pipeline complete. Production environment running. But the OPC loop refused to stop.

Tried every known stop method: CronList → empty (dynamic mode doesn’t use cron); CronDelete → no job ID; even tried /ralph-loop:cancel-ralph → “No active Ralph loop found.”

The only solution: I manually pressed Ctrl+C.

The root cause is clear: the loop’s start condition (“there’s a new task”) and its termination condition (“the task is done”) are asymmetric. Starting is an explicit trigger — I gave it a task. But stopping has no corresponding explicit mechanism. pipeline_complete is a pipeline-level status, not a loop-level termination signal. The loop only asks “should I execute the next tick?” — it never asks “should I stop existing?”

This isn’t something you fix with an if-statement. An autonomous system that doesn’t know when to stop needs an external mechanism — human or machine — to tell it. In this project, that “external mechanism” was my finger on Ctrl+C.

Autonomous loops need a termination guard. “Pipeline complete” does not equal “loop stopped.” Any autonomous system needs explicit termination conditions and mechanisms.

That 3 AM “No-op.” from the opening — it consumed 270M cache_read tokens for zero output. Every 20 minutes, the loop woke up, re-read the entire project context, confirmed there was nothing to do, and went back to sleep — 50 times over. The project’s biggest waste, and its best lesson.

The Ledger

Metric	Data	Notes
Total time	47 hours	Including idle polling waste
API cost	~$347	309M tokens, cache read 87%
Ticks	40	14 SCOPEs, 5 phases
Subagents	76	Each with independent context, high token cost
Context blowouts	3	Auto-recovered, pipeline unbroken
Deploy iterations	5	Final: Fly.io production
Remaining bugs	CORS (manually fixed)	One-line config fix post-pipeline

$347 for a full-stack SaaS MVP implementation + deployment. This scope (PG multi-tenancy + OAuth + Stripe + AI Coach + i18n + Admin + deployment) would take a solo developer familiar with Python but new to full-stack SaaS infrastructure roughly 1–3 weeks of full-time work. But a significant chunk of that $347 was wasted on idle polling and context rebuilding — with a termination guard, costs could have been cut by at least a third.

76 subagents are a double-edged sword: reviewers genuinely don’t know implementation details, making feedback more objective — they won’t go easy on a code section just because “I spent two hours writing it.” But every subagent has to rebuild project context from scratch, consuming massive tokens. Forty ticks means at least forty context rebuilds, plus each reviewer’s independent rebuild on top of that — easily exceeding the main loop’s own consumption. This is the price of “the agent that does the work can’t evaluate it” — the tradeoff between independence and efficiency, with no both-ways solution yet.

From personal tool to production SaaS, 40 ticks, five deploys. No termination guard, CORS fixed with a manual one-liner — but it’s running in production.

If you take one thing from this article: every autonomous loop needs an explicit exit condition. Without a termination guard, an autonomous system that finishes its task becomes an idling engine — it won’t stop, it’ll just burn money.

Silicon Team S2: Evolving the Toolchain Through Real Products ← S2E01: Family Calendar | S2E03: When Everyone Says PASS, It’s Time to Worry →