Silicon Team S1E09: Don't Make AI Better, Make Bad Outcomes Smaller

Silicon Team S1E09

The first eight episodes each solved a specific problem — tests you can’t trust, architecture too rigid, memory that vanishes. But step back and you’ll notice every solution follows the same design principle. This episode makes that principle explicit.

Gates Are Floors, Not Sources

A dam doesn’t make the rain fall better. It stops the flood from getting through.

This line appeared in episode two and again in episode three. By episode nine, I can state it more precisely:

A Gate isn’t the source of quality. A Gate is the floor of quality.

What is the source of quality? The coding AI’s capability, the precision of prompts, the richness of training data. These all happen before the Gate, and here’s the key — none of them are within your control.

You can’t make Claude suddenly better at CSS. You can’t make GPT suddenly understand your project’s architecture. What you can do is: when they make mistakes, the mistakes don’t get through.

That’s the meaning of OPC’s 60 enforcement rules. They’re not designed to make AI write better code — they’re designed to make bad code unable to pass acceptance.

This is a counterintuitive design choice. Most people, when AI writes bad code, instinctively try to fix the prompt, switch models, add few-shot examples — trying to make AI better. OPC doesn’t take that path. OPC assumes AI will make mistakes, then ensures those mistakes get caught.

Why? Because “make AI better” is a road with no end. You never know if it’s good enough. But “make bad outcomes fail the gate” is verifiable — either the gate catches it or it doesn’t. Gates count. They don’t judge.

What AI Can Evaluate

In EP01, we discovered a lesson: AI can’t evaluate how much a product is worth. Having 5 virtual buyers state “how much they’d pay” only produced random sampling within the price range given in the prompt.

This isn’t an AI bug — it’s a fundamental capability boundary.

Judgments AI can make (ordinal — relative ranking):

“This is better than that” — relative comparison
“There’s a broken link here” — pattern recognition
“Loading feedback is missing” — checklist verification

Judgments AI cannot make (cardinal — absolute quantity):

“This is worth $50” — absolute valuation
“Users would pay for this” — purchase decisions
“This bug will cause X% user churn” — economic impact

The distinction: ordinal judgments only require comparing two things — AI is great at this. Cardinal judgments require an anchor point — and that anchor comes from real-world constraints (budgets, markets, competitors) that AI doesn’t have.

OPC’s design strictly respects this boundary. It lets AI find red flags (pattern recognition), lets AI compare which of two versions is better (relative judgment), lets AI check whether acceptance criteria are met (checklist verification). But it never asks AI for absolute evaluations — no total scores, no “is this product good,” no “is it worth releasing.”

The Gate’s verdict isn’t made by AI. The Gate counts.

When Scoring Fails

This boundary was validated in practice.

During a cover image review, I spot-checked a few reviewer scores. One reviewer gave a cover 8.5/10 — “harmonious color palette, clean layout.” I opened the cover: the title was cropped outside the safe zone, completely unreadable on mobile. My score? 4/10 at best.

This wasn’t isolated. Multiple review rounds showed the same pattern: some reviewers rubber-stamped everything, others hallucinated nonexistent problems and scored low. AI reviewers’ absolute scores — cardinal judgments — were systematically unreliable.

But when you only ask for ordinal judgments — “is this version better or worse than the last one” — accuracy is much higher. The AI could see the title had shrunk, the avatar had gotten smaller, the visual effects had disappeared. It just couldn’t tell you “how bad” that was accurately.

AI's capability boundary: can make relative judgments, can't make absolute valuations

This is why OPC’s gates don’t use scores — they use counts. Are there red flags? How many? Which ones are blocking-level? These are all ordinal judgments. The gate doesn’t ask “is this output good” — it asks “does this output have known problems.” The number and type of problems determine pass or fail, not a score.

One Principle

Back to the title: Don’t make AI better, make bad outcomes smaller.

This isn’t a slogan — it’s an actionable design principle. It tells you how to make every design decision:

Mechanical gate (deterministic check) or LLM gate (AI judgment)? Mechanical.
Have reviewers score or find red flags? Red flags.
Spend time optimizing prompts or add an enforcement rule? Add the rule.

Every time you choose “make AI better,” you’re betting on AI’s capability ceiling. Every time you choose “make bad outcomes smaller,” you’re reinforcing a deterministic floor. The former’s returns are unpredictable. The latter’s returns are verifiable.

But this principle has a prerequisite: you have to be there. Gates catch known error patterns. They don’t catch directional drift. When AI executes efficiently in the wrong direction — gates all green, output solving the wrong problem — that’s when you need a human.

The next episode is about exactly that: run OPC in a 125-hour autonomous loop, and when should a human step in?

Silicon Team S1: Can You Trust AI That Writes Code? ← S1E08: AI Ran for 8 Hours and Forgot Who It Was | S1E10: When Humans Should Step In →

Silicon Team S1E09: Don't Make AI Better, Make Bad Outcomes Smaller

Gates Are Floors, Not Sources

What AI Can Evaluate

When Scoring Fails

One Principle

Comments