Silicon Team S2E03: When Everyone Says PASS, It's Time to Worry

Silicon Team S2E03

The first two episodes built two products. The family calendar exposed a direction problem; the education tool exposed a shutdown problem. But a quieter issue had been running in the background all along: every code review, all reviewers said PASS.

S1 established the principle of “the builder doesn’t evaluate itself” — AI writes code, then a separate group of independent AI reviewers examines it. The security reviewer doesn’t know implementation details; the test reviewer doesn’t see the implementation plan. Each reviewer works in isolated context — no rubber-stamping out of politeness.

The mechanism was right. But after two products, a pattern surfaced: security checks security, backend checks backend, frontend checks frontend — everyone running great in their own lane. When all lanes show green, who’s watching the intersection?

The Unanimous PASS Trap

While building the product screening tool, OPC ran a standard review — security, backend, and frontend reviewers. The tool’s job was to mine pain points from Hacker News and generate product probes. All three reviewers examined the code. All three: PASS.

The code was genuinely fine. No injection risks on the security side, backend logic correct, frontend rendering normal. But I looked at the output myself — 25 probes, every single one a SQLite + argparse CLI variant. The tool was faithfully executing the same technical template 25 times.

This wasn’t a code problem. From a code perspective, every probe was correct. It was an imagination problem — the tool was stuck on the most common tech stack combination in its training data, the same class of failure as EP01’s calendar grid. But the security reviewer wouldn’t care about “why always SQLite,” and the backend reviewer wouldn’t care about “is probe diversity sufficient.”

A unanimous PASS is a review result that should trigger alarm. Not because the reviewers were wrong — but because they might all have missed the same blind spot. Each person fulfilled their responsibilities, but nobody stepped back to ask from a holistic perspective: does this thing actually achieve its purpose?

The Tenth Man Rule

Israeli intelligence services have an institutional rule: if the first nine analysts reach consensus on an intelligence assessment, the tenth analyst must argue the opposing case. Not because the tenth person actually disagrees, but to ensure the counterargument has been seriously considered.

The rule reportedly originated from lessons learned in the 1973 Yom Kippur War. Before the war, intelligence agencies received abundant signals that Egypt was about to attack, but analysts formed a consensus that “Egypt won’t attack” — everyone assumed someone else had already considered the counterargument. In reality, no one had.

The danger of consensus isn’t that everyone is wrong. It’s that everyone assumes someone else has already checked the part they didn’t check.

OPC’s multi-role review faces the same structural risk. Each reviewer works in isolated context — that’s an advantage (prevents mutual influence), and simultaneously a weakness (nobody sees the full picture). When the security reviewer says PASS, backend says PASS, frontend says PASS, the orchestrator aggregates three PASSes into one PASS. But what nobody has done is: aggregate the three reviewers’ blind spots and check whether they stack into one larger blind spot.

skeptic-owner: The Mandatory Tenth Man

I added a new role to OPC’s role system: skeptic-owner (the skeptic).

It differs from other roles in two fundamental ways. First, it’s marked mandatory: true — it automatically joins every review, and the orchestrator cannot remove it. Other roles can be selectively included based on task type; skeptic-owner cannot be excluded. Second, it doesn’t look at code details — it reviews the overall direction from a product owner’s perspective.

skeptic-owner cares about three categories of questions:

Direction challenge: Does this implementation actually solve the user’s problem? Or does it only solve a problem that was easy to solve technically?

Coverage blind spots: Do the tests cover real usage scenarios? Or do they only cover the easy-to-test happy path?

Consensus challenge: When every other reviewer says PASS, under what circumstances should that PASS be questioned?

Its very first review after going live validated its value. Same product screening tool — frontend, backend, security reviewers all PASS. skeptic-owner’s feedback:

“The 25 probes generated by the tool are all SQLite + argparse CLI variants. Not one attempted a different tech stack, a different delivery format, a different target user segment. This isn’t a technical problem — it’s a failure of imagination.”

This finding was suggestion-level severity — it wouldn’t trigger a gate FAIL. But it pointed to a problem invisible to all the technical reviewers: technical reviewers care whether the code is correct; skeptic-owner cares whether the direction is correct. Code can be perfectly correct while executing a wrong direction — EP01’s calendar grid was exactly that.

devil-advocate: Auto-Injected When Consensus Forms

skeptic-owner addresses perspective blind spots during the review phase. There’s another class of scenario that occurs during the discussion phase — when multiple agents discuss approaches and reach consensus too quickly.

OPC’s discussion nodes run multiple rounds. In auto mode, round one has each role present their viewpoints; round two responds to other roles’ viewpoints. I added a rule: when discussion enters round two and all roles trend toward agreement, automatically inject the devil-advocate role.

devil-advocate’s sole duty is to argue the opposing case. Not to negate — but to ensure counterarguments exist. There’s a hard rule in the prompt: it must raise at least one “if this approach fails, the most likely failure mode is X” argument.

For irreversible decisions — data deletion, public API contracts, destructive migrations — devil-advocate participation is mandatory regardless of consensus level. The reasoning is straightforward: the correction cost for these decisions is too high. Even with 99% confidence, it’s worth spending extra tokens to have one role seriously consider the remaining 1%.

The cost needs to be faced directly. Each additional role adds one independent agent invocation — roughly 30-60 extra seconds and a few thousand tokens. skeptic-owner is mandatory, adding one role to every review. devil-advocate is conditionally triggered, joining only when consensus forms. Together they increase each review’s cost by 15-25%.

Is it worth it? In the context of a single-person tool, this cost increase buys: at least one role caring about “is the direction right,” not just “is the code right.” That trade is worthwhile. But if OPC were used in a high-frequency, low-risk scenario — say a lint pipeline running dozens of times daily — skeptic-owner’s direction challenges would be meaningless. The value of a role scales proportionally with the importance of the review.

Product Owner: The Human Doesn’t Leave

16 AI roles — security reviewer, backend, frontend, tester, skeptic-owner, devil-advocate, PM, designer, new-user advocate… the roster keeps growing. But one role isn’t AI: Owner.

Owner is me.

EP01’s lesson was already clear: AI executes plans, it doesn’t generate plans. Product direction comes from the Owner — the calendar should be a smart push feed rather than a month-view grid, the education tool needs multi-tenant architecture rather than a single-user tool. These decisions AI cannot make autonomously.

Every layer of the role system compensates for an AI blind spot: technical reviewers check code quality, skeptic-owner checks product direction, devil-advocate checks for counterarguments. But the judgment framework for all these roles — what constitutes good product direction, what technical trade-offs are acceptable — ultimately comes from the Owner’s settings.

The Owner doesn’t participate in every review. But the Owner defines each role’s behavioral boundaries. This is indirect control: not telling each role what to say, but telling them what to care about.

What Role Separation Can and Cannot Do

After two products, the role system genuinely changed review quality. Before, it was “multiple reviewers looking at the same thing” — everyone checking for code bugs. Now it’s “multiple perspectives looking at different things” — someone checks security, someone checks direction, someone’s job is to argue the opposing case.

But role separation has one problem it cannot solve: roles added more eyes, not more teeth.

skeptic-owner can raise “a failure of imagination,” but that finding is suggestion-level severity and doesn’t affect the gate. devil-advocate can point out “the most likely failure mode if this approach fails is X,” but that argument doesn’t automatically become a FAIL verdict. The role system improved the breadth and depth of review, but the ultimate enforcement mechanism — the gate — still runs on the same emoji counting rules.

If reviewers unanimously agree there are no critical issues, the gate PASSes. If skeptic-owner’s direction challenge is only marked as a suggestion, the gate still PASSes. The role system made reviews more comprehensive without changing the rules for “under what conditions code gets blocked.”

This limitation will be confronted directly in EP08: after eight episodes, the gate’s FAIL path was never triggered. Roles added more perspectives, but blocking never happened — were the roles too soft, or was the code genuinely good enough?

When everyone agrees, it’s not a signal to proceed — it’s a signal to challenge. The Tenth Man isn’t necessarily right — but they ensure the opposing case has been seriously considered.

Silicon Team S2: Evolving the Toolchain Through Real Products ← S2E02: 40 Ticks to Production | S2E04: Look at Others’ Pitfalls Before Digging Your Own →