Silicon Team S2E07: If You Can't See the Process, You Can't Trust the Result

Silicon Team S2E07

Over the past six episodes, OPC built products, added extensions, studied peers, screened ideas, and digested knowledge — the machine kept getting stronger. But one question had been shelved the entire time: can a human actually read this machine’s output?

From S1 through now, all OPC review results have been JSON files. Hundreds of JSON files sitting in ~/.opc/reports/ — each report containing the coordinator’s triage, each agent’s verdict, all findings with severity/file/line/issue/fix/reasoning, and the reasons for dismissed and downgraded findings.

The data was comprehensive. But nobody read it.

Not because the data wasn’t important — because the cognitive cost of reading JSON was too high. Open a 2,000-line JSON file, collapse to the top-level structure, find the findings array, expand the first element, read severity, file, message, then collapse back to find the second one. No narrative thread, no causality, no sense of “this finding led to that decision.” JSON is a format for machines, not for humans.

I didn’t read it myself. After a pipeline run, I’d glance at whether the gate said PASS or FAIL. PASS meant continue; FAIL meant go fix — but which specific agent said what, why the coordinator dismissed a finding, whether the security and backend reviewers’ opinions conflicted — all this information drowned in three levels of nested JSON.

S1E07 covered the value of mechanical gates: count, don’t judge. But the gate only gives a PASS/FAIL — it doesn’t explain the process. When the gate PASSes, you don’t know “it almost FAILed”; when it FAILs, you don’t know which of the 7 findings are real and which are misjudgments.

Observability isn’t optional decoration — it’s a prerequisite for trust.

From JSON to Conversation

opc-viewer’s core design decision was a metaphor: present the AI team’s review process as a Slack conversation.

Not a dashboard. Not a table. Not a log viewer. A conversation — with avatars, names, and chronologically ordered messages.

The reasoning is straightforward: a review is inherently a conversation. The coordinator assigns tasks, domain experts offer opinions, the coordinator synthesizes judgment, dismisses unreasonable findings, downgrades excessive severities, and delivers a verdict. Present this as a table and you see a flat list of findings; present it as a conversation and you see who said what, why they said it, and how it was handled afterwards.

Why not a dashboard? Because a dashboard assumes you know what to look for — those metrics, those charts, all presuppose questions the designer anticipated. The essence of a review is discovering problems you didn’t expect. A dashboard can show “critical findings: 3,” but it can’t show that the security reviewer’s concern and the frontend reviewer’s suggestion both point to the same API endpoint. The conversation format doesn’t presuppose questions; it presents what happened and lets the reader discover connections themselves.

In implementation, the frontend synthesizes a timeline from the raw JSON — coordinator initiates triage, dispatches agents, each agent replies with their verdict and findings, the coordinator validates (dismiss/downgrade), then synthesizes the final result. Each message is chronologically ordered, appearing with 60ms intervals of animation.

16 Virtual Team Members

The most “unnecessary” thing opc-viewer did: giving each agent role a virtual persona.

security is Sarah Chen, Security Engineer, with a red-orange gradient avatar. backend is Daniel Okafor, Backend Engineer, blue gradient. frontend is Maya Lin, Frontend Engineer, purple-pink gradient. new-user is Alex Rivera, First-time User, green gradient. pm is Priya Mehta, Product Manager, blue-purple gradient.

16 virtual personas, each with a name, title, initials, and color. Apple Contacts-style gradient circular avatars.

This looks purely decorative. But after a week of use, I found it changed how I read review reports.

Before, reading JSON: "role": "security", "verdict": "..." was just data. Now, reading a Slack-style replay: Sarah Chen said something, Daniel Okafor said something else — and while reading, I unconsciously started reading by person, not by finding. Reading by finding is checking items off a list; reading by person is listening to a team discussion. The former is faster; the latter makes you notice opinion conflicts between different roles — the security reviewer says “this endpoint must have auth,” the new-user advocate says “this flow is too complex.” The two findings are each correct in isolation; only when placed together do you realize they represent a trade-off.

Reading by person has another benefit: you can assess reasoning quality. The same critical finding, if backed by a three-step causal chain (“because A, therefore B, which allows C to be exploited”), is more credible than a bare assertion of “this is insecure.” In flat JSON, all criticals look equally serious. In conversation format, deep reasoning and shallow reasoning are immediately distinguishable — you naturally take Sarah Chen’s three-paragraph analysis more seriously and skim past the one-liner assertion.

Another benefit of virtual personas: memory anchors. “What did Sarah say last time in that auth report?” is far more memorable than “the security role’s finding #3 in report-2026-05-15-auth.json.” Human brains are naturally good at remembering people, not IDs.

Role-to-persona mapping supports prefix matching — new-user-genz falls back to the new-user persona. Unknown roles use a generic “Specialist” fallback. This ensures custom roles added by extensions don’t crash the UI.

Flow Replay: Frame-by-Frame Pipeline Playback

opc-viewer’s second core feature: visualizing the entire pipeline’s execution as a directed acyclic graph (DAG) with frame-by-frame playback.

On the left: the flow graph — each node’s status (pending/current/completed), connections between nodes, loopbacks (gate FAIL cycling back to the build node) marked with “LOOP.” On the right: the current node’s detailed timeline — which agents were dispatched, each agent’s verdict, the gate’s judgment result.

At the bottom: playback controls — play/pause, forward/backward, progress bar. Keyboard shortcuts: → next frame, ← previous frame, Space next frame, P play/pause, R restart.

Data comes from .harness/flow-state.json (flow metadata) and .harness/nodes/[nodeId]/handshake.json (each node’s verdict and findings). If the opc-harness replay command is available, additional template-filled data is pulled.

Flow replay solves this problem: when a pipeline has 14 nodes, you can’t tell from JSON the execution order, which gate triggered a loopback, or how the re-execution results after a loopback differ from the first run. Tables and logs are flat; pipelines have topological structure. Visualization must match the shape of the data.

A concrete example: in one pipeline run, the build node produced code, the gate node found two critical findings, and the pipeline automatically looped back — Claude modified the code based on the findings and re-entered the gate. The second review passed. But if you only look at the final JSON report, you see “PASS, 0 critical” — you don’t know the first pass had 2 criticals, you don’t know how Claude fixed them, you don’t know whether the fixes introduced new issues. Flow replay shows you both complete rounds: the first round’s findings, Claude’s fixes, and the second round’s results. Only by comparing both rounds can you judge the quality of the fix.

Two Views, One Purpose

opc-viewer provides two entry points:

Replay view (ChatView): Follow the chronological story of “how did this review happen” — how the coordinator divided work, how each agent responded, which findings were dismissed. Best for understanding the review process.

Summary view: Findings sorted by severity — critical on top, suggestion at bottom, with filters (All / Critical / Warning / Suggestion / Dismissed). Each finding card expands to show fix suggestions, reasoning, dismissal notes. Best for quickly scanning “what needs fixing.”

Same data, two reading modes. Both process and result need to be visible.

In practice, switching between the two views creates a natural workflow: start with the summary view for a quick triage — “does this review have any serious issues?” If everything is suggestions and dismissed findings, no need to dig deeper. If criticals or warnings appear, switch to the replay view for context — why didn’t the coordinator dismiss this finding? What did other reviewers say about the same file? Is this finding’s reasoning chain credible? The summary view is for quick triage; the replay view is for deep analysis.

Seeing Is Believing

opc-viewer’s technical details aren’t complex — React 19 + Vite + a Node server, reading JSON and rendering UI, Playwright for E2E verification. The codebase is modest relative to OPC itself.

But it solved a problem that had silently persisted throughout S1 and S2: the toolchain’s output was unreadable.

OPC’s mechanical gates guarantee “bad things don’t get through.” The extension system guarantees “domain capabilities are extensible.” But both guarantees are for the machine. For me — the sole human user — a guarantee only exists when I can see it. A PASS gate, if I don’t know it almost FAILed, means I don’t know this review’s quality was teetering on a cliff edge. A FAIL gate, if I don’t know 4 of the 7 findings were emoji parsing misjudgments (S2E08 will cover this bug), means I’ll think the problems are genuinely severe.

Observability changed my relationship with the tool — from “trusting PASS/FAIL” to “understanding why PASS/FAIL.” This isn’t an efficiency tool. This is a trust tool.

Trust isn’t a one-time event. Every pipeline run either builds or erodes trust. If across ten consecutive reviews you see the coordinator making reasonable dismissal decisions, your confidence in the system naturally grows. If in one review you spot the coordinator dismissing an obviously valid critical finding, you now know the system has a blind spot in that scenario — and that knowledge is far more valuable than an opaque PASS, because you’ve learned where the system’s boundaries lie.

If the toolchain’s output is unreadable, the output doesn’t exist.

Silicon Team S2: Evolving the Toolchain Through Real Products ← S2E06: Three Reviewers and Not One Noticed the Wrong Color | S2E08: Letting the Tool Audit Itself — and Finding It Never Blocked Anyone →