Skip to content
Touchskyer's Thinking Wall
Ch 6
20 min read
Engineering Field Notes

This chapter isn’t about methodology. It’s about incidents.

Of the 227 cards, at least half are records of things going wrong. Each one represents real time lost — anywhere from 20 minutes to 2 days. I’ve distilled the most universally applicable lessons and organized them into four themes.

These lessons share one trait: the surface symptom is several layers away from the root cause. A port conflict looks like an OAuth failure. A regex’s stateful behavior looks like “random” match loss. A build cache looks like “deployed but nothing changed.”

If you remember one thing from this chapter: never trust the first error message. Dig one layer deeper. Then dig again.


Git and Deployment Traps

You think Git and deployment are deterministic, but environment differences, hook side effects, and cache state make them probabilistic.

Port Conflict Causes Silent OAuth Failure

What happened: The OAuth login flow suddenly stopped working. No errors — the callback just never fired. Spent nearly two hours checking OAuth provider config, redirect URIs, cookie settings — all correct. Turned out another local process had grabbed the same port. The dev server silently fell back to a different port, but the OAuth redirect URI was hardcoded to the original port.

Why: OAuth redirect URI validation requires an exact match — the port number is part of the URL. Most dev servers (Vite, Next.js dev) auto-increment the port when it’s taken, but they don’t tell you what that means for OAuth. Worse, OAuth failure doesn’t throw an error — it redirects to a port nobody’s listening on, and then… nothing happens. Silent failure is the most expensive kind of failure.

Rule: After starting the dev server, always verify the actual port number. OAuth projects must include a port assertion in the startup script — wrong port, crash immediately, no fallback.

// Add this to your startup script — don't wait until the OAuth callback to discover the port is wrong
const EXPECTED_PORT = parseInt(process.env.PORT || '3000');
server.listen(EXPECTED_PORT, () => {
  const actualPort = server.address().port;
  if (actualPort !== EXPECTED_PORT) {
    console.error(`FATAL: Expected port ${EXPECTED_PORT}, got ${actualPort}. OAuth will break.`);
    process.exit(1);
  }
});

Pre-push Hooks Cause SSH Timeout

What happened: git push to remote timed out randomly. Sometimes it worked, sometimes it hung for 30 seconds then disconnected. First instinct: network issue. But ssh -T git@github.com succeeded every time. Turned out the pre-push hook was running the full test suite, which exceeded the SSH idle timeout.

Why: Git hooks execute after the SSH connection for a push is established but before data transfer begins. If the hook takes too long, the connection can be killed by multiple parties: the SSH server’s ClientAliveInterval, intermediate network devices (routers/firewalls) doing idle connection cleanup, or the client’s own timeout settings. Local testing never triggers this because there’s no network layer involved.

Rule: Pre-push hooks must have a time limit. Any check that takes over 10 seconds belongs in CI, not in a hook. If you absolutely must run tests in the hook, configure ServerAliveInterval and ServerAliveCountMax in your ~/.ssh/config to keep the connection alive.


The Danger of Checking Out main in a Worktree

What happened: Tried git checkout main inside a Git worktree. Git 2.5+ refuses this by default (fatal: 'main' is already checked out), but if you bypass the protection with --force or via git checkout <main's commit hash> (detached HEAD), the main repo’s working directory state gets corrupted. Two worktrees sharing the same ref trigger cascading failures — file lock conflicts, index inconsistencies, corrupted state. Recovery requires manually editing files under .git/worktrees/.

Why: Git worktrees are designed with the assumption that each worktree corresponds to an independent branch. Git has safeguards to prevent the same branch from being checked out by two worktrees, but --force bypasses this, and detached HEAD isn’t covered. Once two worktrees share the same ref, any commit from either side moves the ref and leaves the other side in an inconsistent state.

Rule: Only use feature branches inside worktrees. If you need to see code from main, use git show main:path/to/file or go back to the main repo. Don’t use --force to bypass checkout protection. Consider writing a post-checkout hook as a second line of defense.


Production Build Cache Trap

What happened: Deployed a new version, but production behavior didn’t change. Confirmed the deploy script completed, confirmed the new code was on the server, confirmed the process restarted. But users were still seeing the old version. Reason: the build tool (Vite, in this case) cache wasn’t cleared. The output bundle hash was identical to the previous build, so the CDN assumed nothing changed and didn’t purge.

Why: Modern build tools cache intermediate artifacts for speed. Most of the time this is the right optimization. But when you change config files, environment variables, or dependencies the build tool can’t track, the cache doesn’t invalidate. The sneakier case: the code did change, but after tree-shaking, the actual output chunk content is identical (because the changed code path was never imported) — hash doesn’t change, CDN doesn’t update.

Rule: Deploy scripts must include rm -rf dist/ .cache/ node_modules/.vite/ or equivalent. The build step in your CI/CD pipeline should be a clean build — from scratch every time. In production, correctness always wins over speed. (Exception: if your build time exceeds 10 minutes, use a content-addressable cache like Turborepo — but only if you understand what’s in the cache key and what isn’t.)


npm Scoped Package Scope Must Match Exactly

What happened: Created an npm scoped package @myorg/tool-name. Local development worked fine. npm publish returned 403. Checked the npm token, 2FA, publishConfig in package.json — all correct. Turned out the scope name didn’t match the actual organization name on npm.

Why: npm’s scope system maps @scope/package to an organization or user namespace on the registry. This mapping is an exact match. Note: npm v7+ does lowercase normalization on scope names, so case issues are less of a problem on newer versions, but the spelling must still match the registry exactly. npm install doesn’t trigger permission checks locally (it’s a read operation) — only npm publish verifies your write access to the scope. The problem is completely invisible during development.

Rule: Before creating a scoped package, confirm the exact spelling of your scope name with npm org ls or npm whoami. The scope in package.json must match the registry character for character.


Git and deployment will burn you with invisible state. But at least you can reason about them mechanically. Testing failures are worse — they create a false sense of safety.

Testing and Verification Traps

The common thread here: tests passing doesn’t mean the code is correct. Tests themselves can lie to you.

Smoke Test False Positives

What happened: Ran smoke tests after deployment — all green. But users reported a core feature was broken. Investigation revealed the smoke test was checking “page returns 200” and “page contains a specific keyword.” The actual feature depended on a backend API that was down. The page was SSR’d — the 200 and the keyword existed in the SSR output, but when client-side hydration tried to call the API, it failed. Smoke test passed at the SSR layer, failed at the client layer.

Why: A smoke test’s purpose is “quickly confirm the system isn’t completely dead.” But the gap between “not completely dead” and “functioning correctly” is enormous. HTTP 200 just means the server didn’t crash. Keyword on the page just means the SSR template rendered. Real functional verification requires exercising the user’s critical path — click a button, submit a form, see the result.

Rule: Smoke tests must include at least one end-to-end functional check — not just HTTP status and static content. If full E2E isn’t feasible, at minimum hit the critical API endpoints and verify response shape.


Mocks Lie: The Gap Between Unit Tests and Integration Tests

What happened: Refactored a data processing pipeline. Unit tests for each stage passed. After deployment, the entire pipeline output was wrong. Reason: stage A’s output schema changed, but stage B’s unit test mocked stage A’s output using the old schema. Each stage looked “correct” in isolation, but chained together, everything broke.

Why: Unit testing’s core assumption is “if every part is correct, the whole is correct.” That assumption is wrong when the interface contracts between parts aren’t explicitly verified. A mock is a frozen assumption — it says “I assume upstream looks like this.” But once upstream changes, the mock doesn’t auto-update. Tests stay green while the real system is already broken.

What we ended up doing: The strategy we adopted was “only mock the storage layer, run everything else for real.” DB uses an in-memory implementation (e.g., SQLite in-memory instead of PostgreSQL), but API handlers, middleware, validation, serialization all run for real. This is practical for small-to-medium services; at larger scale, you’ll need more selective mocking — but the principle holds. This approach caught 3 bugs the unit tests missed: middleware ordering issues, validation schema mismatches, serialization dropping fields.

Rule: Any pipeline-style architecture must have at least one integration test that runs real data through the full pipeline. Use mocks only in unit tests to isolate the heaviest external dependencies (DB, third-party APIs) — not at component interfaces. If you mock an upstream component’s output, you’re responsible for updating the mock when upstream changes — but the better move is to not mock it at all.


Runtime Spy to Verify Test Honesty

What happened: Found a set of tests that passed, but the coverage report showed the corresponding code lines were never executed. On closer inspection, the tests were asserting against mock return values — they weren’t testing any real code, just testing their own setup. 100% test pass rate, 0% actual coverage.

// This test is fake — it's testing the mock, not the code
jest.mock('./userService');
test('getUser returns user', async () => {
  getUserById.mockResolvedValue({ id: 1, name: 'Alice' });
  const user = await getUserById(1); // calling the mock, not the real function
  expect(user.name).toBe('Alice');   // always passes, tests nothing
});

Why: This is the classic degeneration of mock-heavy testing. A developer writes the mock first, then writes assertions that verify the mock’s behavior — completely bypassing the code under test. These tests never fail (because mocks always return what they’re told to), creating a false sense of safety while actually verifying nothing.

Rule: Periodically use runtime coverage tools (like jest --coverage or c8) to check whether tests actually execute the code under test. If a test file’s coverage is near 0%, either the test is fake or the code under test has been deleted. Both cases need action.


Delete the Component, Delete the Test

What happened: Deleted a deprecated React component. CI broke: test file was importing a nonexistent module. The fix was simple — delete the test file. But this exposed a deeper problem: the codebase had dozens of test files whose corresponding components had been deleted or renamed. Those tests had been silently skipped all along.

Why: Most test runners behave differently with compile-failed test files depending on their configuration. In some configurations (e.g., the test file glob no longer matches after source files are renamed), the test is silently excluded from the execution list. Result: you think the test is running, but the test hasn’t existed for months. Jest’s --passWithNoTests flag solves the “don’t error when no test files match” problem, but the more common trap is drift between the test file matching rules and the source file naming conventions.

Rule: When you delete a component, you must also delete the corresponding test file and story file. In CI, periodically run a “test file to source file correspondence check” — every .test.ts should have a corresponding source file that exists.


Testing traps are about misplaced trust. Frontend traps are about misplaced assumptions — you assume the platform works one way, and the spec says otherwise.

Frontend Traps

What makes frontend bugs uniquely frustrating: they’re often spec-compliant behavior — it’s not broken, it was designed that way. You just didn’t know.

Astro Slot Content Ends Up Outside body

What happened: Built a layout component in Astro that uses slots for page content. After rendering, some content appeared after the </body> tag. The browser rendered it anyway (browsers have astonishing tolerance for malformed HTML), but JS hydration failed because the DOM structure didn’t match expectations.

Why: Astro slots (in static build mode) do string insertion at build time, not DOM manipulation. If your layout component’s HTML structure is incomplete — say an unclosed <div> — slot content gets inserted in the wrong place. The sneakier case: if the slot content itself contains certain HTML elements (like a <tr> not wrapped in a <table>), the browser’s HTML parser will “fix” the DOM structure by moving elements to where it thinks they’re legal — which might be outside <body>.

Rule: Astro layout HTML must be valid and fully closed. Slot content must not contain “naked” table elements. Run an HTML validator against the build output before shipping. If content appears outside <body>, check the HTML structure first — don’t go debugging the Astro framework.


JS Global Regex lastIndex Trap

What happened: A regex with the g flag used in a loop — first match succeeds, second call with the same input returns null. Third call succeeds again. Behavior looks completely random, but it’s actually deterministic — you just don’t know its state.

const pattern = /foo/g;

pattern.test('foobar'); // true  — lastIndex becomes 3
pattern.test('foobar'); // false — starts from index 3, no match, lastIndex resets to 0
pattern.test('foobar'); // true  — starts from 0 again, and so on

Why: JavaScript RegExp objects with the g (global) or y (sticky) flag are stateful. After each exec() or test() call, the regex object updates its lastIndex property, and the next call starts matching from lastIndex. This is correct behavior per the ECMAScript spec.

Rule: If you don’t need to find multiple consecutive matches within the same string, don’t use the g flag. If you must use g, manually reset lastIndex = 0 before each use, or create a new RegExp instance every time.


Vite Proxy URL Matching Must Use Parsed Pathname

What happened: Configured Vite’s dev server proxy using string startsWith to match URL paths for request forwarding. Most requests worked fine, but one request’s URL contained a query parameter whose value included an encoded / (%2F), which broke the string match and prevented the request from being forwarded.

// ❌ Wrong: string matching against raw URL
if (req.url.startsWith('/api')) { proxy(req); }

// ✅ Right: parse first, then match on pathname
const { pathname } = new URL(req.url, 'http://localhost');
if (pathname.startsWith('/api')) { proxy(req); }

Why: A URL is structured data, not a plain string. Using string operations (startsWith, includes, regex) to match the pathname portion of a URL will be derailed by query strings, fragments, and encoded characters. The correct approach is to parse the URL first, then match against the parsed pathname. This isn’t just a correctness issue — it’s a security issue. ?path=%2Fapi%2Fhack can fool string matching.

Rule: Any URL matching logic must first parse using new URL() or an equivalent parser, then match against .pathname. Never do startsWith or regex matching on a raw URL string.


Frontend traps at least confine their damage to the browser. Architecture traps propagate across the entire system — and they’re the hardest to diagnose because the symptom and the cause live in different layers.

Architecture and Engineering Practice Traps

Every lesson in this section shares a pattern: architecture problems surface as far from the root cause as possible.

Auth and Data Access Must Be Separated

What happened: An API handler had auth checks and data queries mixed together. The code roughly did: query data → check if the current user has permission to access this record → return. Looks reasonable. The problem: when the data doesn’t exist, the code returns 403 (because it falls into the “no permission” branch) instead of 404. An attacker can probe which resource IDs exist by differentiating 403 vs 404 responses.

// ❌ auth and data access mixed — information leakage
async function getResource(userId, resourceId) {
  const resource = await db.find(resourceId);
  if (!resource || resource.ownerId !== userId) return 403; // nonexistent also returns 403
}

// ✅ separated — consistent behavior
async function getResource(userId, resourceId) {
  const resource = await db.find(resourceId);
  if (!resource) return 404;                    // doesn't exist means doesn't exist
  if (resource.ownerId !== userId) return 403;  // no permission means no permission
}

Why: When auth and data access are mixed, error handling semantics get entangled. “User has no permission” and “resource doesn’t exist” become different branches in the same code block, and branch ordering determines externally observable behavior. This is an information leakage problem (OWASP has a dedicated entry for it). At a deeper level, it’s a separation of concerns issue — auth is “who can do what,” data access is “where is the thing.” These are two independent questions.

Rule: Auth checks must happen before data access, as an independent middleware or guard. If you need content-based auth (“only the owner can edit”), query the data first, then check with a dedicated auth function — don’t mix it into the same if-else. (Exception: in certain security scenarios, you intentionally unify 403/404 responses to prevent timing attacks — but that should be a conscious security design decision, not an accidental code structure.)


Inconsistent Sandbox Fallback

What happened: The system had a sandbox mode (for dev/test) that fell back to local mocks when external services were unavailable. Problem: different modules had different understandings of “sandbox mode.” Module A enabled sandbox when NODE_ENV=development. Module B enabled it when SANDBOX=true. Module C auto-fell back when the external service connection failed. Result: in the same environment, some modules used real services while others used mocks. Behavior was unpredictable.

Why: Sandbox/fallback is a cross-cutting concern, but it’s typically implemented independently by each module. Without a unified “system-level sandbox state,” each module decides on its own when to fall back. This creates inconsistency: you think you’re testing real integrations, but half is real and half is fake. Worse, the inconsistency is invisible — no logs tell you “module C is using mocks.”

Rule: Sandbox state must be global, explicit, and controlled by a single source. One environment variable (e.g., SANDBOX_MODE=full|partial|off), all modules read the same variable. Every module logs its sandbox state at startup. No “automatic fallback” — either use the real thing or the fake thing, but it must be a deliberate choice.


Error Surface Is Far from Root Cause

What happened: User reports “save button doesn’t respond.” Frontend shows no error (because a catch block swallowed it), network request returns 500. Backend logs show a database constraint violation. Root cause: a migration from two weeks ago was never run in production — a new NOT NULL column was added in code but not in the database.

What the user saw: button didn’t work. Root cause: migration wasn’t run. In between: frontend error handling → API response → backend exception → ORM error → database constraint. Five layers.

Why: Modern applications have too many abstraction layers. Each layer has its own error handling strategy, and most strategies are “catch it, log it, return a generic error.” With each layer, root cause information degrades. By the time it reaches the user, all that’s left is “doesn’t work.”

Rule: Error messages must carry a correlation ID that spans all layers. The frontend error toast must include the request ID. The backend error response must include an error code (not a message — a machine-readable code). Each layer’s job is to translate and propagate errors, not swallow them.


Dev and Prod API Handler Drift

What happened: Everything worked in the local dev environment. After deploying to production, API behavior was different. Not a bug — dev and prod were running different versions of the handler. Dev was using src/api/handler.dev.ts, prod was using src/api/handler.ts. Six months earlier, someone created the dev version for debugging, added some logging and mocks, and the build config excluded .dev.ts files from production builds. Six months later, the dev handler had a dozen feature patches while the prod handler was still the six-month-old version.

Why: Any “environment-specific code branching” will drift. This is inevitable because developers test in dev, so they naturally fix bugs and add features to the dev version. You only discover prod is behind when you deploy — if you’re lucky.

Rule: No .dev.ts / .prod.ts file naming conventions. Environment differences are injected through configuration (environment variables), not through separate code files. If some logic is only needed in dev (like debug logging), guard it with if (isDev), but the code must live in the same file and go through the same review process.


CLI stdout Must Be Clean

What happened: Built a CLI tool that outputs JSON to stdout for consumption by other scripts via pipe. One day the downstream script errored: Unexpected token in JSON. Reason: the CLI tool printed a warning line at startup (from a dependency), and that warning mixed into stdout alongside the JSON, breaking the JSON structure.

Why: The Unix design philosophy is stdout for data, stderr for diagnostics. But too many tools and libraries don’t follow this — they dump warnings, progress bars, deprecation notices to stdout. When your CLI outputs structured data, any unexpected stdout content corrupts the output. And this problem is intermittent — it only appears when specific conditions trigger the warning.

Rule: CLI stdout should only contain business data. All diagnostic output (warnings, logs, progress) goes to stderr. The default behavior should be clean stdout — a --quiet flag isn’t enough because it requires the caller to know about and remember to use it. If a dependency you rely on dumps garbage to stdout, redirect its output to stderr at startup.


Summary: The Structure of Lessons

Looking back at these lessons, they aren’t 17 isolated pitfalls. They’re different instances of four failure modes:

  1. Silent failure is the most expensive. Port conflicts that don’t error, tests that are silently skipped, errors swallowed by catch blocks — you don’t know it’s broken, so you don’t fix it.
  2. State is a breeding ground for bugs. Regex’s lastIndex, build tool caches, SSH connection timeout timers — any implicit state will bite you eventually.
  3. The distance between surface symptom and root cause is proportional to system complexity. More layers, more painful debugging.
  4. “Different behavior in different environments” is the hardest category of problems to debug. Because you can’t reproduce one environment’s bug in another environment.

There’s a paradox worth reflecting on: AI excels at eliminating known categories of errors — give it a linting rule, give it a checklist, and it’s ten times more reliable than a human. But almost every lesson above is one that only hurts the first time you encounter it. This kind of “creative friction” — hitting walls in unknown territory, developing intuition, growing judgment — is precisely what AI can’t replace. Efficiency and friction aren’t opposites: efficiency without friction just means running faster in the wrong direction.

What 227 cards taught me boils down to one sentence: make your system scream when something goes wrong, not stay silent. Fail loud, fail fast, fail with context. Every time you choose “catch it, return a default” instead of “throw with context,” you’re burying a landmine for your future self.

Landmines don’t go away. They just wait for the worst possible moment to explode.

Comments