Pivot Moments

The cuts that mattered. The roadblocks I hit. The places I changed my mind about what blacklight had to be.

A 1000-line bash Runner backed by a Managed Agents curator did not fall out of one design session. The shape settled across four pivots and one late-stage roadblock that nearly derailed the build. None of these were planned. Each one cut something I thought was holding the design up and replaced it with something simpler.

I'm writing this with the codebase still warm. The opinions here belong to the developer who built it; they aren't retrospective polish.

Pivot 1: I cut the hunters

I started where most agentic-IR designs start: a curator agent for high-level reasoning, plus a fleet of "hunter" agents (cheaper Sonnet model) dispatched per evidence stream. Hunters would parallelize across Apache logs, ModSec audits, filesystem surveys; they'd return compressed summaries to the curator for cross-stream correlation. Standard multi-agent fan-out. I had the dispatch graph drawn.

The first end-to-end run with that shape was the moment I knew it was wrong. A hunter reading Apache transfer records alone could not see the ModSec rule that fired on the same client IP twenty seconds later. By the time the hunter's summary reached the curator, the temporal correlation was already a heuristic, not a fact. I was rebuilding the chunked-context retrieval problem inside the multi-agent graph.

I ripped out the hunters. One curator. 1M context. The full evidence bundle goes in directly. Cross-stream correlation became a property of the context window, not a property of how I serialized hunter summaries. The Sonnet path stayed in the codebase, but only for one-off messages.create calls outside the Managed Agents surface, deferred to future work.

What it cost: concurrent investigation across many cases. blacklight today is one curator session per case. What it bought: the correlation guarantee I wanted, and a simpler workspace bootstrap with one agent record instead of N.

Pivot 2: Three agents collapsed into one

The next mistake was symmetric. I had separate Managed Agent records for each task: bl-curator for investigation, bl-synthesizer for authoring defensive payloads, bl-intent-reconstructor for analyzing samples. Three agents, three system prompts, three sets of tool bindings. Each one tuned to its narrow surface.

I shipped that and tried it on a synthetic case. Watching the traces, I realized the synthesizer was spending most of its first turn re-establishing context the curator already had: the case hypothesis, the recent evidence, the correlation that justified writing a rule in the first place. I was paying tokens to serialize state out of one session, and tokens again to reconstruct it in another. Specificity gains from per-agent prompt tuning were nowhere near the round-trip cost.

I collapsed the three records into one. The "synthesizer" and "intent reconstructor" became custom tools the curator invokes when ready (synthesize_defense, reconstruct_intent). They specialize the curator's emit surface, not its reasoning. The Runner still runs the sandbox-side validation (apachectl -t for ModSec rules, false-positive corpus scan for signatures, CDN safe-list for firewall blocks) but the act of authoring the payload happens inside the same session as the case investigation.

This was the pivot I was most reluctant to make. Three agents felt cleaner architecturally. One agent was right.

Pivot 3: I gave up on synchronous tool-use

I built the first agent loop synchronously, the way every messages-API client works: open an SSE stream, agent emits tool_use events, Runner consumes events live, replies with tool_result. Standard.

It didn't survive contact with a bash Runner. SSE in bash is a fragile thing. Reconnect logic, partial-event recovery, multi-turn state across bl invocations, they all wanted to be a Python or Go client. Worse, the operator workflow this had to support is not synchronous. An operator runs bl observe in the morning. Pulls more evidence in the afternoon. Comes back the next day to apply the cleanup. SSE assumes the Runner is the same process holding the same socket. It isn't.

I rewrote the loop async. The agent writes proposed steps to memory store paths; the Runner polls those paths on a 3-second tick, dedupes against what it's already seen, executes, writes results back. Two consumption modes: continuous polling for foreground REPL, on-demand single-fetch for batched workflows.

This pivot bought three things I hadn't expected. The Runner stays short-lived per invocation, so there's no daemon to manage. The case memory becomes a self-documenting audit log; every step the agent emitted, every result the Runner wrote, is readable by cat. And cross-day resumption is free, Tuesday's bl run reads the same pending queue Monday's bl consult populated, no session continuation, no re-prompt.

Polling latency is invisible against agent reasoning time. A typical curator turn on a realistic bundle is 8-40 seconds; a 3-second poll tick disappears into that.

Pivot 4: The "thinking" pitch died at the API

Early framing pitched Opus 4.7's thinking as a distinguishing feature. The README said "Opus 4.7 thinking on a 1M-context evidence bundle." The system prompt referenced thinking depth. I had positioning slides written around it.

POST /v1/agents rejected the field. 400 invalid_request_error: thinking is not a known input. Same with output_config. Reasoning on the Managed Agents surface is model-internal; it isn't operator-configurable. The platform's SSE stream surfaces reasoning content via dedicated event types at runtime, but I cannot set thinking depth from the client. It isn't a parameter I have access to.

That killed the pitch. I rewrote the README, the design doc, and the system prompt to drop the framing, then sat with the project for an evening to figure out what the actual three pillars were. They turned out to be:

Full evidence bundle in one context. No chunking, no retrieval; correlation as a property of the window.
Persistent case state across days. The curator already holds the hypothesis next time you call.
Three-tier model routing. Opus where reasoning matters, Sonnet for step execution, Haiku for the false-positive gate.

These were what blacklight actually depended on. The "thinking" line was decoration on top of them. Killing the decoration made the pitch sharper, not weaker.

The roadblock: Managed Agents surface migrated mid-build

This wasn't a pivot. It was a wall I walked into.

I had been building against the managed-agents-2026-04-01 beta from the first commit. Memory store calls. Agent CRUD. Workspace seeding. Most of the work was done. Then one evening, my mock test suite stayed green and a live-trace run failed with 4xx responses I hadn't seen before.

The beta surface had changed underneath me. Memory stores no longer used key and key_prefix; they used path and path_prefix with leading slashes. The PATCH-with-if_content_sha256 semantics had been removed in favor of delete-then-post last-write-wins. The server-side ?name= filter on /v1/agents had become client-side. I had memory-store calls in nine source files. Every one of them broke.

I had two paths. Patch each call site. Or build the adapter layer I should have built from the first commit, route every call through it, and let the migration become one diff next time.

I built the adapter. Five new functions wrapped every memory-store operation. Every call site got rewritten to use them. While I was in there, I caught a quieter bug: the response-body trace log had been silently writing to the wrong file path for who knows how many commits, so my cost-cap log was empty. Fixed that too.

When I finished, I reran the live trace. Most of it worked. A scene late in the trace still failed because the curator session-creation surface had drifted in a similar way and I had to chase that separately. But the cost of the migration after the adapter went in was contained. The next surface drift will be one file.

The lesson I took out of this is simple. Mocks lie at the boundary. They tell you your code is right; they don't tell you the surface is still there. Live verification belongs in the cadence, not at the end.

What I'd do differently

If I started over tomorrow:

Adapter layer on day one. Even when it feels redundant. The cost of the abstraction is one file; the cost of skipping it is the migration above.

Live verification per checkpoint, not per release. I caught the API drift weeks after it had landed. With a live trace at every meaningful checkpoint I would have caught it the day it happened, while the relevant code was still warm.

Adversarial corpus before the defensive grammar, not after. I wrote the curator system prompt's untrusted-content fence taxonomy first, then assembled an injection corpus to verify it. The right order is the other way: assemble the corpus from public injection examples, design the taxonomy to cover every class in it, then write the prompt. I got there in two passes when one would have done it.

Skill bundle as its own repo. Skills should ship at a different cadence than the Runner. Right now adding a defensive domain bumps the Runner version, which is wrong. The cleaner shape is a separate rfxn/blacklight-skills repo, surface-versioned, referenced by the Runner at setup. Future work.

What stayed unchanged the whole way through

Three constraints anchored every other decision. They were in the first commit and they're in the last commit:

bash 4.1 or newer, plus curl and jq. No Python on the host. No daemons. No services.
Quarantine, not delete. No clean operation has ever unlinked a file in any commit.
The substrate stays untouched. No new rule engine, no new manifest format, no new wire format. ModSec, APF, iptables, LMD, used natively, every time.

Every time a feature wanted to break one of these, I cut the feature.