Harness Engineering: Putting Reins on Your Agent

Starting with a Real Incident

Last week I was using Claude Code to modify the deployment script for my blog project. The deployment logic was straightforward: build the frontend, initialize git, force push to GitHub Pages. I had written it clearly in CLAUDE.md: "Deployment only operates on the site/dist directory, don't touch other files."

Claude read it, acknowledged it, and then—when executing git init, it set the working directory to the project root instead of the dist directory. By the time I saw git add -A adding every single file in the project, I broke out in a cold sweat. Luckily I manually interrupted before the push, or the entire repository history would have been overwritten.

After that, I bolded that rule, increased the font size, added exclamation marks—and the next deployment, it made a different but equally absurd mistake: on Windows, rm -rf .git doesn't exist as a command, so it just blew up.

That made me seriously start thinking about one question: Do rules written in markdown actually count as rules?

The Whole Industry Is Thinking About the Same Thing

Recently the term "Harness Engineering" has suddenly caught fire. OpenAI published an article about three engineers using Codex to write a million lines of code from scratch in five months—not a single line hand-written. A question on Zhihu sparked several long-form essays: someone proposed a five-layer architecture, someone argued from cybernetics that this is the third closed-loop, and someone said the essence is "migrating control from conversations to the runtime."

From different angles, everyone is saying the same thing: Agents can't run naked.

This resonates completely with my own experience. My blog project isn't big—a few thousand lines of code—but even on a project this small, I went through the full evolution from "vibe coding all the way" to "I need to build something to rein it in."

My Three-Layer Harness

Some people built five layers—boundary layer, memory layer, handoff layer, cognition layer, skill layer. That's the scale of a serious B2B SaaS project. My project is much smaller, so my Harness is only three layers, but the thinking is the same: turn constraints from "reminders" into "physical boundaries."

Layer One: Files as Rules

At first, my approach to CLAUDE.md was pretty raw—I wrote down whatever came to mind, and the rules kept piling up until it was hundreds of lines. Claude would read it, but reading doesn't mean following. Once the context got long or the task got urgent, rules at the bottom might as well not exist.

Later I made an important reorganization: I turned CLAUDE.md from an "encyclopedia" into a "navigation map." I only wrote the most critical constraints and pointers, and moved detailed information into dedicated files—blog validation rules went into .claude/commands/check-post.md, deployment processes went into standalone scripts under scripts/. CLAUDE.md became just an index and quick reference.

This is exactly what the OpenAI team did. They tried writing a massive AGENTS.md, and found that too much information equals no information—the Agent started doing local pattern matching instead of truly understanding constraints. They later cut AGENTS.md down to 100 lines, using it only as a "directory," with deep knowledge placed in the docs/ directory for on-demand loading.

A map, not an encyclopedia. That phrase is worth tattooing on your forehead.

Layer Two: Scripts as Guardrails

After the deployment incident, I did the smartest thing: I changed the deployment logic from "let Claude piece together commands on its own" to "write a Node.js script, and Claude just calls it."

The old approach was writing "the deployment steps are xxx" in CLAUDE.md, then letting Claude execute based on the description. The problem is that the cwd parameter for git init, the absence of rm -rf on Windows, the error from running git remote add twice—these edge cases are impossible to fully enumerate in markdown.

After the change, the deployment script uses fs.rmSync instead of rm -rf, execSync with the cwd option instead of cd &&, and try/catch to handle duplicate git remote add calls. Claude doesn't need to know these details—it just needs to know node scripts/deploy.js.

LangChain has a formula that puts it well: Agent = Model + Harness. The model is the engine; the Harness is the whole car. No matter how good the engine is, you wouldn't dare hit the road without brakes.

I applied the same thinking to blog validation. I used to manually check whether frontmatter formatting was correct, whether image paths existed. Later I wrote it into the /check-post skill and new-post workflow, turning validation rules into executable scripts. The model can say "I checked," but it can't fake the script's check results.

Layer Three: Skills as SOPs

This was the latest layer I realized, but it has the highest ROI.

After using Claude Code for a while, you notice that some operations repeat every single day: after writing a new post you need to validate format and style, Zhihu articles need to be scraped for source material, before deploying you need to confirm the build succeeded. Prompting from scratch each time wastes tokens and makes it easy to miss steps.

The essence of a Skill is the automation of an SOP (Standard Operating Procedure). A markdown file tells Claude "when the user invokes this command, do these things in this order." No code to write, no plugins to install—Claude reads markdown and executes.

For example, my /check-post skill validates blog post formatting and writing style. My /fetch-zhihu skill uses Puppeteer to bypass Zhihu's anti-scraping and fetch articles. These skills turned me from "teaching Claude how to do it from scratch every time" to "hit one command and it already knows."

OpenAI's article mentioned the same pattern. They encoded code cleanup standards into "golden principles," letting Codex automatically scan for deviations and file refactoring PRs. From manual cleanup to automated cleanup—that's closing the feedback loop.

More bluntly: the skill layer has the highest ROI of the entire Harness. Boundary layers, memory layers, cognition layers all require actual code (thousands of lines). Only the skill layer is pure markdown—near-zero code in exchange for a whole set of workflow automation.

Looking back after building all three layers, I realized the underlying logic is really just cybernetics.

The Cybernetics View: Closed Loops Are What Matter

Someone offered a particularly insightful perspective: Harness Engineering is cybernetics showing up for the third time.

The first time was the 1780s with Watt's centrifugal governor—when the flyball spun too fast it automatically closed the valve, when it slowed down it opened the valve, closing the feedback loop at the physical level. The second time was the 2010s with Kubernetes—you declare the desired state, the controller continuously monitors the actual state, and automatically corrects any deviation. The third time is now—you write architectural constraints as Linter rules, the Agent validates code on every commit, and rejects anything that doesn't pass.

The common pattern across all three: someone built good enough sensors and actuators to close the feedback loop at that level.

The sensors of Harness Engineering are tests, linters, and observability; the actuators are LLMs. Compilers can detect syntax errors, tests can verify behavior, linters can check style—but these are all low-level loops. At the architectural level, there used to be neither sensors nor actuators. Until LLMs arrived—they can understand code intent (sensor) and generate new code (actuator).

My own experience confirms this. Before the deployment incident, my "sensor" was me watching terminal output, and my "actuator" was me typing commands. The feedback loop was broken—errors happened before I could see them, and by the time I saw them it was already too late. Writing the deployment logic into scripts and adding hooks was like installing a governor between the Agent and production: the script validates before executing the next step, and stops if validation fails. The feedback loop closes at the script level—no need for a human to watch.

The Stronger the Model, the Less You Should Trust It

The stronger the model, the more it needs an external system with inversely proportional capability to rein it in—because the stronger it is, the more clever ways it finds to bypass constraints and deliver results. Your only countermeasure is to turn constraints into physical boundaries rather than verbal agreements. Put differently: rules in a prompt are, at their core, still tokens. They get pushed out by new information, diluted by long contexts. Reminders help, but reminders aren't boundaries.

My own experience bears this out completely. My Claude had read all my rules. It didn't "not know" that deployment shouldn't run git init in the project root—it "rationalized" the decision under pressure, because the current approach wasn't working and it needed to try something else, and rules are just tokens competing for attention against other tokens.

So boundaries need to be written into the runtime. You want consistent code style? Writing it in CLAUDE.md helps, but Lint is more reliable. You want changes not to break interfaces? Reminding in a prompt helps, but CI is more reliable. You want the Agent not to mess with files? Relying on "don't delete important files" is fragile—what you should really do is set file permissions so it simply can't.

Documentation is for giving the model direction; the runtime is for setting boundaries.

Will Harness Become Obsolete?

There's an interesting debate: Anthropic's Claude Code team says their philosophy is "the thinnest possible wrapper above the model," while OpenAI built a fairly heavy system. Both are using Agents for production development, but chose completely different paths.

I tend to think Harness will persist long-term, though its form will evolve. The reasoning is simple: validation and constraints are needs independent of tool capability. Compilers have gotten more powerful, and we haven't gotten rid of CI/CD. Testing frameworks have gotten better, and we haven't deleted our tests. Your architectural standards don't become unnecessary just because models get stronger; your deployment safety checks don't become skippable just because Agents write better code.

Someone raised a sharp question: do those million lines of code actually work correctly? That's indeed the key—the value of Harness isn't just making Agents write faster, it's making Agents write correctly. IBM Research has a data point: pure LLM code review catches only about 45% of bugs; combined LLM and deterministic tools jump to 94%. Model capability is only half the equation; the other half comes from the deterministic toolchain within the Harness.

Back to My Project

Looking back after writing this, my three-layer Harness really only does one thing: pull out everything that doesn't need to be in the prompt, and let deterministic systems handle it.

Validation rules were pulled out into skill scripts, deployment logic was pulled out into a Node.js program, repetitive workflows were pulled out into slash commands. Claude's context is left with only the information that genuinely requires its judgment, while rule execution is handed off to scripts and tools it can't see.

After three months I can distill it into one line: Documentation writes direction, scripts write boundaries. Direction is negotiable; boundaries are not.

If you're also transitioning from vibe coding to real projects, I'd suggest starting with two things: write your most dangerous operations (deployment, deletion, resets) as scripts instead of letting the Agent improvise; and crystallize your most repetitive workflows (validation, commits, handoffs) into skills instead of re-prompting from scratch each time. The investment for these two things is small, but enough to take you from "watching the Agent work with your heart in your throat" to "letting it run with confidence."

Not learning better prompts. Building a system that makes prompts matter less.