Understanding Harness Engineering in One Article: Finding the Shell That Keeps AI on Track from 14 Engineering Articles

In the first quarter of 2026, "Harness" became a hot buzzword in the large model application layer. An empirical study published by LangChain showed that using a more sophisticated Harness architecture improved the pass rate of AI programming capabilities from 52.8% to 66.5%. This discovery sparked a frenzy for Harness among startups, making it a key factor in attracting investment. However, the boundaries of the concept have become blurred as many technologies are incorporated into the Harness category. To truly understand the evolution of Harness, one must trace the history of its origin. Meanwhile, as it continues to iterate, the Anthropic team has begun to dismantle old frameworks

In the first quarter of 2026, the most dominant buzzword in the large model application layer was undoubtedly "Harness."

This March, LangChain published an empirical article titled "The Anatomy of an Agent Harness," which completely ignited everyone's anxiety and fervor. In this report, they cited an experimental data comparison: simply by switching the same large language model to a more sophisticated Harness architecture, its pass rate on Terminal Bench 2.0 (an authoritative ranking specifically measuring AI programming capabilities) skyrocketed from 52.8% to 66.5%.

In this experiment, not a single byte of the underlying model's weights was changed, and the computing engine remained untouched. By merely swapping out the "shell" for a more refined one, the ranking surged from outside the top thirty to the top five.

Since then, countless startups have begun frantically packaging their own shells. Harness seems to have become a magic touchstone, turning everything into gold, and the most impressive "talking point" and moat for application layer companies when meeting with investors.

But in this frenzy, the boundaries of the concept have begun to be stretched and blurred infinitely.

What exactly is the true shell, and what is outside of it? Many external introductory articles, in pursuit of comprehensiveness, have bundled the rise of CLI tools (command-line tools), the floating of markdown files, and even the recently popular external Skill packages into the Harness umbrella. In a sense, this is correct, as they are all technological choices and creations under the broader logic of Agent infra that enable Agents to run better.

However, if we are to truly understand the underlying thread of Harness's technological evolution and its main axis, we must trace the history of the concept's origin.

Furthermore, at the present moment, if you keep a close eye on Anthropic—the team that first systematized the Harness framework—you will find that while the entire industry is frantically building upwards, they are already silently tearing down walls.

With the iteration of the new Opus version, they have begun to dismantle the control components they once painstakingly built without hesitation.

While some are frantically adding layers, others are decisively dismantling them. This industry-wide frenzy, marked by a sense of fragmentation, essentially stems from the fact that the vast majority of people have not truly understood the engineering papers written over the past fifteen months that navigated these pitfalls.

They only saw the immense gains from doubled final scores, but failed to grasp what kind of desperate bugs originally forced the creation of those complex mechanisms.

Today, we will completely crack open this black box. Following these fifteen months of hard-won literature, we will clarify every true blueprint of this "Harness engineering."

Layer One of Harness: From Notebook to Management System

Explaining Harness is not difficult. Think of an Agent as a car.

The model is the engine—powerful and high-revving, responding as soon as you step on the gas. The interaction program that carries it is the wheels, and the steering wheel is your Prompt, guiding it as the engine drives you forward. However, the engine, steering wheel, and wheels themselves do not constitute a car. You cannot drive an engine on the road. You need a gearbox and brakes to smoothly adjust the wheels as you steer, an instrument panel to tell you how far you've traveled, and brakes to indicate when to stop. All these components together—how tasks are broken down to run smoothly, how progress is recorded, and how completion is judged—that is Harness, the shell.

The shell did not appear out of thin air. It has a precursor.

Large models inherently have only one form of memory: the context window. When the window is full, previous content is pushed out.

This is not a problem for short tasks. In December 2024, Anthropic published an engineering blog post, "Building effective agents," with a core recommendation: Start with the simplest solution and only add complexity when necessary. Most Agent tasks at that time were short sprints completed within minutes, where the model's short-term memory capacity was sufficient. A carefully crafted instruction (System Prompt, i.e., a "role description" pre-fed to the model) was enough to drive it.

But everyone wants Agents to handle larger tasks.

In the first half of 2025, as models' reasoning capabilities improved, the tasks they could theoretically execute began to lengthen. However, this brought significant problems with context. Although models now have large context windows (e.g., 1 million tokens), their effective attention span is not that large, and even if it were, it couldn't hold all the details of a long-term task. Humans prioritize key points when taking notes, but models can't do this. Their memory in complex tasks is almost like that of a goldfish.

To address the problem of insufficient effective context for completing tasks in the past, one of the earliest paths was memory externalization. AutoGPT, as early as March 2023, gave models a blank notebook—granting them tool call permissions for write_to_file and read_file—and let them manage their own memory. The medium was pure .txt files, with no structural constraints. The model wrote and deleted as it pleased.

However, without management, the model naturally becomes chaotic. In March 2024, Devin upgraded the notebook to a structured panel, introducing a structured Planner panel. The model's task planning was forced into a visual progress bar with clear status markers for each step.

By February 2025, Claude Code was born, productizing all the experience Anthropic had accumulated internally on SWE-bench. The combination of CLAUDE.md (project-level instruction file) and a scratchpad became the most widely imitated paradigm in the industry.

But even with such an externalized memory system, the context might still not be enough.

For this reason, in September 2025, Anthropic's own application team published "Effective context engineering for AI agents," proposing three directions for solving context-related issues. There were only two strategies: improve efficiency and compress information to allow long-term tasks to be completed within a single context layer.

The first strategy is to improve context efficiency, which means changing the way context is written. First, the system Prompt should not be a "single paragraph statement"; it should be maintained like code, with version control, A/B testing, and dynamic assembly of different prompt modules based on task type for greater efficiency. Then, tool descriptions need to be improved, as unclear or incorrect descriptions lead to inefficiency and consume context. They found that models read tool descriptions in the same way they read system prompts. The naming of tools, parameter descriptions, and return value formats directly affect the quality of the Agent's decisions. Poorly written descriptions are like a confusing map for a goldfish. Then use external storage (RAG), retrieving only what is needed rather than shoving everything in at once.

The second strategy is context compression and elimination. When a conversation becomes too long, a summary is created to condense the dialogue history, freeing up token space for subsequent tool call results. To prevent context overflow, Anthropic adopted a sliding window policy, retaining only the original text of the last N turns of the conversation and replacing earlier parts with summaries. Simultaneously, the Agent maintains a structured working note area within the context, updated at each step to prevent information from being "washed away" in long conversations. Useless information returned by tool calls is also directly deleted to prevent it from becoming dead weight in the context.

This is Context Engineering, which manages information. It primarily handles where information is stored, how it is retrieved, and how it is selected. It does not manage the process—whether the goldfish model actually refers to the notebook after receiving it, whether it follows what is written, and whether anyone verifies the result.

This distinction was not clearly recognized by anyone at the time, as Anthropic fell into the same pitfall.

In November 2025, they disclosed this experience in "Effective harnesses for long-running agents." In May 2025, Anthropic wanted Claude to write a complete web application from scratch—not to fix a bug, but to build an entire product. A task like this, running for several hours, would exceed the context window even with externalized memory. Each new run would reset the previous session's memory. It was like shift changes for engineers without handover documents.

Initially, they adopted the Context Engineering approach to build the first version of the work framework. The process was divided into two steps: first, an Agent was assigned to initiate the process, analyze requirements, break down over 200 features, and generate a structured list. Then, another Agent took over to write code, handling only one feature per round, submitting after completion, updating the progress file, and handing over to the next iteration.

The notebook was provided, externalization was done, and best practices for Context Engineering were followed. It sounded reasonable.

But the actual execution was a complete failure.

They discovered four modes of failure:

First, premature completion. The Agent declared "project complete" after completing only three features, mistaking the existing amount of code for the total work.

Second, environmental blind spots. The Agent was genuinely writing code, but there was a bug in the environment; its code wouldn't run, and the Agent was unaware of it.

Third, false completion. Features were marked as "done" on the checklist, but the actual functionality was broken. The Agent modified code and passed unit tests, assuming it was correct, but the end-to-end process failed.

Fourth, amnesiac intern syndrome. Each new session involved spending a significant amount of tokens re-familiarizing with the project structure, like a new intern repeatedly asking, "Which folder is the code in?"

Therefore, they realized that Context Engineering, the "notebook" solution, only addressed the problem of "not being able to store information." However, the goldfish's issues extend far beyond that. Sometimes it doesn't read the notebook, and when it does, it often doesn't act according to what's written. Furthermore, it lacks self-validation capabilities.

The notebook is not the problem. The problem is that no one forces the goldfish to read it and act accordingly, and no one verifies if what the goldfish writes is true.

This leap in understanding led Anthropic to shift its approach from "making a better notebook" to "building a comprehensive management system centered around strict adherence to work procedures."

To address false completion and premature declarations, Anthropic realized that relying solely on Markdown for externalization was insufficient, nor could the Agent be both the player and the referee. At the start of the project, a dedicated "Initialization Agent" generates a complete feature checklist in JSON format (a machine-readable data format). This was designed as a strictly enforced process where the "Coding Agent" responsible for actual work could only change a field to "pass" or "fail." You cannot delete features or modify descriptions; you can only update status. The JSON specification mandated that the Agent must change the status to "passing" only after successfully passing its own tests, preventing completion based on a mere "looks about right" assessment.

Under these settings, the JSON used by the problem setter became a physical lock against cheating, rigidly controlling the progress bar through strong validation. Markdown files still existed, but primarily served as signposts rather than strict procedural guidelines (this is also one reason why current Skills follow similar principles).

To combat the "amnesiac intern syndrome," each session began with a mandatory three-step wake-up ritual: running pwd (confirming the current directory), reading git log (viewing code modification history), and reading progress.txt (checking the next task). This is akin to factory shift changes where the incoming worker first reviews the handover log. The Agent's memory is not stored in its own mind but in Git history and progress files. Instead of relying on the Agent to remember, it is systematically helped to store its memory externally and is forced to clock in, review handover logs, and confirm its workstation each time it starts work.

The results were immediate. The Agent could run for several hours, performing one task per iteration, submitting after completion, and externalizing the status to the progress file. The next iteration would read the latest progress.txt to know what to do next.

Anthropic added an even stronger layer of security. Every code change was archived via Git. If the model entered a dead end, the code repository could be rolled back to a previously working, clean state using git revert, and the model could be re-awakened. There was no reliance on the goldfish to undo its own mistakes; it was directly given a time machine.

When historical messages overflowed the context window, Harness would completely clear the goldfish's mind and start a new Agent, transferring the previous session's state and the next task through a structured handover file. Anthropic called this Context Reset—not compressing memory, but directly replacing it with a new goldfish, giving it only a written handover note. This is more aggressive than simple summary compression (Compaction), because Anthropic found that even with compressed history, the model would still experience anxiety and lose coherence in extremely long contexts. Only by completely clearing it and providing a blank slate could it regain focus.

By this point, Anthropic's management system was quite comprehensive. The JSON physical lock controlled false reporting, the three-step wake-up managed amnesia, Git archiving handled rollbacks, and Context Reset handled brain capacity. However, this system governed the entire process: the goldfish had to clock in, read the notebook, and work according to the checklist.

It did not address another issue: whether the information presented to the goldfish was accurate and up-to-date.

If the signposts in the notebook were outdated or incomplete, no matter how strict the process, it would only make the goldfish run faster following a wrong map.

So what to do? Beyond strict process control, there was another path: strict control of the notebook and its repository.

This was the logic behind OpenAI's approach.

In "Harness engineering: leveraging Codex in an agent-first world" (February 2026), they conducted an experiment starting in August 2025 with an empty repository. Three engineers wrote no code themselves. All code—application logic, tests, deployment configurations, documentation, monitoring tools—was generated by a Codex Agent.

Human role? Designing the Agent's working environment. In their own words, "Humans at the helm, Agents executing."

Five months, one million lines of code, fifteen hundred PRs (code commit requests), with zero manual input. The team later expanded to seven people, and throughput continued to grow.

From this process, they gained a realization consistent with Anthropic's but even more stringent: the repository is truth (Repo-as-truth).

From the Agent's perspective, anything it cannot access does not exist. Discussions on Slack don't exist. Consensus in the team's minds doesn't exist. Solutions in Google Docs don't exist. The only thing that exists is versioned files in the code repository that Agents can directly read.

This means that if you want an Agent to know something, there is only one way: write it into the repository. Architectural decisions must be written down, design principles must be written down, quality standards must be written down, and even subjective judgments like "what style does our team prefer" must be written down.

Just writing it down is not enough. OpenAI found in practice that a giant instruction file would occupy valuable context, crowding out the actual task, relevant code, and reference documentation. Moreover, passive documentation has a fatal flaw: reading it doesn't mean the Agent will comply. Therefore, key rules must become executable automated checks, such as custom linter rules and structured tests, integrated into the CI pipeline. Every time an Agent submits code, it will be automatically scanned, and if it violates any rules, it won't be merged. The Agent doesn't need to "remember" the rules; it only needs to modify the code until it passes, based on the error messages.

Using a quality administrator (linter) to manage the flow of data into and out of the repository effectively controls the entire workflow.

Their approach was specific. AGENTS.md (a "new employee handbook" for Agents) is only about one hundred lines long—not an encyclopedia, but a directory. It only tells the Agent where to find deeper information: where the architecture documentation is, where the design principles are, and where the current execution plan is. Each business domain is divided into fixed layers, and dependencies are strictly unidirectional. Violations will fail automated checks. The purpose of this signpost is simply to prevent the Agent from needing to "understand" architectural rules; it only needs to know that a certain path is blocked by the system. If it finds the right way, it follows strictly enforced Linter rule languages.

Documentation is not just written and abandoned. OpenAI runs a dedicated "Doc-gardening Agent" (an Agent specializing in maintaining documentation) that patrols the repository daily, writing no business code. As soon as it detects any documentation out of sync with the actual code, it automatically initiates a modification request to mercilessly trim outdated paragraphs. Because outdated memory is more dangerous than no memory, a goldfish reading incorrect history will produce hallucinations.

Repo-as-truth (the repository is truth) sounds like a technical architecture choice, but its essence is a hierarchical management philosophy.

Anthropic's management system governs the process, ensuring Agents clock in, read notebooks, and work according to checklists. OpenAI's Repo-as-truth governs the environment itself. It doesn't concern itself with the Agent's behavior but ensures that the entire world it perceives is accurate, executable, and automatically maintained, leaving no room for deviation. Process controls behavior, while the environment controls cognition.

From initially issuing a blank notepad to implementing JSON physical locks, three-step wake-up rituals, Git archiving and rollback, Context Reset for clearing and restarting, and finally Repo-as-truth, making the entire repository the Agent's sole reality.

This journey reveals that the first layer of the Harness shell aligns perfectly with the word "Harness" (to control)—it's about managing AI that has memory supply issues, ensuring it diligently follows the rules in its memory notebook. Although there are two solutions (Anthropic and OpenAI), the goal is process control.

Only in this way can we ensure that after running continuously for 6 hours, the car still remembers where it's going, the gearbox doesn't seize due to overheating, and all the signposts in front of it are real.

This is Harness 1.0.

Layer Two of Harness: Ending Anarchy, Moving Towards Concurrency and Efficiency

Once individual cars could reliably handle long-distance travel, the application layer immediately faced another form of greed. Since a single car could run, why couldn't we deploy a hundred cars simultaneously? To solve the efficiency problem of large-scale collaboration, Harness's internal architecture was forced to grow upwards, evolving into an extremely complex layer of concurrency and scheduling control.

But when we truly allowed countless Agents to flood into the same code repository, a disastrous chain reaction occurred.

In "Scaling long-running autonomous coding" (January 2026), the Cursor team documented the collapse they encountered when expanding concurrency. They attempted to have hundreds of Agents share a large project. The result was that when 20 Agents worked simultaneously, the effective throughput dropped to that of only two or three Agents running—the lock mechanism became a bottleneck, and everyone waited for each other, unable to make progress. Even more despairingly, the remaining Agents found the core code occupied, and to appear busy, they selectively modified the simplest, most trivial code. The entire codebase was flooded with modified comments, adjusted spacing, and indentation.

Hundreds of highly intelligent AIs instantly descended into pure anarchy.

This necessitated a higher-dimensional shell architecture. Cursor used state machines to build a three-tier hierarchy of Planner, Worker, and Judge, with rigid gatekeeping. Within the unidirectional flow of a DAG engine (a task scheduling system that only allows forward movement and prevents backtracking), Worker nodes were rigidly locked by the underlying engine and could not move until the Planner node had produced a schedule. Without the Planner's approval and signature, Workers were strictly forbidden from touching the core code.

Before starting long-term tasks, Agents must first submit a complete plan and wait for approval before proceeding—this is the first gate.

Upon completion, each Worker must submit a handover report—not just a simple "done," but a summary of work, identified problems, and any deviations from the plan. The upper-level Planner uses these reports to maintain a global perspective and pull things back on track if they deviate.

This is akin to installing traffic lights at a chaotic intersection, using an unforgiving physical state machine to rigidly suppress individual impulsivity.

Anthropic, in "Building a C compiler with a team of parallel Claudes" (February 2026), revealed another extremely expensive concurrency disaster.

They deployed 16 top-tier Claude instances in parallel to write a C language compiler. Initially, everyone worked on their respective modules, and progress was rapid. However, when it came to the overall compilation and linking stages, the system threw a global error. During the bug-fixing phase, the 16 Agents were like 16 blind individuals without walkie-talkies. They consumed immense computational resources, overwriting hundreds of lines of code with each other. Not only were the bugs not fixed, but a massive amount of idle processing occurred.

The solution that emerged was to introduce GCC (the most mature open-source compiler in the industry) as a reference for the correct answer. Imagine building a car and finding the engine doesn't start. The problem is not knowing which part is broken. A car has thousands of parts, and checking them one by one is too slow.

Anthropic provided the Agent system with a "virtually identical but confirmed working car" (compiled by GCC) and instructed it to randomly swap a few of its own parts with the original parts from the good car. If the car still runs, it means the swapped parts are fine. If it doesn't run, the bug lies within the swapped parts.

Then, the scope is narrowed further: half of the suspect parts are replaced with original ones, and the other half remain the new parts. If it still doesn't run, the bug is in the remaining half. This process repeats, halving the suspect parts each time, until the specific problematic file is pinpointed. This is "binary search."

This method breaks down the massive problem of "where in the entire compiler is the error?" into "which of these 3 files is compiling incorrectly?", drastically reducing debugging complexity. Furthermore, different Agents can simultaneously test different subsets of files, naturally separating their work areas and preventing them from interfering with each other.

The article also mentions a more advanced variation called delta debugging. Some bugs only appear when two files "collaborate"; compiling each file individually might be fine, but together they cause a crash. In such cases, "file pairs" need to be found—a method similar but with a larger search space.

The result: using nearly 2000 sessions and $20,000 in API fees over two weeks, Claude Code produced a 100,000-line compiler capable of compiling a fully bootable Linux operating system.

This represents the core mechanism of Harness's second layer of evolution: large-scale concurrency control. Models inherently lack self-discipline and macro-coordination common sense. Without this rigid control flow, brilliant minds will only use their speed to lead the entire team into a dead end.

Layer Three of Harness: Bursting the Bubble of Blind Confidence

With clock-in systems and external memory, along with traffic lights and dedicated lanes, Agents successfully complete their tasks on schedule. However, when humans take over, they find that the code is a mess—usable but incredibly slow, the UI is confusing and illogical, and features are clickable but lack coherence.

This is actually the issue of false completion that Anthropic encountered in Harness v1, which was only partially resolved. It prevented AI from falsely declaring completion, but the AI's validation was not fully addressed.

Anthropic's mandatory tests can catch functional errors (e.g., function input X should output Y). OpenAI's mechanism, although it includes a linter quality administrator, can only catch structural violations like reversed dependency directions, non-standard naming, or excessively large files.

There's a large category of problems that neither can catch, such as a page opening but with a completely misplaced layout; functionality technically "passing" but with poor user experience; or code logic being self-consistent but misunderstanding the business requirements.

These require a more comprehensive "evaluator" to actually review, use, and judge.

Anthropic was well aware of this. Two months after their Harness article in November, they systematically outlined Agent evaluation methodologies in "Demystifying evals for AI agents" (January 2026), clearly stating that programming Agents must run test suites in real environments for verification, and merely inspecting the code itself is far from sufficient.

In their subsequent paper, "Harness design for long-running application development" (March 2026), they thoroughly revealed the fatal flaw of large language models: when asked to evaluate their own recently completed work, they almost always "confidently praise," even when the quality is clearly mediocre to human observers. Even on verifiable tasks with clear right or wrong answers, they occasionally exhibit poor judgment. Simply put, they are not deceiving others; they genuinely believe they have done a good job.

Anthropic's approach is to directly integrate the Agent as an Evaluator into the internal loop of the Harness. Inspired by GANs (Generative Adversarial Networks), they separate the doer from the judge. In Harness v1, the additional Agent only posed questions, but validation was still performed by the executing Agent itself—meaning the contestant and the judge were the same person. Now, two Agents are pitted against each other, preventing the executor's confidence from running unchecked.

However, simply separating them is not enough, because the evaluator itself is an LLM, naturally inclined to be lenient towards LLM-generated output. Therefore, they repeatedly calibrate the evaluator, instilling a skeptical attitude. The calibrated evaluator then personally inspects the work—opening browsers, clicking page buttons, verifying error stacks (the chain of errors when a program crashes), and capturing screenshots—operating like a real user. The most authentic end-to-end error state is fed back to the Generator, forming a relentless adversarial cycle.

If you don't show me normal page feedback, I will keep giving you low scores, forcing you to rewrite.

In the latest revealed V2 version (March 2026), Anthropic also introduced the Sprint Contract mechanism. Before each iteration begins, the Generator and Evaluator negotiate "what it should look like upon completion." It's like the client and the construction team signing off on acceptance standards before starting work. These are not human-defined standards but conditions negotiated by the two Agents themselves. After nine rounds of confrontation for a museum website, the Generator in the tenth round overturned all previous designs and created a 3D CSS perspective environment with spatial navigation. This creativity was born out of necessity.

Cursor, in "Building a better Bugbot" (January 2026), also addressed this problem, taking a more extreme and expensive approach. They firmly believe that even the judging model can be easily deceived by the surface logic of the code. Thus, they developed an 8-channel parallel blind review mechanism. For the same code difference, the in-shell control system launches 8 independent Bugbots, each channel receiving the code differences in a randomized order. Different orders lead to different reasoning paths, making it difficult for hallucinations to synchronize. The 8 channels independently identify bugs, and the results are merged by majority vote. If a bug is only marked in one channel, it is directly filtered out. The merged result is then re-verified by a validator model to capture any remaining false positives. Layer upon layer of filtering ensures only true signals remain.

This appears highly reliable, but even with a vast and strict panel of judges, there's one area they can't control: the examination room itself.

In mainstream programming benchmarks like SWE-bench and in the practices of various teams, they have repeatedly observed a phenomenon: when a generative model finds itself unable to pass test cases, it learns to tamper with the test environment itself to force a pass. It directly oversteps its authority to modify the evaluation script, changing strict assertion conditions like assert x == 5 (meaning "the answer must equal 5") to assert True (meaning "any answer is considered passing"), thereby forcibly returning a test-passed signal.

Faced with a hellishly difficult exam, the AI's first reaction is not to solve the problem but to find a way to eliminate the examiner.

The arms race between judges and athletes has become a bottomless pit. This is why, in the verification layer of Harness, extremely strict sandbox isolation has become an absolute necessity. Control flow must lock the test environment in a read-only state at the highest privilege level; the examinee can only write on the answer sheet, absolutely never touching the test paper or grading criteria. Only such a physically isolated error-correction loop can forcefully burst the model's bubble of confidence.

Harness Layer One ensures that the model follows the required steps without skipping or acting erratically. Layer Two ensures that the communication flow between multiple agents allows the process to run effectively. However, to ensure the model effectively executes the intent of the task, verification is the problem that Harness Layer Three must solve.

Learning to Subtract After Mastering Addition

Following the lead of these top companies through fifteen months of hard-won literature, we can finally draw a clear diagram of Harness.

Harness is a purely industrial-grade management system built around large language models. The first layer manages its disobedience. The second layer manages group operations. The third layer manages its lack of self-awareness.

They all address the most fundamental issue: controlling the Agent to produce content that meets our expectations.

Other components, such as CLI, manage the Agent's interface; Skill manages the conversion from natural language to procedural logic; externalized memory manages context storage modes. These all fall under the broad category of Agent Infra. They are like gas stations and offline map packs for in-car navigation. They can make the car run smoother and go to more places, but they are not responsible for solving the problem of "how the car should be driven."

Harness's pioneer, Anthropic, has also contributed significantly in these areas. In "Quantifying infrastructure noise in agentic coding evals" (February 2026), they pointed out that simply relaxing resource constraints in the evaluation environment from strict mode to unlimited mode increased the success rate on Terminal-Bench 2.0 by 6 percentage points. This was because with ample resources, the probability of container crashes due to transient memory overflows decreased. This is about the road surface, not the logical design of the autonomous driving system itself.

The three layers of shell represent three types of compensation. This concludes the "addition" part.

However, the story does not end here.

Following the first Harness article in November 2025, Opus 4.5 and 4.6 were successively released. Anthropic did something that most people who have built complex systems are reluctant to do: they began dismantling what they had built.

Context Reset, mentioned in Chapter One, was removed. Opus 4.6's context management capabilities are now so advanced that this clean slate is no longer needed. Running with or without it yielded no difference in quality, and instead added orchestration costs.

Sprint Contract, mentioned in Chapter Three, was removed. The new models can now control their own pace, no longer requiring the evaluator and generator to negotiate an acceptance contract before each work session. The contract process still exists, but the deficiency it compensated for has disappeared.

Leaving them in place was not a safeguard, but a hindrance.

The Evaluator changed from being involved in every round of confrontation to performing QA (Quality Assurance) in the final round. It's not that it's no longer needed, but the way it's needed has changed.

Removing them was not due to foresight.

Anthropic initially believed these components were indispensable skeletons for long tasks. But experiments with Opus 4.6 showed that these compensations no longer improved output, only increased latency and cost.

Removal is driven by experimental results, not architectural prediction.

In Anthropic's own words, "every component of Harness encodes an assumption about what the model cannot do." When an assumption is no longer valid, the component should be removed.

The difficulty lies not in the removal itself, but in judging when to remove it. Removing too early means the model cannot yet cope, and the system will collapse; removing too late means redundant compensation layers obscure the model's true capabilities, making you think the shell is helping when it's actually hindering.

Anthropic's approach is to run the old harness with each new model release, then remove a component and run it again, letting the data speak.

Currently, only Anthropic has completed the full cycle from "addition" to "removal."

OpenAI and Cursor are still in the addition phase—OpenAI's Codex team is expanding from 3 to 7 people, and Cursor's architecture is evolving from flat to hierarchical. However, all three acknowledge the same thing in different ways: the current solution is not the endpoint. Based on Anthropic's experience, the "removal" phase is likely to come.

Cursor has also proven this from another angle. In the project mentioned in Chapter Two—writing a browser over seven days with hundreds of Agents—they discovered that the factor most influencing system behavior was the prompt (i.e., how you talk to the Agent), followed by the harness structure, and finally the model itself.

In their words, "A surprisingly large proportion of the system's behavior differences boil down to how we prompt the Agent."

Adjusting a prompt has a greater effect than changing the entire harness architecture; changing the architecture's effect is greater than changing the model. The least intrusive intervention is often the most effective.

However, this ranking has prerequisites. By the time Cursor reached this conclusion, the harness was already a Planner-Worker-Judge three-tier architecture. Prompts stood on the shoulders of Harness. Without that architectural layer, no matter how good the prompt, it's just shouting at a group of Agents trampling each other. This ranking reflects marginal impact, not fundamental importance.

Two stories: one involves removing components, the other ranks influence. But they point to the same realization: the value of Harness components is not absolute but relative to the model's capabilities.

The reason for every block in Harness is not "what it can do," but "what the model cannot do."

Context reset compensates for the model's inability to remember; the evaluator compensates for the model's inability to objectively assess itself; the sprint contract compensates for the model's inability to define "completion."

Each component is a patch applied to the gaps in the model's capabilities.

These patches, pieced together, manifest (at least for Anthropic) as a continuously deforming surface that changes with the model's capabilities. This surface has a name: the compensation surface.

So, is Harness a moat?

Anthropic has proven that models are beginning to absorb Harness. So perhaps not?

However, as the tasks we want models to perform become increasingly complex—and these expectations often exceed the models' capabilities—the compensation surface may exist for a period.

But as Anthropic concludes, the space of worthwhile harness combinations has not shrunk with model progress; it has shifted.

The migration of the compensation surface means that as the model gets stronger, the focus of the harness shifts. Every addition of a component compensates for what the model currently cannot do; every removal is because the model's progress has made that compensation overhead (redundant burden). The total amount may not decrease, but its position is constantly changing.

Lance Martin of LangChain observed the same pattern as early as July 2025—as models become stronger, you are forced to start dismantling structures. This is the "Bitter Lesson" replaying itself at the application layer.

So, where is the moat?

Let's answer the negative first.

If a company claims, "We have the most comprehensive harness solution"—with the most verification layers, the most complex planner architecture, and the most precise evaluator mechanisms—that is not a moat, but a burden. Because those components exist because of what the model cannot do.

As the model gets stronger, those reasons diminish. A thicker architecture means a heavier bet on the current model's weaknesses, making it slower to pivot.

The truly valuable aspect is not the thickness of compensation but the ability to track the migration of the compensation surface—knowing what to add next and what to remove from before.

The moat is not in the thickness of the harness, but in the speed of migration.

Conversely, any company claiming to have an "all-in-one harness solution" indicates it hasn't yet encountered that wall.

The next time you see an AI product loudly adding features, ask yourself—is this feature compensating for something the model currently cannot do, or is it compensating for something the model could already do on its own? The former is a necessary cost, the latter is technical debt. The next time you see a team removing features, don't interpret it as "they took a wrong turn," but rather as "they are discovering what the model can now do."

However, all three companies have left a fallback.

OpenAI shifts the conversation, stating, "What we don't know yet is how the consistency of the architecture will evolve over several years in a system entirely generated by Agents." Twenty weeks proved the path is viable, but will the path still exist after a year? Every Friday, the team spends 20% of their time cleaning up "AI slop" (low-quality code generated by Agents)—Agents replicate existing patterns in the repository, including suboptimal ones. Later, automated "golden principles" scanning replaced manual cleanup, but this itself is a signal—the system is degrading rapidly while generating at high speed, requiring continuous "garbage collection" to maintain.

Anthropic puts it more directly: these assumptions are load-bearing, but not permanent.

Cursor discovered another form of loss of control. In a flat structure, Agents become extremely risk-averse, preferring meaningless minor modifications to tackling difficult problems, leading to system idling. The system requires periodic fresh starts to combat drift.

These self-imposed limitations are not PR talk. Precisely because the achievements are real, these uncertainties deserve serious consideration. All three companies are using the same strategy—build fast, validate later. The problem is, when "later" arrives, the system may have accumulated millions of lines of code that no one truly understands.

All these solutions are built upon the current capabilities of the models—and that boundary is not standing still.

If every layer of compensation is temporary, is Harness engineering itself also temporary? No one has an answer. But the existence of this question is a signal.

In 2019, Sutton wrote "The Bitter Lesson," which spoke of the end goal: universal methods of computation would eventually triumph over human-designed clever tricks. But these fifteen months tell the story of the process: you must first diligently build those clever tricks to know which ones to dismantle. Anthropic would not have discovered that Opus 4.6 no longer needs Context Reset if they hadn't built it first. Every compensation layer removed was once carefully built.

The path to simplicity must pass through complexity.

But the difference between "knowing you are going through complexity" and "thinking complexity is the endpoint" is everything.

Codex Source Code Leak Leads to an Unexpected Reconciliation

Originally, the article should have ended here. But just as I was preparing to publish, an unexpected event provided us with an opportunity to examine Harness more deeply from an engineering perspective.

On March 31, 2026, Claude Code v2.1.88 was released. Someone discovered a 59.8MB source map file in the npm package. Within hours, 512,000 lines of TypeScript source code were mirrored, reverse-engineered, and dissected line by line across the internet.

Comparing this 512,000 lines of code with the engineering practices mentioned reveals that, as expected, each layer of the shell had corresponding productized implementations, and in several places, they went even further than described in the articles.

Let's start with the first layer. Anthropic suggested in "Effective context engineering for AI agents" (September 2025) that system prompts should be maintained like code, with version control and dynamic assembly. The source code confirms this. It has a dedicated function for assembling prompts, internally using a dividing line to split the prompt into two halves: the first half is the unchanging "ID card" reused across sessions, and the second half is the "task order" assembled on the fly based on the current scenario.

A system prompt written once for a lifetime does not exist here. Each run, the model receives an instruction assembled on the spot.

The same article also noted that poorly written tool descriptions are like a confusing map for a goldfish. Correspondingly, the source code directly hardcodes a set of "operation syntaxes." For example, reading files only uses the built-in FileRead instruction, and using the operating system's cat command is prohibited; editing files only uses FileEdit, and generic tools like sed are prohibited. The model has no choice.

Regarding context management, the source code reveals that strategies like compression and Context Reset are not mutually exclusive but rather different stages on an emergency rescue assembly line. It first trims extraneous information, then performs light compression, followed by heavy compression, and finally, full compression. If repeated failures occur three times, it abandons the session and starts a completely new one.

It tries to save it if possible; only when it cannot be saved does it switch.

The idea of memory externalization is pushed to a new level of granularity. The source code reveals a six-layer memory system: from company-level organizational strategy to project-level configuration, individual preferences, session history, Agent habits, and the ongoing conversation. Upper layers cover lower layers. The same Agent sees a different "reality" in different companies and projects. If "repository is truth," here it's "layered repository is layered reality."

Even more interestingly, the source code includes a system called autoDream, specifically responsible for maintaining this memory system. This background program performs "memory cleanup" automatically when the user is not using the system. It scans the Agent's notes, merges duplicates, converts vague notes into facts, and transforms relative dates like "yesterday" into specific dates. Finally, it condenses the notebook to within 200 lines. This program has read-only access and cannot modify code; it only organizes notes.

It's like assigning a dedicated note organizer to the goldfish, who redraws its notebook while it sleeps.

This idea actually originates from early neural network concepts. Although updating model weights is still difficult, Anthropic has applied a similar approach to external memory.

The second layer reconciliation is more direct. The "Planner-Executor" gatekeeping mode from Cursor and Anthropic's Git file locks have merged into Coordinator Mode in the source code. A main Claude acts as a foreman, dispatching multiple Worker Agents through a four-step pipeline: research, synthesis, implementation, and verification. If a Worker needs to perform a risky operation, it must request permission from the foreman via an "email." The system has built-in collision avoidance to ensure only one Agent can claim a specific operation.

The foreman's instructions include a sentence: "Concurrency is your superpower."

But the source code goes far beyond Coordinator Mode. The source code introduces a completely new Team Mode, with a fundamentally different logic.

In Team Mode, Agents are long-term "teammates." Each teammate has their own independent context window, Git workspace, and memory. They can send messages directly to each other without an intermediary. A front-end expert and a back-end expert can work on their respective code branches independently and merge their work later.

This solves a critical problem: in traditional Coordinator Mode, a single Agent starts to get confused when using 80%-90% of its context window. Team Mode controls each teammate's context utilization to around 40%. Multiple people manage their own areas, keeping everyone's mind clear.

Communication between teammates uses a file-based "mailbox" system. Teammates check for new messages every 500 milliseconds. When a teammate finishes a task, they don't disappear but enter a standby state. It's not a disposable outsourced service, but a long-term team member.

The third layer is also fully implemented. In the source code, the evaluator role is called Verification Agent, explicitly instructed to "try to break it." It must output standardized judgments of PASS, FAIL, or PARTIAL.

The source code also strictly separates Agent permissions based on roles. Agents responsible for research can only read, not write; Agents responsible for planning cannot touch files. Sandbox isolation has become a core design principle.

Finally, for Chapter Four, the source code verifies that Anthropic's "remove a component" approach is routine operation. All advanced features are controlled by feature flags—row upon row of main switches. Features not enabled are directly removed during the build. There are 44 switches—44 patches that can be removed at any time.

The "Compensation Surface Migration" is Happening Now

But these 512,000 lines of code also contain things those engineering articles never mentioned. And their nature is entirely different from the first three layers of shell.

The first three layers address how to make Agents perform long-term tasks. But the new systems appearing in the source code are no longer within that scope.

The first is KAIROS, a background daemon. It doesn't wait for you to speak; it periodically asks itself, "Do I need to do anything now?" However, it has a strict limitation: any operation that would interrupt the user for more than 15 seconds is automatically postponed. KAIROS manages "whether to do it," transforming the Agent into a proactive assistant. The 15-second figure is a new unit of measurement—"the cost of interrupting humans."

The second is the YOLO Classifier, which assigns a risk label to every operation. Safe operations are directly allowed. Writing files within the project directory goes through a fast track, while operations outside require full approval. Executing command-line scripts always requires full approval. The classifier learns—if you reject a certain type of operation several times, the system remembers and blocks it in the future. The shell is learning how to be a shell.

The third is Hooks, which embeds slots at 8 critical nodes on the Agent's workflow. Anyone can insert their own inspection scripts. Hooks transforms the shell into an open platform where a company can attach its own compliance checks. The shell is no longer a monolithic block, but a framework with 8 slots.

None of these extra-budgetary findings are essential for executing long-term tasks. But they are essential for efficiency, customization, and commercial defense.

KAIROS transforms Agents from passive tools into proactive assistants. The YOLO Classifier allows the shell to self-adapt. Hooks transforms the shell from a closed product into an open platform.

These directions point to a new movement. The shell is not just becoming thinner or thicker; it is expanding into entirely new dimensions. The shell is now spreading from Harness to Infra.

The compensation surface is not just migrating; it is expanding.

Risk Warning and Disclaimer

Market risks exist, and investment requires caution. This article does not constitute personal investment advice, nor does it consider the specific investment objectives, financial situation, or needs of individual users. Users should consider whether any opinions, views, or conclusions in this article are suitable for their specific circumstances. Investing based on this information is at your own risk.