The Harness Engineering Compendium

Full Version | Video: AI Engineering Deep Dive | 2026-04-04
Analysis model: Gemma 4 Local Ollama | Voice transcription: mlx-whisper

Same model, same prompt -- just by improving the runtime system surrounding the model, Agent success rate jumped from 70% to 95%. That's the power of Harness Engineering.

Table of Contents

Core Insight
Three-Stage Evolution (P -> C -> H)
The Six-Layer Harness Architecture
Leading Company Practices
Historical Parallels
Business Insights
Summary

1. Core Insight

The central challenge of deploying AI is shifting from "making the model appear smarter" to "making the model work reliably in the real world."

What determines whether an AI application can operate stably in the real world and handle complex multi-step tasks is not the model's IQ, but the reliable, controllable, and traceable "runtime environment" and "control system" built around it.

Agent = Model + Harness
Harness = Everything besides the model that ensures stable delivery

2. Three-Stage Evolution (P -> C -> H)

Over the past two years, AI engineering has undergone three distinct shifts in focus. These three are not replacement relationships but nested -- each boundary is larger than the last.

Stage 1: Prompt Engineering

Core question: Did the model understand what you're saying?

The essence of prompt engineering isn't commanding the model -- it's shaping a local probability space. Give it a role, and it answers along that role; give it examples, and it completes along that pattern.

Good at: Defining tasks, constraining output, activating the model's existing capabilities.

Not good at: Compensating for missing knowledge, managing dynamic information, handling state across long task chains.

The most important capability at this stage isn't system design -- it's language design.

Stage 2: Context Engineering

Core question: Does the model have enough of the right information?

When an Agent needs to operate in the real world -- multi-turn conversations, tool calls, passing intermediate results between steps -- the challenge is no longer "is one answer correct" but "can the entire chain run to completion."

Context isn't just a few background paragraphs -- it represents the total sum of all information influencing the model's current decision: user input, conversation history, retrieval results, tool returns, task state, system rules, safety constraints, and more.

Key approach: Context optimization isn't about "giving more" but about providing on demand, in layers, at the right moment. (Like Agent Skills: provide only minimal metadata upfront, dynamically load detailed SOPs when needed.)

Stage 3: Harness Engineering

Core question: Can the model keep performing correctly during real execution?

"Harness" originally means reins, horse tack, a restraint device. Even when information is perfectly supplied, the model may not execute stably -- it plans well but drifts during execution, misinterprets tool results, or slowly veers off course over long chains with nobody noticing.

Prompts optimize intent expression, context optimizes information supply, and Harness addresses: when the model starts taking continuous actions, who monitors it, constrains it, and corrects it?

Analogy: Sending a new hire to visit a client -- Prompt is explaining the task clearly, Context is preparing all the materials, Harness is sending them with a checklist, requiring real-time check-ins at key milestones, correcting deviations immediately, and validating results against standards.

3. The Six-Layer Harness Architecture

1Goal & Role Definition

The model must know who it is, what the task is, and what success looks like. The Harness's first job is to keep the model thinking within the correct information boundaries.

2Tool System

It's not enough to simply plug in tools. Three problems must be solved: which tools to provide (too few limits capability, too many cause confusion), when to invoke them (don't query when unnecessary, don't hallucinate when a query is needed), and how to refine tool results for feedback (can't dump raw results back into context).

3Execution Orchestration

Addresses "what the model should do next." Establish a clear execution track: understand the goal -> assess if information is sufficient -> supplement data -> execute -> check output -> retry if unsatisfactory. This closely resembles how humans work; the difference is that humans rely on experience, while Agents rely on the Harness.

4Memory & State

An Agent without state is amnesiac every turn. At minimum, three categories must be distinguished: current task state, in-session intermediate results, and long-term memory with user preferences. Mix them together and the system grows increasingly chaotic.

5Evaluation & Observability

The most easily overlooked layer. Many systems don't fail to generate output -- they just can't tell if their output is any good. This includes: output validation, environment verification, automated testing, logging and metrics, and error attribution.

6Constraints, Validation & Recovery

The layer that truly determines whether a system can go to production. In real environments, failure is the norm (inaccurate search results, API timeouts, messy document formats). Must include: constraints (what can and can't be done), validation (how to check before and after output), and recovery (how to roll back to a stable state after failure).

4. Leading Company Practices

Anthropic -- Production and Validation Must Be Separated

Problem 1: Context fatigue. Over time, the context fills up. The model starts dropping details and key points, and even seems to "know it's running out of room and rushes to wrap up." The traditional approach is to compress historical context, but Anthropic found that compression only makes things shorter -- the sense of burden doesn't disappear.

Solution: Context Refresh -- Instead of compressing within the same context, spin up an entirely new Agent and hand off the work. Like dealing with a memory leak in engineering: don't keep clearing the cache -- restart the process and restore state.

Problem 2: Self-scoring bias. When the model does the work and grades itself, it tends toward optimism.

Solution: Separate production from validation -- The Planner breaks down requirements, the Generator implements, and the Evaluator tests like QA (not just reading code, but actually operating pages and checking interaction results). As long as the evaluator is sufficiently independent, the system forms an effective loop.

OpenAI -- Redefining the Engineer's Role

Core approach: Humans don't need to write code -- they only need to "design the environment." The engineer's job becomes three things:

Break product goals into tasks the Agent can understand
When the Agent fails, don't tell it to "try harder" -- ask what capability is missing from the environment
Build feedback loops so the Agent can see the results of its own work

AGENT.MD lesson: Early on, cramming all specifications into one massive AGENT.MD actually made the Agent more confused (context window is a scarce resource -- stuffing it full equals saying nothing). They switched to a directory-page structure: keep only the core index, break detailed content into sub-files, and drill in only when needed.

Automated governance system: Senior engineers' experience was encoded into system rules (how modules are layered, which layers can't depend on which). Rules don't just flag errors -- they return "how to fix it" to the Agent, forming a continuously running automated governance system.

5. Historical Parallels

(The following was analyzed and produced by the Gemma 4 local model)

Case 1: Shu Han's Institutional Governance -- Embodying "Memory/State" and "Constraints/Recovery"

Zhuge Liang (181-234 AD, the legendary strategist of China's Three Kingdoms era) was a supergenius (the Model), and his strategic objectives (Prompt) were brilliant. However, Shu Han's survival and development ultimately depended not on battlefield miracles but on the complete institutional system Zhuge Liang built: clear resource allocation mechanisms, systematized tax collection processes, and a logistics system ensuring stable supply chains.

When front-line commanders (Agents) lost battles, this mature "system" and "memory" (the stable operation of the bureaucracy, clear chains of command) ensured the government wouldn't collapse due to a single individual's failure. This perfectly maps to the Harness's "state management" and "recovery mechanism."

Case 2: Warring States Legalism in Practice -- Embodying "Goal Definition" and "Tool System"

During China's Warring States period (475-221 BC), various schools of thought offered attractive "theoretical models" (Models) and provided rulers with "concepts" (Prompts). For example, Legalist philosophy's "strict laws and heavy punishments" appeared perfect on paper, but without a systematized execution framework, it easily led to excessive cruelty or lack of sustainability.

What truly allowed these theories to scale into national governance was embedding them into institutional systems: establishing comprehensive bureaucratic selection (Tool System), defining specific permissions and operating procedures for each level of officials in different scenarios (Execution Orchestration), and breaking national governance into standard operating procedures (SOPs). This transformed academic concepts from "idealized descriptions" into "executable governance tools."

6. Business Insights

(The following was analyzed and produced by the Gemma 4 local model)

Competition in the AI era has already shifted from "who has the best model" to "who has the most stable operating system." All business strategies should center on "how to build a reliable system layer."

1. Vertical Industry "Intelligent Workflow" SaaS Platforms

Positioning: Don't build "AI chatbots" -- build "fully automated business process engines."

Approach: Break complex business processes (contract review, content production, customer service) into dozens of precise atomic steps. The platform's core is the "orchestration logic" connecting these steps. This way, no matter how the underlying model updates or iterates, as long as the workflow is stable, product value remains consistent -- a powerful moat.

2. AI System "Quality Assurance" Verification Services

Positioning: Sell "reliability," not "generation capability." Be the QA service provider for AI systems.

Approach: For highly regulated industries like finance and healthcare, provide systematic testing environments. The core: automatically capture logs after each AI operation, record execution paths, and score against business constraints. When errors occur, don't just report them -- provide "rollback" functionality, turning failure handling itself into a high-value product.

3. Knowledge-Graph-Driven State Management Layer

Positioning: Go beyond traditional RAG into "structured knowledge management."

Approach: Automatically clean, refine, and inject enterprise data scattered across documents and legacy systems into the Agent's runtime memory as a knowledge graph. This way the Agent doesn't merely "read" information but can "understand the relationships between pieces of information" -- referencing not just "Report A" but the intersection of Report A with current market conditions, historical policies, supply chain constraints, and all related nodes.

7. Summary

Prompt Engineering solves "how to explain the task clearly"

Context Engineering solves "how to provide the right information"

Harness Engineering solves "how to keep the model performing correctly during real execution"

Harness doesn't replace the first two -- it encompasses both within a larger system boundary. What truly determines whether AI can be deployed and deliver reliably is the Harness.