Full Version | Video: AI Engineering Deep Dive
A true story: a team's Agent used the best model, rewrote the prompt over a hundred times, tuned every parameter imaginable, yet the task success rate never broke 70% -- sometimes brilliant, sometimes inexplicably off-track.
They eventually brought in someone to diagnose the system. The biggest changes were made to things that were neither the model nor the prompt: how tasks were decomposed, how state was managed, how key steps were validated, and how failures were recovered. The result? Same model, same prompt -- success rate pulled above 95%.
This reveals a First Principles truth:
Over the past two years, AI engineering has undergone three major shifts in focus. Each shift originated from the structural ceiling hit by the previous stage. The three are nested, not sequential -- Prompt is a subset of Context, and Context is a subset of Harness.
Problem it solves: Did the model understand what you're saying?
Essence: Large models are probability-generation systems extremely sensitive to context. Give it a role, and it answers along that role; give it examples, and it completes along that pattern. So prompt engineering isn't about commanding the model -- it's about shaping a local probability space.
Ceiling: It can only activate capabilities the model already has. No matter how well you phrase things, you can't compensate for missing facts, manage state, or handle complex multi-step tasks. Prompting solves the "expression" problem, not the "information" problem.
Problem it solves: Does the model have enough of the right information?
Essence: Context isn't just a few background paragraphs -- it's the total sum of all information influencing the model's current decision: user input, conversation history, retrieval results, tool returns, task state, intermediate outputs, system rules, safety constraints, results passed from other agents. The prompt is actually just a small part of the context.
Key insight: The context window is a scarce resource. More information isn't always better -- too much dilutes attention. The mature approach is to provide information on demand, in layers, at the right moment. (A canonical practice: Agent Skills start with minimal metadata and only dynamically load the full SOP when needed.)
Ceiling: Even if all the information is perfectly provided, models in long execution chains still plan well but execute poorly, call tools but misinterpret their returns, and slowly drift over long chains with nobody noticing.
Problem it solves: Can the model keep performing correctly during real execution?
Essence: "Harness" originally means reins, horse tack, a restraint device. The first two generations of engineering focused on making models "think better." Harness focuses on keeping models "on track" -- and when they veer off or make mistakes, pulling them back.
Prompts optimize "intent expression," context optimizes "information supply," but the hardest problem in complex tasks is: when the model starts taking continuous actions, who monitors it, constrains it, and corrects it?
Sending a new hire to visit an important client:
Prompt = Explain the task clearly (greet them, present the proposal, ask about needs, confirm next steps)
Context = Prepare all the materials (client background, past records, pricing, competitors, meeting objectives)
Harness = Send them with a checklist, require real-time check-ins at key milestones, verify notes against the recording afterwards, correct deviations immediately, and validate results against standards
During long autonomous tasks, the context fills up, and the model starts dropping details, losing focus, and even exhibiting a curious behavior -- it seems to "know it's running out of room" and starts rushing to wrap up.
The common approach is to compress the historical context and keep going, but Anthropic found that compression only makes things shorter -- the model's "sense of burden" doesn't go away. So they did something more radical: spin up an entirely new Agent and hand off the work. It's like dealing with a memory leak in software engineering -- instead of clearing the cache, you restart the process and restore state.
When a model does the work and grades itself, it's naturally biased toward optimism. The solution: split "doing" and "verifying" across different roles --
Core principle: As long as the evaluator is sufficiently independent, the system can form an effective "generate -> check -> fix -> re-check" loop.
Humans don't write code -- they only "design the environment." The engineer's job becomes three things:
Early on, they crammed all specifications into one massive AGENT.MD file, which actually made the Agent more confused -- the context window is a scarce resource, and stuffing it full is the same as saying nothing. They switched to a directory-page structure: keep only a core index, break detailed content into sub-files, and drill in only when needed. This is fundamentally the same idea as Agent Skills' "load on demand" approach.
Agents submit code so fast that human code review can't keep up. So they encoded senior engineers' experience into system rules (module layering, dependency restrictions, interception conditions), and the rules don't just flag errors -- they return "how to fix it" to the Agent, feeding into the next correction cycle. This is no longer traditional code standards; it's a continuously running automated governance system.
Without changing the underlying model at all, solely by redesigning and optimizing the Harness, they pulled their agent from outside the top 30 to the top 5 on the leaderboard. This is the most intuitive proof of Harness Engineering's value.
Zhuge Liang (181-234 AD) was the most powerful "model" of China's Three Kingdoms era -- a peerless strategist, equivalent to a top-tier LLM. His "Longzhong Plan" (a famous strategic blueprint presented to warlord Liu Bei) was his "Prompt" -- clearly defining the strategic goal of dividing the realm into three. But the survival of Shu Han (the weakest of the three kingdoms) was never sustained by one brilliant move after another.
What truly kept this weakest kingdom alive for decades was the institutional governance system Zhuge Liang built: strict rule of law (the Shu Code) served as the constraint layer; military farming programs and a stable logistics supply chain served as execution orchestration; the chancellery's bureaucratic system ensured that even front-line failures wouldn't cause systemic collapse -- this was state management and recovery.
Contrast this with Guan Yu (a legendary warrior general) -- individual combat ability at the top (strong Model), but his loss of the strategic city of Jingzhou was essentially a failure of Harness. No intelligence verification mechanism (missing evaluation/observability layer), no failure rollback plan (missing recovery layer), chaotic state management (completely unaware when his rear was attacked). A top-tier Model without a Harness -- one failure means permanent collapse.
The Shang Yang Reforms (4th century BC, which transformed the state of Qin into a military superpower) were essentially a complete Harness Engineering exercise.
Before the reforms, Qin's "Model" (military potential) wasn't much weaker than the other six warring states. What was lacking? The operating system. What Shang Yang did maps with striking precision:
Result: the same Qin people, who before the reforms were beaten by the state of Wei and lost their western territory, within just one generation became the most powerful force in the realm. The Model didn't change. The Harness changed. The entire system's output was transformed. This is strikingly parallel to the story in the video: "same model, same prompt, success rate from 70% to 95%."
Competition in the AI era is shifting from "who has the smartest model" to "who has the most stable system." Here are three business directions born from Harness thinking:
Business logic: Models are commoditizing through open-sourcing and homogenization; moats are getting shallower. But the Harness layer (orchestration logic, validation rules, recovery mechanisms, state management) is highly scenario-specific and hard to replicate.
How to do it: For specific verticals (legal contract review, medical report generation, financial risk control reports), don't sell model API access -- sell "a complete workflow engine guaranteeing 95%+ success rate." Customers don't need to care what model runs underneath; they're buying reliable delivery.
Moat: Every industry's Harness requires deep domain knowledge to design validation rules, error recovery paths, and quality standards -- things that can't be replaced by simply swapping models.
Business logic: The fifth layer "Evaluation & Observability" -- the most overlooked in the video -- happens to be the most commercially valuable. Because every enterprise customer's core anxiety is the same question: "How do I know if the AI got it right?"
How to do it: Provide independent "AI auditing" services -- don't generate, only verify. Like Anthropic's Evaluator role, build an independent evaluation layer outside the customer's Agent system: automated testing, log analysis, error attribution, compliance checks.
Pricing model: Charge per verification or by "guaranteed accuracy rate." In highly regulated industries like finance, healthcare, and law, willingness to pay for "reliability" is extremely high.
Business logic: OpenAI's practice reveals a key shift -- an engineer's value is no longer in writing code but in "designing environments where Agents can succeed." This means the market needs an entirely new type of professional service.
How to do it: Like the real case at the beginning of the video -- don't help clients swap models or rewrite prompts. Instead, diagnose their Harness: Is task decomposition reasonable? Are there blind spots in state management? Is the validation mechanism complete? Do failure recovery paths actually work?
Market timing: Right now, countless teams are stuck in the "we got the best model but our Agent is still unstable" trap -- exactly the scenario described at the opening of this case. People who can diagnose and fix Harness problems will be the most sought-after experts of the AI deployment era.
Three stages, three questions:
Prompt Engineering -> How do you explain the task clearly? (Engineering the expression)
Context Engineering -> How do you provide the right information? (Engineering the input environment)
Harness Engineering -> How do you keep the model performing correctly during real execution? (Engineering the entire runtime system)
Harness doesn't replace Prompt and Context -- it encompasses both within a larger system boundary.
When your Agent performs inconsistently, the problem is almost never "the model isn't trying hard enough." It's missing some structural capability. Find that gap, add that layer of Harness -- that's the most valuable work in AI engineering right now.