Full Version | Video: AI Engineering Deep Dive | 2026-04-04
Analysis model: Gemma 4 Local Ollama | Voice transcription: mlx-whisper
The central challenge of deploying AI is shifting from "making the model appear smarter" to "making the model work reliably in the real world."
What determines whether an AI application can operate stably in the real world and handle complex multi-step tasks is not the model's IQ, but the reliable, controllable, and traceable "runtime environment" and "control system" built around it.
Over the past two years, AI engineering has undergone three distinct shifts in focus. These three are not replacement relationships but nested -- each boundary is larger than the last.
The essence of prompt engineering isn't commanding the model -- it's shaping a local probability space. Give it a role, and it answers along that role; give it examples, and it completes along that pattern.
Good at: Defining tasks, constraining output, activating the model's existing capabilities.
Not good at: Compensating for missing knowledge, managing dynamic information, handling state across long task chains.
The most important capability at this stage isn't system design -- it's language design.
When an Agent needs to operate in the real world -- multi-turn conversations, tool calls, passing intermediate results between steps -- the challenge is no longer "is one answer correct" but "can the entire chain run to completion."
Context isn't just a few background paragraphs -- it represents the total sum of all information influencing the model's current decision: user input, conversation history, retrieval results, tool returns, task state, system rules, safety constraints, and more.
Key approach: Context optimization isn't about "giving more" but about providing on demand, in layers, at the right moment. (Like Agent Skills: provide only minimal metadata upfront, dynamically load detailed SOPs when needed.)
"Harness" originally means reins, horse tack, a restraint device. Even when information is perfectly supplied, the model may not execute stably -- it plans well but drifts during execution, misinterprets tool results, or slowly veers off course over long chains with nobody noticing.
Prompts optimize intent expression, context optimizes information supply, and Harness addresses: when the model starts taking continuous actions, who monitors it, constrains it, and corrects it?
Analogy: Sending a new hire to visit a client -- Prompt is explaining the task clearly, Context is preparing all the materials, Harness is sending them with a checklist, requiring real-time check-ins at key milestones, correcting deviations immediately, and validating results against standards.
1Goal & Role Definition
The model must know who it is, what the task is, and what success looks like. The Harness's first job is to keep the model thinking within the correct information boundaries.
2Tool System
It's not enough to simply plug in tools. Three problems must be solved: which tools to provide (too few limits capability, too many cause confusion), when to invoke them (don't query when unnecessary, don't hallucinate when a query is needed), and how to refine tool results for feedback (can't dump raw results back into context).
3Execution Orchestration
Addresses "what the model should do next." Establish a clear execution track: understand the goal -> assess if information is sufficient -> supplement data -> execute -> check output -> retry if unsatisfactory. This closely resembles how humans work; the difference is that humans rely on experience, while Agents rely on the Harness.
4Memory & State
An Agent without state is amnesiac every turn. At minimum, three categories must be distinguished: current task state, in-session intermediate results, and long-term memory with user preferences. Mix them together and the system grows increasingly chaotic.
5Evaluation & Observability
The most easily overlooked layer. Many systems don't fail to generate output -- they just can't tell if their output is any good. This includes: output validation, environment verification, automated testing, logging and metrics, and error attribution.
6Constraints, Validation & Recovery
The layer that truly determines whether a system can go to production. In real environments, failure is the norm (inaccurate search results, API timeouts, messy document formats). Must include: constraints (what can and can't be done), validation (how to check before and after output), and recovery (how to roll back to a stable state after failure).
Problem 1: Context fatigue. Over time, the context fills up. The model starts dropping details and key points, and even seems to "know it's running out of room and rushes to wrap up." The traditional approach is to compress historical context, but Anthropic found that compression only makes things shorter -- the sense of burden doesn't disappear.
Solution: Context Refresh -- Instead of compressing within the same context, spin up an entirely new Agent and hand off the work. Like dealing with a memory leak in engineering: don't keep clearing the cache -- restart the process and restore state.
Problem 2: Self-scoring bias. When the model does the work and grades itself, it tends toward optimism.
Solution: Separate production from validation -- The Planner breaks down requirements, the Generator implements, and the Evaluator tests like QA (not just reading code, but actually operating pages and checking interaction results). As long as the evaluator is sufficiently independent, the system forms an effective loop.
Core approach: Humans don't need to write code -- they only need to "design the environment." The engineer's job becomes three things:
AGENT.MD lesson: Early on, cramming all specifications into one massive AGENT.MD actually made the Agent more confused (context window is a scarce resource -- stuffing it full equals saying nothing). They switched to a directory-page structure: keep only the core index, break detailed content into sub-files, and drill in only when needed.
Automated governance system: Senior engineers' experience was encoded into system rules (how modules are layered, which layers can't depend on which). Rules don't just flag errors -- they return "how to fix it" to the Agent, forming a continuously running automated governance system.
(The following was analyzed and produced by the Gemma 4 local model)
Zhuge Liang (181-234 AD, the legendary strategist of China's Three Kingdoms era) was a supergenius (the Model), and his strategic objectives (Prompt) were brilliant. However, Shu Han's survival and development ultimately depended not on battlefield miracles but on the complete institutional system Zhuge Liang built: clear resource allocation mechanisms, systematized tax collection processes, and a logistics system ensuring stable supply chains.
When front-line commanders (Agents) lost battles, this mature "system" and "memory" (the stable operation of the bureaucracy, clear chains of command) ensured the government wouldn't collapse due to a single individual's failure. This perfectly maps to the Harness's "state management" and "recovery mechanism."
During China's Warring States period (475-221 BC), various schools of thought offered attractive "theoretical models" (Models) and provided rulers with "concepts" (Prompts). For example, Legalist philosophy's "strict laws and heavy punishments" appeared perfect on paper, but without a systematized execution framework, it easily led to excessive cruelty or lack of sustainability.
What truly allowed these theories to scale into national governance was embedding them into institutional systems: establishing comprehensive bureaucratic selection (Tool System), defining specific permissions and operating procedures for each level of officials in different scenarios (Execution Orchestration), and breaking national governance into standard operating procedures (SOPs). This transformed academic concepts from "idealized descriptions" into "executable governance tools."
(The following was analyzed and produced by the Gemma 4 local model)
Competition in the AI era has already shifted from "who has the best model" to "who has the most stable operating system." All business strategies should center on "how to build a reliable system layer."
Positioning: Don't build "AI chatbots" -- build "fully automated business process engines."
Approach: Break complex business processes (contract review, content production, customer service) into dozens of precise atomic steps. The platform's core is the "orchestration logic" connecting these steps. This way, no matter how the underlying model updates or iterates, as long as the workflow is stable, product value remains consistent -- a powerful moat.
Positioning: Sell "reliability," not "generation capability." Be the QA service provider for AI systems.
Approach: For highly regulated industries like finance and healthcare, provide systematic testing environments. The core: automatically capture logs after each AI operation, record execution paths, and score against business constraints. When errors occur, don't just report them -- provide "rollback" functionality, turning failure handling itself into a high-value product.
Positioning: Go beyond traditional RAG into "structured knowledge management."
Approach: Automatically clean, refine, and inject enterprise data scattered across documents and legacy systems into the Agent's runtime memory as a knowledge graph. This way the Agent doesn't merely "read" information but can "understand the relationships between pieces of information" -- referencing not just "Report A" but the intersection of Report A with current market conditions, historical policies, supply chain constraints, and all related nodes.
Prompt Engineering solves "how to explain the task clearly"
Context Engineering solves "how to provide the right information"
Harness Engineering solves "how to keep the model performing correctly during real execution"
Harness doesn't replace the first two -- it encompasses both within a larger system boundary. What truly determines whether AI can be deployed and deliver reliably is the Harness.