Jim Fan's Physical AGI Blueprint
NVIDIA's Head of Robotics Research Jim Fan proposed "The Great Parallel" at Sequoia AI Ascent 2026: robotics will fully replicate the four-stage success path of large language models — pre-training, fine-tuning, reinforcement learning, and automated research. He introduced three key technical breakthroughs: Dream Zero (a World Action Model that lets robots "dream" the future before acting), EgoScale (using human first-person video instead of teleoperation, discovering a Neural Scaling Law for robot dexterity), and Dream Dojo (a neural simulator that uses GPUs instead of real robots for reinforcement learning). He predicts robots will complete the final unlock of the tech tree by 2040, and the "Physical Turing Test" — where you can't tell if a human or robot is performing a task — is just 2-3 years away. This article deconstructs the talk's technical architecture, data strategy, business logic, and implications for Taiwan from first principles.
Sci-fi movies always give us the same picture: a single humanoid robot, like the Terminator, that can do everything. But Jim Fan says this is entirely the wrong direction.
Looking back at LLM's path to success, he identified four phase-transition leaps, each separated by roughly six years:
Next-token prediction = learning the "shape" of language — grammar, logic, how code unfolds
Aligning the model to "useful work" — converging from vast possibilities to human-needed outputs
Using reinforcement learning to surpass imitation learning — the model begins to "think," not just recite
Accelerating the entire loop beyond human capability — AI begins doing AI research itself
Jim Fan's core insight: these four stages can be directly mapped to robotics. He calls it "The Great Parallel."
The Great Parallel: If LLMs learned language by predicting "the next word," then robots can learn actions by predicting "the next physical world state." The underlying mathematical structure is identical — both are sequence prediction problems. The only difference: LLMs predict discrete tokens; robots predict continuous pixels and joint angles.
| LLM Path | Robotics Parallel Path | Core Technology |
|---|---|---|
| Pre-training (predict next word) | Pre-training (predict next physical state) | World Model / Cosmos |
| Supervised fine-tuning (align to useful output) | Action fine-tuning (align to real robots) | GR00T / Dream Zero |
| RL reasoning (surpass imitation) | RL training in simulation | Newton / Dream Dojo |
| Automated research | Physical automated research | Robots design and build the next generation of themselves |
"So as any self-respecting scientist would do, I copy homework and I give it a new name. I call it the Great Parallel."
This is not a metaphor. It's an actionable engineering roadmap. Jim Fan isn't saying "robots will one day be like ChatGPT" — he's saying "we're already walking the same path, and we know where every turn is."
Why not one super robot? Because LLM's success was never about a single super model breaking through alone — it was about scaled parallel training — billions of parameters, trillions of tokens, thousands of GPUs. Robotics is the same: the future isn't one omnipotent robot, but millions of robots learning simultaneously in simulated environments, then deploying learned capabilities to the real world. Quantity defeats quality — that's the true meaning of "The Great Parallel."
To make The Great Parallel work, three core components are needed. Jim Fan's team happens to have built all of them.
GR00T (Generalist Robot 00 Technology) is NVIDIA's foundation model built for humanoid robots. Over the past three years, the robotics field has been dominated by VLAs (Vision-Language-Action models) — essentially a language model with an action output head bolted on top.
Jim Fan zeroed in on the problem:
"These models are really LVAs, because the most amount of parameters are dedicated to language. Language is first-class citizen, followed by vision and action. By design, VLAs are great at encoding knowledge and nouns, but not so much at physics and verbs."
He gave a classic example: the VLA's original paper demonstrated "move the Coke can next to Taylor Swift's photo" — yes, the robot recognized Taylor Swift, but that's a "noun capability," not a "verb capability." What you need is a robot that understands gravity, friction, and deformation of flexible objects — not celebrity recognition.
Replacing VLAs is an entirely new architecture: the World Action Model (WAM).
Dream Zero is WAM's first implementation. Its core capability is "dreaming" — simulating the next few seconds of a scene in its mind before executing any action, then deciding what to do based on the simulation results.
"A moment of silence for our dear friend VLAs. They've served us well. Rest in peace. Long live World Action Models."
Where does Dream Zero "dream"? It needs a world model to provide the raw material for dreams.
Compute = Environment = Data. In traditional robot training, you need real robots (hardware) in real environments (scenes) collecting real data (teleoperation). All three face physical bottlenecks. Dream Dojo's breakthrough: using GPU compute to directly generate training environments and data. Buying more GPUs equals having more robots, more environments, more data. This is why Jensen says "the more you buy, the more you save" — in robotics, this statement becomes literally true for the first time.
The biggest pain point in robotics has always been data. Jim Fan used a single chart to clearly illustrate the evolution of data strategies:
| Method | Ceiling | Problem |
|---|---|---|
| Teleoperation | 24 hrs/robot/day (realistically ~3 hrs) | Expensive, slow, robots frequently "throw tantrums"; NVIDIA Chief Scientist Bill Dally personally operated the controls, making it possibly "the most expensive teleoperation trajectory in history" |
| Data Wearable Devices (UMI/DexOoi) | Hundreds of thousands of hours | Strap the robot hand directly onto a human hand for data collection, eliminating the robot body entirely; spawned two unicorns |
| Human Egocentric Video | Tens of millions of hours | Like Tesla FSD, automatically collecting in the background — human daily activities themselves become training data |
Jim Fan's EgoScale system is remarkable:
The result: using 99.9% everyday human video + 0.1% teleoperation, they trained a high-dexterity robot policy capable of sorting cards, manipulating syringes, and folding clothes. This is the key to compressing training costs by over 10x.
The most stunning discovery from the EgoScale paper:
Robot dexterity exhibits a Neural Scaling Law — pre-training hours and validation loss show a clean log-linear relationship. This comes exactly six years after the original Neural Scaling Law for language models. This means: as long as you keep increasing pre-training hours with human video, robot dexterity will predictably and continuously improve. Once the data flywheel starts spinning, the growth is exponential.
Jim Fan also demonstrated an elegantly simple pipeline:
The significance: the iPhone essentially becomes a pocket world scanner. Anyone can scan their work environment and have a robot learn how to work in that environment through simulation.
Jim Fan used the Civilization tech tree to describe the end game for robotics. He said his research is like unlocking game achievements. Three achievements remain, and then he can retire.
Across a broad range of activities, you cannot tell whether a human or robot is performing the task. The key metric is "unit energy input vs. unit labor output" — it doesn't need to outperform a drunk person, but must reach the efficiency level of a normal human.
An entire robot fleet can be configured like software, through APIs and command lines. Jim Fan joked that "one day it will be orchestrated by Opus 9.0." This will enable "dark factories" — input a product design as a Markdown file, output a fully assembled product, entirely unattended; and automated wet labs to accelerate scientific discovery in chemistry, biology, and pharmaceuticals.
Robots begin designing, improving, and building the next generation of themselves — far beyond human capability. This is the end game.
"Our generation was born too late to explore the earth, and too early to explore the stars. But we are born just in time to solve robotics."
The essential definition of Physical AGI: A system capable of learning "any" physical task. Not an industrial robotic arm optimized for specific tasks, but a general-purpose physical intelligence that can learn new tasks through language instructions and a few demonstrations. This is perfectly symmetrical with the LLM AGI definition — LLM AGI means "handling any cognitive task," Physical AGI means "executing any physical task." Combined, they form complete AGI.
Jim Fan demonstrated a seemingly mundane scene in his talk: a robot using 22-DoF bimanual hands to fold different types of clothing. And it only needed a single demonstration to learn different folding techniques.
Why does this matter? Because folding laundry is one of the "holy grail" problems in robotics.
If a robot can fold laundry, it can also:
"Folding laundry" is not the destination — it's a proof of capability. It represents robots crossing the chasm from "rigid manipulation" to "deformable manipulation." Once deformable manipulation is unlocked, 90% of physical tasks in daily human life are within range. The "one-shot demonstration" learning shown in Jim Fan's talk is even more critical — it means deployment cost approaches zero. You don't need a programmer; you just need to "show it once."
Jim Fan's talk never directly mentioned Taiwan, but every argument points straight at Taiwan's core competitive advantages.
Jim Fan's timeline delivers a clear message for Taiwan:
Every major automation revolution follows the same pattern:
Extremely inefficient, only usable in coal mines for pumping water (because fuel was free on-site). No one believed it could replace horses.
3x efficiency improvement, began entering factories. Still expensive — only large enterprises could afford it.
Costs kept falling; trains, ships, and factories adopted it universally. 99% of physical labor was eventually performed by machines.
The steam engine didn't become "smarter" — it became "cheaper." The steep decline of the cost curve is what triggers the revolution.
Training cost: millions of dollars per run. Inference was prohibitively expensive. Only research labs could afford it.
Cost per conversation dropped to a few cents. For the first time, ordinary people could use AI directly.
Inference costs have dropped to less than 1/1000th of 2020 levels. AI has become infrastructure, not a luxury.
Jim Fan's EgoScale is replicating this curve in robotics:
The steam engine took 57 years from Newcomen to Watt. LLMs took 10 years from AlexNet to ChatGPT. Robotics took less than 3 years from teleoperation to EgoScale. Each automation revolution accelerates faster than the last. Because each new revolution stands on the shoulders of the previous one — robot training directly uses LLM architectures and methodologies, and LLMs used deep learning infrastructure. Jim Fan's "Great Parallel" isn't just a metaphor — it's an engineering prediction built on the historical law of acceleration.
Jim Fan's talk reveals one thing most clearly: NVIDIA is becoming the "shovel seller" of the robot era. They don't make robot bodies — they make:
"Compute = Environment = Data" means: every GPU used to train robots is NVIDIA revenue. When robot companies worldwide race to train models, NVIDIA isn't selling robots — they're selling the "water and electricity" for training robots.
EgoScale's lesson: the competitive advantage of future robot companies won't be in hardware, but in the speed of their data flywheel.
Jim Fan's "Physical API" world means:
The takeaway for investors: don't just invest in robot hardware companies — invest even more in robot AI software and platform companies.
Jim Fan specifically mentioned the UMI (Universal Manipulation Interface) paper — a beautifully simple idea of "strap the robot hand directly onto a human hand" — which spawned two unicorns. This echoes a timeless entrepreneurial truth: the most valuable innovations are often the simplest. Not a more sophisticated teleoperation system, but "just skip the teleoperation entirely."
| Industry | Robot Impact | Timeline |
|---|---|---|
| Warehousing & Logistics | Fully automated picking, sorting, and packaging | 1-2 years |
| Electronics Assembly | Precision component assembly, GPU production lines | 2-3 years |
| Home Care | Elderly care, household automation | 3-5 years |
| Agriculture | Harvesting, grading, packaging | 3-5 years |
| Scientific Research | Automated wet labs, drug synthesis | 5-10 years |
The roadmap is clear. The Great Parallel is not a hypothesis — it's something already happening. Every step LLMs took — pre-training, fine-tuning, RL, automated research — robotics will follow. The only variable is time.
The data bottleneck is being broken. From teleoperation to data wearables to human video, each generation increases data scale by 100-1000x. The discovery of the Neural Scaling Law proves: as long as there's data, robots will keep getting better.
End game by 2040, but the tipping point is 1-2 years away. The Physical Turing Test may be achieved within 2-3 years. The "ChatGPT moment" for robotics — the first time an ordinary person is stunned by "I had no idea robots could do that" — may be just 1-2 years out.
Three articles, three perspectives, one conclusion:
| Article | Speaker | Core Argument | Action Window |
|---|---|---|---|
| Part 1: Overview | Sequoia Partners | AI is a computing revolution, AGI has arrived, $10T services market | 18 months |
| Part 2: Software 3.0 | Andrej Karpathy | LLM as computer, verifiability determines automation speed, understanding can't be outsourced | 12 months |
| Part 3: The End Game for Robotics (this article) | Jim Fan | The Great Parallel, Physical AGI blueprint, training costs compressed 10x | 1-3 years |
AGI is not the future — it's now. Digital AGI is rewriting software (Karpathy), Physical AGI is rewriting manufacturing (Jim Fan), and Sequoia's partners are already placing their bets.
What does this mean for you? You don't need to understand Dream Zero's architecture or the math behind Neural Scaling Laws. What you need to understand is: everything you do today — writing code, managing a factory, caring for the elderly, organizing a warehouse — has an AI/robot version being trained right now. The question isn't "will it happen," but "what is your role in that version."
18 months. That's the window Sequoia has given. This isn't a scare tactic — it's an invitation to start thinking and acting now.
"If you believe in robotics, robotics will believe in you."
Series Theme: AGI Is Not the Future — It's Now, and You Have Only 18 Months