Deutschภาษาไทย中文
Sequoia AI Ascent 2026 Deep Dive 3/3

The End Game for Robotics — Nvidia's Great Parallel Theory

Jim Fan's Physical AGI Blueprint

Independent Research | LittleX Research Lab | 2026-05-02
Series Theme: AGI Is Not the Future — It's Now, and You Have Only 18 Months

In the summer of 2016, a stocky man in a leather jacket walked into the OpenAI office carrying a massive metal slab. Engraved on it: "To Elon and the OpenAI team, for the future of computing and humanity." It was the world's first DGX-1. An intern named Jim Fan rushed over and signed his name on it.

Ten years later, that intern stood on the Sequoia AI Ascent stage, declaring that robotics has entered the "end game."
His argument is stunning: not building one super robot, but having millions of robots learn simultaneously. Just as LLMs learned language from the internet's text, robots will learn every physical action from everyday human videos.

And this time, training costs have been compressed by 10x. The "ChatGPT moment" for robotics may be just 1-2 years away.

Abstract

NVIDIA's Head of Robotics Research Jim Fan proposed "The Great Parallel" at Sequoia AI Ascent 2026: robotics will fully replicate the four-stage success path of large language models — pre-training, fine-tuning, reinforcement learning, and automated research. He introduced three key technical breakthroughs: Dream Zero (a World Action Model that lets robots "dream" the future before acting), EgoScale (using human first-person video instead of teleoperation, discovering a Neural Scaling Law for robot dexterity), and Dream Dojo (a neural simulator that uses GPUs instead of real robots for reinforcement learning). He predicts robots will complete the final unlock of the tech tree by 2040, and the "Physical Turing Test" — where you can't tell if a human or robot is performing a task — is just 2-3 years away. This article deconstructs the talk's technical architecture, data strategy, business logic, and implications for Taiwan from first principles.

The Great Parallel Physical AGI GR00T Dream Zero World Action Model EgoScale Neural Scaling Law Sim-to-Real Dream Dojo Cosmos Newton Humanoid Robots
Table of Contents
  1. Why Not One Super Robot? (The Great Parallel Theory)
  2. Nvidia's Trifecta: GR00T + Cosmos + Newton
  3. Sim-to-Real: Training Costs Compressed 10x
  4. Physical AGI: Definition and Timeline
  5. When Robots Can Fold Laundry — What Does It Mean?
  6. Opportunities and Threats for Taiwan's Manufacturing
  7. Historical Parallels — From Steam Engines to Robots
  8. Business Insights — Investment Logic of the Robot Economy
  9. Conclusion + Series Guide
  10. References

1. Why Not One Super Robot? (The Great Parallel Theory)

Sci-fi movies always give us the same picture: a single humanoid robot, like the Terminator, that can do everything. But Jim Fan says this is entirely the wrong direction.

Looking back at LLM's path to success, he identified four phase-transition leaps, each separated by roughly six years:

2020 — GPT-3 Pre-training

Next-token prediction = learning the "shape" of language — grammar, logic, how code unfolds

2022 — InstructGPT Supervised Fine-tuning

Aligning the model to "useful work" — converging from vast possibilities to human-needed outputs

2024 — Reasoning Models (o1)

Using reinforcement learning to surpass imitation learning — the model begins to "think," not just recite

2026 — Automated Research

Accelerating the entire loop beyond human capability — AI begins doing AI research itself

Jim Fan's core insight: these four stages can be directly mapped to robotics. He calls it "The Great Parallel."

First Principle

The Great Parallel: If LLMs learned language by predicting "the next word," then robots can learn actions by predicting "the next physical world state." The underlying mathematical structure is identical — both are sequence prediction problems. The only difference: LLMs predict discrete tokens; robots predict continuous pixels and joint angles.

LLM PathRobotics Parallel PathCore Technology
Pre-training (predict next word)Pre-training (predict next physical state)World Model / Cosmos
Supervised fine-tuning (align to useful output)Action fine-tuning (align to real robots)GR00T / Dream Zero
RL reasoning (surpass imitation)RL training in simulationNewton / Dream Dojo
Automated researchPhysical automated researchRobots design and build the next generation of themselves

"So as any self-respecting scientist would do, I copy homework and I give it a new name. I call it the Great Parallel."

— Jim Fan, Sequoia AI Ascent 2026

This is not a metaphor. It's an actionable engineering roadmap. Jim Fan isn't saying "robots will one day be like ChatGPT" — he's saying "we're already walking the same path, and we know where every turn is."

Key Insight

Why not one super robot? Because LLM's success was never about a single super model breaking through alone — it was about scaled parallel training — billions of parameters, trillions of tokens, thousands of GPUs. Robotics is the same: the future isn't one omnipotent robot, but millions of robots learning simultaneously in simulated environments, then deploying learned capabilities to the real world. Quantity defeats quality — that's the true meaning of "The Great Parallel."

2. Nvidia's Trifecta: GR00T + Cosmos + Newton

To make The Great Parallel work, three core components are needed. Jim Fan's team happens to have built all of them.

1. GR00T — The Foundation Model for Humanoid Robots

GR00T (Generalist Robot 00 Technology) is NVIDIA's foundation model built for humanoid robots. Over the past three years, the robotics field has been dominated by VLAs (Vision-Language-Action models) — essentially a language model with an action output head bolted on top.

Jim Fan zeroed in on the problem:

"These models are really LVAs, because the most amount of parameters are dedicated to language. Language is first-class citizen, followed by vision and action. By design, VLAs are great at encoding knowledge and nouns, but not so much at physics and verbs."

— Jim Fan

He gave a classic example: the VLA's original paper demonstrated "move the Coke can next to Taylor Swift's photo" — yes, the robot recognized Taylor Swift, but that's a "noun capability," not a "verb capability." What you need is a robot that understands gravity, friction, and deformation of flexible objects — not celebrity recognition.

2. Dream Zero — World Action Model (WAM)

Replacing VLAs is an entirely new architecture: the World Action Model (WAM).

Dream Zero is WAM's first implementation. Its core capability is "dreaming" — simulating the next few seconds of a scene in its mind before executing any action, then deciding what to do based on the simulation results.

Core Mechanism
Simultaneously decodes "next world state" and "next action" — vision and action are both first-class citizens
Key Breakthrough
Zero-shot generalization — can solve action tasks never seen during training
Validation Method
Visualize the robot's "dreams": if the predicted video is correct, the action is correct; if the video hallucinates, the action fails
Historical Analogy
Like the GPT-2 era — the shape is right but not yet precise enough; scaling will bring a qualitative leap

"A moment of silence for our dear friend VLAs. They've served us well. Rest in peace. Long live World Action Models."

— Jim Fan

3. Cosmos + Newton — World Model and Physics Engine

Where does Dream Zero "dream"? It needs a world model to provide the raw material for dreams.

First Principle

Compute = Environment = Data. In traditional robot training, you need real robots (hardware) in real environments (scenes) collecting real data (teleoperation). All three face physical bottlenecks. Dream Dojo's breakthrough: using GPU compute to directly generate training environments and data. Buying more GPUs equals having more robots, more environments, more data. This is why Jensen says "the more you buy, the more you save" — in robotics, this statement becomes literally true for the first time.

3. Sim-to-Real: Training Costs Compressed 10x

The biggest pain point in robotics has always been data. Jim Fan used a single chart to clearly illustrate the evolution of data strategies:

Three Generations of Data Collection

MethodCeilingProblem
Teleoperation24 hrs/robot/day
(realistically ~3 hrs)
Expensive, slow, robots frequently "throw tantrums"; NVIDIA Chief Scientist Bill Dally personally operated the controls, making it possibly "the most expensive teleoperation trajectory in history"
Data Wearable Devices (UMI/DexOoi)Hundreds of thousands of hoursStrap the robot hand directly onto a human hand for data collection, eliminating the robot body entirely; spawned two unicorns
Human Egocentric VideoTens of millions of hoursLike Tesla FSD, automatically collecting in the background — human daily activities themselves become training data

EgoScale: 99.9% Human Video + 0.1% Teleoperation

Jim Fan's EgoScale system is remarkable:

21,000 hours
Human egocentric video pre-training (zero robot data)
50 hours
High-precision data glove fine-tuning
4 hours
Teleoperation data (less than 0.1%)
22 DoF
End-to-end policy for high-dexterity bimanual robots

The result: using 99.9% everyday human video + 0.1% teleoperation, they trained a high-dexterity robot policy capable of sorting cards, manipulating syringes, and folding clothes. This is the key to compressing training costs by over 10x.

Neural Scaling Law for Robot Dexterity

The most stunning discovery from the EgoScale paper:

Major Discovery

Robot dexterity exhibits a Neural Scaling Law — pre-training hours and validation loss show a clean log-linear relationship. This comes exactly six years after the original Neural Scaling Law for language models. This means: as long as you keep increasing pre-training hours with human video, robot dexterity will predictably and continuously improve. Once the data flywheel starts spinning, the growth is exponential.

Real-to-Sim-to-Real: iPhone as a Pocket World Scanner

Jim Fan also demonstrated an elegantly simple pipeline:

  1. Scan a real scene with an iPhone
  2. Extract all objects through a 3D scanning pipeline
  3. Automatically reconstruct them in a physics simulator (all objects are interactive)
  4. Infinitely augment with variations in simulation (he calls them "Digital Cousins")
  5. Transfer the trained policy back to the real robot

The significance: the iPhone essentially becomes a pocket world scanner. Anyone can scan their work environment and have a robot learn how to work in that environment through simulation.

4. Physical AGI: Definition and Timeline

Jim Fan used the Civilization tech tree to describe the end game for robotics. He said his research is like unlocking game achievements. Three achievements remain, and then he can retire.

Three Milestones

Milestone 1: Physical Turing Test (within 2-3 years)

Across a broad range of activities, you cannot tell whether a human or robot is performing the task. The key metric is "unit energy input vs. unit labor output" — it doesn't need to outperform a drunk person, but must reach the efficiency level of a normal human.

Milestone 2: Physical API

An entire robot fleet can be configured like software, through APIs and command lines. Jim Fan joked that "one day it will be orchestrated by Opus 9.0." This will enable "dark factories" — input a product design as a Markdown file, output a fully assembled product, entirely unattended; and automated wet labs to accelerate scientific discovery in chemistry, biology, and pharmaceuticals.

Milestone 3: Physical Automated Research (by 2040)

Robots begin designing, improving, and building the next generation of themselves — far beyond human capability. This is the end game.

14 years
From AlexNet (2012) to AI Ascent 2026 — the digital AI journey
14 years
Jim Fan's estimate from 2026 to the physical AI end game (2040)
95%
Jim Fan's confidence in reaching the end game by 2040
Exponential
Technology doesn't advance linearly — it accelerates exponentially

"Our generation was born too late to explore the earth, and too early to explore the stars. But we are born just in time to solve robotics."

— Jim Fan
First Principle

The essential definition of Physical AGI: A system capable of learning "any" physical task. Not an industrial robotic arm optimized for specific tasks, but a general-purpose physical intelligence that can learn new tasks through language instructions and a few demonstrations. This is perfectly symmetrical with the LLM AGI definition — LLM AGI means "handling any cognitive task," Physical AGI means "executing any physical task." Combined, they form complete AGI.

5. When Robots Can Fold Laundry — What Does It Mean?

Jim Fan demonstrated a seemingly mundane scene in his talk: a robot using 22-DoF bimanual hands to fold different types of clothing. And it only needed a single demonstration to learn different folding techniques.

Why does this matter? Because folding laundry is one of the "holy grail" problems in robotics.

Why Is Folding Laundry So Hard?

If a robot can fold laundry, it can also:

Home Scenarios
  • Tidying rooms, organizing belongings
  • Meal prep and ingredient handling
  • Home care (operating syringes, taking blood pressure)
  • Cleaning, dishwashing, sorting clutter
Industrial Scenarios
  • Assembling precision electronic components
  • Warehouse picking and sorting
  • Quality inspection and packaging
  • GPU assembly (an actual case Jim Fan demonstrated)
Deeper Significance

"Folding laundry" is not the destination — it's a proof of capability. It represents robots crossing the chasm from "rigid manipulation" to "deformable manipulation." Once deformable manipulation is unlocked, 90% of physical tasks in daily human life are within range. The "one-shot demonstration" learning shown in Jim Fan's talk is even more critical — it means deployment cost approaches zero. You don't need a programmer; you just need to "show it once."

6. Opportunities and Threats for Taiwan's Manufacturing

Jim Fan's talk never directly mentioned Taiwan, but every argument points straight at Taiwan's core competitive advantages.

Why Is Taiwan a Critical Node in the Robot Revolution?

TSMC
The world's advanced chip manufacturing center — robot "brains" are produced here
Manufacturing GDP 30%+
Taiwan remains a manufacturing powerhouse — the sector most susceptible to robotic automation
Labor Shortage Crisis
Declining birth rate + aging population = continuously widening labor gap
Complete Supply Chain
From chips to precision machinery to electronics assembly — the entire value chain on one island
Opportunities
  • Chip demand explosion: Dream Dojo-style neural simulators require massive GPU fleets — every robot training farm is a TSMC customer
  • Labor shortage solution: Taiwan's manufacturing labor shortage is exactly what robots are best at solving
  • Precision manufacturing upgrade: Taiwan's precision machinery industry can pivot to become a robot hardware supplier
  • First-mover advantage: If Taiwanese factories are first to adopt Physical AGI, they can maintain manufacturing competitiveness
Threats
  • Manufacturing reshoring: If robots drive labor costs to zero, manufacturing doesn't need to stay in low-cost regions — the US can manufacture domestically
  • Middle-layer disappearance: Taiwan's contract manufacturing model is built on "labor + management" — when robots replace both...
  • China catching up: China is investing heavily in humanoid robots, with a larger market and more application scenarios
  • Technology dependence: Core AI models and training frameworks are controlled by NVIDIA/Google/OpenAI
Action Plan for Taiwan

Jim Fan's timeline delivers a clear message for Taiwan:

  • 18-month window: The Physical Turing Test arrives within 2-3 years, meaning deployment planning must start now
  • From "contract manufacturing" to "intelligent manufacturing": Adopt NVIDIA's Omniverse and Cosmos platforms to build digital twin factories
  • Training data is already in your hands: Production line footage and worker operation videos from Taiwanese factories are exactly the "human egocentric video" that the EgoScale approach needs
  • Precision machinery pivot: Companies like Hiwin and Delta should invest in critical humanoid robot hardware components (actuators, sensors, dexterous hands)

7. Historical Parallels — From Steam Engines to Robots

Every major automation revolution follows the same pattern:

Historical Pattern: From "Too Expensive" to "Too Cheap"
1712 — Newcomen Steam Engine

Extremely inefficient, only usable in coal mines for pumping water (because fuel was free on-site). No one believed it could replace horses.

1769 — Watt's Improved Steam Engine

3x efficiency improvement, began entering factories. Still expensive — only large enterprises could afford it.

1800s — Steam Engine Proliferation

Costs kept falling; trains, ships, and factories adopted it universally. 99% of physical labor was eventually performed by machines.

Tipping Point

The steam engine didn't become "smarter" — it became "cheaper." The steep decline of the cost curve is what triggers the revolution.

Historical Pattern: LLM Cost Collapse
2020 — GPT-3

Training cost: millions of dollars per run. Inference was prohibitively expensive. Only research labs could afford it.

2022 — ChatGPT

Cost per conversation dropped to a few cents. For the first time, ordinary people could use AI directly.

2026 — Today

Inference costs have dropped to less than 1/1000th of 2020 levels. AI has become infrastructure, not a luxury.

Jim Fan's EgoScale is replicating this curve in robotics:

Teleoperation Era
Requires 100% robot data = thousands of dollars per hour
EgoScale Era
Requires 0.1% robot data = costs compressed 1000x
Historical Pattern

The steam engine took 57 years from Newcomen to Watt. LLMs took 10 years from AlexNet to ChatGPT. Robotics took less than 3 years from teleoperation to EgoScale. Each automation revolution accelerates faster than the last. Because each new revolution stands on the shoulders of the previous one — robot training directly uses LLM architectures and methodologies, and LLMs used deep learning infrastructure. Jim Fan's "Great Parallel" isn't just a metaphor — it's an engineering prediction built on the historical law of acceleration.

8. Business Insights — Investment Logic of the Robot Economy

1. The Shovel Sellers Win

Investment Logic #1: Infrastructure Layer

Jim Fan's talk reveals one thing most clearly: NVIDIA is becoming the "shovel seller" of the robot era. They don't make robot bodies — they make:

  • Training infrastructure: GPU + Omniverse + Cosmos = the complete platform for robot training
  • Model layer: GR00T + Dream Zero = the foundation model every robot company will use
  • Simulation environments: Dream Dojo + Newton = virtual training grounds replacing millions of real robots

"Compute = Environment = Data" means: every GPU used to train robots is NVIDIA revenue. When robot companies worldwide race to train models, NVIDIA isn't selling robots — they're selling the "water and electricity" for training robots.

2. Data Is the Moat

Investment Logic #2: Data Flywheel

EgoScale's lesson: the competitive advantage of future robot companies won't be in hardware, but in the speed of their data flywheel.

  • Tesla model: Millions of cars automatically collecting driving data daily. Jim Fan explicitly used this as the benchmark for robot data strategy
  • Deployment = Training: Every deployed robot is a data collector — more deployment, more data, better models, more deployable
  • Massive first-mover advantage: The first company to spin up the flywheel will pull ahead at exponential speed

3. Software Eats Hardware — Yet Again

Investment Logic #3: Software-Defined Robots

Jim Fan's "Physical API" world means:

  • Robot hardware will commoditize (just like today's server hardware)
  • Value concentrates in the software/model layer (just like today's cloud services)
  • "Dark Factories" = the robotics version of "serverless architecture" — input instructions, output products, everything in between is AI

The takeaway for investors: don't just invest in robot hardware companies — invest even more in robot AI software and platform companies.

4. UMI Lesson: The Simplest Ideas Can Spawn Unicorns

Investment Logic #4: Innovation Isn't About Complexity

Jim Fan specifically mentioned the UMI (Universal Manipulation Interface) paper — a beautifully simple idea of "strap the robot hand directly onto a human hand" — which spawned two unicorns. This echoes a timeless entrepreneurial truth: the most valuable innovations are often the simplest. Not a more sophisticated teleoperation system, but "just skip the teleoperation entirely."

Five Industries That Benefit Most

IndustryRobot ImpactTimeline
Warehousing & LogisticsFully automated picking, sorting, and packaging1-2 years
Electronics AssemblyPrecision component assembly, GPU production lines2-3 years
Home CareElderly care, household automation3-5 years
AgricultureHarvesting, grading, packaging3-5 years
Scientific ResearchAutomated wet labs, drug synthesis5-10 years

9. Conclusion + Series Guide

Three Core Messages from Jim Fan's Talk

Message One

The roadmap is clear. The Great Parallel is not a hypothesis — it's something already happening. Every step LLMs took — pre-training, fine-tuning, RL, automated research — robotics will follow. The only variable is time.

Message Two

The data bottleneck is being broken. From teleoperation to data wearables to human video, each generation increases data scale by 100-1000x. The discovery of the Neural Scaling Law proves: as long as there's data, robots will keep getting better.

Message Three

End game by 2040, but the tipping point is 1-2 years away. The Physical Turing Test may be achieved within 2-3 years. The "ChatGPT moment" for robotics — the first time an ordinary person is stunned by "I had no idea robots could do that" — may be just 1-2 years out.

Sequoia AI Ascent 2026 Series Summary

Three articles, three perspectives, one conclusion:

ArticleSpeakerCore ArgumentAction Window
Part 1: OverviewSequoia PartnersAI is a computing revolution, AGI has arrived, $10T services market18 months
Part 2: Software 3.0Andrej KarpathyLLM as computer, verifiability determines automation speed, understanding can't be outsourced12 months
Part 3: The End Game for Robotics (this article)Jim FanThe Great Parallel, Physical AGI blueprint, training costs compressed 10x1-3 years
Series Conclusion

AGI is not the future — it's now. Digital AGI is rewriting software (Karpathy), Physical AGI is rewriting manufacturing (Jim Fan), and Sequoia's partners are already placing their bets.

What does this mean for you? You don't need to understand Dream Zero's architecture or the math behind Neural Scaling Laws. What you need to understand is: everything you do today — writing code, managing a factory, caring for the elderly, organizing a warehouse — has an AI/robot version being trained right now. The question isn't "will it happen," but "what is your role in that version."

18 months. That's the window Sequoia has given. This isn't a scare tactic — it's an invitation to start thinking and acting now.

"If you believe in robotics, robotics will believe in you."

— Jim Fan, Sequoia AI Ascent 2026
Sequoia AI Ascent 2026 Deep Dive Series

Series Theme: AGI Is Not the Future — It's Now, and You Have Only 18 Months

  1. Part 1: AGI Has Arrived — Sequoia's Triple Declaration (Sequoia Keynote Deep Dive)
  2. Part 2: Software 3.0 — When LLMs Become Computers (Karpathy Talk Deep Dive)
  3. Part 3: The End Game for Robotics — Nvidia's Great Parallel Theory (this article)

References

  1. Jim Fan, "Nvidia's Jim Fan on the End Game for Robotics," Sequoia AI Ascent 2026, April 2026. YouTube
  2. NVIDIA, "Project GR00T: Foundation Model for Humanoid Robots," NVIDIA Research, 2024-2026.
  3. NVIDIA, "Cosmos: World Foundation Models," NVIDIA Research, 2025.
  4. NVIDIA, "Newton: Physics Engine for Robotics Simulation," NVIDIA, 2025.
  5. NVIDIA, "Dream Zero: World Action Models for Robotics," NVIDIA Research, 2026.
  6. NVIDIA, "EgoScale: Egocentric Video Pre-training for Dexterous Manipulation," NVIDIA Research, 2026.
  7. NVIDIA, "Dream Dojo: Neural Simulator for Robot Reinforcement Learning," NVIDIA Research, 2026.
  8. Chi et al., "Universal Manipulation Interface (UMI): In-The-Wild Robot Teaching Without Robot," RSS 2024.
  9. Brohan et al., "RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control," Google DeepMind, 2023.
  10. Kaplan et al., "Scaling Laws for Neural Language Models," OpenAI, 2020.
  11. Sequoia Capital, "AI Ascent 2026 Keynote," April 2026. YouTube
  12. Andrej Karpathy, "From Vibe Coding to Agentic Engineering," Sequoia AI Ascent 2026. YouTube