Sequoia AI Ascent 2026 Deep Dive 3/3

The End Game for Robotics — Nvidia's Great Parallel Theory

Jim Fan's Physical AGI Blueprint

Independent Research | LittleX Research Lab | 2026-05-02
Series Theme: AGI Is Not the Future — It's Now, and You Have Only 18 Months

In the summer of 2016, a stocky man in a leather jacket walked into the OpenAI office carrying a massive metal slab. Engraved on it: "To Elon and the OpenAI team, for the future of computing and humanity." It was the world's first DGX-1. An intern named Jim Fan rushed over and signed his name on it.

Ten years later, that intern stood on the Sequoia AI Ascent stage, declaring that robotics has entered the "end game."
His argument is stunning: not building one super robot, but having millions of robots learn simultaneously. Just as LLMs learned language from the internet's text, robots will learn every physical action from everyday human videos.

And this time, training costs have been compressed by 10x. The "ChatGPT moment" for robotics may be just 1-2 years away.

Abstract

NVIDIA's Head of Robotics Research Jim Fan proposed "The Great Parallel" at Sequoia AI Ascent 2026: robotics will fully replicate the four-stage success path of large language models — pre-training, fine-tuning, reinforcement learning, and automated research. He introduced three key technical breakthroughs: Dream Zero (a World Action Model that lets robots "dream" the future before acting), EgoScale (using human first-person video instead of teleoperation, discovering a Neural Scaling Law for robot dexterity), and Dream Dojo (a neural simulator that uses GPUs instead of real robots for reinforcement learning). He predicts robots will complete the final unlock of the tech tree by 2040, and the "Physical Turing Test" — where you can't tell if a human or robot is performing a task — is just 2-3 years away. This article deconstructs the talk's technical architecture, data strategy, business logic, and implications for Taiwan from first principles.

The Great Parallel Physical AGI GR00T Dream Zero World Action Model EgoScale Neural Scaling Law Sim-to-Real Dream Dojo Cosmos Newton Humanoid Robots

Table of Contents

Why Not One Super Robot? (The Great Parallel Theory)
Nvidia's Trifecta: GR00T + Cosmos + Newton
Sim-to-Real: Training Costs Compressed 10x
Physical AGI: Definition and Timeline
When Robots Can Fold Laundry — What Does It Mean?
Opportunities and Threats for Taiwan's Manufacturing
Historical Parallels — From Steam Engines to Robots
Business Insights — Investment Logic of the Robot Economy
Conclusion + Series Guide
References

1. Why Not One Super Robot? (The Great Parallel Theory)

Sci-fi movies always give us the same picture: a single humanoid robot, like the Terminator, that can do everything. But Jim Fan says this is entirely the wrong direction.

Looking back at LLM's path to success, he identified four phase-transition leaps, each separated by roughly six years:

2020 — GPT-3 Pre-training

Next-token prediction = learning the "shape" of language — grammar, logic, how code unfolds

2022 — InstructGPT Supervised Fine-tuning

Aligning the model to "useful work" — converging from vast possibilities to human-needed outputs

2024 — Reasoning Models (o1)

Using reinforcement learning to surpass imitation learning — the model begins to "think," not just recite

2026 — Automated Research

Accelerating the entire loop beyond human capability — AI begins doing AI research itself

Jim Fan's core insight: these four stages can be directly mapped to robotics. He calls it "The Great Parallel."

First Principle

The Great Parallel: If LLMs learned language by predicting "the next word," then robots can learn actions by predicting "the next physical world state." The underlying mathematical structure is identical — both are sequence prediction problems. The only difference: LLMs predict discrete tokens; robots predict continuous pixels and joint angles.

LLM Path	Robotics Parallel Path	Core Technology
Pre-training (predict next word)	Pre-training (predict next physical state)	World Model / Cosmos
Supervised fine-tuning (align to useful output)	Action fine-tuning (align to real robots)	GR00T / Dream Zero
RL reasoning (surpass imitation)	RL training in simulation	Newton / Dream Dojo
Automated research	Physical automated research	Robots design and build the next generation of themselves

"So as any self-respecting scientist would do, I copy homework and I give it a new name. I call it the Great Parallel."

— Jim Fan, Sequoia AI Ascent 2026

This is not a metaphor. It's an actionable engineering roadmap. Jim Fan isn't saying "robots will one day be like ChatGPT" — he's saying "we're already walking the same path, and we know where every turn is."

Key Insight

Why not one super robot? Because LLM's success was never about a single super model breaking through alone — it was about scaled parallel training — billions of parameters, trillions of tokens, thousands of GPUs. Robotics is the same: the future isn't one omnipotent robot, but millions of robots learning simultaneously in simulated environments, then deploying learned capabilities to the real world. Quantity defeats quality — that's the true meaning of "The Great Parallel."

2. Nvidia's Trifecta: GR00T + Cosmos + Newton

To make The Great Parallel work, three core components are needed. Jim Fan's team happens to have built all of them.

1. GR00T — The Foundation Model for Humanoid Robots

GR00T (Generalist Robot 00 Technology) is NVIDIA's foundation model built for humanoid robots. Over the past three years, the robotics field has been dominated by VLAs (Vision-Language-Action models) — essentially a language model with an action output head bolted on top.

Jim Fan zeroed in on the problem:

"These models are really LVAs, because the most amount of parameters are dedicated to language. Language is first-class citizen, followed by vision and action. By design, VLAs are great at encoding knowledge and nouns, but not so much at physics and verbs."

— Jim Fan

He gave a classic example: the VLA's original paper demonstrated "move the Coke can next to Taylor Swift's photo" — yes, the robot recognized Taylor Swift, but that's a "noun capability," not a "verb capability." What you need is a robot that understands gravity, friction, and deformation of flexible objects — not celebrity recognition.

2. Dream Zero — World Action Model (WAM)

Replacing VLAs is an entirely new architecture: the World Action Model (WAM).

Dream Zero is WAM's first implementation. Its core capability is "dreaming" — simulating the next few seconds of a scene in its mind before executing any action, then deciding what to do based on the simulation results.

Core Mechanism

Simultaneously decodes "next world state" and "next action" — vision and action are both first-class citizens

Key Breakthrough

Zero-shot generalization — can solve action tasks never seen during training

Validation Method

Visualize the robot's "dreams": if the predicted video is correct, the action is correct; if the video hallucinates, the action fails

Historical Analogy

Like the GPT-2 era — the shape is right but not yet precise enough; scaling will bring a qualitative leap

"A moment of silence for our dear friend VLAs. They've served us well. Rest in peace. Long live World Action Models."

— Jim Fan

3. Cosmos + Newton — World Model and Physics Engine

Where does Dream Zero "dream"? It needs a world model to provide the raw material for dreams.

Cosmos: NVIDIA's video world model that learns physics by predicting the next frame's pixels. Jim Fan demonstrated that Cosmos (V3) taught itself gravity, buoyancy, light reflection and refraction — no one wrote in any physics equations; physical laws "emerged" from pixel prediction
Newton: A classical physics simulation engine for scenarios requiring precise collision detection and rigid body dynamics
Dream Dojo: Turns Cosmos into a complete "neural simulator" — input action signals, output next-frame RGB images and sensor states, entirely data-driven with no physics equations or graphics engines whatsoever

First Principle

Compute = Environment = Data. In traditional robot training, you need real robots (hardware) in real environments (scenes) collecting real data (teleoperation). All three face physical bottlenecks. Dream Dojo's breakthrough: using GPU compute to directly generate training environments and data. Buying more GPUs equals having more robots, more environments, more data. This is why Jensen says "the more you buy, the more you save" — in robotics, this statement becomes literally true for the first time.

3. Sim-to-Real: Training Costs Compressed 10x

The biggest pain point in robotics has always been data. Jim Fan used a single chart to clearly illustrate the evolution of data strategies:

Three Generations of Data Collection

Method	Ceiling	Problem
Teleoperation	24 hrs/robot/day (realistically ~3 hrs)	Expensive, slow, robots frequently "throw tantrums"; NVIDIA Chief Scientist Bill Dally personally operated the controls, making it possibly "the most expensive teleoperation trajectory in history"
Data Wearable Devices (UMI/DexOoi)	Hundreds of thousands of hours	Strap the robot hand directly onto a human hand for data collection, eliminating the robot body entirely; spawned two unicorns
Human Egocentric Video	Tens of millions of hours	Like Tesla FSD, automatically collecting in the background — human daily activities themselves become training data

EgoScale: 99.9% Human Video + 0.1% Teleoperation

Jim Fan's EgoScale system is remarkable:

21,000 hours

Human egocentric video pre-training (zero robot data)

50 hours

High-precision data glove fine-tuning

4 hours

Teleoperation data (less than 0.1%)

22 DoF

End-to-end policy for high-dexterity bimanual robots

The result: using 99.9% everyday human video + 0.1% teleoperation, they trained a high-dexterity robot policy capable of sorting cards, manipulating syringes, and folding clothes. This is the key to compressing training costs by over 10x.

Neural Scaling Law for Robot Dexterity

The most stunning discovery from the EgoScale paper:

Major Discovery

Robot dexterity exhibits a Neural Scaling Law — pre-training hours and validation loss show a clean log-linear relationship. This comes exactly six years after the original Neural Scaling Law for language models. This means: as long as you keep increasing pre-training hours with human video, robot dexterity will predictably and continuously improve. Once the data flywheel starts spinning, the growth is exponential.

Real-to-Sim-to-Real: iPhone as a Pocket World Scanner

Jim Fan also demonstrated an elegantly simple pipeline:

Scan a real scene with an iPhone
Extract all objects through a 3D scanning pipeline
Automatically reconstruct them in a physics simulator (all objects are interactive)
Infinitely augment with variations in simulation (he calls them "Digital Cousins")
Transfer the trained policy back to the real robot

The significance: the iPhone essentially becomes a pocket world scanner. Anyone can scan their work environment and have a robot learn how to work in that environment through simulation.

4. Physical AGI: Definition and Timeline

Jim Fan used the Civilization tech tree to describe the end game for robotics. He said his research is like unlocking game achievements. Three achievements remain, and then he can retire.

Three Milestones

Milestone 1: Physical Turing Test (within 2-3 years)

Across a broad range of activities, you cannot tell whether a human or robot is performing the task. The key metric is "unit energy input vs. unit labor output" — it doesn't need to outperform a drunk person, but must reach the efficiency level of a normal human.

Milestone 2: Physical API

An entire robot fleet can be configured like software, through APIs and command lines. Jim Fan joked that "one day it will be orchestrated by Opus 9.0." This will enable "dark factories" — input a product design as a Markdown file, output a fully assembled product, entirely unattended; and automated wet labs to accelerate scientific discovery in chemistry, biology, and pharmaceuticals.

Milestone 3: Physical Automated Research (by 2040)

Robots begin designing, improving, and building the next generation of themselves — far beyond human capability. This is the end game.

14 years

From AlexNet (2012) to AI Ascent 2026 — the digital AI journey

14 years

Jim Fan's estimate from 2026 to the physical AI end game (2040)

95%

Jim Fan's confidence in reaching the end game by 2040

Exponential

Technology doesn't advance linearly — it accelerates exponentially

"Our generation was born too late to explore the earth, and too early to explore the stars. But we are born just in time to solve robotics."

— Jim Fan

First Principle

The essential definition of Physical AGI: A system capable of learning "any" physical task. Not an industrial robotic arm optimized for specific tasks, but a general-purpose physical intelligence that can learn new tasks through language instructions and a few demonstrations. This is perfectly symmetrical with the LLM AGI definition — LLM AGI means "handling any cognitive task," Physical AGI means "executing any physical task." Combined, they form complete AGI.

5. When Robots Can Fold Laundry — What Does It Mean?

Jim Fan demonstrated a seemingly mundane scene in his talk: a robot using 22-DoF bimanual hands to fold different types of clothing. And it only needed a single demonstration to learn different folding techniques.

Why does this matter? Because folding laundry is one of the "holy grail" problems in robotics.

Why Is Folding Laundry So Hard?

Deformable objects: Clothing has no fixed shape; every grasp starts from a different initial state
High-dimensional manipulation: Requires precise coordination of 44 joints across both hands
Multi-strategy generalization: Different garments (T-shirts vs. pants vs. towels) require different folding methods
Tactile feedback: Too much force tears the fabric; too little drops it

If a robot can fold laundry, it can also:

Home Scenarios

Tidying rooms, organizing belongings
Meal prep and ingredient handling
Home care (operating syringes, taking blood pressure)
Cleaning, dishwashing, sorting clutter

Industrial Scenarios

Assembling precision electronic components
Warehouse picking and sorting
Quality inspection and packaging
GPU assembly (an actual case Jim Fan demonstrated)

Deeper Significance

"Folding laundry" is not the destination — it's a proof of capability. It represents robots crossing the chasm from "rigid manipulation" to "deformable manipulation." Once deformable manipulation is unlocked, 90% of physical tasks in daily human life are within range. The "one-shot demonstration" learning shown in Jim Fan's talk is even more critical — it means deployment cost approaches zero. You don't need a programmer; you just need to "show it once."

6. Opportunities and Threats for Taiwan's Manufacturing

Jim Fan's talk never directly mentioned Taiwan, but every argument points straight at Taiwan's core competitive advantages.

Why Is Taiwan a Critical Node in the Robot Revolution?

TSMC

The world's advanced chip manufacturing center — robot "brains" are produced here

Manufacturing GDP 30%+

Taiwan remains a manufacturing powerhouse — the sector most susceptible to robotic automation

Labor Shortage Crisis

Declining birth rate + aging population = continuously widening labor gap

Complete Supply Chain

From chips to precision machinery to electronics assembly — the entire value chain on one island

Opportunities

Chip demand explosion: Dream Dojo-style neural simulators require massive GPU fleets — every robot training farm is a TSMC customer
Labor shortage solution: Taiwan's manufacturing labor shortage is exactly what robots are best at solving
Precision manufacturing upgrade: Taiwan's precision machinery industry can pivot to become a robot hardware supplier
First-mover advantage: If Taiwanese factories are first to adopt Physical AGI, they can maintain manufacturing competitiveness

Threats

Manufacturing reshoring: If robots drive labor costs to zero, manufacturing doesn't need to stay in low-cost regions — the US can manufacture domestically
Middle-layer disappearance: Taiwan's contract manufacturing model is built on "labor + management" — when robots replace both...
China catching up: China is investing heavily in humanoid robots, with a larger market and more application scenarios
Technology dependence: Core AI models and training frameworks are controlled by NVIDIA/Google/OpenAI

Action Plan for Taiwan

Jim Fan's timeline delivers a clear message for Taiwan:

18-month window: The Physical Turing Test arrives within 2-3 years, meaning deployment planning must start now
From "contract manufacturing" to "intelligent manufacturing": Adopt NVIDIA's Omniverse and Cosmos platforms to build digital twin factories
Training data is already in your hands: Production line footage and worker operation videos from Taiwanese factories are exactly the "human egocentric video" that the EgoScale approach needs
Precision machinery pivot: Companies like Hiwin and Delta should invest in critical humanoid robot hardware components (actuators, sensors, dexterous hands)

7. Historical Parallels — From Steam Engines to Robots

Every major automation revolution follows the same pattern:

Historical Pattern: From "Too Expensive" to "Too Cheap"

1712 — Newcomen Steam Engine

Extremely inefficient, only usable in coal mines for pumping water (because fuel was free on-site). No one believed it could replace horses.

1769 — Watt's Improved Steam Engine

3x efficiency improvement, began entering factories. Still expensive — only large enterprises could afford it.

1800s — Steam Engine Proliferation

Costs kept falling; trains, ships, and factories adopted it universally. 99% of physical labor was eventually performed by machines.

Tipping Point

The steam engine didn't become "smarter" — it became "cheaper." The steep decline of the cost curve is what triggers the revolution.

Historical Pattern: LLM Cost Collapse

2020 — GPT-3

Training cost: millions of dollars per run. Inference was prohibitively expensive. Only research labs could afford it.

2022 — ChatGPT

Cost per conversation dropped to a few cents. For the first time, ordinary people could use AI directly.

2026 — Today

Inference costs have dropped to less than 1/1000th of 2020 levels. AI has become infrastructure, not a luxury.

Jim Fan's EgoScale is replicating this curve in robotics:

Teleoperation Era

Requires 100% robot data = thousands of dollars per hour

EgoScale Era

Requires 0.1% robot data = costs compressed 1000x

Historical Pattern

The steam engine took 57 years from Newcomen to Watt. LLMs took 10 years from AlexNet to ChatGPT. Robotics took less than 3 years from teleoperation to EgoScale. Each automation revolution accelerates faster than the last. Because each new revolution stands on the shoulders of the previous one — robot training directly uses LLM architectures and methodologies, and LLMs used deep learning infrastructure. Jim Fan's "Great Parallel" isn't just a metaphor — it's an engineering prediction built on the historical law of acceleration.

8. Business Insights — Investment Logic of the Robot Economy

1. The Shovel Sellers Win

Investment Logic #1: Infrastructure Layer

Jim Fan's talk reveals one thing most clearly: NVIDIA is becoming the "shovel seller" of the robot era. They don't make robot bodies — they make:

Training infrastructure: GPU + Omniverse + Cosmos = the complete platform for robot training
Model layer: GR00T + Dream Zero = the foundation model every robot company will use
Simulation environments: Dream Dojo + Newton = virtual training grounds replacing millions of real robots

"Compute = Environment = Data" means: every GPU used to train robots is NVIDIA revenue. When robot companies worldwide race to train models, NVIDIA isn't selling robots — they're selling the "water and electricity" for training robots.

2. Data Is the Moat

Investment Logic #2: Data Flywheel

EgoScale's lesson: the competitive advantage of future robot companies won't be in hardware, but in the speed of their data flywheel.

Tesla model: Millions of cars automatically collecting driving data daily. Jim Fan explicitly used this as the benchmark for robot data strategy
Deployment = Training: Every deployed robot is a data collector — more deployment, more data, better models, more deployable
Massive first-mover advantage: The first company to spin up the flywheel will pull ahead at exponential speed

3. Software Eats Hardware — Yet Again

Investment Logic #3: Software-Defined Robots

Jim Fan's "Physical API" world means:

Robot hardware will commoditize (just like today's server hardware)
Value concentrates in the software/model layer (just like today's cloud services)
"Dark Factories" = the robotics version of "serverless architecture" — input instructions, output products, everything in between is AI

The takeaway for investors: don't just invest in robot hardware companies — invest even more in robot AI software and platform companies.

4. UMI Lesson: The Simplest Ideas Can Spawn Unicorns

Investment Logic #4: Innovation Isn't About Complexity

Jim Fan specifically mentioned the UMI (Universal Manipulation Interface) paper — a beautifully simple idea of "strap the robot hand directly onto a human hand" — which spawned two unicorns. This echoes a timeless entrepreneurial truth: the most valuable innovations are often the simplest. Not a more sophisticated teleoperation system, but "just skip the teleoperation entirely."

Five Industries That Benefit Most

Industry	Robot Impact	Timeline
Warehousing & Logistics	Fully automated picking, sorting, and packaging	1-2 years
Electronics Assembly	Precision component assembly, GPU production lines	2-3 years
Home Care	Elderly care, household automation	3-5 years
Agriculture	Harvesting, grading, packaging	3-5 years
Scientific Research	Automated wet labs, drug synthesis	5-10 years

9. Conclusion + Series Guide

Three Core Messages from Jim Fan's Talk

Message One

The roadmap is clear. The Great Parallel is not a hypothesis — it's something already happening. Every step LLMs took — pre-training, fine-tuning, RL, automated research — robotics will follow. The only variable is time.

Message Two

The data bottleneck is being broken. From teleoperation to data wearables to human video, each generation increases data scale by 100-1000x. The discovery of the Neural Scaling Law proves: as long as there's data, robots will keep getting better.

Message Three

End game by 2040, but the tipping point is 1-2 years away. The Physical Turing Test may be achieved within 2-3 years. The "ChatGPT moment" for robotics — the first time an ordinary person is stunned by "I had no idea robots could do that" — may be just 1-2 years out.

Sequoia AI Ascent 2026 Series Summary

Three articles, three perspectives, one conclusion:

Article	Speaker	Core Argument	Action Window
Part 1: Overview	Sequoia Partners	AI is a computing revolution, AGI has arrived, $10T services market	18 months
Part 2: Software 3.0	Andrej Karpathy	LLM as computer, verifiability determines automation speed, understanding can't be outsourced	12 months
Part 3: The End Game for Robotics (this article)	Jim Fan	The Great Parallel, Physical AGI blueprint, training costs compressed 10x	1-3 years

Series Conclusion

AGI is not the future — it's now. Digital AGI is rewriting software (Karpathy), Physical AGI is rewriting manufacturing (Jim Fan), and Sequoia's partners are already placing their bets.

What does this mean for you? You don't need to understand Dream Zero's architecture or the math behind Neural Scaling Laws. What you need to understand is: everything you do today — writing code, managing a factory, caring for the elderly, organizing a warehouse — has an AI/robot version being trained right now. The question isn't "will it happen," but "what is your role in that version."

18 months. That's the window Sequoia has given. This isn't a scare tactic — it's an invitation to start thinking and acting now.

"If you believe in robotics, robotics will believe in you."

— Jim Fan, Sequoia AI Ascent 2026

Sequoia AI Ascent 2026 Deep Dive Series

Series Theme: AGI Is Not the Future — It's Now, and You Have Only 18 Months

Part 1: AGI Has Arrived — Sequoia's Triple Declaration (Sequoia Keynote Deep Dive)
Part 2: Software 3.0 — When LLMs Become Computers (Karpathy Talk Deep Dive)
Part 3: The End Game for Robotics — Nvidia's Great Parallel Theory (this article)

References

Jim Fan, "Nvidia's Jim Fan on the End Game for Robotics," Sequoia AI Ascent 2026, April 2026. YouTube
NVIDIA, "Project GR00T: Foundation Model for Humanoid Robots," NVIDIA Research, 2024-2026.
NVIDIA, "Cosmos: World Foundation Models," NVIDIA Research, 2025.
NVIDIA, "Newton: Physics Engine for Robotics Simulation," NVIDIA, 2025.
NVIDIA, "Dream Zero: World Action Models for Robotics," NVIDIA Research, 2026.
NVIDIA, "EgoScale: Egocentric Video Pre-training for Dexterous Manipulation," NVIDIA Research, 2026.
NVIDIA, "Dream Dojo: Neural Simulator for Robot Reinforcement Learning," NVIDIA Research, 2026.
Chi et al., "Universal Manipulation Interface (UMI): In-The-Wild Robot Teaching Without Robot," RSS 2024.
Brohan et al., "RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control," Google DeepMind, 2023.
Kaplan et al., "Scaling Laws for Neural Language Models," OpenAI, 2020.
Sequoia Capital, "AI Ascent 2026 Keynote," April 2026. YouTube
Andrej Karpathy, "From Vibe Coding to Agentic Engineering," Sequoia AI Ascent 2026. YouTube

The End Game for Robotics — Nvidia's Great Parallel Theory

Abstract

1. Why Not One Super Robot? (The Great Parallel Theory)

2. Nvidia's Trifecta: GR00T + Cosmos + Newton

1. GR00T — The Foundation Model for Humanoid Robots

2. Dream Zero — World Action Model (WAM)

3. Cosmos + Newton — World Model and Physics Engine

3. Sim-to-Real: Training Costs Compressed 10x

Three Generations of Data Collection

EgoScale: 99.9% Human Video + 0.1% Teleoperation

Neural Scaling Law for Robot Dexterity

Real-to-Sim-to-Real: iPhone as a Pocket World Scanner

4. Physical AGI: Definition and Timeline

Three Milestones

5. When Robots Can Fold Laundry — What Does It Mean?

Why Is Folding Laundry So Hard?

6. Opportunities and Threats for Taiwan's Manufacturing

Why Is Taiwan a Critical Node in the Robot Revolution?

7. Historical Parallels — From Steam Engines to Robots

8. Business Insights — Investment Logic of the Robot Economy

1. The Shovel Sellers Win

2. Data Is the Moat

3. Software Eats Hardware — Yet Again

4. UMI Lesson: The Simplest Ideas Can Spawn Unicorns

Five Industries That Benefit Most

9. Conclusion + Series Guide

Three Core Messages from Jim Fan's Talk

Sequoia AI Ascent 2026 Series Summary

References

Related Articles