
NVIDIA's Jim Fan: The robotics field is still in a chaotic state, and even the development direction may be wrong

Jim Fan stated that the reliability of robot hardware has become the biggest obstacle to software iteration, and the lack of industry standards has led to a chaotic evaluation system. He feels that the current mainstream visual-language-action model (VLA) technology route is "not right," as its pre-training method based on visual language models (VLM) fundamentally misaligns with the actual needs of robots. He mentioned that he is betting on video world models as an alternative solution
Recently, Jim Fan, head of NVIDIA's robotics business and co-director of the GEAR Lab, published a lengthy article on social media, harshly criticizing the current state of the robotics industry. He believes that despite significant advancements in hardware technology, the entire industry remains in a chaotic state regarding software iteration, standard setting, and technology route selection.
Jim Fan pointed out that the current mainstream visual-language-action model (VLA) technology route "feels wrong," as its pre-training method based on visual language models (VLM) fundamentally misaligns with the actual needs of robotics. He stated that he is betting on video world models as an alternative solution.
This statement has attracted industry attention, highlighting that, against the backdrop of rapid developments in other areas of artificial intelligence, the fundamental issues in robotics technology reveal that the industry is still far from commercialization, which may affect investors' valuation expectations for related companies.
Jim Fan summarized three lessons learned in the robotics field by 2025, covering core issues such as hardware reliability, industry standards, and technology routes, providing a frontline perspective for understanding the current bottlenecks in the robotics industry.

Hardware Reliability Becomes the Biggest Obstacle to Software Iteration
Jim Fan pointed out that although robots like Optimus, e-Atlas, Figure, Neo, and G1 demonstrate superb engineering technology, hardware reliability severely limits the speed of software development. He stated that the most advanced artificial intelligence has not yet fully utilized the capabilities of these cutting-edge hardware, "the physical capabilities exceed the command capabilities of the brain."
Unlike humans, robots cannot self-repair from damage. Issues such as overheating, motor failures, and firmware anomalies occur daily, and errors are irreversible and intolerable. Caring for these robots requires the support of an entire operational team.
Jim Fan lamented, "The only thing that can grow with scale is my patience." This statement reveals the harsh reality of high labor costs and low iteration efficiency in robot development.

Lack of Industry Standards Leads to Confused Evaluation Systems
Jim Fan described the benchmarking situation in the robotics field as an "epic disaster." He pointed out that unlike the consensus standards formed in the large language model field, such as MMLU and SWE-Bench, the robotics industry lacks unified standards in hardware platforms, task definitions, scoring criteria, simulators, or real-world settings.
A common phenomenon in the industry is that each company temporarily defines its own benchmark when releasing news and claims to achieve "state-of-the-art" (SOTA) levels based on that. More seriously, demonstration videos are often selected from the best results out of 100 attemptsJim Fan calls out: "We must do better in 2026 and stop treating reproducibility and scientific rigor as second-class citizens." This criticism points directly to the fundamental issue of the industry's lack of scientific rigor.

Mainstream Technology Route Faces Fundamental Doubts
Jim Fan has raised fundamental doubts about the currently dominant VLA model. The common practice of the VLA model is to graft action modules onto pre-trained visual language models, but this approach has two core issues.
First, most parameters in VLM are designed to serve language and knowledge, rather than physics. Second, to achieve high-level understanding, visual encoders actively discard low-level details, but these small details are crucial for the dexterous operation of robots.
Jim Fan believes that VLM is highly optimized for benchmark tests like visual question answering, and its pre-training objectives are misaligned with the needs of robotics, stating, "There is no reason to believe that the performance of VLA will scale with the increase of VLM parameters." He expresses his bet on video world models as a more suitable pre-training objective for robotic strategies.

Jim Fan's views have sparked discussions within the industry. Netizen Stewart Alsop questioned why, if video world models are superior, models like Helix, GR00T N1, and π0, which have been delivered, are still built on the VLM foundation, and why world models are currently mainly used for strategy evaluation and synthetic data rather than direct motion control.
Jim Fan responded that these are models for 2025, looking forward to the next generation of large models in 2026.

