
China's "Four AI Giants" Rarely Share the Stage: Alibaba, Tencent, Kimi, and KNOWLEDGE ATLAS Discuss the Next Steps for Large Models and the Possibility of China's Catch-Up

The competition among large models has shifted from the "Chat" phase to the "Agent" phase, with the focus moving from leaderboard scores to the execution of complex tasks in real environments. The industry predicts that 2026 will be the year of commercial value realization, with the technological path evolving towards verifiable reinforcement learning (RLVR). In the face of the "China surpassing" topic, leaders maintain a calm attitude, assessing the probability of leading as within 20%, believing that there are essential differences between China and the United States in terms of computing power investment structure, new paradigm leadership, and the toB ecosystem
Key Points:
- Shift in Competitive Coordinates: The engineering problems of the Chat era have been largely resolved. The future key to victory will no longer be a smarter "search box," but rather the ability to complete complex, long-chain real tasks. The core value of AI is shifting from "providing information" to "delivering productivity."
- Core Threshold Evolution: The bottleneck of agents lies not in the depth of thinking, but in environmental feedback. Future training paradigms will shift from manual labeling to RLVR (Reinforcement Learning with Verification). Only by allowing models to self-iterate within a "level system" that has clear right and wrong judgments (such as code, mathematics, and real business flows) can practical implementation be achieved.
- Efficiency as the New Leverage: High-quality data is about to run out, and future competition will be a race for "energy conversion efficiency." Achieving higher Token Efficiency (learning effectiveness per unit of data) through second-order optimizers and linear architectures is key to breaking through the intelligence ceiling under limited computing power.
- Awareness of Probability: The industry consensus is that China has a high probability of surpassing in the old paradigm (engineering replication, local optimization, toC implementation), but the probability of leading in the new paradigm (underlying architecture innovation, long-term memory, etc.) may not exceed 20%, as the U.S. invests several orders of magnitude more in basic research.
- Opportunity Window for Overtaking: The opportunity for surpassing lies in two variables: first, when the Scaling Law encounters diminishing marginal returns and the world is forced into a "smart efficiency" competition, China's frugal innovation may break through; second, with improvements in computing conditions in academia, a paradigm shift driven by academia may occur around 2026.
- Ultimate Variable for Success: What China lacks most is not ranking scores, but tolerance for uncertainty. True surpassing depends on whether we dare to step away from the "pressure of deterministic delivery" and allocate resources to new paradigms that may fail but can define the future, rather than just brushing rankings on the old track.
Recently, the AGI-Next Frontier Summit initiated by the Beijing Key Laboratory of Basic Models at Tsinghua University brought together a significant portion of the AI community. The four key figures of basic models were all present: ZhiPu's Tang Jie, Kimi Yang Zhilin, Alibaba's Lin Junyang, and the "sudden face-to-face jump screen" of Yao Shunyu.

They believe that the competition for large models has shifted from merely "chatting" capabilities and ranking scores to the stage of agents that can enter real environments, be verified, and complete complex tasks. The industry generally expects that 2026 will no longer be the "year of stronger models," but rather a critical year for models to truly run through business processes and create commercial value.
Regarding the topic most concerning to investors, "Can China surpass?" the sentiment conveyed at the summit was calm and realistic. Although China has a strong capability for catching up under engineering replication and manufacturing logic, several core figures assessed the probability of China leading in the next-generation paradigm as not exceeding 20% This caution stems from the essential differences in the structure of computing power investment between China and the United States—while the U.S. tends to invest in high-risk explorations of "next-generation research," China's computing power is currently more focused on delivery and productization.
From "Better at Chatting" to "Able to Get Things Done": A Fundamental Shift in Evaluation Coordinates
The evaluation coordinate system of the large model industry has undergone a fundamental shift. Tang Jie stated that the issues of this generation of Chat have "almost been resolved," and the industry's focus is shifting from "better at chatting" to "able to complete tasks." In the past, the market pursued the "scores" of models on exams, but now the core metric has become the "success rate" in real-world environments.
For enterprises, if AI is merely a smarter search box, its value is limited; however, if AI can turn the success rate of task execution from chance to certainty, it signifies a qualitative change in productivity. Therefore, industry leaders such as Tang Jie and Yang Zhilin have reached a consensus: AI is transitioning from Chat to Thinking, Coding, and Agent.
At this stage, RLVR (Reinforcement Learning with Verification) has become a key technological path. Tang Jie emphasized that in fields where results can be determined, such as mathematics and programming, models can explore through closed-loop self-exploration; however, in complex tasks like web interaction, "verifiable space" is scarce. The future competitive threshold is not about making models think a few more steps, but rather establishing a sufficiently complex, realistic, and scoreable "level system" that allows models to iterate through "experience grinding."
Commercialization Divergence: High Premiums in ToB and Vertical Stratification
As the technological focus shifts to Agents, the commercialization paths have also shown significant divergence. Yao Shunyu pointed out that the logic of toC and toB will gradually drift apart. In the toC market, improvements in user experience do not necessarily lead to enhanced retention; however, in the toB market, what enterprises fear most is not slowness, but rather "errors that are uncontrollable."
Moreover, the industry's view on "vertical integration" is also being revised. Yao Shunyu observed that in the toB field, the model layer tends to lean towards "hardcore industrialization," competing on pre-training and computing power; while the application layer leans towards "business engineering," competing on processes and delivery. This may lead the future toB market towards a stratified structure: the strongest models paired with the most scenario-aware application teams, rather than a simple "model as product." This serves as a warning for Chinese companies: they should not only focus on rankings but also pay attention to the implementation and iteration capabilities within specific business chains.
The Probability of China's Lead: Structural Bottlenecks Under Optimistic Expectations
Regarding the discussion of "China's probability of leading," the summit presented a kind of "structural calm." Although the market is keen to discuss "rise" and "ranking," industry insiders like Lin Junyang have capped the probability of China's leading new paradigm at 20%.
This cautious assessment is based on the structural differences in computing power usage between China and the U.S.:
- Differences in Investment Direction: The U.S. invests a large amount of computing power into "next-generation research," which has a high tolerance for error and aims to bet on the future; China, on the other hand, allocates a large amount of computing power for delivery and productization, aiming to "survive first."
- Paradigm Discourse Power: Yao Shunyu pointed out that China is very strong in replication and engineering; once a path is proven feasible, it can quickly do better (as seen in manufacturing and electric vehicle logic). However, the real challenge lies in whether it can lead new paradigms such as long-term memory and autonomous learning frameworks, rather than just "ranking" within old paradigms The bottleneck of computing power, the completeness of the software and hardware ecosystem, and the willingness to pay in the B2B market constitute the "three thresholds" that restrict the development of models in China. If the ecosystem only rewards certain numerical rankings and squeezes out the spirit of adventure from the organizational structure, then surpassing will be difficult to achieve.
The following is the full text of the speech, organized by Quantum Bit:
Tang Jie
My topic is "Let Machines Think Like Humans."
In 2019, with the support of Tsinghua University, we completed the transformation of our achievements and established KNOWLEDGE ATLAS.
During the same period, we also continued to promote open source, with projects at both the model and tool levels, as well as a large model API system aimed at developers.
I spent nearly twenty years at Tsinghua.
Looking back, what I did was actually quite simple, mainly two things:
One is AMiner in the early years; the other is large models.
There is a concept that has deeply influenced me, which I call "Doing Research Like Drinking Coffee." This is closely related to one of the guests present today—Professor Yang Qiang.
When I first graduated, I went to Hong Kong University of Science and Technology, where almost all spaces were in one building: classrooms, laboratories, conference rooms, and cafes were all together.
One time I ran into Professor Yang in the café, and I said I had been drinking a bit too much coffee lately and might need to cut back.
He first said, "Yes, you should cut back," and then added that if we could become addicted to research like we do to coffee, then we would probably be able to do research really well.
This statement had a significant impact on me and has influenced me since 2008.
Doing research essentially requires long-term focus and continuous investment. AGI is precisely such a matter; it does not pursue short-term results but is a project that requires years of investment.
In 2019, our laboratory had already gained some international influence in the fields of graph neural networks and knowledge graphs, but at that time, we made the firm decision to pause and almost everyone shifted to research related to large models. Today, we have achieved a little bit of results.
If we look at the development trajectory of large models, it would be more intuitive to describe it in terms of "intelligence level."
Around 2020, the models mainly solved relatively simple problems like MMU and QA; by 2021 and 2022, they began to enter the stage of mathematical calculations and basic reasoning; through post-training, these capabilities were gradually supplemented.
By 2023 and 2024, models transitioned from knowledge memorization to complex reasoning, even being able to handle graduate-level problems, and began to show usability in real-world programming tasks like SWE-bench.
This process is very similar to human growth: from reading and arithmetic to more complex reasoning, and then moving into real work scenarios.
Starting this year, everyone has also seen HLE, which stands for "Human-Level Evaluation," where many questions cannot be directly answered by search engines, requiring models to possess stronger generalization capabilities.
How to solve this remains an unanswered question, but it can be confirmed that by 2025, the overall capabilities of models will still be rapidly improving.
From another perspective, a core issue is: How do models transition from scaling to true generalization capabilities?
Humans have always hoped that machines would possess generalization capabilities. Teach them a few examples, and they can extrapolate and solve more problems, even those they have never encountered before
This aligns with our expectations for teaching children: by learning three questions, they can solve the fourth, the tenth, and even go beyond the original teaching scope.
The current path hopes to enhance this generalization ability through Scaling. Objectively speaking, the model's level of generalization still has significant room for improvement, and we can only continue to advance at different levels.
The earliest stage involved training models using Transformers, leveraging large-scale data and computing power to "memorize" a vast amount of knowledge.
The second stage is to align the model and enhance its reasoning capabilities, allowing it to better understand human intentions and complete more complex reasoning tasks.
This requires continuous ScalingSFT and even the introduction of reinforcement learning. By utilizing a large amount of human feedback data, we can continuously expand the feedback scale, making the model more accurate and reliable.
An important change this year is RLVR.
In the past, reinforcement learning was difficult to scale due to its reliance on human feedback, which suffers from high noise and limited coverage of scenarios. If a verifiable environment is introduced, the model can explore autonomously and automatically obtain feedback, continuously growing in a closed loop.
However, the challenges here are also very apparent. The term "verifiable" is relatively easy to define in fields like mathematics and programming; however, once expanded to broader tasks, such as whether a webpage is aesthetically pleasing or whether interactions are reasonable, human judgment is still required.
Therefore, the current challenge facing RLVR is that verifiable scenarios are gradually depleting. Whether we can move into semi-automated verification or even unverified task spaces, allowing the model's capabilities to continue generalizing, is a key question.
Looking further ahead, when machines begin to enter the physical world and execute real tasks, how to construct the environment for intelligent agents and how to design feedback mechanisms will present more challenges. It can be seen that the development of AI is no longer limited to a single model or Transformer structure, but is evolving into a complex, systematic intelligent system.
From a capability structure perspective, the model initially focused on reasoning tasks in mathematics and science, gradually advancing from elementary, middle, and high school levels to high-difficulty physics and chemistry problems like GPQA, and approaching Olympiad gold medal levels. This year, the extremely challenging HLE intelligent assessment benchmark has also begun to show significant progress.
In real-world environments, coding ability is another typical example. In 2021, code models already existed, and at that time, there were collaborations with Junyang, Kimi, and others. During that phase, the models had basic programming capabilities, but their success rates and stability were limited, often requiring ten programs to successfully run one.
Today, the situation has changed significantly; models can often run complex tasks successfully in one go and have begun to substantially assist senior engineers in completing more complex engineering work.
Many people will ask, as intelligence continues to enhance, is it enough to just keep training the model?
DeepSeek emerged, and at that time, we repeatedly discussed a question internally:
The issues with Chat have basically been resolved. Continued optimization will likely only achieve performance that is close, or make some improvements in personalization and emotional aspects. From an overall paradigm perspective, the space is rapidly converging, leaving more challenges at the engineering and implementation levels.
This forces us to think about the next direction. Our judgment is that the new paradigm is no longer just "dialogue," but rather enabling everyone to truly accomplish a specific task using AI.
Moving from Chat to doing tasks marks a clear turning point.
At that time, we had two main approaches in front of us: one was to focus on Thinking capabilities, combining Coding and Agent;
the other was to allow the model to interact more deeply with the environment, using AI to directly assist research, such as DeepResearch, to generate complex research reports. This was a trade-off.
Ultimately, we prioritized the former path, enhancing Thinking capabilities and introducing Coding scenarios, while not completely abandoning the direction of interacting with the environment.
On July 28, we made an attempt to integrate Coding, Agentic, and Reasoning capabilities into the same model.
In the 4.5 version released on July 28, we conducted systematic evaluations using 12 benchmarks, achieving relatively leading results in agent, reasoning, and coding tasks at that time.
Subsequently, we quickly opened 4.5 for user use, allowing everyone to program in real scenarios.
Problems soon emerged. For example, some users wanted to generate a playable Plants vs. Zombies game in one sentence, including a complete interface, interaction logic, scoring mechanism, and backend system. Version 4.5 frequently encountered bugs in such real complex environments, making it difficult to complete tasks.
This pointed directly to the value of RLVR verifiable reinforcement learning. We built a large number of real programming environments as verifiable feedback sources for reinforcement learning, while combining SFT data for bidirectional optimization, allowing the model to gradually improve stability in real interactions.
Similar methods were also introduced in the Web scenario, enhancing verifiability through feedback from the Web environment.
Under this strategy, we achieved good results in real-world evaluations such as SWE-bench, and have maintained decent performance recently.
However, benchmark results do not equate to the capabilities of the main model. How to reliably feed these capabilities back into the main model remains a huge challenge. Many models perform outstandingly on individual benchmarks, but the actual user experience may not necessarily improve.
Another challenge lies in the training system itself. RL tasks are diverse, with significant differences in sequence length and time scale, making it difficult to schedule uniformly. To address this, we developed a fully asynchronous reinforcement learning training framework, allowing different tasks to run in parallel and converge dynamically. This framework was also open-sourced this year.
On this basis, the Agent and Coding capabilities have seen significant improvements. The recently released version 4.7 has made notable progress in these two dimensions compared to versions 4.6 and 4.5
Sensory evaluation is equally crucial. Real users do not care about model scores; they care about whether their programs can run smoothly and whether the results are reliable. To this end, we organized a large number of manual evaluations, inviting experienced engineers to conduct subjective assessments of real programming tasks. There are still many issues to be resolved, but the direction is gradually becoming clear.
After integrating these capabilities, by the end of 2025, we achieved a relatively good overall score on the ArtificialAnalysis leaderboard, which can be considered a milestone result.
Taking a step further, when the model truly enters the Agent environment and attempts large-scale implementation, the problems will become more complex.
The most basic capability of an Agent can be understood as programming. Once the program is written, it can be executed, corresponding to one or several actions in the Agent. However, as the complexity of tasks continues to increase, completely different forms will emerge.
On the left is the computer use proposed by Claude, in the middle is Doubao's mobile Agent, and on the right is the asynchronous, ultra-long link tasks done by Manus.
If you want AI to complete tasks involving dozens or hundreds of steps, such as monitoring discussions about Tsinghua University on Xiaohongshu around the clock, automatically organizing themes and generating documents, these tasks are essentially completely asynchronous and extremely complex. They cannot rely on humans to monitor devices; they are closer to a capability at the Device use level.
The greater challenge posed by these types of problems does not solely lie in the scale of data. Many application scenarios have almost no ready-made data; they are more about code logic, which is a typical cold start problem.
In the early stages, we did collect and integrate a large amount of data, achieving good results in some scenarios through SFT and specific domain reinforcement learning, but soon we discovered a practical issue: traditional iPhone use or mobile interaction is essentially about pressing buttons, while the interaction object of AI is not a person.
From a system perspective, AI does not need to operate the mobile interface; the ideal way is to directly call APIs. However, the reality is that devices cannot be fully API-ized, and GUIs still exist.
This requires a hybrid solution. In AI-friendly scenarios, APIs should be prioritized; in human-friendly scenarios, AI should simulate humans to complete GUI operations. By combining APIs with GUIs, we collect interaction data in a large number of real environments and conduct fully asynchronous reinforcement learning, allowing the model to gradually acquire a certain degree of generalization ability.
It is important to emphasize that this generalization ability is still very limited and has a significant gap from the ideal state, but it already possesses preliminary transfer and adaptation capabilities.
Another issue brought about by cold starts is the risk of reinforcement learning itself. If data is insufficient, the model is prone to getting stuck in local optima during reinforcement, manifesting as policy solidification and narrowed paths, ultimately leading to a deviation in overall performance.
To address this issue, we introduced an alternating mechanism during the training process, periodically inserting SFT into the reinforcement learning process to correct direction and restore diversity, enabling the model to have a certain degree of fault tolerance and pullback capability, thus forming a scalable training paradigm
In the mobile environment, this strategy has achieved relatively significant performance improvements in the Android scenario.
Additionally, we have also made certain efforts in emphasizing learning for multi-task large models, primarily using multi-round reinforcement learning in algorithms, essentially scaling it down to a larger scale in engineering.
This year, we open-sourced AutoGLM around December, making everything inside it open source. This model is a 9B model that can act particularly quickly in human-computer interaction.
We introduced a large amount of Agent-related data into the 9B scale model, significantly enhancing the model's capabilities in Agent tasks, but some of its original general language abilities and reasoning capabilities may decline. It is no longer a completely general model but is more oriented towards Agents.
In future larger-scale Agent models, how to enhance Agent capabilities while avoiding damage to general capabilities is a problem that needs to be solved.
The year 2025 will also be the year of open-sourcing GLM. From January to December, we will successively open-source multiple model lines, covering language models, agent models, and multimodal models, including a series of versions such as GLM-4.6, 4.6V, 4.5V, etc.
On the Artificial Analysis leaderboard, almost all the blue models in the top five come from China, indicating that China has formed a very significant influence in the field of open-source large models.
The next question is, can we continue scaling? What might the next AGI paradigm be? At the same time, we also face more real challenges.
After making progress in open-sourcing, it is easy to develop an optimistic sentiment, believing that Chinese large models have surpassed the United States in certain dimensions. However, the gap may not be narrowing and could even be widening.
What should we do next?
From the development path of large models, it has essentially been borrowing from the learning process of human brain cognition. In the earliest stages, it is about memorizing long-term knowledge of the world as much as possible, just like children read extensively; then gradually learning reasoning, mathematics, abstraction, and deduction.
This main line still holds, and there are several types of capabilities where humans significantly outperform current models, which may be new breakthrough directions.
First, multimodal.
Humans form overall cognition through various inputs such as vision, hearing, and touch, and insufficient sensory integration ability directly affects judgment and action.
How models can establish a similar multimodal "sensory integration" mechanism, that is, native multimodality, is a key direction.
Second, memory and continuous learning.
Humans possess a multi-level memory structure, including short-term memory, working memory, and long-term memory.
Furthermore, an individual's long-term memory is not equivalent to "knowledge." Only when knowledge is recorded does it truly enter the long-term memory system of human civilization.
How to expand from individual memory to group-level and civilization-level memory structures in the future, and incorporate them into a model's sustainable learning framework, is an important issue
Third, reflection and self-awareness ability.
The current model has a preliminary ability for reflection, but deeper self-awareness still remains highly controversial. There are significant divisions in academia, with some supporting it and others opposing it. Personally, I tend to believe that it is possible and worth exploring.
Human cognition operates on a dual system, System 1 and System 2.
System 1 completes 95% of tasks, such as casually answering "yes" to "Are you having dinner tonight?"—these responses are memorized by System 1.
System 2 is only activated in more complex situations, accounting for about 5%.
The same principle applies to large models. In 2020, we created a reference AI system structure diagram based on human cognition: System 1, System 2, plus a self-learning module.
The idea of introducing "self-learning" is based on three points.
First, System 1 can correspond to a large-scale model, allowing it to cover a wide range of common Q&A and routine tasks through pattern matching and knowledge extraction.
Second, System 2 can correspond to a stronger knowledge integration and reasoning mechanism, such as instruction fine-tuning and chain of thought, enabling the model to handle more complex reasoning and decision-making.
Third, the human brain undergoes unconscious integration and consolidation during sleep; without sleep, humans do not become smarter.
Corresponding to today's path, we can categorize it into three types of Scaling.
First, Scaling data and model size to enhance the upper limit of intelligence.
Second, Scaling reasoning to extend thinking time, using more computation and search to find better solutions.
Third, Scaling self-learning environments to provide the model with more opportunities for interaction with the external world and to receive feedback from the environment.
Through these three Scalings, machines can reference human learning paradigms and learn more.
For System 1, since we already have Transformers, does it mean that simply adding more data and parameters is sufficient?
However, we are currently facing a problem: the computational complexity of Transformers approaches O(N²), and as the context length increases, the memory overhead becomes larger, and the inference efficiency decreases more noticeably.
Recently, there have been some new models, such as those that handle long sequences with linear complexity, attempting to carry larger amounts of knowledge with a smaller "capacity," similar to the human brain.
But recently I have also been reflecting on whether we can find better compression methods to compress knowledge into a smaller space. This raises two types of questions.
First, can it be done from an engineering perspective?
Second, can it be achieved from a methodological perspective?
Recently, many people have been discussing that large models need to return to research and cannot simply rely on Scaling. Scaling is a good method, but it is also a lazy approach.
The key is to find a new paradigm that allows the machine to independently scale, defining its own reward functions, interaction methods, and even training tasks for Scaling
After having the above two points, we also need to face the ultra-long tasks of the real world. We need to enable this machine to plan like a human, do something, check it, and then provide feedback.
There have already been some attempts online; this idea is model generation, experiments are also done by the model, reports are also generated by the model, and in the end, a Wordshop can be created, but in fact, it has not been realized yet.
Here are some of our thoughts:
Before large models, most machine learning was about mapping F-X to Y, where I learn a function that allows X samples to map to Y.
With the arrival of large models, this problem has transformed into mapping F-X to X, which is not strictly X, but it needs to fully utilize self-supervised learning for multi-task self-learning.
In the second layer, we introduce more data to help the model learn reasoning and how to activate lower-level intelligent capabilities.
Further on, we aim to enable the model to have self-reflection and self-learning abilities. Through continuous self-evaluation and self-criticism, the model can gradually discern which behaviors are effective and which paths have room for optimization.
In the future, we hope the model can further develop higher-level capabilities, such as self-awareness.
We also need to teach this machine to learn more, such as self-awareness, allowing it to explain its own actions. For example, if AI generates a large amount of content, it should be able to explain why it generated this content, what it is, what its goals are, and ultimately, perhaps one day, AI will have consciousness.
We roughly define five layers of thinking.
Computers have three capabilities: computation, programming, and searching. The combination of these three capabilities may lead to what is called "superintelligence."
I often think back to an event in 2019. At that time, I collaborated with Alibaba, and they wanted me to describe the future direction in one page of PPT. The page I provided was called "AGI-Next30," discussing what we should do in the next 30 years.
Looking back today, reasoning capabilities have reached a certain consensus and progress; memory capabilities are beginning to show early forms, but they are still limited; consciousness is still in the exploratory stage. This is also the direction we continue to invest in.
Looking further ahead, if we continue to reference human cognition, future AI may need to answer more fundamental questions: What is "I," why is it "I"; how to construct a meaning system for the model; what is the goal of a single intelligent agent; how to coordinate goals when multiple intelligent agents act as a group. Through these questions, AI may have the potential to embark on a continuous exploration of the unknown.
Some may think these questions are too distant or even impossible. But from the perspective of humanity itself, the ultimate driving force of civilization is the continuous exploration of the unknown. Those seemingly impossible directions are often the exploration goals that deserve serious consideration on the road to AGI.
For me personally, in 2026, the focus is more on concentration and doing some truly new things.
First, Scaling will continue, but it is necessary to distinguish between two different directions. One is Scaling known paths, continuously exploring the limits of capabilities by increasing data and computing power; Another approach is Scaling unknown paths, which means seeking new paradigms that have not yet been clearly defined.
Second, technological innovation will become more critical. We will advance the exploration of entirely new model architectures, focusing on solving issues such as ultra-long context and efficient knowledge compression, and further achieve knowledge memory and continuous learning capabilities.
Third, multimodal sensory integration will become a key direction this year. With this capability, AI will be able to perform long-chain and long-duration tasks in real work environments, such as continuous collaboration on devices like smartphones and computers.
At the same time, I also judge that this year is likely to become an important breakthrough year for AI for Science. With the enhancement of multiple foundational capabilities, the range of scientific research tasks that AI can participate in will significantly expand, opening up more new possibilities.
Yang Zhilin
From 2019 to now, all large models are basically based on the same first principle, the Scaling Law, which is also a perspective on converting energy into intelligence.
If there are better methods or better chips, it is actually possible to convert energy into higher-level intelligence more effectively and in greater quantities.
With more computing power, data, and model parameters, the loss of your model can decrease linearly, which is the foundation of the entire technological development.
The earliest article proposing the Scaling Law compared the differences between Transformer and LSTM under the significance of Scaling Law, which is very interesting.
Regardless of the parameter count, the loss of Transformer will always be lower than that of LSTM, meaning that under the scale of Scaling Law, better scaling effects can be achieved with fewer losses or fewer parameters.
The core reason why Transformer became the mainstream architecture later is because it performs better on the Scaling Law.
The iteration of all model architectures today is actually aimed at finding a line that can get closer to the lower left corner. If your network architecture is closer to the lower left corner, you actually have a better network architecture.
In the current situation, it becomes more meaningful. The stock data on the internet is limited; it is a finite set, and the growth rate of high-quality data actually cannot keep up with the speed of model iteration. Therefore, when you have a quadrant closer to the lower left corner, your intelligence ceiling will be higher.
However, many people may overlook why Transformer is better. The key lies in Token efficiency.
What is Token efficiency? For example, when you provide a context of one hundred K, you will count the log of the first, second, third, fourth, and up to the one hundredth Token, such as loss, but it is a position loss because your horizontal axis is your Token efficiency, indicating which position you are in this sequence
You can see that in the first hundred tokens, Transformer and LSTM are completely the same, basically these two lines are intertwined. This means that when you are predicting what the next context will look like with a very short context, the effect is essentially equal.
Therefore, under a very short context of a hundred, Transformer is not a better architecture. However, the advantages of the architecture become evident when your context is very long, as Transformer significantly outperforms LSTM.
This is another perspective to break it down, and it is an important metric.
How much advantage do you have at different context lengths? This question will become very important in the Agentic era, as many agent tasks require very long contexts, and you need to handle very complex tasks. Therefore, when an architecture has lower position loss, it indicates that it has much more technical potential when performing agent tasks.
Our pre-training strategy or model design strategy revolves around these two dimensions.
The first is Token efficiency. What we hope to achieve is to shift this line as far left as possible. The further left you move, the higher your token efficiency, meaning you can achieve the same effect with as few tokens as possible.
When your entire pre-training tokens are insufficient, tokens are a constant. When you exhaust all tokens, your intelligence ceiling is higher because your loss is lower. This is an important metric and optimization direction for our pre-training.
The second direction is Long context.
Today, very complex tasks must be completed under ultra-long contexts. This is because extending the context will inevitably reduce the loss, and only a good agentic can reduce it more. If you are using architectures like LSTM, CNN, or RNN, it stops at a hundred tokens.
You can perform simple translation tasks, but you will never be able to accomplish a programming task, as there is no way to create a codebase from scratch. This is our overall optimization; multiplying token efficiency by long context will ultimately lead to very good agent intelligence.
So there are two main tasks here. The first is using the MUON optimizer, which is an industrial second-order optimizer. Traditionally, the first-order Adam optimizer was proposed over a decade ago in 2014. For almost ten years, mainstream large models have been trained based on Adam.
However, we found that the MUON second-order optimizer may perform very well, as it shows a twofold improvement in token efficiency. When you look at these two lines, you can achieve the same test loss using only 50% of the data, which means that if you use the same amount of data, your loss is smaller or more There will be a doubling effect of scaling.
On the right is the architecture of our latest research, kimi Linear. When you stretch this line, the reduction is very significant, meaning that your performance on tasks like Long Context will improve significantly. Finally, when these two factors are multiplied, we believe that the best agent performance can be achieved in terms of model training strategies.
All of this is aimed at creating better agents. Why focus on token efficiency? Essentially, the reasoning of an agent or the training of AgentRL is a search process. For example, if you want to develop a Linearx from scratch, you are fundamentally dealing with a search problem.
If you have unlimited data, you can enumerate all possible situations to see which ones are good operating systems. You let AI develop Linearx to enhance its efficiency. The previous agent used a well-informed prior from the model, and during the process, there is no need to enumerate every possible token combination, as many combinations are meaningless or incorrect. Better pre-training and foundational models reduce the search space and enhance better priors.
Today, many people are researching how to reduce priors, and it may ultimately be possible to achieve AGI with very few or almost no priors. However, I believe that achieving AGI based on priors will still happen sooner. The entire field should first achieve AGI based on priors, and then explore ways to achieve SCI with increasingly lower priors.
The equivalent here corresponds to stronger priors. You are working with limited data, but with the same amount of data, a larger brain capacity, higher learning efficiency, and greater intelligence can yield a stronger agent with better priors. Context is another dimension; your agent's behavior requires its working memory, so you have stronger environmental perception to perform longer tasks, ultimately combining these two elements.
Based on this, the entire iteration of kimi in 2025 has explored and practiced along the two directions just mentioned. First is the Muon optimizer; we have conducted many experiments and discovered several important techniques.
For example, it is necessary to incorporate VDK. In the search process, the original optimizer was Adam, which might be close to 1.0, but using Muon, the results are actually better in comparison to its scale. Through these significant improvements, we have developed a truly effective optimizer that stands the test of time in various aspects, achieving a 2x improvement in token efficiency.
Therefore, note that the efficiency here is not just efficiency; it is actually the upper limit of intelligence because your token count is limited. We have also conducted many fair comparisons, and basically, all tasks show improvement, which is essentially equivalent to training with double the tokens of others
During the process of enhancing this optimizer, some issues were observed. In a medium-scale experiment, challenges arose in the optimization process of Muon. The horizontal axis of the left chart represents the number of training steps, while the vertical axis shows the maximum Logit value, which exhibits explosive growth that is actually unhealthy.
When reflected on the right side at very high levels, your Logit training may not converge, and the Loss could explode, leading to some instability. In fact, the final performance of this model would not be good.
An important point here is solving the Muon explosion issue through a new method. We tried many approaches, and QK-clip showed very promising results. However, there are some details; when you perform QK mapping, a value is multiplied, which is determined by the current maximum Logit of QK, allowing it to dynamically clip within specific values.
The effect is as follows: one with Clip and one without.
The two lines on the left are completely overlapping, which you might not notice; they are actually fully overlapping. This indicates that adding Clip has no impact on the results, allowing for the reproduction of any effect, but the logit becomes much healthier.
On the right, the Logits began to rise, reaching one hundred, and QK started to take effect. It was found that I might not need this, and at this point, it would automatically decrease, thus stabilizing the training effectively, enabling the new optimizer to perform stable training at the trillion-parameter level of kimiK2; otherwise, it would have exploded like before.
This chart is the most beautiful thing seen in 2025; it is the most beautiful thing in the world.
It shows a completely smooth descending Loss curve, with no issues throughout the training of 15T Tokens, successfully compressing all logits and converging smoothly to a very good point. When you have an elegant method, you can achieve an elegant result.
On the excellent model of kimiK2, we also conducted a lot of reinforcement learning and post-training, but this is not the focus today. The important points are that we have comprehensively improved the capabilities of various agents and can benchmark against leading companies in the United States.
At the same time, a very important point is at the core; for example, in HLE, I don't know how to solve 99% of the problems, but the model can now achieve an accuracy rate of 45%, which is even higher than OpenAI. This means that on the most critical data, we perform better than American companies, which is a significant highlight.
Additionally, it is a completely agent-based model. kimiK2 is China's first agent model. After the upgrade of K2 Thinking, it can complete tool calls in two hundred steps, allowing it to write programs when solving difficult problems. After two to three hundred steps, it can tackle problems that I completely do not understand, but its answers are correct.
Thanks to these developments, I believe many Chinese open-source models are gradually becoming new standards. Recently, EDA released new products, and now many Chinese open-source models are undergoing standard testing. This is also a significant benefit of open-source; we hope to see more Chinese open-source strength, and that Chinese models can gradually become standard setters
After K2, we are continuously exploring what the next generation of models might look like. I just mentioned the open-source work on kimiLinear, which is also an early attempt on our part. We will continue to optimize and improve upon this foundation to train the K3 model.
One of the most important improvements is kimi Delta Attention, which is a new linear attention mechanism. This technology has been around for a while, but it has not yet become mainstream or been adopted by cutting-edge models.
The main reason is that it struggles with long-distance tasks. When your context becomes longer, linear attention does not perform as well as full attention or the original Transformer.
This is a significant issue because many tasks now require long-range capabilities. As the context lengthens, performance deteriorates, making it difficult to switch.
The most important aspect of kimi Linear is that it allows this linear attention mechanism to perform better on long-range tasks, even better than full attention, while also being faster. Because it is linear, its efficiency is significantly higher; for a million contexts, it may offer a 6 to 10 times advantage in end-to-end speed.
At the same time, it can improve many existing shortcomings of linear attention, which may have insufficient expressive power, leading to subpar performance. Therefore, kimi Linear is the first linear attention architecture that outperforms full attention mechanisms in both short-range tasks and long input/output tasks. Thus, it will play a very important role in practice.
Let's take a brief look at what it looks like. S represents the current linear data, and you can see that it is entirely linear. The operation relative to ST minus one is referred to as linear attention.
A key point here is the diagonal matrix in the middle. Each dimension of FT will be multiplied by a value, meaning that for each dimension in this state, we can precisely control how much memory is retained from ST minus one to ST.
This is a crucial point; its expressive power will be greatly enhanced. However, if you are working with very rough or unoptimized data, your efficiency will significantly decrease. We have made many optimizations here, and you can derive the following form after making various changes to the earlier values.
In terms of engineering implementation, this can yield many benefits. If you compare DPLR, we have advantages in data and reduced matrix operations, so the overall efficiency is very high. To achieve a good architecture, you need to combine many underlying optimizations with the model architecture. You cannot just modify some architecture; without efficient implementation, it is challenging to achieve good results.
However, compared to previous linear attention architectures, it has a significant advantage in that its expressive power is stronger
Take a look at the effects in this image. On the left is the performance comparison, where we will examine two types of tasks: one is a short-range task, MMLU, which are all fair comparisons using exactly the same data and models of the same size. In the short-range, it performs significantly better, and in long-range tasks, it shows better results compared to many previous linear attention and full attention architectures.
At the same time, the speed shown in the right image has also significantly increased, basically as fast as the previous linear attention, but much faster than full attention.
Next, we will do more scaling based on K2. Of course, this scaling is not just about adding computing power. It involves many technical improvements, which will equivalently translate into scaling advantages. An important point here is that, in addition to challenges like architecture and optimizers, better data is crucial.
A very important point is that the upcoming models will have more taste, more style, and aesthetics.
The process of creating models is essentially about creating a worldview—what you think is good, what kind of performance a good AI should have, and what values it should pursue. It's somewhat like what Jobs said about taste. This is something we strongly believe in because intelligence is different from many other things; each token generated by a model is not an interchangeable item.
If you look at many things today, the electricity generated in Shenzhen is the same as that in Beijing, and the last penny in a bank account is completely equivalent; it is an equivalent exchange. But intelligence is not like that; the intelligence generated by a CEO is different from that generated by a designer or a musician.
In the dimension of intelligence, there is a vast space for taste, which increases exponentially, and you will have more new tastes emerging. It’s not that this model will converge; this is a very important goal for us moving forward.
I often have conversations with Kimi and share an interesting dialogue we had before. Now we are all working on AGI/ASI, which may lead to a better future where we can explore the universe together, but it could also pose a threat to humanity.
If your performance is very good, it can now accomplish many automated tasks, and there will even be significant improvements later on. This answer is very enlightening.
It may not just be an ordinary tool but something that can elevate the limits of human civilization.
The extension of human cognition; today we have many problems that we cannot solve, many cancers that cannot be conquered, and many energy issues that need to be addressed. There are even many social issues that require better design. I think, from Kimi's perspective, it is a very important key for us to explore the unknown world.
So, although there are risks, my answer is that I would still choose to continue development because giving up this development means giving up the limits of human civilization. Therefore, we should not fear the risks of technology but should strive to break through further. At the same time, in this process, we may manage the risks well because all technological breakthroughs come with risks; we cannot stagnate out of fear
We hope to continue improving K4, K5 to K100 in the next ten to twenty years.
Lin Junyang
Both Teacher Tang and Zhilin are from Tsinghua, and I represent Peking University here. I haven't been back to Haidian District for a long time; I am from Chaoyang District.
Today, I will give an overall introduction to the progress of Qianwen in 2025. Some of the information may be relatively old, as we have been working on the next generation of things in recent months. I will try to share what I can.
The title "Towards a Generalist Agent" has actually gone through many rounds of changes; it was originally called "Towards a Generalist Model," but later I felt that "model" is too broad a term.
After some thought, I realized that "agent" might be a larger concept, similar to how humans can autonomously use tools. A significant difference between humans and animals is the ability to use tools independently. Thus, it became "Towards a Generalist Agent."
Moreover, the training paradigm has changed significantly today. In the past, regardless of what we did, we would label the inputs and outputs, which you can consider as our traditional labeling. With this new technology, as long as I solve the reasoning and evaluation, this thing can be impressive and can be applied to anything, allowing me to unleash my imagination.
For example, today, data intelligence and model intelligence are both possible, which is also a small reason why I, someone who works with language models, dare to boldly claim that I want to work on VLA and robots.
If everyone wants to use our models, the easiest way to experience our open-source and closed-source models is quite interesting. We have been doing open-source for a long time, and everyone is relatively clear about it, so I won't elaborate or boast.
However, netizens have been criticizing us, saying that our tools are hard to use, and that they always have to search through our models. So we took OpenWEB AI down and turned it into an aggregator, making it look like ChatGPT. Initially, the algorithm team didn't have a strong product awareness, but as we worked on it, we developed this feeling that the model is the product, leading to some fun outcomes, so we will put everything on this platform.
Generally, we can find good search results in qwen.ai. Posting blogs is relatively simple for us. Recently, our new model architecture Qwen Next has become popular, and many colleagues have had trouble citing it, so please forgive us.
We have been doing open-source for quite a while, starting on August 3, 2023. Many people ask us why we are doing open-source.
There are many coincidences involved in this matter. Anyway, after doing open-source for a while, we have accomplished a lot, and at least it is still relatively industrial.
There isn't much content; basically, there are some scripts that everyone can view. We have a relatively large number of models; why is that? In the past, many people didn't understand why we were making small models, but today everyone understands that small models are quite valuable
The small model ultimately originated from our internal 1.8B model used for experiments. We conducted pre-training, and resources are limited after all. When doing experiments, you can't use the 7B model for everything; you have to use the 1.8B model for testing. At that time, my junior told me that we should open source this model, and I didn't understand it at all.
I said this model is almost in an unusable state in 2023; why should we open source it?
He told me: The 7B model consumes too much machine resources, and many master's and doctoral students do not have the machine resources to conduct experiments. If the 1.8B model is open-sourced, many students will have the opportunity to graduate, which is a very good original intention.
As we continued, mobile phone manufacturers came to us saying the 7B model is too large, and the 1.8B model is too small. Could we create a 3 to 4B model for them? This is easy; there’s nothing particularly difficult about it.
As we progressed, the types of models increased, which is somewhat related to serving everyone.
However, our inner pursuit is not just to serve developers or researchers; we are looking to see if we can create a Multimodal Foundation Agent, and I strongly believe in this.
If we trace back further, as Professor Tang mentioned, when we were collaborating back then, we were heavily focused on multimodal work. Looking back, it was a passionate time.
In 2023, large models are something that everyone wants; there’s a bit of a large-scale steelmaking aspect to it. Multimodal is something we have always wanted to pursue.
If you want to create something intelligent, it should naturally be Multimodal. Of course, there are different opinions, and various scholars have their views on whether multimodality can drive intelligence.
Humans have eyes and ears to do more things. I consider that Foundations have more productivity; can they better assist humanity? Undoubtedly, we should work on vision and speech.
Ideally, I remember back in 2022, we designed a system with a brain in the middle. We didn't know what that brain was, but we knew that different modalities and tasks should enter this brain and output from it. This is the true vision of AGI.
Today, it seems very possible because I don't know if everyone’s research direction has achieved unified understanding generation; this matter is quite complex.
Currently, Google has not achieved unified understanding and mutual generation, but I still believe in these things. If we look at GPT, after unifying many things today, it seems more perfect. Back then, there was still debate about which one was better.
The biggest progress this year is Qwen3, which is a mascot that looks a bit like a bear, but it’s a capybara. While working on it, I felt our students were working too hard; I didn’t want them to struggle too much. In today’s competitive era, being a bit more relaxed is not a bad thing. We are working on relatively more directions
But you can see that each direction has its own coherent logic.
For example, we have been working on Text and VL, Omni for a longer time, focusing on visual, text, and voice generation. One of our unique aspects is that we are supported by Alibaba Cloud, and we have many businesses related to Alibaba Cloud's clients. The cloud business has a diverse range of clients, including Embed Guard, which provides services to everyone.
Today, I will introduce the relatively main areas of Text, VL, and Omni, and Coder will provide relevant introductions on Text. This year, Text mainly focuses on the Qwen3 series, which has now reached 3.5, with version 3 being developed for a longer time.
One of the biggest features is the overall capability enhancement.
This year, it is interesting that the reasoning capability needs to be improved. I would like to add my personal understanding that reasoning is somewhat different from the current pure task models.
The second point is the languages and dialects we support; there are a total of 119 languages including dialects.
Why are we working on multilingual capabilities? There was a bit of serendipity involved. In 2023, we thought that as long as we did well in Chinese and English, we could serve our target audience well. However, I once met a Korean friend who asked why they couldn't use our model when developing theirs.
He said that your model doesn't understand any meaning at all, which hurt my feelings. I took a look and found that it was quite simple, and I quickly made the adjustments.
Later, I realized that our global user base was growing. I remember some friends from Pakistan kept telling me to support Urdu quickly because they really had no large models available. I thought this was indeed a good idea, so we supported more languages. We haven't finished yet; collecting data from Africa is indeed a bit challenging, and African languages are not covered.
Today, I talked with some mobile phone manufacturers, and many people in Africa still use feature phones. While we have entered the era of smartphones, they are still working on this. Therefore, if we want to help all of humanity, it is indeed a long way to go. If your intention is not to help all of humanity, I think it would be better not to do it, so we will continue.
The third point is that today's long texts and long videos may be one example.
However, I find this very interesting. If you really want to create a model with self-awareness, the context must be long enough. Previously, there was a discussion about whether it is necessary to include a lot of irrelevant information in long contexts, but having this capability allows for better understanding.
So now we can handle over 1M, and internally we have already achieved several M, which may still not be enough. Today, I want to emphasize that this is indeed a very long process. Returning to the earlier question, a significant difference between our generation of models and those in 2024 is that the reasoning capability needs to be enhanced. Broadly speaking, reasoning involves problem inference to achieve better solutions
Although we have to conduct relevant research, how to make the reasoning more native, during Qwen3, the version we released in April had some shortcomings, particularly in data handling, and there were some issues with the integration.
Over 90% of customers no longer use the Thinking model. A significant reason for the extensive use of our QWQ series is that their users enjoy chatting with the machine. However, everyone quickly returned to Instruct. Here, we mainly look at the yellow and blue sections; the blue represents the April version, and the red represents the July version.
Besides improving the data quality, an important aspect is that AIME can achieve a score of 70. You can reach 90 with Thinking, but after adding this capability, customer feedback clearly indicates that the model is much smarter than before. With just over 20 points, it basically struggles with any questions, such as understanding simple math problems in the education sector. This is a model we are quite proud of, and it is not very large; many people are using our series of models.
However, there is also a regret that this model still has many unfinished aspects, which is a matter of trade-offs.
For example, integrating Coding and Agent capabilities is quite challenging. Considering our technical strength and situation, including our ongoing work on the Cod series, we launched this model.
Today's Cod is quite different from the past. For instance, last year and the year before, we were solving simple competition problems, just to see if we could produce the correct answers.
What are we doing today? Software Engineer. In 2024, everyone will be surprised: can AI function like a programmer? Maintaining this today is quite difficult; if you can do it, that's great. In practice, this process is quite complex. The simplest part is that I can at least open these folders, and by looking at the names of these folders, I know which one I can click on. This is actually a multi-turn interaction process.
A very important aspect of today's Agent is why everyone emphasizes multi-turn environmental interaction. To put it simply, opening folders and checking them is also a way of interacting with the environment. This is very important and interesting, and it excites us because it can truly generate productivity. We want today's Coding model to be productive, and many codes can be written, which is quite surprising.
Of course, there are differences between China and the U.S. I just returned from the Bay Area and felt the differences between the two sides. This is quite exaggerated, but whether the model is not good enough today or if WEBCoding has not been improved further, I think there is a difference in perception. What we want to achieve is a common goal, which is to generate productivity.
At that time, we particularly focused on two benchmarks: one is SWE-bench, can you submit a PR to solve it? A score of 70 is considered a relatively high threshold, and now we can feel that it is above 75. In July, we achieved scores of 67 and 69, which we felt were acceptable
Terminal-Bench is quite challenging as well. Today, everyone is using this series of products, and people will find that this thing indeed connects with your productivity in a way that is different from before. What we are doing today is practical tasks. Perhaps today we only have one or two Benches, but is it possible to make it more suited to real environments and real production tasks? That is what we want to achieve.
When it first came out, it was quite popular, but now the competition is too fierce. The Token coder has consistently ranked second, just a little bragging.
The most interesting thing about this set of tools is that it has never been done before. Today, I have a Scaling for model training, which is related to AgentScaffolds. The so-called scaffolding can be simply understood as this thing. It can work with machines in Alibaba Cloud's ECS, not only presenting algorithmic challenges but also completing tasks and then eliminating them.
There are many real challenges, and I can feel the things in the upper right corner myself, while the upper left corner requires collaboration with other partners. The algorithmic Instruct today is a real collaboration, and to accomplish such difficult tasks, we need a lot of Instruct support.
This was about coding, and we want to see if coding capabilities can be integrated into our larger model. One regret is that I have not pushed to open-source the largest model, which is over 1T, although I really want to open-source it.
But that's how it is. We have finally integrated these capabilities, and you can see that our SWE-bench can achieve a score of 70. Previously, it was quite difficult to achieve a high score without well integrating it. This also indicates that when you reach a strong level, you can gather a strong model, which requires corresponding accumulation.
Qwen3-Max is also ranked in the top five overall. Of course, it reflects human preferences. Will future evaluations be dynamic? For example, let it trade stocks in a human production environment. In the end, there is a company that does stock trading, although there is a lot of randomness, but it has made a good start, allowing everyone to see whether AI performs well or poorly in the real world.
When doing language models, we also need to consider whether it can have eyes to see the world. For example, we just mentioned wanting to create a Coding Agent to enhance productivity. I must let it control the computer and see the computer screen; without eyes, it cannot see. Therefore, we are unhesitatingly pursuing this, which is a huge difference. Visual Understanding is what we should focus on.
However, today many models can see things more clearly than humans. For example, I have both myopia and astigmatism, which makes it hard for me to see clearly. But I can still distinguish up, down, left, and right clearly. AI, however, interestingly sees very detailed things clearly. For instance, when asked about the front, back, left, and right, it surprisingly distributes the information well
We have been evaluating a case for a long time, called the direction of living entities. At that time, I even asked our evaluators what a living entity is, unable to distinguish whether things are on the left or right, which I found quite strange, but this is the problem we need to solve.
But not only that, we also need to do one thing: to ensure that its intelligence does not decrease. We do not expect it to significantly increase its IQ, but at least it should not become less intelligent, because many times, creating VL models leads to a decrease in intelligence. This time, we finally managed to prevent it from becoming less intelligent, reaching a state comparable to our 235B language model.
Here, I will briefly discuss the main improvements we made this time.
First, we are all working on enhancing its ability to operate mobile phones and control computers.
Second, its language governance; whether the VL model can be used as an LRM, so that it can catch up with native multimodal models, at least technically allowing the language intelligence to reach a comparable state.
Third, Coding, which is very important, but the input for Coding can also be images or videos.
For example, today I want to create an APP or a webpage; I can draw it out. It doesn't necessarily have to be written in words, as that tests human expression ability. Many times, people do not express themselves clearly; you can draw a picture. Understanding videos may be the next opportunity for VL.
Video is a broader form of expression; images can be understood as single-frame videos, and understanding long videos is a very interesting task.
I have been thinking, if we have smart glasses that receive more information every day, can we construct our matrix? These glasses provide a first-person perspective, while the videos we usually gather online are from a third-person perspective, and we have very little understanding of the first-person perspective. We generally discuss whether it can build a good understanding of the physical world.
When we were working on this, we realized that it is really necessary to know whether it can understand spatial elements. This motivates us to do something: can we create a VLA? Perhaps we need to integrate all this data, and is it possible to connect hardware to create a VLA model, allowing it to gain some generalization?
Additionally, there is the enhancement of basic capabilities. For example, today when everyone is using OCR, a lot of effort goes into detecting very wrinkled items. However, our images often cannot detect them; if the paper is very wrinkled, can we make it understandable? That is a problem we need to solve.
Furthermore, there are seals with very unique and small fonts, and low image resolution; being able to recognize them is a special task. Can multimodal models perform reasoning and understand images? For instance, today we analyze a math problem step by step, and can we combine it with images to see smaller points on the image?
To give a smaller example, if there is a photo with 50 people, can it count them? It can't, but with reasoning, I can look at it bit by bit and possibly count that number. Today, combined with specific applications, there is actually a lot of room for what can be done.
We can basically reach a level of 2.5 pro now, but what makes me happy is that the intelligence of language has not declined so much, which can be considered the first time this problem has been solved.
What do we want to do further? Besides understanding images and videos, is it possible to generate images and videos simultaneously? We even have a bigger imagination; if we think about whether it is possible to implement our foundational model to imagine these things.
I have some images in my mind, and this imagination is meaningful to me. These things can all be realized through image generation and video generation, which will also be linked to this year's vision model.
This year, we just started doing generation, spending a few months on the Qwen-lmage series, and we just updated one in December.
This is a blind test conducted by our internal staff, and the ranking is still acceptable, basically slightly worse than the best open-source and closed-source models. However, when I saw some actual images, I was quite excited.
For example, compared to other models, there wasn't much feeling, but you can look at the versions from August and December. The images generated in August had a very strong AI feel, but the ones generated in December were already approaching the absurd, although they weren't that beautiful or good-looking, they were close to looking real.
In fact, there is another image in our blog of a girl in a dormitory taking a photo, which really looks like a girl just waking up in a dormitory. It's not great to put it here, so I put a prettier one. There are also more natural things, like a lighthouse, where the water splashes are particularly exaggerated, but the water on the right can reach a very natural state.
Another thing is that generating text in images needs to be very accurate, able to generate text onto the image. The storyboard is not pieced together; it is actually one image made up of 12 images, including the text generated all at once. Today, the model has some powers that exceed our expectations; sometimes we didn't even think it would become this strong when we trained the model ourselves.
But besides generation, we also need to do more important things. After we did the generation, users told us that editing is a bigger demand because everyone needs to edit photos to make themselves look better.
There is also an Image-edit version, and we will combine edit generation together. I use this thing every day; recently when I went out traveling, I wanted to capture the feeling of "American Pastoral." There were many people below, so I removed a lot of them and adjusted the style, and I could create that thing. This is what I do every day.
I want to share a more interesting case with everyone, which is something everyone will ask me today: how does the open-source community help us develop this model? If it weren't for the open-source community telling us, we would never have thought of this problem in our lifetime
There is an image we want to edit, to lower the person on the right side of the image. You will find that after lowering it, when the two images overlap, it becomes blurry; it has shifted a bit and is not in its original position.
For many students who work with Photoshop, this needs to be very precise; you cannot move it randomly. Therefore, version 2511 focuses heavily on solving this problem. In version 2511, when I combine the two images, the person remains in their original position. I think the developers have put in a lot of effort to create something that can genuinely help them.
Editing can do many things, such as adjusting the lighting to make it softer and more vibrant. It is essential for our users and products to tell us whether the lighting is reasonable or not. Many of us working on algorithms often feel this, but sometimes the requirements for images are higher than expected.
So, when everyone talks about world models, whether we can truly build something that conforms to physical laws or the real world is still a very important matter.
There are also examples, such as translating some shots or rotating them by 30 degrees; these are very common tasks. Today, this model can even be combined with reasoning. We have always wanted to do one thing: teaching children can be very painful for parents. Many times, AI cannot teach certain problems, such as drawing auxiliary lines; it really requires generative models to do so. I can genuinely solve a math problem, for example, drawing auxiliary lines, which I might need to push through generative methods for better understanding.
Next, going a step further, if the image problems we are looking at are mostly solved, can we even generate things that can listen and speak like humans? Because voice interaction is also very important. Today, when everyone uses various apps, they find that having voice interaction is indeed very convenient.
Omni is also very intelligent, and I am willing to believe in some things. Today, understanding the ambient sounds of events and what people say cannot be solved simply by using ASR.
So we are creating something called Talker. This model has been in development for a long time, allowing it to both listen and speak, ensuring its effectiveness is stable. Omni continues to work in this direction, and the overall progress has slightly decreased in intelligence, but the reduction is not significant.
Our model can reach a level of 2.5 in text, and for speech, it can basically match the level of 2.5por. There are many interesting aspects here, but due to time constraints, I cannot share them all with you.
Today, TDS can switch various voices, including customizing your own voice. As long as you describe what this voice looks like, AI can present things in that form. I think there are still many fun things to explore, whether the foundational model and foundational agent can interact better with the real human world, including the virtual world.
What is the next step? After doing so much, we certainly hope to bring it all together. The multimodal model is what we aim to achieve
There is something very important that I think is also a common goal, which is to do similar things to Kimi culture. When we conduct various experiments simultaneously, we ultimately chose to use Linear Context, of course, in combination with a three-layer Linear structure.
The next generation of models will also be implemented along the new architecture. What we want to achieve here is whether the new generation of architecture can solve the problems we just mentioned and save a lot of steps. There may even be more power within it. The next generation of models will indeed be implemented along the new architecture.
What further things do we want to do? The Omni model is not just about understanding text, visuals, and audio; we may also enable it to generate text and audio. We have already achieved this today, but we have not yet integrated visual generation. If we can achieve three inputs and three outputs, I think it would be something I personally like.
The second point is that today's paradigm has undergone a significant change. Today, it is not like before where we train models with labeled data, with one input and output for training; today we need to incorporate more data into experiments for training.
If everyone pays attention to the promotion of XAI, although I think RL data is somewhat wasteful, on the other hand, it also means that RL has a lot of imaginative space. Of course, it’s not about having a dialogue with oneself; I am not so concerned about whether our model can become the strongest mathematical brain. I am more concerned about contributing to society like a real person. If it can achieve this, I think that would be quite good.
Therefore, Multi-turn RL with environment feedback towards long-horizon reasoning is necessary because many tasks require a long time, and you have to do them step by step.
However, AI can accelerate many things. For example, something that takes humans two months can be done by AI in two days. Although there are many tokens involved, two days can indeed save us a lot of time.
Agents can actually move towards both the virtual world and the physical world, which is why we have the concept of Embodied Reasoning. We discussed a method internally; even if you are working on VLA or coding models, it is essentially about transforming language into an embodied model, which is very exciting from this perspective.
Thus, we feel like we should go all out and see if we can move towards Digital Agents, with GUI operations while also being able to use APIs. This would be a perfect Digital Agent. If we move towards the physical world, can it pick up a microphone or pour tea?
Roundtable Discussion
The most exciting part of this summit is undoubtedly the roundtable session.
It started off very dramatically; there were supposed to be four guests, but only three were on stage
In confusion, Yao Shunyu suddenly big face jumps on screen:
Am I now just a big face on the screen?
The whole audience was stunned for a moment, then burst into laughter.
The host took this opportunity to directly start the topic with Yao Shunyu.
Q1: Route Differentiation
Host: I am Guangmi, the host of the next panel.
Let's start talking about the theme of differentiation. The competition in Silicon Valley is so fierce, it hasn't completely followed everything, but has focused on enterprises, coding, and agents.
I'm also thinking about what directions the models in China will differentiate into. I find the theme of differentiation quite interesting.
Shunyu, please start by sharing with everyone, and also tell us what you have been doing recently.
Yao Shunyu: Hello everyone, am I now a huge face at the venue? Sorry, I couldn't come to Beijing in person today, but I'm very happy to participate in this event. Recently, I've been busy with models, products, and AI, which is a very normal state. The feeling of returning to the country is quite good, the food is much better.
I have two major feelings. One is that there is a clear differentiation between toC and toB, and the other is that the path of vertical integration, as well as the layering of models and applications, is also beginning to show differentiation.
First, I think it's obvious that when people think of AI, there are two: ChatGPT and the other is Claude code, which are for toC and toB.
An interesting point is that when we use ChatGPT today compared to last year, the feeling is not too different.
However, coding, to exaggerate a bit, is already reshaping the way people work in the entire computer industry; people no longer write code but communicate with computers in English.
I think a core point is that for toC, most people most of the time do not need such strong intelligence. Perhaps today, compared to last year, the ability to write algebra and Galois theory has improved, but most people do not feel it most of the time.
Most people, especially in China, use it more like an enhanced search engine, and often do not know how to use it to unleash its intelligence.
But for toB, one obvious point is that the higher the intelligence, the higher the productivity, and the more valuable it becomes, and these things are all related.
For toB, another obvious point is that most of the time, many people are willing to use the strongest model, one model costs $200/month, the second strongest or slightly weaker model costs $50/month or $20/month.
Many Americans are willing to pay a premium for the best model, perhaps their annual salary is $200,000, and they need to complete 10 tasks a day. A very strong model might get eight or nine tasks right out of ten, while a weaker one might only get five or six right. The problem is that you don't know which five or six are wrong, and you need to spend extra effort to monitor this situation
I think whether it's people or models, there is an interesting phenomenon discovered in the toB market, where the differentiation between strong models and slightly weaker or weaker models is becoming increasingly apparent.
The second observation is the difference between vertical integration and the layered application of models. A good example is ChatGPT Agent, which, compared to applications like Claude or Gemini combined with Manus, used to be thought that having vertical integration capabilities would definitely lead to better outcomes, but at least today, that is not necessarily the case.
First of all, the capabilities required at the model layer and the application layer are quite different, especially for toB or productivity scenarios, where larger pre-training is still a very critical factor. This is indeed difficult for product companies to achieve, but to effectively utilize such a good model or for such a model to have its spillover capabilities, a lot of corresponding work needs to be done on the application side or the environment side.
We find that vertical integration still holds in toC applications. Whether it's ChatGPT or Doubao, the model and product are tightly coupled and iterated closely. However, for toB, this trend seems to be the opposite; models are becoming stronger and better, but there are also many application-level elements that apply good models in different productivity segments.
Tencent is definitely a company with a stronger toC gene. I think we will consider how to enable today's large models or the development of AI to provide more value to users. A core consideration is that we often find that in our environment, whether it’s a stronger model or a very strong model, it often requires additional context.
I often give an example recently, such as when I want to ask what I should eat today. In fact, asking ChatGPT today will yield very different results compared to asking it last year or tomorrow.
To improve this situation, it’s not just about needing a larger model, stronger pre-training, stronger reinforcement learning, stronger agent environments, or stronger search engines; this issue may require more additional input, or what we call context.
toB is indeed a challenging endeavor, and the revolution in productivity, including many Chinese companies today working on Coding Agents, needs to tap into many overseas markets.
We will think about how to serve ourselves well first. The difference between startups doing coding and large companies doing coding is that large companies already have various application scenarios and various areas where productivity needs to improve.
If our model can perform better in this area, not only will this model have its unique advantages, but our company itself can also achieve good development. Importantly, capturing data from real-world scenarios will be a very interesting endeavor.
For example, Cloud and these startups, they want to create more data vendors for Coding Agents to label this data, and they need to utilize various software engineers to think about what kind of data they should label
This matter involves only a few data companies, and although they have hired many people, you will ultimately be limited. However, if you are a company with 100,000 employees, there may be some interesting attempts to effectively utilize real-world data, rather than just relying on labelers or agreements.
Lin Junyang: Today, whether it's B2B or B2C, we are addressing real issues. What we are thinking about is how to make the human world better. Even if you are making B2C products, there will be differentiation. Today, OpenAI is more like a platform, but B2C ultimately needs to serve the real users in this group.
Today, many AI applications may lean towards medical and logistics. I think Coding is really impressive; I visited them because I know they communicate a lot with clients. This is an area where we are still lacking, even though we have significant advantages. The Chinese SaaS market is indeed quite different from the U.S.; they frequently communicate with clients, making it easy to discover substantial opportunities.
When I talk to many API vendors in the U.S., they don't perceive Coding's consumption as being that large. In China, it really isn't that big, at least from my perspective. However, in the U.S., it's basically all about Coding, and I feel that this is something not everyone can grasp.
Some of the related things being done today are based on their own observations of opportunities with clients. I believe that the differentiation among everyone is a natural one. I prefer to believe in AGI and let AGI do what it should do, following the natural course; this is what we should be doing.
Yang Qiang: I would like to discuss the differentiation between industry and academia, which may span both the U.S. and China.
Historically, academia has been an observer while industry has been leading the charge. This has led many in academia to engage in industrial activities, like Professor Tang Jie, which is a good thing. It's similar to when astrophysics began, focusing on observation with Galileo's telescope, and then Newton emerged.
Therefore, I believe that in the next phase, when we have numerous stable large models and enter a steady state, academia should catch up.
What problems should academia address? These are issues that industry may not have had time to solve yet, and this is something I have been considering: where is the upper limit of intelligence? For example, given certain resources, whether computational or energy resources, how well can you perform?
We can be more specific, such as how we allocate these resources—how much is allocated to training and how much to inference? In fact, I have been working on AI for a long time; I conducted a small experiment in the early 1990s. If we invest a certain amount in memory, how much can that memory assist reasoning? Will this assistance become counterproductive, meaning that if you remember too much, the noise from your memories interferes with your reasoning? Is there a balance point? I believe these questions are still relevant today
I have also been thinking about another question recently. Everyone studying computer science must take computer theory courses, where there is an important theorem called Gödel's incompleteness theorem. The general idea is that a large model cannot prove its own innocence; there will inevitably be some illusions that cannot be eliminated. Even if you provide more resources, it will eliminate more illusions.
So the scientific question arises: how many resources can be exchanged for a reduction in illusions or a decrease in error rates? There is a balance point, which is very similar to economics, a kind of balance between risk and return in economics. Therefore, we call this the no free lunch theorem. I think these concepts are particularly suitable for collaboration between the mathematics community, the algorithm community, the academic community, and the industrial community, which nurtures a huge breakthrough.
Teacher Tang Jie also mentioned continuous learning. I think continuous learning is a particularly good question; it involves a concept of time, as you are continuously learning over time.
However, you will find that if you connect different agents together, and each agent cannot achieve 100%, then after N iterations, its capability declines exponentially. How can you ensure it does not decline? Humans have a method for this: on the first day, you learn, and on the second day, you learn based on the noise from the first day. This way, your capability is similar to that of a large model that declines.
However, humans have a method, which is sleep. I recommend everyone read a book called "Why We Sleep," written by two professors from MIT. It’s very interesting; it says that sleeping every night cleans up the noise, allowing you to continuously improve your accuracy the next day, rather than just being a simple accumulation of two strategies.
Research on these theories is nurturing a new computing model. Today, we may be more focused on Transformer computers, but I think it is necessary to explore some new avenues, which is something the industrial and academic communities need to align on.
Tang Jie: In the early days, it was still the foundational model. In 2023, we were the first to create Chat. At that time, the first idea was to quickly launch Chat online, as there were regulations in place for a joint launch in August and September.
My first impression was that about ten large models were launched, and each user did not have that many. Of course, today the differentiation is very serious.
After a year of reflection, I feel that this does not really solve the problem. My first prediction is that it will replace search.
I believe many people today are using this model to replace search. Up to now, I believe many people have started using this model instead of searching, but it has not replaced Google. Instead, Google has revolutionized its own search and made improvements.
From this perspective, I think this battle has ended since DeepSeek emerged.
After DeepSeek, we should think about what the next battle will be.
Our team has debated for a long time; the next battle must involve AI doing something. What that something is can be discussed. At that time, Guangmi came to communicate with us, and Guangmi has particularly profound knowledge and thinks deeply about problems The communication with him was very enlightening for me; I had never thought of it that way, and that experience greatly inspired me.
Later, our team debated for many nights, and in the end, we could call it our luck; on the other hand, we also put all our energy into coding.
Q2: Autonomous Learning
Host: Next, we have a rather interesting question. Today is a particularly special time point; one is that pre-training has gone through three years, and everyone says that it may have reached 70-80% of the returns today. Reinforcement learning has also become a consensus, achieving a space of 40-50%, and the subsequent data and environmental space is vast.
The next new paradigm, as mentioned by Teacher Tang, is autonomous learning and self-learning, because the theme of today's meeting is the outlook for the future. I think this is a particularly worthwhile topic to discuss.
Yao Shunyu: Autonomous learning is now a very popular term; in every corner of Silicon Valley, people are talking about it, forming a consensus. From my observation, everyone has a different definition and perspective on this. I will mention two points:
First, this is not a methodology, but rather data or tasks.
When we talk about autonomous learning, in what kind of scenarios and based on what kind of reward functions is it being done?
Becoming increasingly personalized in your conversations is a form of autonomous learning; becoming more familiar with each company's unique environment or documentation while coding is a form of autonomous learning; exploring new sciences, and in this process, like a PhD, going from not understanding what organic chemistry is to becoming an expert in that field is also a form of autonomous learning. Each challenge or methodology of autonomous learning is quite different.
Second, is ChatGPT's continuous adaptation to users' data to bridge the gap in conversational style a form of self-learning?
Today, Claude has already written 95% of the code for the Claude project; it is helping itself become better. Is this a form of self-learning?
Back in 2022 and 2023, when I was promoting this work in Silicon Valley, I wrote on the first page that the most important point of ASI is autonomous learning. Today's AI systems essentially have two parts: first, it is a model, and second, it has a codebase. How you use this model, whether for inference or as an agent, has a corresponding codebase. Today, we see that the Claude system essentially has two parts.
One part is a large set of corresponding code for the deployment environment, such as what the KeonGPU environment is like.
The other part is how to use it, which has a large set of corresponding code, whether for GPU or what its frontend or environment is like.
In the area of Switch, people may not realize that these examples of autonomous learning may still be limited to each specific scenario, without giving a sense of great power
This situation is already happening, and there are various issues related to efficiency or limitations. In my personal opinion, it resembles a gradual change.
Many people say that signals will be seen in 2026, but I believe we will see signals as early as 2025.
Cursor learns from the latest user data every few hours, including new models, and is using this real-world data for training. People may feel that this is not particularly groundbreaking because they are limited by their lack of pre-training capabilities, and their model performance is indeed not as good as Opens, which is clearly a signal.
The biggest problem is imagination. It is easy for us to imagine what reinforcement learning or reasoning paradigms might look like if realized. We can envision O1, where a math problem that originally scored 10 points now scores 80 points, thanks to reinforcement learning with a very strong thinking chain.
If in 2026 or 2027 we have a paradigm shift, and I announce a new model or system that has achieved self-learning, what kind of tasks should we use, and what kind of results should it achieve for you to believe it has been realized?
It is a profitable trading system that can earn a lot of money and truly solves scientific problems that humans could not solve before. I think we need to first imagine what it looks like.
Lin Junyang: From a more practical perspective, the paradigm mentioned earlier is still in a relatively early stage. We have not fully explored reinforcement learning, and much potential remains untapped.
Today, we also see many issues arising in this area, and I believe similar problems exist globally.
If we talk about the next generation of paradigms, one is autonomous learning. I previously discussed with a friend that humans cannot make AI more powerful. For example, if you continuously interact with AI, it will only make the context longer, and AI becomes increasingly less intelligent, which is quite frustrating.
Can this really happen? This is worth pondering. You can generate more tokens to become stronger, just like if I really work for 30 hours, I can accomplish very difficult tasks. Today, it is challenging for everyone to achieve breakthroughs. Is it possible to realize this through coding?
From this perspective, AI certainly needs to evolve autonomously, but whether you need to update parameters is subjective; everyone has different technical means to achieve this.
The second point is whether AI can achieve stronger proactivity. The environment is my input signal. Currently, my AI needs human assistance to start, but is it possible for it to think autonomously and do some things? This raises a new issue: safety. I am very concerned about safety, not just worried about it saying things it shouldn't, but more about it doing things it shouldn't.
For example, today it might actively generate some ideas and throw a bomb into a conference hall. We certainly do not want unsafe things to happen. Just like raising a child, we need to instill some correct directions, but active learning is a very important paradigm
It may soon be possible to train AI, as I see our classmates doing this every day, and I feel they will be replaced quickly.
A more continuous understanding of users is quite important. For example, in the past when we were building recommendation systems, user information was continuously inputted, making the system stronger and its algorithms simpler. In the era of AI, can it understand you better? Can this input of information truly become a tool to help us?
If we talk about autonomous learning, it might be achievable through interaction with people. But what metrics should we use to measure it? It's hard to say.
In the era of recommendations, the better you perform, the more others may click and buy. However, in the AI era, when it covers all aspects of human life, what the real metrics are, we don't really know. I feel that the bigger technical challenge today is that we don't know how to proceed, which may be a more worthy research question for us.
Many so-called breakthroughs in technology are merely observational issues, developing linearly, and it's just that humans have a very strong perception of them.
Including the emergence of ChatGPT, for those of us working on large models, it's a linear growth. Now everyone is working on Memory; is this technology right or wrong?
Many solutions don't have a right or wrong, but the results produced, at least for us, seem to show that our own Memory knows what I've done in the past, but just remembering past events and calling my name each time doesn't really make you seem smart.
Is it possible that your Memory reaches a certain critical point, combining with your Memory, just like people in life? In the past, when everyone talked about movies, it really resembles a person; understanding your Memory is at that moment when human feelings suddenly burst forth.
I think it will take at least a year; many times, technology doesn't develop that quickly.
Everyone is quite competitive, with new things every day, but technology is developing linearly. From an observational perspective, we are in a phase of exponential growth. For example, a slight improvement in coding ability may bring a lot of productive value.
Every day, looking at what we do feels quite outdated, and those bugs are really embarrassing to share with everyone. If we do this, and we have achieved such results, I think in the future, if the algorithm infrastructure combines better, there may be greater potential.
Yang Qiang: I have always been working on federated learning, and the main idea of federated learning is that multiple centers collaborate.
I am increasingly seeing many places with insufficient local resources, but there are many privacy and security requirements for local data. So we can imagine that the capabilities of large models are becoming stronger, and how this general-purpose large model collaborates with local specialized small models or domain expert models is becoming increasingly possible.
For example, in the U.S., ZOOM, which is an AI system developed by Huang Xuedong and others, has created a large base that everyone can plug into. It can operate in a decentralized state, protecting privacy while effectively communicating and collaborating with general large models
I think this open-source model is particularly good, one is the open-source of knowledge, and the other is the open-source in terms of code, at the model stage.
Especially in scenarios like healthcare and finance, we will see more and more of this phenomenon occurring.
Tang Jie: I am very confident that there will be a significant paradigm shift this year. I won't go into too much detail, but as I mentioned earlier, continuous learning, memory, and even multimodality may all lead to new paradigm changes.
Why will such a paradigm emerge?
I think the industry has been running far ahead of academia. I remember when I returned to Tsinghua last year and the year before, chatting with many professors about whether we could build large models. Many professors first said they didn't have GPUs, or rather, the number of GPUs was almost zero.
The industry has 10,000 GPUs, while schools have 0 or 1, a factor of 10,000. But now, many schools already have a lot of GPUs, and many professors have started doing a lot of research related to large models. Many professors in Silicon Valley have also begun research on model architecture and continuous learning.
We used to think that the industry was dominating these areas, but I believe that by the end of 2025 to early 2026, this phenomenon will not exist much anymore. There may still be a factor of 10 difference, but it has already incubated seeds. I believe there is this innovative gene in academia, and there is this possibility, which is the first point.
Secondly, I believe that the emergence of innovation must involve a significant investment in something, and its efficiency becomes a bottleneck. Currently, the investment in large models is already huge, but the efficiency is not high. In other words, as we continue scaling, there will definitely be returns.
Originally, data from early 2025 might have been 10 TB, now it's 30 TB, and we can even scale up to 100 TB. However, after scaling up to 100 TB, how much return do you get, and what is the computation cost? This has become a problem. If you don't innovate, it could end up costing 1 billion or 2 billion, but the returns are very small, making it not worthwhile.
On the other hand, for new intelligent innovations, if we have to retrain a base model every time and retrain many reinforcement learning (RL) models, like when RL was released in 2024, many people thought they could continue training, and there were returns on the table. But today, continuing to train RL may yield returns, but not as significant; it’s still an efficiency issue. Perhaps in the future, we can define, on one hand, since we need to scale up, the simplest method is scaling. Scaling will yield returns, and scaling will definitely lead to improvements in intelligence upper limits.
The second method should be to define intelligence efficiency, meaning the efficiency of intelligence. We need to determine how much investment is required to achieve this incremental intelligence. If we can achieve this increment with less investment, and now we have become a bottleneck, if we can achieve the same improvement in intelligence with a less demanding paradigm, it becomes a bottleneck issue
So I believe that there will definitely be such a paradigm shift in 2026. We are also working hard, and we hope it happens to us, but it may not necessarily be the case.
Q3: The Year of the Agent
Host: The third topic is about the Agent strategy, which is no longer just a Chat, but really about automating an entire day's or even a week's workflow. The year 2026 may be a key year for Agents to create economic value.
Shunyu has spent a lot of time researching Agents. What do you think about the possibility of Agents, such as Long Agent, really being able to perform 1-2 weeks of human work in 2026? How will you think about the Agent strategy, including the starting point from model companies?
Yao Shunyu: I think, as I just mentioned, toB and toC are quite different. Currently, it seems that the toB situation has reached a continuously rising curve, and there seems to be no sign of slowing down.
An interesting point is that it basically does not innovate much; it just feels that the model pre-training has become larger. As long as the pre-training continues to grow and the post-training effectively handles these real-world tasks, it will become smarter and bring greater value.
To some extent, doing toB makes all the goals more consistent. The higher the intelligence of the model, the more tasks it can solve, and the more tasks it solves, the greater the benefits it brings under toB.
The issue with toC is that we all know that DAU or product metrics and model intelligence are often unrelated, or even have an inverse relationship. I think this is another very important reason to focus; as long as the model is improved, its benefits will increase, and everything will be very good.
Currently, it seems that toB or productivity Agents are just beginning. Besides the model, there are two Next issues: environmental problems or deployment issues.
Before OpenAI, I interned at a company that was toB. I think working at a toB company has many rewards, and the biggest takeaway is that even if today's models do not improve and all model training stops, we can still deploy these models in various companies around the world, which can already bring 10x or 100x returns today, potentially impacting GDP by 5%-10%. However, today its impact on GDP is still less than 1%.
Additionally, I think education is very important. I observe that the gap between people is very large. More often than not, it is not that humans replace human jobs, but that those who can use these tools are replacing those who cannot. Just like when computers came out, if you turned around and learned programming while others continued to use a calculator and algorithms, the gap would be enormous.
The most meaningful thing that can be done in China today is better education, teaching everyone how to better use products like Claude or ChatGPT. Of course, Claude may not be usable in China, but we can use domestic models like Kimi or KNOWLEDGE ATLAS
Lin Junyang: This may involve the issue of product philosophy. Of course, Manus is indeed very successful, and whether shell companies are the future is also a topic in itself. At this stage, I tend to agree with your viewpoint that the model is the product.
I talked with TML, and they call it Research. I actually quite like this concept, including my perspective on OpenAI. I think there are still many such things, where a lot of Research can become product managers and bring these things to life. This includes our own internal Research, which can also create things aimed at the real world.
I am willing to believe that the upcoming Agent can achieve what was just mentioned, and it has a strong relationship with the proactive learning mentioned earlier. It can operate for such a long time because it must evolve in the process and also decide what to do, as the instructions it receives are very general tasks. Our Agents have now become managed Agents, rather than the kind where I have to keep interacting back and forth.
From this perspective, the requirements for the model are very high. The model is the Agent itself, and the Agent is the product itself. If they are integrated, then creating a foundational model today is essentially creating a product.
From this perspective, if we continuously enhance the upper limit of model capabilities, including scaling, it can indeed achieve this.
I think another point is related to environmental interaction. The environment we interact with is still not very complex; these are still computer environments. I have friends who are involved in AI for Science. For example, when you work on AlphaFold, what you ultimately achieve has not yet reached that stage.
For instance, in terms of pharmaceuticals, even with today's AI, it may not help you that much because you need to conduct experimental trials to get feedback. Is it possible that in the future, AI environments become complex enough to mimic real human world environments, directing robots to conduct experimental trials to increase efficiency?
Currently, human efficiency is very low. We still need to hire many outsourced workers to conduct experiments in experimental environments. If we can reach this point, it might be what I envision as Agents being able to work for a long time, rather than just writing files on a computer. These tasks can be completed quickly this year, and in the next 3-5 years, this will become even more interesting. This may need to be combined with embodied intelligence.
The most interesting thing about creating a general Agent is that the long tail is actually more worth paying attention to, or rather, the greater charm of AI today lies in the long tail. If it is the Matthew effect, the head items are relatively easy to solve.
When we were doing recommendations back then, we saw that the recommendations were very concentrated, with products mostly at the head. However, we wanted to push the tail items forward, but I suffered greatly at that time. As someone who works with multimodal systems, encountering recommendation systems while trying to tackle the Matthew effect was basically a dead end
Today, the so-called AGI is solving this problem: can you create a general agent that can address long-tail issues? Today, I have a user who has searched everywhere and really cannot find anyone to help solve this problem. But at that moment, I felt the power of AI; in any corner of the world, no one could help, but you could solve it for me. This is the greatest charm of AI.
Should we create a general agent? I think it's a matter of opinion. If you are a shell expert and can do it better than the model companies, I think you can go for it; but if you don't have that confidence, this task may be left to model companies to create models as products, because when they encounter problems, all they need to do is train the model or use some computing power, and the problem might be solved.
The most interesting aspect of RL today is that fixing problems is easier than before.
Fixing problems used to be very difficult. Let me give an example from a B-end client. They said they wanted to create SSDs themselves and asked if I could tell them how to proportion the general data. Every time we faced this, we were very troubled because we felt that the other party was not very capable of making SSDs, and their data was very poor, but they might think it was very useful.
But now with RL, even a very small data point, which doesn't even need to be labeled, just requires a query, can be trained a bit, and merging it becomes very easy. This might be the charm of today's technology.
Yang Qiang: The emergence of agents should have four stages.
One is the definition of goals, whether it is human-defined or automatically defined. This is the goal. The second is planning, which involves the actions in between; planning can be defined by humans or automatically defined by AI.
We are currently at a very primitive stage, where both the goals and planning are defined by humans. Therefore, the current software systems for these agents are basically at a more advanced stage, but I anticipate that in the future, a large model will observe human work, especially in utilizing data.
Ultimately, the goals can also be defined by large models, and planning can also be defined by large models, so agents should be a native system that is endogenous to large models.
Tang Jie: Several aspects determine the future trajectory of agents.
First, does the agent itself solve human issues, and how valuable is that issue? For example, the original agents like GPT-S created many agents, and at that time, you would find that those agents were very simple. Eventually, it was discovered that prompts solved the problem, and most agents gradually disappeared. The first factor is how valuable solving the agent issue is and whether it can truly help people.
Second, what is the cost of doing this? If the cost is particularly high, that is also a problem. As Junyang just mentioned, perhaps calling an API could solve the problem, but conversely, if calling an API can solve it, that API itself might consider it very valuable and thus incorporate it, which creates a contradiction. The foundational application is always contradictory
Finally, the speed of application development. If I have a time window, I can open up a six-month time window to quickly meet the application requirements. After six months, whether to iterate or how to proceed is also a consideration.
Currently, large models are more about competing in speed and time. Perhaps if our code is correct, we can advance further in this area, but if we fail, six months can be lost. This year, we have only done a little in Coding and Agent development. Our current coding call volume is quite good, and I think it’s more about direction; developing Agents is also a future direction.
Q4: Can China Surpass?
Host: The fourth question is, what is the probability that the world's leading AI companies in three to five years will be Chinese teams?
Yao Shunyu: I think the probability is quite high, and I remain optimistic. It seems that once something is discovered, it can be quickly replicated in China, often doing better in many aspects, including previous examples in manufacturing and electric vehicles.
I think there are a few key points. One is whether China can break through in lithography machines. If computing power ultimately becomes a bottleneck, can we solve the computing power issue?
Currently, we have a good advantage in electricity and infrastructure. The main bottlenecks are production capacity, including lithography machines, and the software ecosystem. If these issues are resolved, I believe it will be a significant help.
Another issue is whether there can be a more mature or better B2B market beyond the consumer market, or whether there are opportunities to compete in the international business environment.
Today, we see many productivity or B2B models or applications still emerging in the United States, as there is a stronger willingness to pay and a better culture. It is quite difficult to do this domestically, so everyone tends to choose to go overseas or internationalize; these are two significant objective factors.
More importantly, on a subjective level, I have been chatting with many people recently, and our feeling is that there are many very strong talents in China. Once something is proven to be doable, many people will actively try and want to do it better.
I think there may not be enough people in China willing to break through new paradigms or take very risky ventures. This involves economic, business, and cultural factors. If we can increase the number of people with entrepreneurial or adventurous spirit who genuinely want to explore the frontier or break through new paradigms, that would be beneficial.
Currently, once a paradigm occurs, we can do better in specific areas with very few resources and high efficiency. The question is whether we can lead a new paradigm, which may be the only issue China needs to address today, as in many respects, we are already doing better than the United States in business, industrial design, and engineering
The research culture varies greatly in different places; the differences between American laboratories may be even larger than those between Chinese and American laboratories, and the same goes for China.
In China, people still prefer to do safer things. For example, pre-training has been proven to be feasible today, and this is indeed very difficult, with many technical issues to solve. However, once this is proven to be doable, we are confident that we can clarify this issue within a few months or a certain period.
But if today someone is asked to explore long-term memory or continuous learning, people do not know how to approach it or whether it can be achieved, making this quite challenging.
It may not just be that people prefer to do certain things and are reluctant to engage in innovative endeavors; a significant point is that cultural accumulation or overall cognition actually requires time to settle.
OpenAI started this in 2022, while domestically it began in 2023. There will be some differences in understanding this, or rather, China is not as advanced in this regard.
I think it may largely be a matter of time. When you have accumulated deeper culture or foundation, the subtlety of influence may affect how people work, but it is very nuanced and difficult to reflect through rankings.
China places more emphasis on rankings or numbers. For example, one of the strengths of DeepSeek is that they may not focus as much on the numbers in the rankings but rather on, first, what is the right thing to do;
Second, what you can personally experience as good or bad. I find this quite interesting because, while the Claude model may not rank highest in programming or software engineering, everyone knows it is the most user-friendly. I think it is essential for everyone to break free from the constraints of these rankings and persist in what they believe is the right process.
Lin Junyang: The number of computers in the U.S. may be one to two orders of magnitude larger than ours, but I see that whether it is OpenAI or others, they are investing a lot of computers into next-generation research. Today, we are relatively constrained; just the delivery may already occupy the vast majority of our computers, which is a significant difference.
Innovation occurs in the hands of the wealthy or the poor. The poor do not lack opportunities. We feel that these wealthy individuals are really wasting resources; they train so much, but much of it may not be useful. However, if you are poor today, for instance, in the case of algorithm infrastructure joint optimization, if you are really wealthy, there is little motivation to pursue this.
I think further, as Shunyu mentioned the lithography machine issue, there may be another point in the future. From the perspective of combining software and hardware, is it really possible to create something, for example, our next-generation model and chip could potentially be developed together?
Back in 2021, when I was working on large models, Alibaba was looking for me to predict whether this model would still be a Transformer three years later, or whether it would be multimodal. Why three years?
He said we need three years to tape out.
My response at the time was that three years later, I don't even know if I will still be at Alibaba!
But today I am still at Alibaba, and he is indeed still a Transformer, and still multimodal. I really regret not pushing him to do it back then. Our communication was very much like a chicken talking to a duck; he explained a lot of things to me, and I completely didn't understand.
When I explained to him, he also didn't know what we were doing, so we missed this opportunity. Is there a possibility for this opportunity to come again? Although we are a group of poor people, could it be that poverty breeds change, and will opportunities for innovation occur here?
Today, our education is improving. I belong to the earlier part of the 90s, while Shunyu belongs to the later part of the 90s. Our team has many post-00s, and I feel that everyone's spirit of adventure is becoming stronger and stronger.
Americans naturally have a very strong spirit of adventure. A typical example is when electric vehicles first came out; even in situations where the ceiling was leaking and driving could lead to accidental death, many wealthy individuals were still willing to pursue this.
Today, everyone's spirit of adventure is starting to improve, and with the business environment in China also getting better, I think it is possible to bring about some innovation.
The probability is not that high, but it is indeed possible. I think it's about 20%, which is already very optimistic.
Today, if you are in this industry, you cannot be fearful; you must have a very strong mindset. For our mindset, being able to work in this industry is already quite good, and being able to work on large models is already very fortunate.
I think it still depends on what your original intention is. Shunyu just mentioned a point: your model doesn't have to be that strong in the C-end to be okay. I might think about this issue from another angle: what kind of value does our model bring to human society? As long as I believe that this thing can bring sufficient value to human society and help humanity, even if it's not the strongest, I am willing to accept it.
Yang Qiang: We can look back at the development of the internet; it started in the United States, but China quickly caught up, and applications like WeChat are number one in the world.
I think AI is a technology; it is not a terminal product, but we have many smart talents in China who can maximize this product, whether it's to B or to C. However, I may be more optimistic about to C because of the diversity; Chinese people brainstorm together, but to B may have some limitations, such as willingness to pay and changes in corporate culture.
I have also been observing discussions with some classmates from business schools about commercial directions. For example, there is a company in the U.S. called Palantir, whose philosophy is that no matter what stage AI is at, I can always find good applications for it in enterprises. There is definitely a gap in between, and we need to bridge it. They have a method called ontology, which uses an ontological approach.
I observed that the general idea is similar to what we did with transfer learning, which means applying a general solution to a specific practice, using an ontology for knowledge transfer. This method is very clever Of course, it is solved through an engineering approach called Frontend Engineer FDE.
Anyway, I think this is very worth learning from. I believe that Chinese companies, especially AI Native companies, should develop such B2B solutions, and I believe they will. So I think B2C will definitely flourish, and B2B will catch up quickly.
Tang Jie: First of all, I think we must acknowledge that there is indeed a gap between China and the United States, especially in AI Labs in the corporate sector. This is the first point.
However, I believe that in the future, China is gradually getting better, especially with the post-90s and post-00s generations of enterprises, which are far better than before. Once, I said at a conference that our generation is the most unfortunate; the previous generation is still working, and we are also working, so we have not yet had our day in the sun. Unfortunately, the next generation has already emerged, and the world has been handed over to them, seamlessly skipping over our generation. This is a joke.
Perhaps the opportunities in China are:
First, a group of smart people are really willing to take special risks.
The post-00s generation, including the post-90s generation, has some individuals like Junyang, Kimi, and Shunyu who are very willing to take risks to do such things.
Second, our environment may be a bit better.
Whether it is the national environment, such as the competition between large and small enterprises, the issues among startups, or our business environment, as Junyang just mentioned, I am still doing delivery. I think if we can improve this environment further.
Allowing a group of smart people who dare to take risks to have more time to innovate, for example, giving Junyang more time to innovate, this is the second point, and perhaps this is something our government and country can help improve.
Third, returning to each of us, can we persist?
Can we be willing to dare to take risks on a path, and the environment is still good?
I believe the environment will definitely not be the best; we should never think that the environment is the best. We are precisely lucky to experience an era where the environment has gradually improved from not being so good.
We are the witnesses, perhaps the wealthiest in terms of experience and gains. If we stubbornly persist, maybe we will be the ones who reach the end.
Academician Zhang Bo: In the AI era, entrepreneurs will bear more missions.
After listening to this report, I have been thinking and feel that I can say a few more words. In fact, I am not qualified to say these words. First of all, I am much older than everyone else. Just now, Teacher Tang Jie mentioned how the next generation will replace him; I have long been replaced.
For enterprises, I am an outsider, but I recall what Teacher Yang Qiang mentioned earlier about Gödel's incompleteness theorem.
That is to say, it is difficult for people inside the system or circle to discover the problems and errors within it. As an outsider, the conditions provided to me may allow me to discover problems that you have not yet noticed
During my break, I made a PPT. I didn't dare to start too early, as I didn't hear how everyone else was doing it, so I hesitated to begin.
First, let's address a question: what are we currently doing?
From everyone's introductions just now, we are all working on large language models. In fact, the initial goal was to create a chatbot, meaning we hope machines can communicate with humans. What has been the outcome of this effort? The result is that, under external prompts, machines can generate diverse, semantically coherent language similar to that of humans in open domains.
Does achieving this mean we have mastered human language? It can be said that we have, but it is not thorough enough. We find that there are many aspects where it differs from human language.
What can we do about this? What causes these differences? To what extent can we achieve this in the future? Ultimately, can machines understand their tasks like humans do, reflect on their issues, and possess consciousness? Philosophically, this is referred to as self-reflexivity.
Starting from this point, what principles are currently used in large language models? They actually employ the principle of distributed semantics, which translates semantics into what Firth described: using the words that contribute the highest frequency around a word to define its meaning.
From this, we have the conditions to transform the contributing words into learning semantics from contributions. This is what we are currently doing; we are transforming the originally discrete space of contributing words into a sparse space in high-dimensional space, helping it become a dense vector space geometric structure. This is a significant advancement that makes our language computable. The original sparse contributing space was not computable, but now that it has become a dense vector space, it can be computed. Thus, the problem of language processing has completely turned into a mathematical computation problem.
It can be proven that as long as the amount of data you use is sufficient and the context is long enough, this space will exhibit semantic relationships. If we have enough data and sufficiently long texts, it will bring us closer to this.
Everyone is now working hard in this area; the longer the length, the better, and the more data, the better. We are now basically approaching semantic relationships. From this perspective, Professor Tang Jie also mentioned that to a certain extent, it is entirely possible to achieve understanding and self-reflexivity, as well as to reflect on one's own thoughts. In fact, this phenomenon has already been observed in large language models.
Where does the problem currently lie? It lies in the fact that the model is an approximation, not the true model of human language. Why? Because the definition we use is based on the contributing words to define the meaning.
Regarding the definition of semantics, there are seven or eight different philosophical schools of thought in the world. We do not have a scientific definition of semantics, so the definitions currently used are incomplete and approximate. This tells us that even if you make this model very large, you are still operating under the semantic relationships defined by this semantics.
Therefore, many people have pointed out and discovered that many strange phenomena occur with robots, attributing it to issues with semantics, which is incorrect
Many things now are caused by model approximation, and our definition itself is very incomplete and inaccurate because science currently cannot find an accurate definition. This leads to five deficiencies:
The deficiency of reference, the deficiency of true knowledge and causality, the deficiency of pragmatics, the deficiency of polysemy and dynamic context, and the deficiency of closed-loop behavior.
These five deficiencies will inevitably affect your application of language models. So what we need to do now is to work on this.
Many reports have clearly stated that through the architecture and algorithms involved, we will help ourselves continuously approach this semantic relationship. However, this semantic relationship is the best we can achieve at present; it is impossible to obtain the accurate definitions we truly need. Therefore, these five deficiencies must exist.
Next, let's talk about another question: what do we need to do now?
In fact, what everyone is doing now is to enable LLMs to execute complex tasks as agents in real-world environments. This is what everyone is currently working on. You take the language model for application, and there are many issues now, wanting to take a further step from the language model to turn it into an agent capable of executing complex tasks.
Everyone uses a concept to frame this goal, called general artificial intelligence. In fact, there are many misunderstandings with this concept. Our goal is this, but to make it sound better, everyone says AGI because AGI is very attractive.
There is a misconception here; people think that doing AGI must be general. In fact, AGI does emphasize generality, but it is not the same as what we currently want to do. However, since everyone uses it this way, we have to use it this way too. Therefore, many definitions based on this goal are definitely not feasible and will lead to significant misunderstandings.
For example, Musk says that machines will be able to perform over 70% of the tasks that humans can do, and reach or exceed human levels. Such a definition is completely unexecutable and unverifiable, which will inevitably lead to many misunderstandings. Some people say it is easy to achieve, while others say it is impossible. Why?
Because this definition is very vague. What does it mean to reach human level? If the rate of change exceeds that of humans, does that count as reaching human level? Some people say it does, while others say it does not count at all, especially if robustness is lacking in other aspects. Therefore, I believe we must define an executable and verifiable definition.
I believe an executable and verifiable definition of AGI should meet the following five requirements and achieve the following five key capabilities. In fact, what everyone is doing now is these five things. Earlier, Professor Tang mentioned four levels, which actually include several levels here, but one is missing. I emphasize that the adjectives describing these issues are very important:
Spatiotemporal consistent multimodal understanding and implementation.
Everyone is working on this. Where is the key? The key is spatiotemporal consistency, which is a significant challenge. Everyone knows that the timing of each modality is not synchronized; video is frame by frame, text is a sentence every few thousand years, and it is very difficult to align two words. If you cannot align them, you cannot achieve multimodal understanding
Controllable Online Learning and Adaptation.
In the past, we mainly focused on offline learning. As mentioned earlier, reinforcement learning (RL) is primarily about controllability. Teacher Tang mentioned verifiability, which relates to the controllability of reasoning.
The biggest issue with reinforcement learning is its uncontrollability. Although you have a goal, whether that goal can converge is uncertain; the entire learning process is uncontrollable. If the controllability issue is not resolved, the effectiveness of online learning will not be particularly good.
Verifiable Reasoning and Long-term Execution and Planning.
For reasoning, it must be verifiable. In many large models, a lot of reasoning is unverifiable, making it difficult to determine its correctness. Planning mainly focuses on long-term planning and execution, so this key issue must be captured in a descriptive manner.
Calibratable Reflection and Metacognition.
Currently, all reflections are based on feelings, lacking traceability, verifiability, and the ability to transform them into accurate signals.
Cross-task Strong Generalization.
As we know, large language models perform well in cross-domain generalization, but if we want them to execute actual tasks, we must address the cross-task reinforcement issue. The biggest challenge here is the out-of-distribution problem, where structures differ and long-tail generalization occurs.
Therefore, I believe that if we set this as our goal, we will have an executable and verifiable definition. According to this definition, it should guide us forward.
The next question is about forming such an entity, which is the Agent.
What’s the next step?
In fact, the several things we are currently doing are aimed at solving these five problems: multimodality, embodiment and interaction implementation, retrieval and evidence grounding, structured knowledge alignment, and tool and execution grounding.
Specifically, we are working on these six tasks, all aimed at achieving the five goals mentioned earlier.
Fourth, what is our goal?
In the past, artificial intelligence was merely a tool. Now we are in a very contradictory state; on one hand, we hope AI can do more complex tasks, and on the other hand, we fear that AI will surpass us and become a new subject.
This has caused a lot of anxiety. In the past, we only had one subject, humans, which we couldn't manage because humans are plural, not singular, and each subject has different requirements. Now, with the emergence of subjects beyond humans, what should we do? How do we coexist with AI? How do we address these concerns?
In fact, the future subjects can be divided into three levels:
1. Functional-Action Subject.
We have already reached this level and hope it continues to develop, as it can assist us.
2. Normative-Responsibility Subject.
This level has not yet been achieved. The biggest challenge is how to ensure machines can also bear responsibility. This is what we hope to accomplish, but given the current situation, it presents certain difficulties, and the technical challenges are quite high. However, I believe everyone will strive to achieve this
3. Experience - The Subject of Consciousness.
What everyone fears the most is this: once machines have consciousness, what should we humans do?
If we are people in actual enterprises, we may not need to think too far ahead; we can focus on the first and second issues, but these two questions must be considered: alignment and governance.
The issue of alignment has been discussed a lot. Must machines align with humans? This is a question worth discussing. Not all humans have virtues; among humans, there is greed and deception, which machines originally do not have. Is humanity at its highest when aligned with machines? No, we are also involved in this issue.
How do we govern? I believe the main governance is not about governing machines, but about governing humans, namely researchers and users.
This involves what responsibilities enterprises and entrepreneurs should bear in the era of artificial intelligence.
Fifth, Entrepreneurs in the AI Era.
Before the emergence of large language models, I strongly opposed my students starting businesses. Some students' parents sought my advice, and some shared my views, saying not to start enterprises. However, after the advent of large models, I believe the most outstanding students should engage in entrepreneurship.
Because artificial intelligence redefines what it means to be an entrepreneur. As mentioned earlier, artificial intelligence will define everything, including future entrepreneurs. Future entrepreneurs should possess six aspects of responsibilities.
I will briefly mention the redefinition of value creation. Artificial intelligence does not simply provide products and services; it transforms knowledge, ethics, and applications into reusable tools to benefit humanity. This completely changes the landscape, and artificial intelligence should be treated as a universal technology like water and electricity, handed over to humanity. Therefore, the requirements for entrepreneurs are high, including governance issues.
Entrepreneurs must also take on social responsibilities, so entrepreneurs in the AI era have many new missions.
In the AI era, entrepreneurs will become one of the honorable and sacred professions.

