Domestic Video Model Evaluation: ByteDance's Dream 2.0 vs KUAISHOU-KeLing 3.0

Recently, KUAISHOU and ByteDance released Keling 3.0 and Seedance 2.0, attracting high market attention. The new models have breakthroughs in consistency, stability, and scene segmentation, with the biggest highlight being support for video input. Tests show that Seedance 2.0 has significantly improved performance, but Keling has an advantage in professional content production. The tests covered both animation and live-action styles, with results indicating that Keling performed average in video conversion, while Seedance failed to generate successfully. Overall, Seedance focuses more on storytelling, while Keling concentrates on professional content

Recently, KUAISHOU and ByteDance respectively released Keling 3.0 and Seedance 2.0, attracting significant market attention. Compared to the previous generation models (Keling 2.6 and ByteDance Seedance 1.5), both new models have made breakthroughs in consistency, stability, and scene segmentation, but the biggest breakthrough is the realization of video input. The previous generation models only had text-to-video, image-to-video, and some simple motion control functions, while the new generation models can upload a video, and the model can generate a new video based on the content in the video, achieving multi-modal input - video output, completing the puzzle of native multi-modality.

The market is concerned about how to view the current competitive landscape of video models after the performance of Seedance 2.0 has significantly improved compared to version 1.5. What are the differentiated advantages of Keling compared to Seedance?

In light of these questions, we conducted seven sets of tests. The evaluation covered capabilities including two styles (animation style and realistic human style). The reason for choosing these two directions is: on one hand, the current AI applications are mostly in the field of AI animated dramas; on the other hand, if AI can gradually penetrate into real human performance scenarios (including real short dramas, mid-length videos, and longer videos and movies), the entire market space for AI video generation will open up. The core driving force is that there is demand, but the technology has not yet reached a sufficiently good level, resulting in the penetration rate currently being limited to animated style content.

Here are our seven sets of test prompts and the capabilities we want to test:

Set 1: Makoto Shinkai style Japanese animation

Set 2: Cyberpunk city night scene

Set 3: Animated crying scene

Set 4: Real human crying scene

Set 5: Sports performance

Group 6: Video to Video

Upload a live-action video to the model, requiring conversion into Disney animation style. Keling successfully outputted results, but the effect was average, somewhat stiff, and not as vivid as the effects generated from pure text, with the background music directly taken from the original video. Seedance failed to generate successfully.

Group 7: Lip Sync Ability

Model Positioning Differences and Pricing Comparison

Overall, Seedance focuses more on helping users express a story, while Keling emphasizes professional-level content production. Keling has a stronger cinematic quality, including lighting, detailed expressions, skin details, motion control while running, and richer details like background rain.

In terms of pricing: generating a 5-second 720P video costs about 4 RMB for Keling and about 2.3 RMB for Seedance, which is a little more than half the price of Keling. For a 15-second video, Seedance's pricing advantage is more apparent. However, Seedance currently does not support 1080P resolution and can only use Keling 3.0. Additionally, Google's video model pricing is much higher than domestic ones, but it is currently the only model capable of generating 4K videos, targeting a different consumer group.

Competitive Landscape

We tested multiple video models using the same prompt (a man running in the rain). The conclusion is that currently, Keling 3.0 and Seedance 2.0 are at the strongest level globally.

Alibaba's Wanxiang 2.6 has a cartoon style and lacks detail. MiniMax's Conch 2.3 generates videos that are relatively realistic, but the downside is that it cannot achieve audio and video synchronization, requiring post-dubbing. Veo 3.1 (Google) has all the basic elements, but the characters look a bit strange. OpenAI's Sora 2 has poor effects, with a noticeable game-like feel, resembling "The Sims" style, possibly due to being fed a lot of game engine data during training, resulting in a less realistic generation style.

In terms of pricing, several domestic companies price a 5-second video at around $0.4, while overseas it is much more expensive—Google charges about $5 for a 5-second video, and OpenAI's Sora 2 is about $2.5 (Gemini members and Sora users have a small daily generation quota)

Market Space and Growth Logic

Currently, the ARR (Annual Recurring Revenue) of several major AI video models is growing rapidly, showing a growth trend of 1-3 times a year, and there has been no situation where the growth of Company A leads to a decline in the revenue of Company B. As of January this year, the combined ARR of major companies is approximately less than 1 billion USD, still a very early blue ocean market. In comparison, OpenAI has an ARR of 20 billion USD, and Anthropic has an ARR of 9 billion USD, totaling nearly 30 billion USD. The ARR of video model companies combined is less than 1 billion USD, indicating a significant gap.

From the downstream market perspective, the domestic box office revenue is about 40-60 billion RMB annually, while overseas it is around 10-20 billion USD. Additionally, considering social short videos, advertisements, and live-action short dramas, the current penetration rate of the AI video industry is still very low.

The market space for AI video models equals the sum of the market size of various vertical scenarios multiplied by the AI penetration rate. Currently, the scenarios unlocked by AI mainly include AI comic dramas, which have been fully rolled out in this track and replaced many traditional animation creation labor segments. However, in the fields of live-action short dramas, medium and long videos, and movies, AI has not yet effectively penetrated, primarily due to the technology not being mature enough— for example, the clarity requirements for movies are higher, and the current 720P or 1080P does not meet the requirements.

Moreover, AI video models will also experience a "supply creates demand" process, similar to AI programming— the ARR of AI programming companies like Cursor, Lovable, and Replit is also growing several times a year, due to the lowered programming threshold allowing product managers, salespeople, and even complete novices to use AI programming. Video models will follow a similar logic.

Regarding the gap between China and the U.S., the gap in text models is about 3 to 6 months, but in video, China has already surpassed Veo 3.1 and Sora 2 (partly because these two models were released earlier and have not been updated recently). In the short term, the gap in the AI video field between China and the U.S. is very small, and China has even achieved a lead. The core reason is that China has a lot of multimodal data— ByteDance and KUAISHOU each have their own video platform data (similarly, Google's ability to create Veo 3.1 is also reliant on data from YouTube and Waymo), and the labeling capability of this data during training greatly aids video generation models

From the data of the comic drama market, the viewership increased fivefold from January to July last year, and then grew by about four to five times from July to December. The comic drama market itself is rapidly growing, and the production costs of AI in this sector will also increase accordingly.

According to estimates by the third-party agency Mayor Research, the Chinese video production market is about 20 billion USD, while the global market is around 160 billion USD, covering various styles including long videos, short videos, medium videos, comics, and live-action. Currently, the parts where AI can penetrate mainly include comic dramas and some advertisements, KTV background videos, and other short content that does not require high consistency. In terms of user scale, comic dramas reach about over 100 million users, micro short dramas nearly 700 million users, and long videos, short videos, and online videos range from 800 million to 1 billion users. The user base currently reached by AI is still relatively small.

Technical Route Outlook

Video models are about a year behind text models, and world models are another year behind video models. In 2023, we saw ChatBot, in 2024, we will see reasoning models, and by 2025, Agent, Coding, and multimodal will all have prototypes. In 2026, the text market will not be abandoned and will still be an important direction, but AI Agent, Coding, and multimodal may undergo significant changes this year (including technological, token consumption, and revenue changes).

For upstream, multimodal scenarios can drive token consumption, computing power demand, and storage needs; for downstream, comic dramas, live-action short dramas, and medium to long videos may benefit from technological iterations.

In terms of technical routes, the mainstream route for current video models is DiT (Diffusion + Transformer), with the Sora series and Veo 3 validating the feasibility of this route. However, it cannot be ruled out that some companies will explore autoregressive routes, which may outperform DiT in terms of the length of generated content but at a higher cost.

Additionally, multimodal may interact with world models. Google's recently released Genie project is a world model that can stably generate 1-2 minutes of content with better physical understanding capabilities. The team led by Fei-Fei Li also launched a commercial world model product at the end of last year, moving from the research phase to commercialization. We expect many new changes in the world model field this year.

Risk Warning and Disclaimer

The market has risks, and investment should be cautious. This article does not constitute personal investment advice and does not take into account the specific investment goals, financial situation, or needs of individual users. Users should consider whether any opinions, views, or conclusions in this article are suitable for their specific circumstances. Investment based on this is at one's own risk