Chinese version of Sora: KUAISHOU-WR Keling

On June 6, 2024, Kuaishou launched an AI video generation large model called Kuaishou Kelin. This model combines multiple self-developed technological innovations, capable of generating ultra-long videos up to 2 minutes long, with a frame rate of 30fps and a resolution of 1080p, supporting various aspect ratios. Kuaishou Kelin has the ability to simulate physical world characteristics, perform substantial reasonable movements, and possesses strong conceptual combination and imagination capabilities, transforming users' imagination into concrete images. This large model enables users to easily and efficiently complete artistic video creation, with powerful video generation capabilities

Kuaishou Kelin is an AI video generated by Kuaishou on June 6, 2024. Kelin is independently developed by Kuaishou's AI team, based on Kuaishou's years of accumulation in video technology, adopting a technology route similar to Sora's DiT, combined with multiple self-developed technological innovations, and its effects are comparable to Sora.
From a technical perspective, Kuaishou Kelin combines multiple self-developed technological innovations, adopts the native WenSheng video technology route, replaces the combination of image generation + temporal module, achieves the ability of Kelin to generate long time, high frame rate, and accurately handle complex movements. In terms of completeness, it can simulate physical world characteristics, complete large-scale reasonable movements; in terms of innovation, it has strong conceptual combination ability and imagination, can transform users' rich imagination into specific images; in terms of parameter performance, it not only supports the generation of up to 2 minutes of 30fps and 1080p resolution ultra-long videos, but also supports various aspect ratios.
From a functional experience perspective, Kuaishou Kelin's large model has powerful video generation capabilities, allowing users to easily and efficiently complete artistic video creation. Through textual description, Kelin's generated videos can achieve: 1) Large-scale reasonable movements: Kelin uses 3D spatiotemporal joint attention mechanism to better model complex spatiotemporal movements, generate video content with large-scale movements, and conform to motion laws. Complex, large-scale spatiotemporal movements can also be accurately depicted; 2) Generation of videos up to 2 minutes long: Thanks to efficient training infrastructure, ultimate inference optimization, and scalable infrastructure, Kelin's large model can generate videos up to 2 minutes long with a frame rate of 30fps; 3) Simulating physical world characteristics: Based on self-developed model architecture and the powerful modeling capability inspired by Scaling Law, Kelin can simulate real-world physical characteristics and generate videos that conform to physical laws; 4) Strong conceptual combination ability: Based on a deep understanding of text-video semantics and the powerful capability of Diffusion Transformer architecture, Kelin can transform users' rich imagination into specific images, creating a fictional real world; 5) Movie-level image generation: Based on self-developed 3D VAE, Kelin can generate movie-level videos with a resolution of 1080p, vividly presenting vast grand scenes or delicate close-up shots; 6) Support for free output video aspect ratios: Kelin adopts a variable resolution training strategy, which can output various video aspect ratios of the same content during the inference process, meeting the needs of more diverse video material usage in richer scenarios.
From an industry perspective, as a leading short video company, Kuaishou actively deploys AI. Its product performance not only demonstrates Kuaishou's deep accumulation in AI large model technology, but also reflects the technical capabilities of domestic AI video models. We are optimistic about the continuous iteration of AI technology, the acceleration of Chinese manufacturers in catching up technologically, the acceleration of AI video application development and commercial exploration, and the reduction of barriers to content creation I. Product Introduction

Kuaishou Ke Ling is an AI video generated by Kuaishou on June 6, 2024. Ke Ling is independently developed by Kuaishou's AI team, based on Kuaishou's years of accumulation in video technology, adopting a technology route similar to Sora's DiT, combined with multiple self-developed technological innovations. The effects are comparable to Sora, with a series of advantages: 1) capable of generating significant reasonable movements; 2) able to simulate physical world characteristics; 3) possessing strong conceptual combination ability and imagination; 4) the generated video has a resolution of up to 1080p, a duration of up to 2 minutes (30fps), and supports free aspect ratios. (Some functions are not yet open for external testing)

II. Functional Experience

According to the Ke Ling official website, the Ke Ling large model has powerful video generation capabilities that allow users to easily and efficiently complete artistic video creation. Through textual descriptions, videos generated by Ke Ling can achieve:

Significant reasonable movements

Ke Ling uses a 3D spatiotemporal joint attention mechanism to better model complex spatiotemporal movements, generating video content with significant movements that also conform to the laws of motion. Complex, significant spatiotemporal movements can be accurately depicted.

Chart 1: The image of an astronaut running on the moon, with smooth movements, and the movement of steps and shadows is reasonable and appropriate.

Source: Kuaishou Ke Ling Official Website

Video generation up to 2 minutes

Thanks to efficient training infrastructure, ultimate inference optimization, and scalable infrastructure, the Ke Ling large model can generate videos up to 2 minutes long with a frame rate of 30fps.

Chart 2: Simply by using the prompt "A little boy riding a bike in the garden experiencing the changing seasons of autumn, winter, spring, and summer," a video lasting one and a half minutes can be generated.

Source: Kuaishou Ke Ling Official Website

Simulating physical world characteristics

Based on the self-developed model architecture and the powerful modeling capabilities inspired by the Scaling Law, Ke Ling can simulate the physical characteristics of the real world and generate videos that adhere to physical laws.

Chart 3: In the generated video of a little boy eating a hamburger, the process of eating the hamburger is realistically reflected, with the biting position matching the bite marks on the hamburger, and the facial muscle dynamics are lifelike.

Source: Kuaishou Ke Ling Official Website （IV）Strong Conceptual Combination Ability

Based on a deep understanding of text-video semantics and the powerful capabilities of the Diffusion Transformer architecture, Kuaishou Kelin can transform users' rich imagination into concrete images, creating a fictitious real world.

Figure 4 vividly presents the imaginative scene of a cat driving a car.

Source: Kuaishou Kelin Official Website

（V）Movie-level Image Generation

Based on self-developed 3D VAE, Kuaishou Kelin can generate movie-level videos with a resolution of 1080p, vividly presenting vast grand scenes or delicate close-up shots.

Figure 5 shows the generated video with a resolution of up to 1080p.

Source: Kuaishou Kelin Official Website

（VI）Support for Free Output Video Aspect Ratios

Kuaishou Kelin adopts a variable resolution training strategy, which allows for multiple video aspect ratios to be output during the inference process, meeting the diverse video material needs in richer scenes.

Figure 6 shows the freedom to support various aspect ratios for the same video, including Kuaishou's native vertical videos.

Source: Kuaishou Kelin Official Website

Author: Liu Xin from Huachuang Securities, Source: Huachuang Securities, Original Title: "Kuaishou Kelin: Domestic Debut Benchmarking Sora's DiT Architecture Frequent Video AI Model"