3D version of SORA is here! DreamTech launches the world's first native 3D-DiT large model Direct3D

DreamTech announced the launch of the world's first native 3D-DiT large-scale Direct3D model, solving the challenge of high-quality 3D content generation through 3D Diffusion Transformer (3D-DiT). This technology avoids the shortcomings of 2D dimensionality enhancement, demonstrating the potential to obtain high-quality, non-deformed, flawless, and commercially viable 3D content. This is an important business innovation that will meet the demand for high-quality 3D content in various commercial scenarios

In May 2024, DreamTech announced its high-quality 3D large model Direct3D and released the related academic paper "Direct3D: Scalable Image-to-3D Generation via 3D Latent Diffusion Transformer".

This is the first publicly released native 3D generation route 3D large model, which solves the long-standing challenge of high-quality 3D content generation in the industry by using 3D Diffusion Transformer (3D-DiT).

Persisting in the native 3D technology route and making breakthroughs

Previously, the technology route commonly used in 3D AIGC was 2D-to-3D lifting, which means obtaining 3D models by dimensionality lifting from 2D image models. Representative solutions include Score Distillation Sampling (SDS) represented by early Google's DreamFusion and Large Reconstruction Model (LRM) represented by Adobe's Instant3D. Although 3D data has been gradually introduced into the model training process to improve quality, the inherent problems of 2D dimensionality lifting technology, such as multi-headedness, multi-surface, cavities, occlusion, etc., make it difficult for existing solutions to meet the requirements of commercial applications for general 3D generation.

Last year, some industry personnel began to explore the native 3D route, which means directly obtaining 3D models without going through intermediate multi-view 2D images or multi-view iterative optimization. This technology route can avoid the defects of 2D dimensionality lifting and demonstrate the potential to obtain high-quality, non-deformed, non-defective, commercially available 3D content. In principle, the native 3D route has significant advantages over 2D dimensionality lifting methods. However, its model training and algorithm development have faced many challenges, with the most critical issues being:

Efficient 3D model representation: Images and videos can directly obtain latent features compressed by 2D/2.5D matrix representation, whereas 3D data has complex topology and higher dimensional representation Efficient compression of three-dimensional data and subsequent analysis and learning of the distribution of 3D latent space have always been a difficult problem for industry professionals.

Efficient 3D Training Architecture: The DiT architecture was first applied in the field of image generation and achieved great success, including Stable Diffusion 3 (SD3) and Hunyuan-DiT, both of which adopted the DiT architecture; in the field of video generation, OpenAI SORA successfully achieved far superior video generation effects compared to Runway and Pika using the DiT architecture; however, in the 3D generation field, due to complex topology and 3D representation methods, the original DiT architecture cannot be directly applied to 3D mesh generation.

High-Quality Large-Scale 3D Training Data: The quality and scale of 3D training data directly determine the quality and generalization ability of the generative model. It is generally believed in the industry that at least tens of millions of high-quality 3D training data are needed to meet the training requirements of large-scale 3D models. However, 3D data is extremely scarce worldwide, despite the existence of large-scale 3D training data sets such as ObjaverseXL, the vast majority of which are low-quality simple structures, with less than 5% of usable high-quality 3D data. Obtaining a sufficient amount of high-quality 3D data is a global challenge.

To address the core challenges mentioned above, DreamTech has proposed the world's first native 3D-DiT large model, Direct3D. Through extensive experimental verification, Direct3D's three-dimensional model generation quality significantly surpasses the current mainstream 2D dimensionality expansion methods, mainly due to the following three points:

D3D-VAE: Direct3D proposed a 3D VAE (Variational Auto-Encoder) similar to OpenAI SORA to extract the latent features of 3D data, reducing the representation complexity of 3D data from the original N^3 to a compact 3D latent space of n^2 (n<<N), and achieved nearly lossless recovery of the original 3D mesh through the decoder network. By using 3D latent features, Direct3D reduced the computational and memory requirements for training 3D-DiT by more than two orders of magnitude, making large-scale 3D-DiT model training possible.

D3D-DiT: Direct3D adopted the DiT architecture and made improvements and optimizations to the original DiT, introducing semantic-level and pixel-level alignment modules for input images, which can achieve high alignment between the output model and any input image DreamTech 3D Data Engine: Direct3D uses a large amount of high-quality 3D data in training, most of which is produced by DreamTech's self-developed data synthesis engine. The DreamTech synthesis engine has established a fully automated data processing flow for data cleaning, labeling, etc., and has accumulated over 20 million high-quality 3D data, completing the last piece of the puzzle for native 3D algorithm implementation. It is worth mentioning that during the training of Shap-E and Point-E by OpenAI in 2023, they attempted to use millions of 3D synthetic data. In comparison to OpenAI's data synthesis solution, DreamTech's synthesized 3D data has a larger scale and higher quality.

Adopting the DiT Architecture, Validating the Scaling Law in the 3D Field Again

In terms of technical architecture, Direct3D adopts a Diffusion Transformer (DiT) similar to OpenAI SORA. The DiT architecture is the most advanced AIGC large model architecture, combining the advantages of Diffusion and Transformer architectures to meet the requirements of scalability, providing more data and more large model parameters to the model. DiT can achieve and even surpass human-level generation quality. Currently, practical projects of DiT technology include Stable Diffusion 3 in the image generation direction (Stability AI, February 2024), Hunyuan-DiT (Tencent, May 2024) in the image generation direction, SORA in the video generation direction (OpenAI, February 2024), and Direct3D by DreamTech is the world's first publicly available DiT practice in 3D content generation.

The DiT architecture conforms to and has repeatedly validated the Scaling Law.

In large language models, the Scaling Law has been fully proven effective. With the increase in the number of parameters and training data, the intelligence level of large models will be greatly improved. In the field of image generation, from SD1 with 0.8B parameters to SD3 with 8B, and Dall-E 3 with 12B parameters, all demonstrate the effectiveness of the Scaling Law. In the field of video generation, compared to Runway, Pika, etc., SORA, it is speculated that its technical implementation mainly involves changing the model architecture to DiT and increasing the model parameters and training data by an order of magnitude, demonstrating a shocking generation effect that has greatly improved video resolution, video length, and video generation quality In the 3D field, Direct3D-1B has demonstrated the first feasible native 3D-DiT architecture to the industry, using a self-developed high-quality data synthesis engine to increase the training data volume and enlarge the model parameter quantity, resulting in a steady improvement in generated results. In the future, the 3D generation field will be completely replaced by Direct3D (or its derivative architectures) from the existing LRM or SDS solutions. Currently, the DreamTech team is steadily advancing the scale-up of Direct3D, planning to launch Direct3D-XL with 15 billion parameters by the end of the year, while increasing the high-quality 3D training data by more than 5 times. This will mark a milestone moment for 3D generation.

Commercial-level Quality Achieved in 3D Content Generation

With the introduction of Direct3D, the 3D generation field has taken a big step into the commercial era. Taking 3D printing as an example, models generated using SDS, LRM, and other technologies often face the following issues:

Distorted geometric structures with multiple heads and tails;

Many sharp burrs on the model;

Overly smooth surfaces lacking details;

Low mesh face count, unable to guarantee fine structures.

The existence of these issues means that models generated by various previous solutions cannot be printed normally on 3D printers and require manual adjustments and repairs. Direct3D, taking the native 3D technology route, uses only 3D data in its training set, resulting in 3D models closer to the original quality. It perfectly addresses core issues such as geometric structures, model accuracy, surface details, and mesh face count. The quality of models generated by Direct3D has exceeded the precision limit of household printers, requiring higher-spec commercial and industrial printers to fully reproduce the generated model's precision.

Previously, SDS, LRM, and other technology solutions were limited by the representation form of 3D model features, generally generating models with mesh face counts around 50,000 to 200,000, which are difficult to increase further. However, in commercial use, the mesh face count of 3D models often needs to reach 1-5 million. Direct3D has proposed a more refined 3D feature representation paradigm, allowing the generated model's mesh face count to have no upper limit, reaching and exceeding 10 million, meeting various commercial scene requirements.

With the increase in Direct3D model parameter quantity and training data volume, 3D generation can be applied to more industries, including trillion-level games and animation industries. It is expected that by the end of 2025, 3D generation will replace most of the work in game, animation, and film modeling, being widely adopted in various industries.

Direct3D Practice

Based on Direct3D large models, DreamTech has launched two cutting-edge products that are currently open for testing.

One of them is Animeit!, targeting C-end users. Animeit! can convert any image/text input from users into high-quality 3D character images in anime style. The 3D characters generated by Animeit! have skeletal nodes for motion binding. Users can interact with personalized 3D AI companions directly on Animeit! through dialogue and motion interactions.

The precision of the anime characters generated by Animeit! is extremely high, with clear facial contours and detailed hand features. The hand details are prominent, with distinct finger details. This quality level was previously unattainable by the 3D generation technology route and can now be used for MMD production in the anime community.

The other product is a 3D content creation platform for creators. Users can obtain high-quality 3D models within 1 minute through text descriptions, similar to platforms like Midjourney, without the need for long refinement times. Users can also upload a single image and receive a high-quality, accurately restored 3D model after a short wait.

About DreamTech

DreamTech is deeply rooted in the field of 3D AI technology, dedicated to enhancing the user experience of global AIGC creators and consumers with innovative products and services. The company's vision is to create a seamless real-time interactive 4D spatiotemporal experience with advanced AI technology, achieving General Artificial Intelligence (AGI) by simulating the complexity and diversity of the real world.

DreamTech has gathered top AI talents globally. Its founding team consists of academicians from two prestigious UK academies, national-level young talents, and high-level talents from Shenzhen. The core members of the company graduated from world-renowned universities such as the University of Oxford, the Chinese University of Hong Kong, and the Hong Kong University of Science and Technology. They have previously worked at industry-leading companies such as Apple, Tencent, and Baidu. The founding team members have successfully established several benchmark companies in the 3D field, which were later acquired by industry giants like Apple, Google, and Bosch