A comprehensive analysis of 10,000 words: Will "end-to-end" bring a qualitative change to Tesla's FSD V12?

Wallstreetcn
2024.08.15 12:28
portai
I'm PortAI, I can summarize articles.

Tesla's FSD V12, based on an "end-to-end" model, has undergone a comprehensive update, expected to adapt to China's road conditions in six months to a year. Despite Musk's confidence in Tesla's leading position in the field of autonomous driving, the launch of FSD has experienced multiple delays. The latest version has received positive reviews from the industry, with executives from NVIDIA and DELL praising FSD V12 as revolutionary, performing like a human driver, with significant technological advancements

During the Q4 2022 earnings conference, Musk confidently declared that Tesla was absolutely in the lead in the field of autonomous driving, stating, "You can't even find the second place with a telescope." At that time, Tesla's autonomous driving had been delayed by 6 years, and The Wall Street Journal delicately expressed disbelief in Musk...

One year later, Tesla began rolling out FSD V12 in early 2024 within a certain range, and in March of the same year, renamed FSD Beta to FSD Supervised. AShok Elluswamy, head of Tesla's Autopilot team, posted on X (Twitter) that based on "end-to-end" training, FSD V12 had completely surpassed the accumulated V11 within months.

Figure 1. AShok Elluswamy's post on X (original Twitter)

At the same time, the launch of FSD V12 quickly received positive responses from the industry. NVIDIA CEO Jensen Huang highly praised, "Tesla is far ahead in autonomous driving. The truly revolutionary aspect of Tesla's 12th version of fully autonomous driving cars is that it is an end-to-end generative model."

Michael Dell (Chairman and CEO of Dell Technologies) stated on X, "The new V12 version is impressive, just like a human driver." Brad Porter (former Chief Technology Officer of Scale AI and Vice President of Amazon Robotics) also commented, "FSD V12 is like the arrival of ChatGPT 3.5, it's not perfect, but impressive. You can see that it's completely different, eagerly looking forward to its evolution to GPT4."

Even He Xiaopeng, Chairman of XPeng Motors who had a tense relationship with Tesla in the past, praised after test driving FSD V12 on Weibo, "FSD V12.3.6 performs extremely well, I need to learn from it." He also mentioned, "This year's FSD and previous Tesla autonomous driving capabilities are completely different. I highly appreciate it."

Figure 2. NVIDIA CEO Jensen Huang expressing that Tesla's autonomous driving is far ahead during an interview

What kind of changes have been made that allowed FSD V12 to surpass years of accumulation in just a few months? All of this can be attributed to the introduction of "end-to-end". To systematically understand the earth-shaking changes between Tesla's FSD V12 and its previous versions, we need to start from the basic framework of autonomous driving and the history of FSD V12 In order for everyone to gain something from reading this article, I strive to simplify it to an elementary school level, ensuring professionalism while increasing readability. I will explain the basic concepts of autonomous driving and the evolution of FSD V12 in a simple and understandable way, so that even elementary school students without any professional background knowledge can easily understand.

After reading this article, you will have a clear understanding of the current hottest and consensual concepts in the autonomous driving industry, such as "end-to-end" and the once popular concepts of "modularization", "BEV bird's-eye view + Transformer", "Occupancy network", and more.

Furthermore, you will also learn why Tesla V12 is groundbreaking, why autonomous driving ChatGPT is about to arrive, and you will form a preliminary judgment on the current development stage of the autonomous driving industry.

The article is a bit long, but after patiently reading it, you will definitely gain something.

Introduction to Autonomous Driving: From Modularization to End-to-End

1.1 Autonomous Driving Levels

Before we dive in, we need to have an understanding of the overall framework of autonomous driving: the widely accepted classification of autonomous driving levels is based on the SAE (Society of Automotive Engineers) standards, ranging from L0 to L5, a total of 6 levels. As the level increases, the need for manual emergency takeover by the driver decreases, and the functions of the autonomous driving system become more comprehensive. At L4 and L5 levels, there is no longer a need for the driver to take over the driving (in theory, at these two stages, the steering wheel and pedals do not need to be installed).

Figure 3. SAE J3016 Autonomous Driving Levels

L0: No Automation

L1: "Driver Assistance" (Partial Automation)

L2: "Partial Automation" (Partial Hands-Free) Current Development Stage

L3: "Conditional Automation" (Partial Eyes-Free) Current Development Stage

L4: "High Automation" (Driver's Brain-Free)

L5: "Full Automation" (Driverless)

1.2 Autonomous Driving Design Concepts: Modularization vs End-to-End

After understanding the basic framework of autonomous driving levels, we need to further understand how vehicles achieve autonomous driving. The design concepts of autonomous driving can be divided into two categories: traditional modular design and end-to-end design.

Under the influence of Tesla in 2023, end-to-end autonomous driving has gradually become the consensus in the industry and academia. (UniAD, winner of the best paper award at CVPR 2023, adopts end-to-end, reflecting the academic community's recognition of this design concept; in the autonomous driving industry, following Tesla, many smart driving companies such as Huawei, Li Auto, XPeng, Nio, etc., have successively embraced end-to-end, representing the industry's recognition of this concept 1.2.1 Modularization

Figure 4. Simplified diagram of modular architecture

Before comparing the advantages and disadvantages of two design concepts, let's first break down what modular design is: it includes perception, decision-making planning, and execution control modules (as shown in Figure 4). Researchers can adjust the parameters of each module to make the vehicle adapt to various scenarios.

Perception Module: Responsible for collecting and interpreting information about the vehicle's surroundings, detecting and identifying surrounding objects (such as other traffic participants, traffic lights, road signs) through various sensors (such as cameras, LiDAR, radar, millimeter wave, etc.) — The perception module is the core of autonomous driving, and most of the technological iterations before the end-to-end boarding are focused on the perception module. The core purpose is to make the vehicle's perception level reach human level, allowing your car to notice red lights, merging vehicles, or even a dog on the road just like you do when driving.

Note: The part that provides perception information to the vehicle also includes the positioning part. For example, some companies use high-precision maps to determine the precise location of the vehicle in the environment (but high-precision maps are costly, and obtaining accurate data is very difficult, making it difficult to promote widely).

Decision-Making Planning Module: Based on the output of the perception module, predict the behavior and intentions of other traffic participants, and formulate the vehicle's driving strategy to ensure that the vehicle can reach the destination safely, efficiently, and comfortably. This module is like the vehicle's brain (frontal lobe), constantly thinking about the best driving path, when to overtake/change lanes, whether to let merging vehicles in, whether to go or stop when seeing traffic lights, whether to overtake when encountering a delivery rider occupying the lane, etc. — In this part, the vehicle makes decisions based on code rules. For example, if the vehicle's code includes instructions to stop at red lights and yield to pedestrians, then in the corresponding scenario, our car will make decisions and plans based on the pre-written code rules. However, if a situation arises that is not covered by the rules, then our car will not know how to respond.

Control Module: Executes the driving strategy output by the decision-making module, controlling the vehicle's throttle, brakes, and steering. If the decision-making module is like the brain's strategist, then the control module is like the soldier who follows orders and "hits where directed". Figure 4. Detailed Modular Architecture Diagram Source: Guosen Securities

Pros and Cons of Modularization

Pros: Explainable, verifiable, easy to debug

■ Because each module is relatively independent, when our vehicle encounters a problem, we can trace back to which module the problem occurred; after encountering a problem, we only need to adjust the corresponding parameters based on the original code rules. In simple terms, "For example, when our autonomous driving vehicle brakes too hard when facing other vehicles merging, we only need to adjust how the speed and acceleration of the vehicle should change under merging conditions."

Cons: Information loss during transmission, multiple tasks leading to inefficiency, compound errors, difficulty in exhaustive rules leading to high construction and maintenance costs.

■ Information loss during transmission: There is inevitable information loss during the transmission of sensor information from entering the perception module to outputting to the control module, going through multiple links in between. In addition to the decrease in efficiency, information loss is unavoidable; for example, in a simple game of telephone, what the first person says as "hello" may turn into something completely unrelated like "Li roars" by the time it reaches the last person.

Figure 5. Telephone Game Illustration

■ Difficulty in exhaustive rules leading to high construction and maintenance costs: Once everyone understands the basic logic of modularization, they will know that modularization is based on rules. All decisions made by vehicles on the road are based on a set of rules, and behind these rules are lines of code. Programmers write the rules of the road in code form in advance, and the vehicle traverses all possible options based on the written rules in corresponding situations to select the optimal solution, make decisions, and take appropriate actions.

At this point, you might think it's simple to just include rules like "stop at red lights, go at green lights" directly into the system. However, engineers find it difficult to account for all situations on the road because the real physical world is constantly changing, with countless permutations and combinations. We can only anticipate common occurrences and write them into the rules, but rare extreme events also happen (such as a monkey suddenly appearing on the road fighting with a person). Therefore, relying on stacking rules with code can only lead to a sigh of "human effort is sometimes limited." 1.2.2 End-to-End

After discussing modularization, let's take a look at what the industry currently recognizes as end-to-end. End-to-end means that information enters from one end and exits from the other end, without various modules transmitting back and forth, all done in one go.

It is based on a unified neural network, from raw sensor data input directly to control command output, a continuous learning and decision-making process, without any explicit intermediate representation or artificially designed modules, eliminating the need for engineers to write endless code. Another core concept is lossless information transmission (originally it may have been like a game of telephone, but with end-to-end, it becomes direct communication).

Figure 6. Simplified illustration of modularization vs. end-to-end architecture

I will give two examples to explain the difference between modularization and end-to-end: Under the modular design concept, a vehicle is like a novice driver learning to drive at a driving school, without autonomy, and not actively learning to imitate. It does what the instructor says (write code rules), if the instructor says to stop at a red light, or give way to pedestrians, it will follow the instructions. But if it encounters something the instructor hasn't mentioned, it will just freeze and not know how to handle it (like "Shaoluobo" in Wuhan).

On the other hand, under the end-to-end design concept, a vehicle is like a novice driver with autonomy and the ability to actively learn to imitate. It learns by observing others' driving behaviors. At first, it knows nothing, but it is a good learner. After watching millions of videos of excellent veteran drivers, it gradually becomes a real veteran driver. Its performance can only be described with one word, "stable"!

Figure 7. Modularization vs. end-to-end Source: Li, Xin, et al. Towards knowledge-driven autonomous driving Huaxin Securities Research

As shown in Figure 7, a vehicle based on the modular design concept driven by a set of code rules reaches its limit after college and cannot further advance, while an end-to-end approach based on data-driven (the videos of veteran drivers shown to the vehicle are the so-called data) may start at elementary school, but it has strong growth and learning abilities (reinforcement learning and imitation learning), and can quickly advance to a doctoral level.

(Just like Richard Yu's comment, "Fsd has a low lower limit and a high upper limit", but as long as you have enough data and provide it with enough videos of veteran drivers, it won't stay at a low level for too long) Of course, there is still controversy surrounding the basic definition of end-to-end. "Technical purists" believe that many companies' promotion of "end-to-end" is not truly end-to-end (such as modular end-to-end). They believe that true end-to-end should be global end-to-end, where all steps from sensor input to final control signal output are end-to-end traceable and can be globally optimized.

On the other hand, "pragmatists" believe that as long as the basic principles are met and can improve the performance of autonomous vehicles, it is sufficient.

Three Major Divisions of End-to-End

Some friends may be confused by this point. End-to-end also has different divisions? Yes, that's right. Currently, end-to-end can mainly be divided into three categories (there are multiple divisions, but for easier understanding, this article only lists the divisions from the NVIDIA GTC conference). As shown in Figure 8, it can be divided into explicit end-to-end, implicit end-to-end, and large language model-based end-to-end.

Explicit End-to-End

Explicit end-to-end autonomous driving replaces the original algorithm modules with neural networks and connects them to form an end-to-end algorithm. This algorithm includes visible algorithm modules that can output intermediate results. When troubleshooting, it can be partially whitened for adjustment. In this case, engineers no longer need to write rules line by line, and the decision-making planning module shifts from handwritten rules to deep learning-based patterns.

It may seem abstract and difficult to understand. In simple terms, it is end-to-end but not completely end-to-end (also known as modular end-to-end). The so-called white box is actually in contrast to the black box. I will explain the implicit end-to-end part using the example of a novice driver later. If you don't understand here, you can skip ahead.

The UniAD model, which won the Best Paper Award at CVPR 2023, adopts explicit end-to-end. As shown in the following figure, we can clearly observe that various perception, prediction, planning modules are connected using vectors.

Note: Explicit end-to-end needs to be understood in conjunction with implicit end-to-end; explicit end-to-end can also be divided into perception end-to-end and decision-making planning end-to-end Implicit End-to-End

The implicit end-to-end algorithm constructs an integrated basic model, using massive external environmental data received by sensors, ignoring intermediate processes, and directly supervising the final control signal for training. "Technical purists" believe that end-to-end like the one shown in Figure 9, where sensor information goes in one end and directly outputs control signals at the other end, is the true end-to-end, with no additional modules in between.

As mentioned earlier, explicit end-to-end, by comparing Figures 8 and 9, the obvious difference is that the implicit integrated global end-to-end has no separate modules in between, only the neural network exists (the sensor is the way it views the world, the middle end-to-end system is its complete brain, and the steering wheel, brakes, and throttle are its limbs);

In contrast, explicit end-to-end separates the complete brain in the middle in a modular way. Although it no longer needs to write code to learn various rules and can gradually learn by watching videos of experienced drivers, it still operates in a modular way, so critics argue that it is not truly end-to-end.

However, this approach also has its advantages. As mentioned earlier, explicit end-to-end is to some extent a white box, because when our vehicle learns and exhibits some unexpected bad behaviors, we can trace back to which module's end-to-end has a problem. On the other hand, as an implicit end-to-end black box model, there is no way to start because it is completely integrated, and the creators do not know why it behaves that way (this is roughly what people often hear about black box on the internet).

Figure 9. Implicit End-to-End Source: PS Chib, et al. Recent Advancements in End-to-End Autonomous Driving using Deep Learning: A Survey

End-to-End of Generative AI Large Models

ChatGPT has brought great inspiration to autonomous driving. It uses massive unlabeled and low-cost data for training, and also has the function of human-machine interaction and answering questions. Autonomous driving can mimic this human-machine interaction mode, inputting environmental issues and directly outputting driving decisions, completing these tasks through end-to-end based on large language models.

The main functions of AI large models are twofold: first, they can generate massive, close-to-real, diverse training video data at low cost, including Corner Cases (rare but potentially dangerous abnormal situations during autonomous driving); second, they use reinforcement learning to achieve end-to-end effects, from video perception to directly outputting driving decisions The core idea is that models can learn causality from natural data without the need for annotations, greatly improving the overall generalization ability. Similar to ChatGPT, it predicts the next scene from the previous scene in a self-regressive manner.

Let's simplify the importance of large models for end-to-end scenarios:

Currently, the value of autonomous driving databases is very low: typically including two types of data. One is normal driving situations, which are repetitive and account for about 90% of public data, such as Tesla's shadow mode. Musk acknowledges that this type of data has low value, with effectiveness possibly as low as one in ten thousand or even lower. The other type is accident data, or incorrect demonstrations. Using this for end-to-end training can either adapt to limited conditions or lead to errors. End-to-end is a black box, lacking explanations, only correlations. High-quality, diverse data is needed for training results to be somewhat good.

End-to-end needs to address data issues first. Relying on external collection is not very feasible due to high costs, low efficiency, lack of diversity, and interactions (interactions between own vehicle and other vehicles, environment, requiring expensive manual annotations). Therefore, introducing generative AI large models, which can generate massive and diverse data, reduce manual annotations, and lower costs.

In addition, the core logic of large language model end-to-end is to predict future developments, essentially learning causal relationships. There is currently a gap between neural networks and humans. Neural networks provide probabilistic outputs, knowing the result but not the reason behind it; humans can learn common sense about the physical world through observation and unsupervised interaction, able to judge what is reasonable and what is not possible, learn new skills through a small number of experiments, and predict the consequences of their actions.

The goal of generative AI end-to-end large models is to enable neural networks to have the ability to generalize like humans.

For example: Human drivers will definitely encounter situations that they have never seen before but may be dangerous. Although they have not experienced it before, based on past experiences, they can infer what to do in such situations to stay safe (for example, we may have never experienced a phenomenon of a tyrannosaurus rex appearing on the road, but when it does happen, we will definitely drive away quickly). By inferring and judging the reasonableness of behavior based on past experiences, this is what we hope large language models can do end-to-end, hoping that our vehicles can truly drive like humans.

Data source: Guan, Yanchen, et al. "World models for autonomous driving: An initial survey."

Currently, as Tesla has not yet held its third AI Day, we are temporarily unclear about the specific network architecture of Tesla's end-to-end. However, based on Ashok, Tesla's head of autonomous driving, at the 2023 CVPR and some responses from Musk himself, it can be inferred that Tesla's end-to-end model is likely based on a large language model end-to-end (World model) (Expecting Tesla's third AI Day)

End-to-end Pros and Cons

Figure 10. End-to-end architecture simplified diagram

Pros: Lossless information transmission, fully data-driven, with learning ability and generalization

■ As the perception, decision-making, and planning end-to-end autonomous driving path become clearer, end-to-end provides imagination space for moving towards L4 autonomous driving.

Cons: Unexplainable, excessive parameters, insufficient computing power, illusion problems

■ If you have used large language models like ChatGPT, you will know that sometimes it will talk nonsense seriously (that is, illusion problems). Talking nonsense during a chat is harmless, but! If your vehicle drives seriously nonsense on the road, it can be fatal! And due to the black box problem, you cannot trace the reasons, which is the problem that end-to-end urgently needs to solve. The common solution currently is to add safety redundancy.

Figure 11. Huawei ads3.0 instinctive safety network

■ In addition, the landing of end-to-end also faces huge demands for computing power and data. According to a report by Chentao Capital, although most companies claim that 100 large computing power GPUs can support the training of an end-to-end model once, this does not mean that entering the mass production stage of end-to-end only requires this order of magnitude of training resources. Most companies developing end-to-end autonomous driving currently have training computing power at the kilocalorie level. As end-to-end gradually moves towards large models, training computing power will become insufficient.

Behind computing power is money, as Li Xiang, the founder of Li Auto, said, "Intelligent driving in the future requires only $1 billion as an entry ticket."

At this point, we have covered some basic framework content of autonomous driving (due to limited space, only a small part is included). Looking back from a historical perspective, the progress of autonomous driving is basically following Tesla's established path (while various manufacturers will innovate on the basis of their original path, the essence remains unchanged). To some extent, perhaps being able to keep up with Tesla itself is a kind of ability.

Next, I will elaborate on the development of modularity and end-to-end for Tesla FSD V12.

The Past and Present of Tesla FSD Is Keeping Up with Tesla Itself an Ability?

2.1 The Past of Tesla FSD V12 The development history of Tesla's autonomous driving to some extent reflects the development history of the most important route in the autonomous driving industry. In 2014, Tesla released the first generation hardware Hardware 1.0, with both hardware and software provided by Mobileye, an Israeli automotive technology company. However, the overall cooperation ended with the "world's first fatal autonomous driving accident" in 2016 (the core reason here is that Mobileye provided a closed black box solution, Tesla could not modify the algorithm inside, and could not share vehicle data with Mobileye).

Figure 12. Tesla's intelligent driving development history Source: Tesla official website, Guosen Securities Research Institute

From 2016 to 2019 was Tesla's self-research transition period. In 2019, the hardware was upgraded to version 3.0, and the first self-developed FSD1.0 chip was used, adding shadow mode function to help Tesla collect a large amount of autonomous driving data, laying the foundation for its pure visual route.

From 2019 to 2024, before the large-scale promotion of FSD V12.0, it was a period of comprehensive self-research. In 2019, the algorithm architecture was upgraded to neural network with the HydraNet Hydra algorithm, focusing on pure vision from 2020 onwards, and at the AI Day in 2021 and 2022, BEV and Occupancy network architectures were successively announced, verifying the BEV + Transformer + Occupancy perception framework in North America, with domestic manufacturers following suit (with a gap of about 1-2 years in between).

As mentioned earlier, the most core part of the modular intelligent driving design concept is the perception module, which is how we can make vehicles better understand the information input from sensors (cameras, radars, millimeter waves, etc.). Most of the concepts mentioned above and what Tesla did before FSD V12 version were to make the perception module smarter, to some extent, it can be understood as making the perception module move towards end-to-end, because to enable a vehicle to drive automatically, the first step is to make it truly and objectively perceive the dynamic changing physical world.

Next is to establish driving rules for it (decision-making planning module), and the decision-making planning module is more traditional, adopting a Monte Carlo tree search + neural network solution (similar to Google AlphaGo playing Go), quickly traversing all possibilities to find the path with the highest winning rate, which includes a large amount of manually input code rules, that is, imagining and selecting the best trajectory in the road based on a large number of pre-set human rules (complying with traffic rules and not colliding with other traffic participants), while the control module involves more hardware aspects such as throttle, brake, and steering wheel Because the perception module is the core part of progressive change, next I will try to explain the basic functions of these concepts in a simple and understandable way, as well as what problems they solve respectively (due to the length of the text, some parts have been summarized).

2.1.1 Evolution of Tesla FSD Perception Side

In 2017, Andrej Karpathy, who previously taught at Stanford, joined Tesla, marking the beginning of Tesla's end-to-end evolution on the perception side:

(1) HydraNet Hydra Algorithm - Revealed at Tesla AI DAY 2021

HydraNet is a complex neural network developed by Tesla to help cars "see" and "understand" the surrounding environment. The name HydraNet comes from the Greek mythological creature "Hydra". This network system, like a multi-headed snake, has multiple "heads" that can simultaneously handle different tasks. These tasks include object detection, traffic light recognition, lane prediction, etc. Its three main advantages are feature sharing, task decoupling, and more efficient fine-tuning by caching features.

Feature Sharing: In simple terms, based on the HydraNet backbone network processing the most basic information, the processed information is then shared with its different "heads", so that each "head" does not need to process the same information repeatedly, and can more efficiently complete their respective tasks.

Task Decoupling: Separating specific tasks from the backbone allows for individual fine-tuning of tasks; each "head" is responsible for a specific task, such as one for lane recognition, another for pedestrian recognition, and so on. These tasks do not interfere with each other and are independently completed.

More Efficient Feature Caching for Fine-Tuning: By limiting the complexity of information flow, ensuring that only the most important information is passed to each "head", the bottleneck part can cache important features and accelerate the fine-tuning process.

Figure 13. HydraNet Hydra Framework Source: 2021 Tesla AI Day

(2) BEV (Birds’Eye View + Transformer) - Revealed at Tesla AI DAY 2021

From 2D Plane Images to 3D Birds’Eye View Space

While HydraNet helps autonomous vehicles with recognition tasks, the perception of the vehicle's surroundings is completed by BEV (Birds’Eye View) + Transformer. The combination of the two helps Tesla convert the 2D plane images captured by eight cameras into a 3D vector space (which can also be done by LiDAR) Occupancy Network - Unveiled at Tesla AI Day 2022

The addition of the Occupancy Network has transformed BEV from 2D to true 3D (as shown in Figure 16), and with the incorporation of temporal information (based on optical flow), it has completed the transition from 3D to 4D.

The Occupancy Network introduces height information, achieving genuine 3D perception. In previous versions, vehicles could identify objects present in the training dataset, but were unable to recognize unseen objects. Even if the object was recognized, in BEV, it could only determine the extent to which it occupies a certain square area, without being able to obtain the actual shape. The Occupancy Network divides the 3D space around the vehicle into many small cubes (voxels), enabling the determination of whether each voxel is occupied (its core task is not to identify what it is, but to determine whether something is present in each voxel) This is like driving in the fog, although you can't see clearly what's ahead, you probably know there are obstacles ahead, and you need to go around them.

Occupancy Network is also implemented through Transformer, ultimately outputting Occupancy Volume (the volume occupied by objects) and Occupancy flow (time flow). This means how much volume nearby objects occupy, while the time flow is determined through optical flow.

Figure 17. Optical flow method

The optical flow method assumes that the pixel brightness constituting the object is constant and time continuous. By comparing the position changes of pixels in two consecutive frames of images, it ultimately brings 4D projection information.

Figure 18. Projection information

(4) Tesla leads the convergence of perception technology, and domestic leading manufacturers follow suit

You may not have a very direct sense when you read to this point, but I will list a few intuitive data for everyone

In 2021, FSD V9, the first AI Day announced the BEV network, and in 2023, the domestic BEV architecture began to be implemented.

In 2022, at the second AI Day, Tesla announced the Occupancy Network, and in 2023-2024, domestic Occupancy Network began to be implemented.

In 2023, Tesla announced that FSD V12 adopts end-to-end technology, and in 2024, domestic manufacturers successively follow suit (adopting modular end-to-end).

Figure 19. Domestic manufacturers lag behind Tesla by 1-2 years Source: Tencent Technology, compiled and drawn by Han Qing

BEV+Transformer solves the problem of autonomous vehicles relying on high-precision maps: High-precision maps are different from the commonly used maps like Amap and Baidu Maps (as shown in Figure 20), they are accurate to the centimeter level and include more data dimensions (such as roads, lanes, elevated objects, guardrails, trees, road edge types, roadside landmarks, etc.).

The cost is very high, requiring constant maintenance of centimeter-level accuracy in maps, but road information always changes (such as temporary construction), so long-term data collection and mapping work are needed It is unrealistic to rely on high-precision maps to achieve fully autonomous driving in all urban scenarios. By now, everyone should have a certain understanding of the contributions brought by BEVs (Note: Tesla's Lane Neural Network is also a key algorithm to break away from high-precision maps. Due to space limitations, it is not elaborated here).

Figure 20. Comparison between high-precision maps and regular maps

Occupancy Network addresses the issue of low obstacle recognition rate: By transforming recognized objects into 4D, regardless of what is around the vehicle and whether it is familiar with it or not, it can identify and avoid collision issues. Prior to this, vehicles could only recognize objects that appeared in the training dataset. Occupancy Network has to some extent led to the realization of end-to-end perception in autonomous driving relying on neural networks, which is of great significance.

2.2 The Past and Present of Tesla's FSD V12

At the beginning of the article, we mentioned: Ashok Elluswamy, head of Tesla's Autopilot team, posted on X (Twitter) that based on "end-to-end," FSD V12 has completely surpassed the accumulated V11 in just a few months of training.

Ashok Elluswamy's post on X (original Twitter)

Combined with the high praise from industry leaders for FSD V12, it can be seen that FSD V12 and V11 are quite different, so I divide them into the past and present based on V12.

According to Table 1, since the introduction of FSD V12, its iteration speed is much faster than before, reducing over 300,000 lines of C++ code to just a few thousand, and consumers and professionals can frequently be seen on social media expressing that Tesla's FSD V12 behaves more like a human.

Table 1. FSD iteration versions Source: Tesla AI Day, Musk's Twitter, Zhongtai Securities, Tencent Technology, compiled and drawn by Hanqing

We do not know exactly how Tesla achieved this transformation, but it can be inferred from Ashok Elluswamy's speech at the 2023 CVPR that the end-to-end model is likely built on the basis of the original Occupancy "The Occupancy model actually has very rich features, which can capture many things happening around us. A large part of the entire network is in building model features."

From an overall perspective, the modular end-to-end system in China may have certain differences from the large-scale end-to-end model constructed by Tesla.

Since the previous section has roughly explained what end-to-end is, we will not elaborate further here. Next, I would like to discuss why it is said that Tesla is currently in a leading position in this autonomous driving race, and we can compare it objectively with data.

After entering the end-to-end era, the level of intelligent driving by car companies is mainly determined by three factors: massive high-quality driving data, large-scale computing power reserves, and the end-to-end model itself. Similar to ChatGPT, end-to-end autonomous driving also follows the brutal aesthetics of massive data x large computing power. With the blessing of this brutal input, astonishing performances may suddenly emerge.

Figure 21. Level of intelligent driving in the end-to-end era

Since we do not know how Tesla achieves its end-to-end system, we will only discuss data and computing power here.

2.2.1 Tesla's Constructed Computing Power Barrier

The development history of FSD can be said to be the development history of its computing power accumulation. In early 2024, Musk stated on X (formerly Twitter) that computing power constrained the iteration of FSD functions, but starting from March, Musk stated that computing power is no longer a problem.

Figure 22. Musk's tweets on X

After the mass production of the Dojo chip, Tesla's computing power scale quickly increased from less than 5EFLOPS of the original A100 cluster to the top 5 level of global computing power, and is expected to reach a computing power scale of 100EFLOPS in October this year, about the level of 300,000 A100s Figure 23. Tesla's computing power change curve. Source: Tesla

Comparing with the computing power reserves of domestic manufacturers (as shown in Figure 24), it can be seen that under various real-world constraints, the gap in intelligent driving computing power reserves between China and the United States is still quite significant, and domestic manufacturers have a long way to go.

Figure 24. Comparison of computing power between Tesla and domestic intelligent driving enterprises. Source: Autohome, public information, Jiazi Light Year Think Tank. Compiled and drawn by Tencent Technology Hanqing.

Of course, behind computing power also means a huge capital investment. Musk stated on X (formerly Twitter) that he will invest over $10 billion in the field of autonomous driving this year. Perhaps it is really like what Lang Xianpeng, Vice President of Intelligent Driving at Li Auto, said, "One billion dollars in the next year is just the entry ticket."

Figure 25. Musk announced plans to invest over $10 billion in the field of autonomous driving by 2024.

2.2.2 Tesla's High-Quality Data

End-to-end intelligent driving is like a highly potential little genius. You need to feed it a large amount of high-quality driving videos of experienced drivers to help it quickly grow into a Ph.D. in the field of driving. This is a miraculous process of great effort.

Musk mentioned the data required for training models in the financial report meeting: "Training with 1 million video cases is barely enough; with 2 million, it's slightly better; with 3 million, it's impressive; and with 10 million, it becomes unbelievable." Training still requires high-quality human driving behavior data. Thanks to Tesla's own shadow mode, millions of mass-produced vehicles can help Tesla collect data. Tesla announced the establishment of a comprehensive data training process at the 2022 AI Day:

It covers data collection, simulation, automatic annotation, model training, and deployment. As of April 6, 2024, the cumulative driving mileage of FSD users has exceeded 1 billion miles. The cumulative driving mileage of any domestic manufacturer's users in China is far behind.

The quality and scale of data can determine the model's performance more than parameters. Andrej Karpathy once stated that Tesla's autonomous driving department spends 3/4 of its energy on collecting, cleaning, categorizing, and annotating high-quality data, with only 1/4 dedicated to algorithm exploration and model creation The importance of data can be seen from this.

Tesla is gradually exploring the "no man's land" of autonomous driving, pushing scale and capability to the extreme.

Figure 26. FSD users have accumulated over 1 billion miles of driving.

Conclusion

Of course, the ultimate effect depends on the actual on-road performance of the vehicles. Tesla's V12 operation area is mainly concentrated in the United States, where the overall road traffic conditions are better compared to China, where pedestrians and electric bikes may suddenly rush onto the road at any time. However, from a technical perspective, someone who can drive proficiently in the United States should not have difficulty driving in China. Moreover, learning ability is one of its core features. Perhaps initially, its performance may not be as outstanding as in the United States, but considering the iteration time before FSD V12.5, it may adapt to the road conditions in China half a year to a year later.

This has a significant impact on domestic manufacturers. It remains to be seen how many smart driving companies will respond to Tesla's FSD V12, which has been validated in the United States.

Author: Han Qing, Source: Tencent Technology, Original Title: "Thousands of words of hardcore interpretation: Will 'end-to-end' bring a qualitative change to Tesla FSD V12?"