NVIDIA's biggest risk lies in a corner that few people pay attention to!

GPU and high computational demand complement each other in the Transformer architecture. However, if in the near future, the Transformer that the AI world relies on is replaced by architectures with lower computational requirements, will it pose a threat to NVIDIA, the "shovel seller"?

This year's NVIDIA is indeed "fierce". It seems that competing in computing power has become the consensus among technology giants. One reason is that many large models are based on the Transformer architecture, which requires high computing power. If the Transformer is gradually replaced by architectures with lower computing power requirements during the iterative process, will this also become a "potential risk" for NVIDIA?

Rob Toews, a well-known venture capitalist who invested in OpenAI's rival Cohere and a partner at Radical Ventures, pointed out in a column article published on September 3 that the Transformer supports parallelization during training, which coincides with the "explosion" of GPUs. GPUs have more stream processors, which are suitable for parallel processing and concurrent computing of dense data, making them very suitable for computing workflows based on the Transformer.

There is no doubt that the Transformer architecture is very powerful and has completely changed the AI field. However, its disadvantages are also obvious. As the length of the article increases, the computational complexity becomes very high. At the same time, as the model size continues to expand, the required computation increases exponentially, leading to a sharp increase in the computing power requirements of the Transformer.

Toews pointed out that in order to address the issues with the Transformer, methods such as Hyena, Monarch Mixer, BiGS, and MEGA have been proposed to reduce computational complexity and decrease the computing power requirements using subquadratic approaches.

Toews frankly stated that although these architectures still have a long way to go to challenge the "throne" of the Transformer, it is undeniable that fresh things continue to emerge in the development of AI, and in the process of continuous updates and replacements, perhaps nothing is forever unshakable.

When the computing power requirements surge, to some extent, whoever holds NVIDIA GPUs will possess the hardest "hard currency" of the AI era. And if in the future the Transformer is replaced by architectures with lower computing power requirements, it will pose a certain threat to NVIDIA, the "big player" in the market.

The Enormous Computational Cost of the Transformer

On June 12, 2017, the groundbreaking paper "Attention is All You Need" introduced the Transformer architecture that revolutionized the field of large models. As of September 4, the Transformer has been around for over 6 years, and the paper has been cited a staggering 87,345 times. Analysis indicates that the expanding large models based on the Transformer come at the cost of high performance and power consumption. Therefore, while the potential of artificial intelligence may be limitless, the physical and cost limitations are real. Why does Transformer have such high computational requirements?

Toews explains that there are two main reasons: 1. The computational complexity of the attention mechanism, 2. The increasingly large model size:

The basic principle of Transformer is to use self-attention mechanism to capture the dependency relationships in sequence data, regardless of their distance.

The attention mechanism needs to compare each word in the sequence with all other words, which leads to a quadratic growth in computational cost with the length of the sequence, that is, the computational complexity is O(n^2). This quadratic complexity results in a sharp increase in computational cost as the length of the text increases.

At the same time, the Transformer architecture can better scale up large models, so researchers are constantly training larger models based on Transformer. Currently, the mainstream language models have parameters in the billions or even trillions, requiring a large amount of computational power. As the model size increases, the required computational power increases exponentially.

Ruth Porat, CFO of Alphabet, Google's parent company, said on an earnings call that capital expenditures will be "slightly higher" than last year's record level due to the need to invest in AI infrastructure.

Microsoft's latest report shows that the company's quarterly capital expenditures exceeded expectations, and CFO Amy Hood attributed it to increased investment in AI infrastructure.

Microsoft also invested $10 billion in OpenAI earlier this year to support the massive computational resources required for training large language models. Inflection, a startup that has been established for only 18 months, has raised over $1 billion to build a GPU cluster for training its large language model.

NVIDIA GPUs are facing production bottlenecks in the market frenzy. The latest H100 chips have already sold out, and now orders placed will have to wait until the first or even second quarter of 2024 to be fulfilled.

Toews points out that all of the above examples show that Transformer-based models have such a high demand for computational resources that the current AI boom has caused a global shortage of GPUs, and hardware manufacturers cannot keep up with the surging demand.

Challenges Faced by Transformer

At the same time, Toews points out that Transformer is limited in processing sentence lengths, and most existing methods use truncation, which leads to information loss. Therefore, how to achieve pre-training on long texts is currently a major challenge.

And this AI arms race is destined to continue. If OpenAI, Anthropic, or any other company continues to use the Transformer architecture, the length of their model's text sequences will be limited. Toews pointed out that various attempts have been made to update the Transformer architecture while still using attention mechanisms, but with better handling of long sequences. However, these improved Transformer architectures, such as Longformer, Reformer, Performer, Linformer, and Big Bird, often sacrifice some performance and therefore have not been widely adopted.

Toews emphasized that nothing is perfect, and the development of history does not stop. Although the Transformer currently occupies an absolute dominant position, it is not without flaws, and these flaws have opened the door for new architectures.

A Challenger to the "Throne"?

Toews believes that the search for architectures that can replace the "Transformer" has become the most promising area, and one research direction is to replace attention mechanisms with a new function. Approaches such as Hyena, Monarch Mixer, BiGS, and MEGA have proposed using subquadratic methods to reduce computational complexity and lower computational requirements.

Toews emphasized that researchers from Stanford and Mila have proposed a new architecture called Hyena, which has the potential to replace the Transformer. It is an attention-free, convolutional architecture that can match the quality of attention models while reducing computational costs. It has performed well on subquadratic NLP tasks:

It is claimed that Hyena can achieve the same accuracy as GPT-4 but with 100 times less computational power. This is the first attention-free architecture that can match the quality of GPT while reducing total FLOPS by 20%, and it has the potential to become a universal deep learning operator for image classification.

Toews noted that the initial research on "Hyena" was conducted on a relatively small scale. The largest "Hyena" model has 1.3 billion parameters, while GPT-3 has 175 billion parameters, and GPT-4 is said to have reached 1.8 trillion parameters. Therefore, a key test for the "Hyena" architecture will be whether it can continue to demonstrate strong performance and efficiency improvements when scaled up to the current size of the "Transformer" models.

Toews believes that liquid neural networks are another architecture with the potential to replace the "Transformer". Two researchers from MIT drew inspiration from the tiny nematode Caenorhabditis elegans and created the so-called "liquid neural networks".

It is claimed that liquid neural networks are not only faster but also exceptionally stable, which means the system can handle a large amount of input without losing control. Toews believes that this smaller architecture means that Liquid Neural Networks are more transparent and easier for humans to understand than "Transformers":

After all, it is easier for humans to explain what happens in a network with 253 connections than in a network with 17.5 billion connections.

As the architecture continues to improve and gradually reduce its reliance on computing power, does this also mean that it will have an impact on NVIDIA's future revenue?