The fastest large-scale model explosion in history! Groq became an overnight sensation, with its self-developed LPU outperforming NVIDIA GPU in speed.
NVIDIA's challenger Groq is here! Abandoning GPUs, they have developed their own LPU! The text generation speed is faster than blinking! The inference scene speed is 10 times faster than NVIDIA, but the price and power consumption are only one-tenth of the latter.
Upon waking up, the AI circle has changed again.
Just as we were still digesting the shock brought by Sora, another Silicon Valley startup has taken over the hot search with the fastest large model and self-developed chip LPU in history.
Just yesterday, the AI chip startup Groq (not Musk's Gork) opened up free trials for their products. Compared to other AI chatbots, Groq's lightning-fast response speed quickly sparked discussions on the internet. After being tested by netizens, Groq's generation speed reached nearly 500 tok/s, crushing GPT-4's 40 tok/s.
Some netizens were shocked and said:
Its response speed is faster than me blinking.
However, it is worth emphasizing that Groq did not develop a new model, it is just a model launcher, running open-source models Mixtral 8x7B-32k and Llama 270B-4k on its homepage.
The lightning-fast response speed in the realm of large models comes from the hardware driving the model—Groq did not use NVIDIA's GPU, but instead developed a new AI chip—LPU (Language Processing Units).
500 tokens per second, writing a paper faster than you blink
The most outstanding feature of LPU is its speed.
According to test results from January 2024, the Meta Llama 2 model driven by Groq LPU has far superior inference performance, surpassing top cloud computing providers by 18 times.
Image Source: GIT HUB
A previous article by Wall Street News mentioned, Groq LPU paired with Meta Llama 2 70B can generate the same number of words as Shakespeare's "Hamlet" in 7 minutes, which is 75 times faster than the typing speed of an average person.
As shown in the image below, a Twitter user asked a marketing-related professional question, and Groq outputted a lengthy discourse of over a thousand words within four seconds.
There were also netizens who tested solving a code debugging problem simultaneously using Gemini, GPT-4, and Groq. As a result, Groq's output speed is 10 times faster than Gemini and 18 times faster than GPT-4.
Groq's impact on the speed of other AI models has led netizens to exclaim, "The Captain America of AI reasoning has arrived."
LPU, the challenger to NVIDIA GPUs?
Let's emphasize again, Groq did not develop new models, it simply used a different chip.
According to Groq's official website, the LPU is a chip designed specifically for AI inference. While GPUs drive mainstream large models like GPT, they are parallel processors designed for graphics rendering with hundreds of cores. The LPU architecture is different from the SIMD (Single Instruction, Multiple Data) used by GPUs, allowing the chip to more effectively utilize each clock cycle, ensuring consistent latency and throughput, and reducing the need for complex scheduling hardware:
Groq's LPU inference engine is not just a regular processing unit; it is an end-to-end system designed to provide the fastest inference for applications that require a large amount of computation and continuous processing (such as LLM). By eliminating external memory bottlenecks, the performance of the LPU inference engine is orders of magnitude higher than traditional GPUs.
In simple terms, the most intuitive experience for users is "speed."
Readers who have used GPT must know how painful it is to wait for large models to output characters one by one. With LPU driving large models, real-time response is basically achievable.
For example, when Wall Street News asked Groq about the difference between LPU and GPU, Groq generated this answer in less than 3 seconds, without significant delays like GPT or Gemini. If asked in English, the generation speed would be even faster.
Groq's official introduction also shows that the innovative chip architecture can connect multiple Tensor Streaming Processors (TSP) together without the traditional bottlenecks of GPU clusters, making it highly scalable and simplifying the hardware requirements for large-scale AI models.
Energy efficiency is another highlight of the LPU. By reducing the overhead of managing multiple threads and avoiding underutilization of cores, the LPU can provide more computing power per watt.
During an interview, Groq's founder and CEO, Jonathan Ross, never missed an opportunity to take a jab at NVIDIA. Before, he told the media that in large-scale inference scenarios, the speed of the Groq LPU chip is 10 times faster than the NVIDIA GPU, but the price and power consumption are only one-tenth of the latter.
Real-time inference is the process of running data through trained AI models to provide instant results for AI applications, thus achieving a smooth end-user experience. With the development of AI large models, the demand for real-time inference is increasing.
Ross believes that for companies using artificial intelligence in their products, the cost of inference is becoming a problem. As the number of customers using these products increases, the cost of running models is also rapidly increasing. Compared to NVIDIA GPUs, the Groq LPU cluster will provide higher throughput, lower latency, and lower costs for large-scale inference.
He also emphasized that Groq's chips, due to a different technological path, are more abundant in supply than NVIDIA and will not be bottlenecked by suppliers such as Samsung or SK Hynix:
"The uniqueness of the GroqChip LPU lies in its independence from Samsung or SK Hynix's HBM, as well as TSMC's CoWoS packaging technology for soldering external HBM onto the chip."
However, some AI experts have expressed on social media that the actual cost of Groq chips is not low.
For example, AI expert Jia Yangqing's analysis shows that the overall cost of Groq is more than 30 times that of NVIDIA GPUs.
Considering that the memory capacity of each Groq chip is 230MB, and 572 chips are needed to run the model, the total cost amounts to $11.44 million.
In comparison, a system with 8 H100s is equivalent in performance to the Groq system, but the hardware cost is only $300,000, with an annual electricity bill of about $24,000. A three-year total operating cost comparison shows that the operating cost of the Groq system is much higher than that of the H100 system.
Moreover, the most critical point is that the LPU is currently only used for inference, and to train large models, NVIDIA GPUs still need to be purchased.
One of the founders is a former Google TPU designer, believes they can sell 1 million LPUs in the next 2 years
Before becoming an overnight sensation on the internet today, Groq had been quietly researching and developing for over 7 years.
Public information shows that Groq was founded in 2016 and is headquartered in Mountain View, Santa Clara, California. The company's founder, Jonathan Ross, is a former senior engineer at Google and one of the designers of Google's self-developed AI chip, TPU. The product manager, John Barrus, has previously served as a product executive at Google and Amazon. Estelle Hong, the only Chinese face among the top executives and Vice President, has been with the company for four years, previously working for the U.S. military and Intel.
In August last year, Groq also announced a collaboration plan with Samsung, stating that its next-generation chips will be produced using 4-nanometer technology at Samsung's chip factory in Texas, USA, with mass production expected in the second half of 2024.
Looking ahead to the next generation LPU, Ross believes that GroqChip's energy efficiency will increase by 15 to 20 times, allowing for more matrix calculations and SRAM memory to be added to devices within the same power range.
In an interview at the end of last year, Ross stated, "Considering the shortage and high cost of GPUs, I believe in Groq's future development potential."
"Within 12 months, we can deploy 100,000 LPUs, and within 24 months, we can deploy 1 million LPUs."