GPT-4 "The Ultimate Revelation": 18 trillion parameters, trained once for $63 million!

Wallstreetcn
2023.07.11 08:32
portai
I'm PortAI, I can summarize articles.

GPT-4 has been "open-sourced" again. SemiAnalysis has "unveiled" a wealth of information about GPT-4. The parameter scale is more than 10 times that of GPT-3, and it adopts the MoE model architecture. GPT-4 was trained with 13 trillion tokens.

As we all know, OpenAI is not "open", especially after the release of GPT-4, the entire OpenAI team has remained tight-lipped about almost all information regarding GPT-4.

However, this morning, Dylan Patel and Gerald Wong from the media outlet semianalysis published an article titled "GPT-4 Architecture, Infrastructure, Training Dataset, Costs, Vision, MoE," which exposed all the details of GPT-4, from its model architecture and training to its costs. Has GPT-4 been "open-sourced"?

The article provides a detailed introduction to the architecture, training, and inference infrastructure of GPT-4, as well as specific parameters and information such as the number of parameters, training dataset, token count, costs, and the Mixture of Experts (MoE) model.

It also delves deep into the various trade-offs that OpenAI faces in choosing different routes, and frankly states that the most interesting aspect of GPT-4 is understanding why OpenAI made certain architectural decisions.

https://www.semianalysis.com/p/gpt-4-architecture-infrastructure

It is worth noting that Dylan Patel is also the author of the Google internal document leak incident ("We have no moat, and neither does OpenAI").

Recently, DeepMind CEO Hassabis confirmed the authenticity of the leaked Google document in an interview with the media.

Considering that Dylan Patel is the whistleblower, the credibility of this "big revelation" about GPT-4 has been further enhanced.

The article begins by pointing out that the reason OpenAI is not open is not to protect humanity from AI destruction, but because the large models they build are replicable. In the future, major internet companies and leading AI startups in both China and the United States will have the ability to build large models that can rival or even surpass GPT-4. And OpenAI's most enduring moat lies in their real user feedback, top engineering talent in the industry, and the leading position brought by their first-mover advantage.

Wall Street News has compiled the main content of the GPT-4 disclosure:

18 trillion parameters and model framework

The article points out that GPT-4 has a total of 18 trillion parameters in 120 layers, while GPT-3 has only about 175 billion parameters. In other words, the scale of GPT-4 is more than 10 times that of GPT-3.

OpenAI controls costs by using a Mixture of Experts (MoE) model. GPT-4 has 16 expert models, each with approximately 1.11 trillion parameters. Among them, two expert models are used for forward propagation.

The algorithm used by OpenAI for GPT-4 is actually very simple. There are also about 550 billion parameters in the model, which are used for attention mechanisms.

In each forward propagation inference (generating one token), GPT-4 only needs to use about 280 billion parameters and 560 TFLOPs. In comparison, a pure dense model requires about 18 trillion parameters and approximately 3,700 TFLOPs of computation for each forward propagation.

Composition of the dataset

OpenAI trained GPT-4 with 13 trillion tokens. Because there were no high-quality tokens, this dataset also included many epochs.

Number of epochs: 2 epochs for text-based data and 4 epochs for code-based data.

During pre-training, GPT-4 used a context length (seqlen) of 8k, and the 32k version was fine-tuned based on the pre-trained 8K version.

The batch size gradually increased in the cluster within a few days. In the end, OpenAI used a batch size of 60 million. Of course, since not every expert model can see all tokens, this is only the size of the expert model for every 7.5 million tokens.

Real batch size: Divide this number by the sequence length (seq len) to get it.

OpenAI's parallel strategy

Parallel strategy is crucial for A100 GPUs. In order to perform parallel computation on all A100 GPUs, OpenAI adopts 8-way tensor parallelism, as this is the limit of NVLink. In addition, it is said that OpenAI adopts 15-way parallel pipeline.

Theoretically, considering data communication and computation time, 15 pipelines are quite a lot. However, once KV cache and cost are added, if OpenAI mostly uses 40GB A100 GPUs, such an architecture is theoretically meaningful. However, the author states that he does not fully understand how OpenAI manages to avoid generating "bubbles" (huge bubbles) like the one shown in the figure below, given such high pipeline parallelism. It is very likely that OpenAI has successfully borne the cost of these bubbles.

Training Cost: The cost of one training session is 63 million USD

OpenAI trained GPT-4 with approximately 2.15e25 FLOPS, using around 25,000 A100 GPUs for 90 to 100 days, with a utilization rate between 32% and 36%. The high number of failures is also a reason for the low utilization rate, which requires restarting training from previous checkpoints.

Another reason is that the all-reduce operation between so many GPUs is very expensive.

If the cost of OpenAI's cloud computing is approximately 1 USD per A100 GPU hour, then under these conditions, the cost of this training session alone is approximately 63 million USD.

This does not include all the experiments, failed training sessions, and other costs such as data collection, RLHF, and labor costs.

Taking into account all these factors, the actual cost is much higher.

However, in today's conditions, with a cost of 2 USD per H100 GPU hour, pre-training can be done on approximately 8,192 H100 GPUs in just 55 days, at a cost of 21.5 million USD.

Trade-offs when using the Expert Model

MoE (Mixture of Experts) is a good method to reduce the number of parameters during inference, but at the same time, it increases the number of parameters.

If OpenAI really wants to achieve optimal performance, they need to train twice as many tokens.

There are many reasons for adopting a relatively small number of expert models. One of the reasons why OpenAI chose 16 expert models is that it is difficult for more expert models to generalize and converge when performing many tasks.

Inference Cost of GPT-4

Compared to the Davinci model with 175 billion parameters, the cost of GPT-4 is three times higher, even though its feed-forward parameters only increase by 1.6 times. This is mainly because GPT-4 requires a larger cluster and achieves lower utilization.

The author believes that when inferring with 128 A100 GPUs, the cost of GPT-4 for a sequence length of 8k, per 1000 tokens, is 0.0049 USD, while inferring GPT-4 with 128 H100 GPUs for the same sequence length costs 0.0021 USD per 1000 tokens. Please note that this is assuming a relatively high utilization rate and maintaining a high batch size. However, it is evident that OpenAI's utilization rate is sometimes very low.

Multi-Query Attention

OpenAI, like other major companies, also uses MQA.

Simply put, it only requires one attention head and can significantly reduce the memory usage of the KV cache. Nevertheless, GPT-4 with a length of 32k definitely cannot run on a 40GB A100, and the maximum batch size of 8k also has its limits.

Continuous Batching

OpenAI has implemented variable batch sizes and continuous batching.

This is done to allow for a certain degree of maximum latency and optimize inference costs.

Speculative Decoding

OpenAI uses "speculative decoding" in the inference process of GPT-4.

The basic principle of "speculative decoding" is to use a smaller, faster draft model to decode multiple tokens in advance, and then input them as a batch into the prediction model. If OpenAI uses speculative decoding, they may only use it in sequences of about 4 tokens.

Visual Multimodality

It is a visual encoder independent of the text encoder, with cross-attention between the two, similar to Flamingo. This adds more parameters to the 1.8 trillion parameters of GPT-4.

The multimodal capability of GPT-4 is fine-tuned with approximately 20 trillion tokens after text pre-training. It is said that OpenAI originally intended to train the visual model from scratch, but due to its immaturity, they had to fine-tune it from the text training model.

As for the next-generation model GPT-5, it will start visual training from scratch and be able to generate images and even audio on its own.

The following is the full text translated by Newin through GPT link:

OpenAI keeps the GPT-4 architecture closed, not because it poses some kind of risk to humanity, but because the content they build is replicable. In fact, we expect companies like Google, Meta, Anthropic, Inflection, Character, Tencent, ByteDance, Baidu, and others to have models with the same or even greater capabilities as GPT-4 in the short term.

Please do not misunderstand, OpenAI has amazing engineering capabilities, and what they have built is incredible, but the solutions they have found are not magic. It is an elegant solution that involves many complex trade-offs. Scaling is just part of the battle. OpenAI's most enduring competitive advantage lies in having the most practical applications, leading engineering talent, and the ability to surpass other companies with future models.

We have gathered a wealth of information about GPT-4 from multiple sources, and today we would like to share some of it. This includes the model architecture, training infrastructure, inference infrastructure, number of parameters, composition of the training dataset, token count, number of layers, parallel strategy, multimodal visual adaptation, the thought process behind different engineering trade-offs, and the unique techniques they have implemented to alleviate some of the major bottlenecks associated with large-scale model inference.

The most interesting aspect of GPT-4 is understanding why they made certain architectural decisions.

Additionally, we will outline the cost of training and inferring GPT-4 on A100, as well as how it scales with H100 in the next generation model architecture.

First, let's take a look at the problem statement. From GPT-3 to 4, OpenAI aims to scale up by a factor of 100, but the problem lies in the cost. Dense Transformer models cannot be further scaled. Dense Transformer is the model architecture used by OpenAI GPT-3, Google PaLM, Meta LLAMA, TII Falcon, MosaicML MPT, and other models. We can easily list over 50 companies that train LLM using this same architecture. It's a good architecture, but it has limitations when it comes to scaling.

Prior to the release of GPT-4, we discussed the relationship between training cost and the impending AI brick wall. There, we revealed OpenAI's high-level approach to the architecture and training cost of GPT-4 in relation to various existing models.

Over the past six months, we have come to realize that training cost is irrelevant.

Of course, it may seem crazy to spend tens or even hundreds of millions of dollars in compute time to train a model, but for these companies, it is a negligible expense. It is essentially a fixed capital expenditure that always yields better results when scaled up. The only limiting factor is scaling the compute to a time scale where humans can provide feedback and modify the architecture.

In the coming years, multiple companies like Google, Meta, and OpenAI/Microsoft will train models on supercomputers worth over a trillion dollars. Meta burns $16 billion annually on the "Metaverse," Google wastes $10 billion each year on various projects, Amazon loses over $50 billion on Alexa, and cryptocurrencies squander over $100 billion on worthless things.

These companies, and society as a whole, can and will spend over a trillion dollars on creating supercomputers capable of training single massive models. These gigantic models can then be productized in various ways. This work will be replicated across multiple countries and companies. It is a new space race. Unlike previous wastefulness, artificial intelligence now has tangible value and will be realized in the short term through human assistants and autonomous agents. Expanding artificial intelligence is more important than inference.

The goal is to separate training computation from inference computation. That's why it makes sense to train beyond the optimal range of Chinchilla, regardless of the model to be deployed. That's why sparse model architectures are used; not every parameter needs to be activated during inference.

The real challenge is the high cost of scaling these models for users and agents. The cost of inference is much higher than the cost of training. This is OpenAI's innovation goal in model architecture and infrastructure.

Inference for large models is a multivariate problem, and model size is fatal for dense models. We have discussed the issues related to edge computing in detail here, but the problem statement in data centers is very similar. Simply put, devices can never have enough memory bandwidth to achieve the desired throughput level of large language models. Even if the bandwidth is sufficient, the utilization of hardware computing resources on edge computing devices will be very low.

Utilization is crucial in data centers and the cloud. One of the reasons Nvidia is appreciated for its excellent software is that it constantly updates low-level software to improve the utilization of FLOPS by moving data more intelligently within and between chips and memory.

In most current use cases, the goal of LLM inference is to run as a real-time assistant, which means it must achieve a high enough throughput for users to truly use it. The average human reading speed is about 250 words per minute, but some people can read up to 1,000 words per minute. This means you need to output at least 8.33 tokens per second, but closer to 33.33 tokens per second to handle all cases.

Based on the memory bandwidth requirements, a dense model with one billion parameters cannot achieve this throughput on the latest Nvidia H100 GPU server.

Each generated token requires loading each parameter from memory to the chip. The generated token is then input into the prompt and generates the next token. In addition, streaming transfer KV cache for attention mechanism requires additional bandwidth.

This chart assumes that due to the inability to fuse each operation, the memory bandwidth required for attention mechanism, and hardware overhead, the efficiency is equivalent to parameter reading. In reality, even with "optimized" libraries like Nvidia's FasterTransformer, the total overhead is even greater.

The above chart shows the memory bandwidth required to infer an LLM and provide high enough throughput for a single user. It shows that even with 8 H100s, it is impossible to serve a dense model with one billion parameters at a speed of 33.33 tokens per second. In addition, the utilization rate of 8 H100 FLOPS at a rate of 20 tokens per second is still less than 5%, resulting in a very high inference cost. In fact, the current H100 system based on 8-way tensor parallelism has inference limitations for about 300 billion forward parameters.

However, OpenAI is achieving human reading speed using A100, with model parameters exceeding 1 trillion, and offering it widely at a low price of only $0.06 per 1,000 tokens. This is because it is sparse, meaning not every parameter is used.

Regarding the model architecture, training infrastructure, inference infrastructure, parameter count, training dataset composition, token count, layer count, parallel strategy, multimodal visual encoder, the thought process behind different engineering trade-offs, unique techniques implemented, and how they alleviate some of the major bottlenecks related to large-scale model inference in GPT-4.

1 GPT-4 Model Architecture

GPT-4 is more than 10 times the scale of GPT-3. As far as we know, it has approximately 1.8 trillion parameters distributed across 120 layers, while GPT-3 has approximately 175 billion parameters.

OpenAI has successfully controlled costs by using a mixture of experts (MoE) model. If you are not familiar with MoE, please read our article from six months ago about the general GPT-4 architecture and training costs.

In addition, OpenAI uses 16 experts in its model, with each expert's MLP parameters being approximately 111 billion. Two experts are routed to each forward pass.

Although the literature discusses advanced routing algorithms for determining which expert to route each token to, it is reported that the routing algorithm in OpenAI's current GPT-4 model is quite simple.

Furthermore, the attention mechanism shares approximately 55 billion parameters.

Each forward pass inference (generating 1 token) only uses approximately 280 billion parameters and 560 TFLOPS. This is in contrast to purely dense models, which require approximately 1.8 trillion parameters and 3700 TFLOPS per forward pass.

2 Dataset Integration

OpenAI trained GPT-4 on approximately 13 trillion tokens. Considering that RefinedWeb's CommonCrawl contains approximately 5 trillion high-quality tokens, this makes sense. For reference, Deepmind's Chinchilla model and Google's PaLM model were trained on approximately 1.4 trillion tokens and 0.78 trillion tokens, respectively. It is even claimed that PaLM 2 was trained on approximately 5 trillion tokens.

The dataset does not consist of 13 trillion unique tokens. Instead, due to the lack of high-quality tokens, the dataset contains multiple epochs. The text data has 2 epochs, and the code data has 4 epochs. Interestingly, this is far from being the best choice for Chinchilla, indicating the need to train the model with twice the number of tokens. This suggests a lack of easily accessible tokens on the web. The number of high-quality text tokens is 1,000 times that, while audio and visual tokens are even more, but obtaining them is not as simple as web scraping.

They have millions of lines of instruction fine-tuning data from Scale AI and internally, but unfortunately, we don't have much information about their reinforcement learning data.

The context length during the pre-training phase is 8k. The 32k token length version is fine-tuned based on the 8k base after pre-training.

The batch size gradually increases over a few days, but in the end, OpenAI uses a batch size of 60 million! Of course, since not every expert sees all the tokens, this actually means that each expert processes 7.5 million tokens per batch.

3 Parallel Strategy

The strategy of parallelization across all A100 GPUs is crucial. They adopt 8-way tensor parallelism because it is the limit of NVLink. In addition, we heard that they are using 15-way pipeline parallelism. From the perspective of computation time and data communication, theoretically, the number of pipeline parallelism is too high, but it makes sense if they are limited by memory capacity.

In pure pipeline + tensor parallelism, each GPU only requires about 30GB of parameters (FP16). Once KV cache and overhead are added, theoretically, if most of OpenAI's GPUs are 40GB A100s, this makes sense. They may be using ZeRo stage 1. They may be using block-level FSDP or hybrid shared data parallelism.

As for why they didn't use full-model FSDP, it may be because of the high communication overhead. Although most of OpenAI's nodes have high-speed network connections between them, not all nodes do. We believe that the bandwidth between at least some clusters is much lower than others.

We don't understand how they avoid huge bubbles in each batch with such high pipeline parallelism. It is likely that they simply bear this cost.

4 Training Cost

In the training of GPT-4, OpenAI used approximately 25,000 A100 chips and achieved an average functional utilization (MFU) of about 32% to 36% over a period of 90 to 100 days. This extremely low utilization is partly due to a large number of failures that require restarting from checkpoints, and the aforementioned bubble cost is very high.

Another reason is that the cost of global reduction among so many GPUs is extremely high. If our guess is correct, then the cluster is actually composed of many smaller clusters with very weak network connections between them, i.e., the non-blocking connections between different parts of the cluster are 800G/1.6T, but these parts can only be connected at a speed of 200G/400G.

If their cost in the cloud is about $1 per hour for an A100 chip, the cost of this training alone is about $63 million. This does not take into account all the experiments, failed training runs, and other costs such as data collection, reinforcement learning, and personnel costs. Due to these factors, the actual cost is much higher. Additionally, this means that you need someone to purchase chips/networks/data centers, bear the capital expenditure, and rent them to you.

Currently, using about 8,192 H100 chips at a price of $2 per hour, pre-training can be completed in about 55 days at a cost of about $21.5 million. It should be noted that we believe that by the end of this year, there will be 9 companies that will have more H100 chips. Not all of these companies will use them all for a single training run, but those that do will have larger-scale models. Meta will have over 100,000 H100 chips by the end of this year, but a significant number of chips will be distributed in their data centers for inference. Their largest single cluster will still exceed 25,000 H100 chips.

By the end of this year, many companies will have enough computing resources to train models of a scale comparable to GPT-4.

Trade-offs of 5 MoE

MoE is a good way to reduce the number of parameters during inference while increasing the number of parameters, which is necessary for encoding more information per training token because obtaining enough high-quality tokens is very difficult. If OpenAI is really trying to achieve Chinchilla optimization, they will have to use twice the number of tokens in training.

Nevertheless, OpenAI has made multiple trade-offs. For example, MoE is very difficult to handle during inference because each part of the model is not used for every token generation. This means that certain parts may be idle while others are in use when serving users. This has a significant negative impact on utilization.

Researchers have shown that using 64 to 128 experts results in smaller losses than using 16 experts, but that is purely a research result. There are multiple reasons for reducing the number of experts. One of the reasons OpenAI chose 16 experts is because more experts are difficult to generalize across many tasks. Using more experts may also make convergence more difficult. In such a large-scale training run, OpenAI chooses to be more conservative in the number of experts.

In addition, reducing the number of experts also helps their reasoning infrastructure. There are various trade-offs when adopting an expert-mixed reasoning architecture. Before discussing the trade-offs faced by OpenAI and the choices they have made, let's start with the basic trade-offs of LLM reasoning.

6 Trade-offs in Reasoning

By the way, before we begin, we would like to point out that everyone we have talked to at all LLM companies thinks that Nvidia's FasterTransformer inference library is quite bad, and TensorRT is even worse. The disadvantage of not being able to use Nvidia's templates and modify them means that people need to create their own solutions from scratch. If you are an Nvidia employee reading this article, you need to address this issue as soon as possible, otherwise the default choice will become open tools, making it easier to add third-party hardware support. A wave of massive models is coming. If there is no software advantage in inference and manual kernel writing is still required, then AMD's MI300 and other hardware will have a larger market.

In the inference of large language models, there are three main trade-offs that occur between batch size (concurrent number of users of the service) and the number of chips used.

  1. Latency - The model must respond with reasonable latency. People don't want to wait for several seconds before the output starts flowing into the chat application. Preloading (input tokens) and decoding (output tokens) take different amounts of time to process.
  2. Throughput - The model must output a certain number of tokens per second. Approximately 30 tokens per second are required for human use. Lower and higher throughputs can be acceptable for various purposes.
  3. Utilization - The hardware running the model must achieve high utilization, otherwise the cost will be too high. Although higher latency and lower throughput can group more user requests to achieve higher utilization, this increases the difficulty.

LLM inference is all about balancing two main factors: memory bandwidth and computation. In the most oversimplified terms, each parameter must be read, and associated with it are 2 FLOPs. Therefore, the ratio of most chips (e.g., H100 SXM chip with only 3TB/s of memory bandwidth but 2,000 TFLOP/s of FP8) is completely unbalanced in inference with a batch size of 1. If the service is provided for only one user with a batch size of 1, the required memory bandwidth dominates the inference time for generating each token. The computation time is almost zero. In order to effectively scale large language models to multiple users, the batch size must exceed 4. Multiple users share the cost of parameter reading. For example, for a batch size of 256 or 512, there are 512 FLOP/s or 1024 FLOP/s of memory read per byte.

This ratio is closer to the proportion between the memory bandwidth and FLOPS of H100. This helps achieve higher utilization, but at the cost of increased latency.

Many people consider memory capacity as a major bottleneck for LLM inference because large models require multiple chips for inference, and larger memory capacity reduces the number of chips it can accommodate. However, it is actually better to use chips with capacity exceeding the requirement in order to reduce latency, improve throughput, and enable larger batch sizes for higher utilization.

Google demonstrated these trade-offs in their PaLM inference paper. However, it is worth noting that this is for dense models like PaLM, not sparse models like GPT-4.

If an application requires minimal latency, we need to apply more chips and divide the model into as many parts as possible. Smaller batch sizes usually achieve lower latency, but smaller batch sizes also result in poorer utilization, leading to higher overall cost per token (in chip-seconds or dollars). If an application requires offline inference and latency is not an issue, the main goal is to maximize the throughput per chip (i.e., minimize the overall cost per token).

Increasing batch size is the most efficient approach because larger batches generally achieve better utilization. However, certain partitioning strategies that are inefficient for small batch sizes become efficient as the batch size increases. More chips and larger batch sizes are cheaper because they increase utilization, but they also introduce a third variable, network time. Some methods that partition the model across different chips are more efficient for latency but trade off with utilization.

Memory time and non-attention computation time are directly proportional to the model size and inversely proportional to the number of chips. However, for a given partition layout, the time required for chip-to-chip communication decreases slowly (or not at all), so it becomes increasingly important and a bottleneck as the number of chips increases. While we have only briefly discussed it today, it should be noted that as batch size and sequence length increase, the memory requirements for the KV cache increase dramatically. If an application needs to generate text with long attention contexts, the inference time will increase significantly.

For a 500B+ model with multi-head attention, the attention KV cache becomes very large: for a batch size of 512 and a context length of 2048, the total KV cache reaches 3TB, which is three times the size of the model parameters. The memory on the chip needs to load this KV cache from external storage to memory, during which the computational cores on the chip are essentially idle. Longer sequence lengths are particularly detrimental to memory bandwidth and capacity. The cost of OpenAI's 16k sequence length GPT-3.5 Turbo and 32k sequence length GPT-4 is much higher because they cannot use larger batch sizes due to memory limitations.

A lower batch size leads to lower hardware utilization. Additionally, as the sequence length increases, the KV cache also becomes larger. The KV cache cannot be shared among users, so it requires separate memory reads, further becoming a bottleneck for memory bandwidth.

7 Trade-offs and Infrastructure of GPT-4 Inference

All of the above is challenging in GPT-4 inference, but the model architecture adopts the Expert-Mixture Model (MoE), which introduces a whole new set of difficulties. The forward pass for each token generation can be routed to different sets of experts. This poses a challenge in achieving a trade-off between throughput, latency, and utilization when the batch size is large.

OpenAI's GPT-4 has 16 experts, with 2 experts per forward pass. This means that if the batch size is 8, the parameter read for each expert may only be a batch size of 1. What's worse is that one expert may have a batch size of 8, while others may have 4, 1, or 0. With each token generation, the routing algorithm sends the forward pass in different directions, resulting in significant variations in token-to-token latency and expert batch sizes. The choice of fewer experts is one of the main reasons why OpenAI opted for the inference infrastructure. If they had chosen more experts, memory bandwidth would have become a bottleneck for inference.

OpenAI often achieves batch sizes of 4k+ on the inference cluster, which means that even with optimal load balancing between experts, the batch size per expert is only about 500. This requires a significant amount of usage to achieve. We understand that OpenAI runs inference on a cluster consisting of 128 GPUs. They have multiple such clusters in different data centers and locations. Inference is performed on 8-way tensor parallelism and 16-way pipeline parallelism. Each node consisting of 8 GPUs has only about 130B parameters, which is less than 30GB per GPU in FP16 mode and less than 15GB in FP8/int8 mode. This allows inference to run on a 40GB A100 chip, provided that the KV cache size for all batches does not become too large.

A single layer with various experts is not split across different nodes because it would make the network traffic too irregular and the cost of recomputing the KV cache between each token generation too high. For any future MoE model expansion and conditional routing, handling the routing of the KV cache is a major challenge.

The model has 120 layers, so it is straightforward to evenly distribute them across 15 different nodes. However, placing fewer layers on the main node of the inference cluster makes sense because the first node needs to perform data loading and embedding. Additionally, we have heard some rumors about speculative decoding in inference, which we will discuss later, but we are unsure whether to believe these rumors. This can also explain why the main node needs to include fewer layers.

8 Cost of GPT-4 Inference

The cost of GPT-4 is three times that of the Davinci model with 175B parameters, even though its feedforward parameters have only increased by 1.6 times. This is mainly because GPT-4 requires a larger cluster and achieves lower utilization.

We believe that the cost per 1k token for inferring GPT-4 with 128 A100s for an 8k sequence length is 0.0049 cents, while for 128 H100s, it is 0.0021 cents.

It is worth noting that we assume high utilization and maintain a high batch size. This may be an incorrect assumption as it is evident that OpenAI sometimes has very low utilization. We assume that OpenAI shuts down clusters during low periods and reconfigures these nodes to resume training on smaller test models from checkpoints, experimenting with various new techniques. This helps to reduce the cost of inference. If OpenAI does not do this, their utilization will be lower, and our cost estimate will increase by more than double.

9 Multi-Query Attention

MQA is a technology that other companies are using, but we want to point out that OpenAI is also using it. In short, with just one head, the memory capacity of the KV cache can be greatly reduced. Even so, GPT-4 with a sequence length of 32k definitely cannot run on a 40GB A100 chip, and GPT-4 with an 8k sequence length is limited by the maximum batch size. Without MQA, the maximum batch size of GPT-4 with an 8k sequence length would be severely restricted, making it economically infeasible.

10 Continuous Batching

OpenAI has implemented variable batch sizes and continuous batching. This allows for maximum latency to some extent and optimizes the cost of inference. If you are unfamiliar with this concept, this article written by AnyScale is worth reading.

11 Speculation on Decoding

We have heard from reliable sources that OpenAI uses speculative decoding in GPT-4 inference. We are not sure if we fully believe this. The widespread variation in token-to-token latency and the differences observed when performing simple retrieval tasks versus more complex tasks suggest that this is possible, but there are too many variables to be certain. Just in case, we will use some text from "Using Segment-Level Speculative Decoding to Accelerate LLM Inference" here and make slight modifications/additions for clarification.

Using LLM is typically divided into two stages. The first stage is pre-filling, where the prompt text is used to generate a KV cache and the logits (probability distribution of possible token outputs) for the first output. This stage is usually fast because the entire prompt text can be processed in parallel.

The second stage is decoding. A token is selected from the output logits and fed back into the model to generate the logits for the next token. This process is repeated until the desired number of tokens is generated. Because decoding must be done sequentially, the weight flow needs to pass through the computation unit to generate a single token each time. Therefore, the arithmetic intensity (i.e., FLOP/compute-to-memory bandwidth ratio) of the second stage is very low when running in small batches.

Therefore, decoding is usually the most expensive part of autoregressive generation. That's why in OpenAI's API calls, input tokens are much cheaper than output tokens.

The basic idea behind guessing decoding is to use a smaller, faster draft model to pre-decode multiple tokens and then feed them as a batch to the oracle model. If the draft model's predictions for these tokens are correct, i.e., agreed upon by the larger model, then multiple tokens can be decoded in a batch, saving a significant amount of memory bandwidth and time for each token.

However, if the larger model rejects the tokens predicted by the draft model, the remaining batch will be discarded, and the algorithm naturally falls back to standard token-by-token decoding. Guessing decoding may also involve rejection sampling to sample from the original distribution. Note that this is only useful in small batch settings where bandwidth is the bottleneck.

Guessing decoding trades off computation and bandwidth. Guessing decoding has two key advantages as a performance optimization target. First, it does not degrade model quality at all. Second, the advantages it provides are often orthogonal to other methods, as its performance comes from transforming sequential execution into parallel execution.

The current guessing method predicts a single sequence in a batch. However, this does not scale well with large batch sizes or low alignment of the draft model. Intuitively, the probability of two models agreeing on consecutive long sequences decreases exponentially, which means that as the arithmetic intensity increases, the return on guessing decoding quickly diminishes.

We believe that if OpenAI uses guessing decoding, they may only use it on sequences of about 4 tokens. By the way, the whole conspiracy about GPT-4 lowering quality might just be because they let the oracle model accept lower probability sequences from the guessing decoding model. Another note is that some speculate that Bard uses guessing decoding because Google waits for the sequence to be generated before sending the entire sequence to the user, but we don't believe this speculation is true.

12 About Visual Multimodality

The visual multimodal capability is the least impressive part of GPT-4, at least compared to leading research. Of course, no company has commercialized the research on multimodal LLM yet.

It is a standalone visual encoder separate from the text encoder, but with cross-attention. We heard that its architecture is similar to Flamingo. It adds more parameters on top of the 1.8T parameters of GPT-4. After pretraining on text only, it is further fine-tuned on an additional 2 trillion tokens.

For the visual model, OpenAI originally intended to train from scratch, but this approach is not mature enough, so they decided to start with text first to mitigate risks.

It is said that the next model, GPT-5, will be trained from scratch on vision and will be able to generate images on its own. Additionally, it will also be capable of handling audio.

One of the main purposes of this visual capability is to enable autonomous agents to read web pages and transcribe content from images and videos. The data they train on includes a portion of joint data (rendered LaTeX/text), screen captures of web pages, YouTube videos: sampled frames, and running Whisper for transcription.

One interesting aspect of all this over-optimization for LLM is that the cost of the visual model is different from the cost of the text model. As described in the "Amazon Cloud Crisis" article, the cost is very low in the text model. In the visual model, however, the IO for data loading is about 150 times higher. The byte count per token is 600 instead of 4 for text. There is ongoing research on image compression.

This is important for hardware vendors who are optimizing their hardware based on the use cases and ratios of LLM in the next 2-3 years. They may find themselves in a world where every model has powerful visual and audio capabilities. They may find that their architectures are ill-suited. Overall, the architecture is sure to evolve beyond the current stage of simplified text-based dense and/or MoE models.