TSMC foundry! Intel's new AI PC chip Lunar Lake released: AI computing power 120TOPS!

Wallstreetcn
2024.06.05 08:06
portai
I'm PortAI, I can summarize articles.

Taiwan Semiconductor Manufacturing Company (TSMC) has released Intel's new AI PC chip Lunar Lake, which enhances GPU performance, AI computing power, and overall computing power. Lunar Lake is entirely manufactured by TSMC, marking the first time Intel has fully outsourced production to TSMC. This move may be related to TSMC's leading position in process technology and the independent spin-off of Intel's foundry business. The packaging of Lunar Lake is still completed by Intel's foundry business group

On June 4th, Pat Gelsinger, CEO of Intel, delivered a keynote speech at COMPUTEX 2024, officially announcing the next-generation mobile processor Lunar Lake for AI PCs. Not only does the CPU, GPU, and NPU performance see a comprehensive improvement, but energy consumption is also significantly reduced, with overall AI computing power increased to 120TOPS.

1. Lunar Lake is entirely manufactured by TSMC for the first time, but the next generation, Panther Lake, will return to Intel for manufacturing

It is reported that Lunar Lake consists of 7 main parts, with the entire package including memory, reinforcement, and underlying chips. The underlying chips use Intel's Foveros interconnect technology to combine the computing chip and platform controller chip together. In terms of process nodes, the Lunar Lake computing chip (including CPU, GPU, and NPU, etc.) is manufactured using TSMC's N3B process node, while the platform controller chip is manufactured using TSMC's N6 process node. In other words, the main core die of this chip is all manufactured by TSMC!

Previously, although some core parts of Intel's high-end mobile platform chips were outsourced to TSMC, the CPU core has always been produced by Intel itself. This shift is partly due to TSMC's leading process technology and may also be related to the independent spin-off of Intel's foundry business. The independent spin-off of Intel's foundry business allows Intel's design business to more freely choose external suppliers with greater competitiveness. Fortunately, the packaging of Lunar Lake is still completed by Intel's foundry business group.

In response to this, Gelsinger stated that the reason Lunar Lake chose TSMC for manufacturing was because TSMC had better process technology at the time, which still seems to be a good choice now. He thanked TSMC for providing many core key manufacturing technologies, making Lunar Lake possible. This also demonstrates the cooperation between TSMC and Intel in the foundry industry, including UCIe (Universal Chiplet Interconnect).

However, Gelsinger emphasized that the next generation, Panther Lake, will almost entirely be based on Intel's process, using the Intel 18A process, as well as hybrid bonding technology, wafer-to-wafer stacking, advanced packaging technology, and backside power delivery technology. He hopes to showcase Intel's wafer fab capabilities by then

1. CPU Cores: 4 P-cores + 4 E-cores, significantly improved performance and efficiency

According to the introduction, Lunar Lake's CPU core still adopts a Hybrid core architecture design, with 4 Lion Cove P-core performance cores and 4 Skymont E-core efficiency cores, combined into an 8-core hybrid design to achieve the best balance between performance and efficiency.

The Lion Cove P-core performance cores of Lunar Lake have made significant improvements in the cache hierarchy. They use multi-level data caching, with each core including a 48KB L0D cache (4-cycle load-to-use latency), a 192KB L1D cache (9-cycle latency), and an extended L2 cache (up to 3MB, 17-cycle latency). Overall, this makes the 240KB cache latency almost the same as the core latency, whereas the previous Redwood Cove could only achieve a 48KB cache in the same time. The 4 P-cores also share a 12MB L3 cache, which can bring better single-thread performance and optimize core PPA design.

Intel has added a third Address Generation Unit (AGU)/Storage Unit pair to further enhance storage performance. It is worth noting that this balances the number of load and store pipelines, with 3 each; in most Intel architectures, the number of load units is greater than store units.

Overall, Intel has invested more in cache in its true long-term CPU design philosophy to address this issue. As CPU complexity increases, the cache subsystem also continues to grow to ensure its normal operation. In this case, ensuring the normal operation of the CPU is a key improvement to enhance its performance and maintain its energy efficiency.

A deeper look into the Lion Cove's computational architecture, which focuses on improving performance and efficiency in Intel's P-core design. This architecture adopts a new front-end approach to handle instructions, with prediction blocks 8 times larger than before, wider fetch range, higher decode bandwidth, significantly increased Uops cache capacity and read bandwidth. The UOP queue capacity has been increased, which also improves overall throughput. During execution, Lion Cove's out-of-order engine is divided between integer (INT) and vector (VEC) domains, with independent renaming and scheduling functions.

The Data Translation Lookaside Buffer (DTLB) has also been modified, increasing its depth from 96 pages to 128 pages to improve its hit rate. This partitioning method enables future scalability, independent growth for each domain, and helps reduce power consumption for specific domain workloads. The out-of-order engine has also been improved, with allocation/renaming increased from 6 to 8, retirement increased from 8 to 12, instruction window depth increased from 512 to 576, and execution ports increased from 12 to 18. These changes make the pipeline more robust and flexible in execution.

The integer execution units in Lion Cove have also been enhanced: integer ALUs increased from 5 to 6, jump units increased from 2 to 3, and shift units increased from 2 to 3. They have expanded 64x64 units to over 64, and increased from 1 unit to 3 units, providing more powerful computing capabilities for the most complex operations. Another significant advancement is the migration of the P-core database from a "sea of fubs" to a "sea of cells." The process of updating the organization of P-core substructures has shifted from tiny, latch-based partitions to broader, larger trigger-based partitions, which are very unpredictable in the development process.

The Lion Cove architecture also maintains consistency with performance improvements, with IPC performance expected to increase by a double-digit percentage compared to the previous generation Redwood Cove. This improvement is particularly significant, especially in the improvement of hyper-threading, where IPC has increased by 30%, dynamic power efficiency has increased by 20%, and a balance has been achieved in previous technologies without increasing core area, reflecting Intel's commitment to improving performance within existing physical constraints.

The power management of Lion Cove has also been improved, including the adoption of an AI self-adjusting controller to replace static thermal protection bands. It allows the system to dynamically respond to actual real-time operating conditions in an adaptive manner to achieve higher sustained performance. It uses finer clock granularity, now at intervals of 16.67MHz. Compared to 100MHz, this means more precise power management and performance adjustments, thereby maximizing efficiency from the power budget

At least on paper, Lion Cove seems to have made significant improvements over Golden Cove. It integrates improved memory and cache subsystems, better power management, and an increase in IPC performance, rather than focusing on increasing frequency.

The Lunar Lake's Skymont E-core efficiency core is designed to achieve a new level of performance efficiency. The 4 E-cores share a 4MB L2 cache, delivering over 2 times better power efficiency than the previous generation, and a 2 times improvement in Vector and AI output performance compared to the previous generation.

The Skymont core features a more comprehensive microarchitecture, starting with a 9-wide decode stage, which is 50% more than the previous generations. This is supported by a larger micro-operation queue, now capable of accommodating 96 entries, compared to the previous design with only 64 entries. Using "Nanocode" allows for more microcode parallelism within each decode cluster.

The out-of-order execution engine of the Skymont core has also been significantly improved. The issue width has increased to 8-wide, while the retirement stage has doubled to 16-wide. This enhances the core's ability to issue and execute multiple instructions simultaneously, and reduces latency through dependency interruption mechanisms.

Skymont has deepened the reorder buffer from 256 entries to 416 entries to provide queuing and buffering capabilities. Additionally, the size of the physical register file (PRF) and reservation stations has also increased. These enhanced features enable the core to handle more instructions in flight, thereby improving instruction execution parallelism

It is worth noting that the scheduling port initially had 26 ports, with 8 used for integer ALU, 3 for jump operations, and 3 for loading operations per cycle, further achieving flexible and efficient resource allocation. In terms of vector performance, Skymont supports 4×128-bit FP and SIMD vectors, doubling the Gigaflops/TOPs per second and reducing the latency of floating-point operations. Intel also redesigned the memory subsystem, with four cores sharing 4MB L2 cache, doubling the L2 bandwidth to 128B per cycle, reducing memory access latency, and increasing data throughput.

Intel's disclosed performance indicators highlight the significant improvement in power efficiency of the Skymont E core: compared to the previous generation Meteor Lake's LP E core, single-thread performance has increased by 1.7 times, while power consumption is only one-third.

When directly comparing the Skymont E-core cluster with the Meteor Lake LP E-core cluster, multi-thread performance has increased by 2.9 times, while power consumption has decreased overall.

This is equally useful for mobile and desktop designs. In other words, the Skymont E core is very flexible, fully utilizing low-power structures and system cache in mobile scenarios, and optimizing multi-thread throughput for desktop computing blocks. Compared to Raptor Cove, Skymont provides 2% better integer and floating-point performance in single-thread workloads, with power and heat ranges almost the same as its predecessor.

Skymont E represents the next step in the development of Intel's core architecture, making significant progress in decoding, execution, memory subsystem, and power efficiency, meeting the demand for more energy-efficient computing, and improving IPC gains compared to the previous Crestmont E core.

2. GPU performance increased by 50%, with new display, multimedia, and image engines

The GPU of Lunar Lake adopts the new generation Xe2 GPU architecture, with 8 sets of new generation Xe cores, 8 ray tracing units, XMX AI engine, and 8MB of dedicated cache. It can provide 67 GPU TOPS of computing power, real-time ray tracing, AI-based XeSS image quality enhancement, Intel Arc software stack, and other functions, bringing a 50% graphics processing performance improvement compared to the previous generation Meteor Lake.

Lunar Lake also integrates a new display, multimedia, and image engine (IPU) that works with the GPU. The display core has 3 eDP 1.5, DP, and HDMI 2.1 output interfaces, the multimedia engine supports AV1 and the latest VVC encoding functions, and the IPU can provide image enhancement functions such as Temporal noise reduction, Multi-frame, and Dual exposure staggered HDR.

Specifically, Intel's eDisplayPort 1.5 includes panel replay functionality, which integrates adaptive sync and selective update mechanisms. This helps to reduce power consumption by refreshing only the parts of the screen that have changed rather than the entire display. These innovations not only save energy but also improve visual experience by reducing display latency and enhancing sync accuracy

The pixel processing pipeline depicted is one of the fundamental foundations on which the Intel display engine relies, with each pipeline supporting six planes for advanced color conversion and synthesis. In addition, it integrates hardware support for color enhancement, display scaling, pixel adjustment, and HDR perceptual quantization, ensuring vivid and accurate graphics on the screen. The design is highly flexible, energy-efficient, and performance is carefully designed to support various input and output formats at least on paper. So far, Intel has not provided any quantifiable power metrics, TDP, or other power elements.

In terms of compression and encoding, the Xe2 architecture can losslessly increase the display stream compression ratio to 3:1, including transmission encoding for HDMI and DisplayPort protocols. These chip features can further reduce data loads and maintain high resolution at the output end without compromising visual quality.

Regarding the multimedia engine, Intel's adoption of the VVC codec represents a significant improvement in video compression technology. Compared to AV1, this codec can reduce file sizes by 10% and support adaptive resolution streaming and advanced content encoding for 360-degree and panoramic videos. This will ensure lower bitrates for streaming without compromising quality - a fundamental aspect of modern multimedia applications.

The Windows GPU software stack is very powerful from top to bottom, supporting D3D, Vulkan, Intel VPL API, and frameworks. This means that combining these qualities can provide comprehensive support for various runtimes and drivers in the market, thereby improving overall efficiency and compatibility in different software environments

3. NPU computing power increased to 48TOPS

As a new generation AI PC processor for laptops, Lunar Lake's Neural Processing Unit (NPU) has undergone a significant upgrade, integrating the all-new fourth-generation NPU core (NPU 4) with 6 Neural Compute engines, 12 enhanced SHAVE digital signal processors (DSP), and 9MB cache, providing 48 TOPS of AI computing power.

Compared to the previous generation NPU 3, NPU 4 has made a huge leap in enhancing neural processing capabilities and efficiency. The improvements in NPU 4 are mainly achieved through higher frequencies, better power architecture, and more engine numbers, giving it better performance and efficiency.

In NPU 4, these enhancements are strengthened in the vector performance architecture, with more computing blocks and better optimization of matrix calculations. This requires a large amount of neural processing bandwidth; in other words, this is crucial for applications that require ultra-fast data processing and real-time inference.

This architecture supports INT8 and FP16 precision, with INT8 capable of up to 2048 MAC (multiply-accumulate) operations per cycle, and FP16 capable of up to 1024 MAC operations per cycle, indicating a significant improvement in computational efficiency.

A deeper understanding of the architecture reveals an increase in the hierarchy of NPU 4. Each neural computing engine is embedded with an incredible inference pipeline, including MAC arrays and many dedicated DSPs for different types of calculations. The pipeline is built for numerous parallel operations, thereby improving performance and efficiency. The new SHAVE DSP is optimized, with four times the vector computing capability of the previous generation, capable of handling more complex neural networks.

Another major improvement of NPU 4 is the increase in clock speed and the introduction of a new node, which doubles the performance at the same power level as NPU 3. This results in a fourfold increase in peak performance, making NPU 4 a powerful engine for demanding AI applications. The new MAC array on the chip has advanced data conversion capabilities, allowing dynamic data type conversion, fusion operations, and output data layout, enabling the data flow to reach optimal state with minimal delay.

The bandwidth improvement of NPU 4 is crucial for handling larger models and datasets, especially in applications based on Transformer language models. This architecture supports higher data flow, reducing bottlenecks and ensuring smooth operation even at runtime. The DMA (Direct Memory Access) engine of NPU 4 doubles the DMA bandwidth, which is an important complement to improving network performance and an effective method for handling heavy neural network models. Further support for more features, including embedded tokenization, expands the potential of NPU 4.

Another significant improvement in NPU 4 is in matrix multiplication and convolution operations, where the MAC array can handle up to 2048 MAC operations (INT8) and 1024 MAC operations (FP16) in a single cycle. This in turn allows NPU to process more complex neural network calculations at higher speeds and lower power consumption. This difference is seen in the dimension of the vector register file; NPU 4 has a width of 512 bits. This means that more vector operations can be performed in a single clock cycle, thereby improving computational efficiency.

NPU 4 supports activation functions, now with more types of activation functions to support and process any neural network, and the option to choose precision to support floating-point calculations, making computations more precise and reliable. Improved activation functions and optimized inference pipeline will enable it to execute more complex and detailed neural network models at a faster speed and higher accuracy.

The upgrade of SHAVE DSP in NPU 4 increases its vector computing capability to four times that of NPU 3, overall improving the vector performance by 12 times. This is very useful for transformers and large language models (LLM), making them faster and more energy-efficient. Increasing the number of vector operations per clock cycle can achieve a larger vector register file size, significantly enhancing the computational capabilities of NPU 4 Overall, the performance of NPU 4 has seen a significant improvement compared to NPU 3, with a 12x increase in overall vector performance, a 4x increase in TOPS computing power, and a 2x increase in IP bandwidth. These improvements make NPU 4 a high-performance and efficient AI solution, suitable for the latest AI and machine learning applications where performance and latency are crucial. The architectural enhancements, data transformation, and bandwidth improvements make NPU 4 a top solution for managing highly demanding AI workloads.

4. Better Security Technology and High-Speed Connectivity Technology

The control layer of the Lunar Lake platform also integrates security and next-generation high-speed connectivity technologies.

In terms of security, it features Intel Partner Security Engine, Intel Silicon Security Engine, Converged Security and Manageability Engine.

For connectivity, the Lunar Lake platform integrates the latest Wi-Fi 7, Bluetooth 5.4, and 1GbE MAC connection technologies.

The integrated Wi-Fi 7 solution supports Multi-Link Operation (MLO), increasing reliability, throughput (supporting 5.8Gbps), reducing latency, and enabling traffic separation/differentiation. Compared to the BE200 network interface, the silicon size has been reduced by 28% and adopts the 11Gbps CNVio3 interface. Additionally, it utilizes radio frequency interference mitigation technology to dynamically adjust the DDR clock frequency that significantly impacts Wi-Fi performance.

Intel has also announced further collaboration with Meta to enhance VR experiences using this Wi-Fi 7 technology. This further optimizes video latency performance, reduces interference, making VR applications more seamless and immersive, at least from a wireless connectivity perspective. The enhanced capabilities of Wi-Fi 7 provide high, reliable speeds, and low latency to meet the most challenging demands in VR applications

5. 3D Foveros Packaging and Scalable Fabric Gen 2 Interconnect

In terms of interfaces, Lunar Lake provides 4 PCIe 5.0, 4 PCIe 4.0, 3 integrated Thunderbolt 4 (40Gbps), 2 USB 3.0, and 6 USB 2.0 interfaces. It is worth mentioning that the Thunderbolt 4 interface, through Thunderbolt Share acceleration, can elevate productivity to a new level, enabling connection to multiple computers.

All computing cores, Memory Side cache, security, connectivity, and I/O modules in Lunar Lake are jointly packaged on the processor substrate using Intel's 3D Foveros multi-chip packaging technology, and adopt Memory On Package packaging. Next to the core of Lunar Lake, 32GB of memory is packaged.

It should be noted that Lunar Lake's 32GB w/ 2 Ranks LPDDR5X memory chips are packaged on the substrate along with the processor. Each chip has a transfer bandwidth of 8.5GT/s, supports 16b x 4 channels, can reduce 40% of PHY power, and save 250mm² of circuit board area.

Lunar Lake's CPU, GPU, and NPU computing cores are interconnected via Scalable Fabric Gen 2, and then connected directly to the platform control layer's Scalable Fabric Gen 2 through D2D, seamlessly linking computing nodes and chip layers to provide better scalability and efficiency for computing cores. In addition, with the coordination of the Home Agent for hierarchical coherency, including Memory Side cache, Coherency Agent in each core cluster, and I/O Coherency in the platform control layer

6. New power supply design and power management, comprehensive energy consumption can be reduced by 40%

In terms of power supply, Lunar Lake adopts a new 4 PMIC power supply design, which can provide more power supply paths, dynamic voltage ID, and more monitoring functions. The power supply optimization for SoC achieves the best performance efficiency.

In terms of power management, the integrated Intel thread controller focuses on efficiency, as well as power balancers optimized for each load type, enhanced "sleep" state power and latency, and ML-based WL classification and frequency control. The Thread Director identifies the level of each workload and uses its energy and performance core scoring mechanism to help the operating system schedule threads to the core with the best performance and efficiency.

Additionally, Lunar Lake has integrated a shared 8MB Memory Side cache in many core chips, which can reduce the number of DRAM transfers and save power. With the cache mechanism, the latency between the core and DRAM is further reduced, enhancing the transmission bandwidth.

According to data released by Intel, thanks to advanced process nodes, new E-core design, Memory Side cache, power supply design, power management, and thread scheduler technology, Lunar Lake's energy consumption can be reduced by 40% compared to the previous generation Meteor Lake.

7. Lunar Lake to ship in the third quarter, Arrow Lake to be launched in the fourth quarter

It is reported that currently, Lunar Lake has more than 80 designs from 20 OEM manufacturers and is expected to start shipping in the third quarter.

Intel also revealed the future architecture of mobile processors for AI PCs. Arrow Lake for desktop will be launched in the fourth quarter of this year, followed by Panther Lake with Intel 18A next year, and new products will continue to be introduced after 2026.

Summary: Comprehensive AI computing power up to 120 TOPS

Looking at the internal cores of Lunar Lake, compared to the previous generation Meteor Lake, it undoubtedly brings significant upgrades. Not only does the CPU core integrate Lion Cove P core and Skymont E core, but it also comes with the latest Xe2-LPG GPU architecture and a new generation NPU 4 core, providing leading AI performance.

Combining the AI computing power provided by CPU, GPU, and NPU, the overall AI computing power of the Lunar Lake platform reaches 120 TOPS, highlighting Intel's investment in AI. The CPU can provide 5 TOPS of computing power through VNNI and AVX instructions to drive light AI work; the GPU provides 67 TOPS of computing power through XMX and DP4a to meet the AI performance needs of gaming and creation; the NPU provides 48 TOPS of computing power to handle dense vector and matrix operations, providing AI assistance and creation functions.

In comparison, the NPU computing power of Qualcomm Snapdragon X Elite is 45 TOPS, Apple M4's NPU computing power is only 38 TOPS, and although AMD's latest AI PC chip - Ryzen AI 300 series integrates AMD's third-generation NPU core with AI computing power increased to 50 TOPS, Intel Lunar Lake's NPU core with AI computing power of 48 TOPS is slightly lower. However, it still significantly exceeds the minimum NPU computing power requirement of 40 TOPS for Microsoft's Copilot+ PC. Intel is more focused on providing higher comprehensive AI computing power, that is, by combining AI engines with NPU, CPU, and GPU, the comprehensive AI computing power is increased to 120 TOPS, nearly three times that of the previous generation Meteor Lake, showing a substantial improvement

It is worth mentioning that Lunar Lake has brought significant improvements in power supply and power management, combined with more advanced process nodes, CPU computing cores, and improvements in energy efficiency, making Lunar Lake significantly lower in energy consumption compared to the previous generation Meteor Lake, making it more suitable for mobile devices.

According to data disclosed by Intel, Lunar Lake has a 50% increase in GPU performance, a fourfold increase in AI computing power of the NPU core, a 40% reduction in SoC power consumption, a 3.5-fold increase in GPU AI computing power, and the overall SoC computing power exceeds 120TOPS.

In summary, Lunar Lake has brought significant performance improvements compared to the previous generation Meteor Lake, especially in terms of AI capabilities, while also bringing higher energy efficiency and lower power consumption. Compared to other AI PC chip competitors, it still has a significant advantage.

Intel CEO Gelsinger also stated in his speech that he is very optimistic about the development of AI PCs. Currently, there are more than 8 million AI PCs equipped with Intel-Core Ultra processors shipped, indicating that the era of AI PCs has arrived.

At the same time, Gelsinger also predicts that the shipments of AI PCs based on Intel chips will reach 45 million this year. By 2028, the proportion of PCs with AI capabilities among all PCs will reach 80%. Intel has more than 300 AI accelerators and over 500 artificial intelligence models. When AI PCs enter the market, Intel already has a complete AI PC ecosystem.

Obviously, with the launch of Lunar Lake, it will help further enhance Intel's competitiveness in the AI PC chip market. However, the specific market performance of Lunar Lake remains to be seen.

Regarding whether Windows on Arm will affect market share, Gelsinger believes that this is not the first release of a Windows on Arm product, and the x86 market share still maintains a leading position. There is currently no clear incentive to prompt consumers to switch from the x86 platform to the Arm platform, and there has not been a similar product that can replace the existing x86 architecture Reasons for consumers to consider change, coupled with the newly launched Lunar Lake featuring the best graphics card and not afraid of market share impact.

When asked if he sees Qualcomm as a competitor, Gelsinger jokingly welcomed Qualcomm to introduce their own products to the market, as this would help create the entire market faster. However, he is very confident in his own products, with current shipments already reaching 1 million units, which, from this perspective, outperforms Qualcomm's Snapdragon X Elite presented yesterday. In addition, from Lunar Lake to the next generation Panther Lake, Intel is building its own ecosystem, marking a new chapter that is difficult to be replaced in the entire AI field.

Gelsinger pointed out that in the second half of the year, customers purchasing PCs with Lunar Lake will be quite impressed, and he believes there will be more benchmarking information compared to Qualcomm's products in the future.

Currently, Intel is actively expanding overseas manufacturing, with multiple semiconductor constructions in the United States. Gelsinger believes that Intel, Samsung, and Taiwan Semiconductor's layout in the United States indicates significant development in the U.S. chip industry. Research institutions also expect the U.S.'s influence in semiconductors to increase from 10% to 20% by 2030, showing great potential for development. Intel continuously praised Taiwan Semiconductor's collaboration on Lunar Lake and with UMC in the speech, demonstrating the importance placed on the Taiwanese ecosystem. However, there is a need for a more balanced global supply chain, which is currently taking shape.

With the U.S. export restrictions, could this accelerate chip development in China? Gelsinger candidly admitted that the chip ban is like a magic line, with excessive restrictions indeed prompting China to develop self-made chips, which does harm the export market. Therefore, a careful balance is needed, and Intel must ensure that this aspect meets the expectations of global ecosystem partners. At the same time, Intel will continue to export products to China. As Chinese technology faces restrictions, with processes reaching below 2 nanometers, Intel's offerings remain attractive in the Chinese market.

Another reporter asked why Pat Gelsinger did not go to South Korea. He responded that this trip did not include South Korea, but he will visit South Korea again in the future, as the country holds significant importance for Intel in terms of local tech factories and customers.

Source: Chip Intelligence, Original Title: "Taiwan Semiconductor Manufacturing! Intel's New AI PC Chip Lunar Lake Released: AI Computing Power 120TOPS!"