Intel representative teases the new Ponte Vecchio compute GPU for AI & HPC applications of the future

Intel Details Ponte Vecchio GPU & Sapphire Rapids HBM Performance, Up To 2.5x Faster Than NVIDIA A100

During Hot Chips 34, Intel once again detailed its Ponte Vecchio GPUs running on a Sapphire Rapids HBM server platform.

Intel Showcases Ponte Vecchio 2-Stack GPU and Sapphire Rapids HBM CPU Performance vs. NVIDIA A100

In Intel Fellow & Chief GPU Compute Architect Hong Jiang’s presentation, we get more details on the Blue Team’s upcoming server powerhouses. The Ponte Vecchio GPU comes in three configurations starting with a singular OAM and going all the way up to a x4 subsystem with Xe Links, running either alone or with a dual-socket Sapphire Rapids platform.

The OAM supports all-to-all topologies for 4-GPU and 8-GPU platforms. Complementing the entire platform is Intel’s oneAPI software stack, which is a level-zero API that provides a low-level hardware interface to support cross-architecture programming. Some of the main features of oneAPI include:

  • Interface for oneAPI and other tools for accelerator devices
  • Fine gain control and low latency for throttle capabilities
  • Multi-threaded design
  • For GPU, it is shipped as part of the controller

So in terms of performance metrics, a 2-stack Ponte Vecchio GPU configuration like the one featured in a singular OAM is capable of delivering up to 52 TFLOPs of FP64/FP32 compute, 419 TFLOPs of TF32 (XMX Float 32) , 839 TFLOPs from BF16/FP16 and 1678 TFLOPs from INT8 horsepower.

Intel also details its maximum cache sizes and the maximum bandwidth each offers. The log file size on the Ponte Vecchio GPU is 64 MB and offers 419 TB/s of bandwidth, the L1 cache is also 64 MB and offers 105 TB/s (4:1), and the L2 cache has 408 MB and offers a bandwidth of 13 TB/s (8:1), while HBM memory pools up to 128 GB and offers a bandwidth of 4.2 TB/s (4:1). There are a variety of computer efficiency techniques within Ponte Vecchio, such as:

Log file:

  • Caching log
  • accumulators

L1/L2 cache:

  • Write via
  • Answer to me
  • transmission writing
  • no cache

prefetch:

  • Software preload (instruction) to L1 and/or L2
  • Command Streamer prefetches L2 for instructions and data

Intel explains that the larger L2 cache can offer big gains in workloads like 2D-FFT Case and DNN Case. Some performance comparisons have been shown between a full Ponte Vecchio GPU and a module set to 80MB and 32MB.

But that’s not all, Intel also has performance comparisons between NVIDIA Ampere A100 with CUDA and SYCL and its own Ponte Vecchio GPUs with SYCL. In miniBUDE, which is a computational workload that can predict the binding energy of the ligand with the target, the Ponte Vecchio GPU simulates the test results 2 times faster than Ampere A100. There is another performance metric in ExaSMR (Small Modular Reactors for Large Nuclear Reactor Designs). here, the Intel GPU is shown to offer a 1.5x performance advantage over the NVIDIA GPU.

It’s a little interesting that Intel is still comparing its Ponte Vecchio GPUs to the Ampere A100 because the green outfit has released its next-gen Hopper H100 and it’s already shipped to customers. If Chipzilla feels that safe within his 2-2.5x performance figures, then I don’t think he’ll have any problem competing well with Hopper unless otherwise.

Here’s everything we know about Ponte Vecchio GPUs powered by Intel 7 technology

Moving on to the Ponte Vecchio specs, Intel outlined some key features of its flagship data center GPU, such as 128 Xe cores, 128 RT drives, HBM2e memory, and a total of 8 Xe-HPC GPUs that will be interconnected. The chip will feature up to 408MB of L2 cache in two separate stacks that will be connected via the EMIB interconnect. The chip will feature multiple dies based on Intel’s own ‘Intel 7’ process and TSMC’s N7/N5 process nodes.

Intel also previously detailed the package and die size of its flagship Ponte Vecchio GPU based on the Xe-HPC architecture. The token will consist of 2 tokens with 16 active dice per stack. The maximum active size of the upper die will be 41mm2, while the size of the base die, which is also known as ‘Compute Tile’, is 650mm2. We have all the chiplets and compute nodes that will be used by the Ponte Vecchio GPUs, listed below:

  • Intel 7nm
  • TSMC 7nm
  • 3D Foveros Packaging
  • EMIB
  • Enhanced 10nm Super Fin
  • rambo cache
  • HBM2

Here’s how Intel gets to 47 tiles on the Ponte Vecchio chip:

  • 16 Xe HPC (internal/external)
  • 8 Rambo (internal)
  • Base 2 Xe (internal)
  • 11 EMIB (internal)
  • 2 Xe links (external)
  • 8 HBM (external)

The Ponte Vecchio GPU uses 8 HBM 8-Hi stacks and contains a total of 11 EMIB interconnects. The entire Ponte Vecchio Intel package would measure 4843.75 mm2. It is also mentioned that the bump step for Meteor Lake CPUs using the high-density Forveros 3D package will be 36u.

The Ponte Vecchio GPU is not 1 chip but a combination of several chips. It’s a chiplet powerhouse, containing the most chiplets on any GPU/CPU, 47 to be precise. And these are not based on a single process node, but on several process nodes, as we had detailed a few days ago.

Although the Aurora supercomputer in which the Ponte Vecchio GPUs and Sapphire Rapids CPUs were to be used was delayed due to various delays by the blue team, it is still good to see the company offer more details. Since then, Intel has teased its next-gen Rialto Bridge GPU as a successor to the Ponte Vecchio GPUs and is said to start testing in 2023. You can read more details about it here.

GPU accelerators for next generation data centers

GPU name AMD Instinct MI250X NVIDIA Hopper GH100 Intel Ponte Vecchio Intel Rialto Bridge
packaging design MCM (Infinity Weave) Monolithic MCM (EMIB + Foveros) MCM (EMIB + Foveros)
GPU architecture Aldebaran (CDNA 2) Hopper GH100 Xe-HPC Xe-HPC
GPU compute node 6nm 4N 7nm (Intel 4) 5nm (Intel 3)?
GPU cores 14,080 16,896 16,384 ALUs
(128 Xe cores)
20,480 ALUs
(160 Xe cores)
GPU clock speed 1700MHz ~1780MHz to be confirmed to be confirmed
L2/L3 cache 2x8MB 50MB 2 x 204MB to be confirmed
Computation FP16 383 tops 2000 TFLOPS to be confirmed to be confirmed
FP32 Computing 95.7 TFLOPS 1000 TFLOPS ~45 TFLOPs (Silicon A0) to be confirmed
FP64 Compute 47.9 TFLOPS 60 TFLOPS to be confirmed to be confirmed
memory capacity 128GB HBM2E 80GB HBM3 128GB HBM2e 128GB HBM3?
memory clock 3.2Gb/s 3.2Gb/s to be confirmed to be confirmed
memory bus 8192 bit 5120 bit 8192 bit 8192 bit
memory bandwidth 3.2TB/s 3.0TB/s ~3TB/s ~3TB/s
form factor OAM OAM OAM OAM v2
Cooling Passive Cooling
Liquid refrigeration
Passive Cooling
Liquid refrigeration
Passive Cooling
Liquid refrigeration
Passive Cooling
Liquid refrigeration
TDP 560W 700W 600W 800W
Throw Q4 2021 2H 2022 2022? 2024?

Leave a Comment

Your email address will not be published.