万人红黑大战

The Building Blocks of Advanced Multi-GPU Communication

万人红黑大战 | 下一页

万人红黑大战

万人红黑大战

Tesla V100 with NVLink GPU-to-GPU and GPU-to-CPU Connections

Tesla V100 with NVLink GPU-to-GPUconnections

万人红黑大战 | 下一页

万人红黑大战

NVIDIA NVLink Performance Since 2014

All-to-all communication between 16 GPUs

万人红黑大战 | 下一页

万人红黑大战

NVIDIA NVLink technology addresses interconnect issues by providing higher bandwidth, more links, and improved scalability for multi-GPU system configurations. A single NVIDIA Tesla? V100 GPU supports up to six NVLink connections for a total bandwidth of 300 gigabytes per second (GB/sec)—10X the bandwidth of PCIe Gen 3. Servers like the NVIDIA DGX-1? and DGX-2 take advantage of this technology to give you greater scalability for ultrafast deep learning training. NVLink also supports up to 8 GPUs in a single virtual machine with NVIDIA Quadro Virtual Data Center Workstation (Quadro vDWS) or NVIDIA Virtual Compute Server (vComputeServer) software.

万人红黑大战 | 下一页
NVIDIA NVLink Performance Since 2014

万人红黑大战

First introduced with the NVIDIA Pascal? architecture, NVLink on Tesla V100 has increased the signaling rate from 20 to 25 GB/s in each direction. This direct communication link between two GPUs, improves accuracy and convergence of high-performance computing (HPC) and AI and achieves speeds over an order of magnitude faster than PCIe.

万人红黑大战 | 下一页
NVLink Connecting Eight Tesla V100 Accelerators in a Hybrid Cube Mesh Topology as Used in the DGX-1V Server

万人红黑大战

NVLink can bring up to 70 percent more performance to an otherwise identically configured server. Its dramatically higher bandwidth and reduced latency enables even larger deep learning workloads to scale in performance as they grow.

万人红黑大战 | 下一页

NVLink Delivers Up To 70% Speedup vs PCIe

万人红黑大战 | 下一页

万人红黑大战

NVLink: GPU Servers: Dual Xeon Gold 6140@2.30GHz or E5-2698 v4@3.6GHz for PyTorch with 8xV100 PCIe vs 8xV100 NVLink. SW benchmarks: MILC (APEX medium). HOOMD-Blue (microsphere), LAMMPS (LJ 2.5).

万人红黑大战 | 下一页

万人红黑大战

万人红黑大战

The rapid adoption of deep learning has driven the need for a faster, more scalable interconnect, as PCIe bandwidth often creates a bottleneck at the multi-GPU system level.

NVIDIA NVSwitch builds on the advanced communication capability of NVLink to solve this problem. It takes deep learning performance to the next level with a GPU fabric that enables more GPUs in a single server and full-bandwidth connectivity between them.

万人红黑大战 | 下一页

万人红黑大战

NVSwitch is the first on-node switch architecture to support 16 fully connected GPUs in a single server node and drive simultaneous communication between all eight GPU pairs at an incredible 300 GB/s. These 16 GPUs can be used as a single large-scale accelerator with 0.5 terabyte of unified memory space and 2 petaFLOPS of deep learning compute power. A single HGX-2 or DGX-2 system with NVSwitch delivers up to 2.7X more application performance than 2 HGX-1 or DGX-1 systems connected with InfiniBand.

万人红黑大战 | 下一页

NVSwitch Delivers a >2X Speedup for Deep Learning and HPC

万人红黑大战 | 下一页

2 HGX-1V servers have dual-socket Xeon E5 2698v4 Processor, 8X V100 GPUs. Servers connected via 4X 100 Gb IB ports (run on DGX-1) HGX-2 server has dual-socket Xeon Platinum 8168 Processor, 16X V100 GPUs, NVSwitch (run on DGX-2).

万人红黑大战 | 下一页

   ECWMF’s IFS: The Integrated Forecasting System (IFS) is a global numerical weather prediction model developed by the European Centre for Medium-Range Weather Forecasts (ECMWF) based in Reading, United Kingdom. ECMWF is an independent intergovernmental organization supported by most of the nations of Europe, and operates one of the largest supercomputer centers in Europe for frequent updates of global weather forecasts. The IFS mini-app benchmark focuses its work on a spherical harmonics transformation that represents a significant computational load of the full model. The benchmark speedups shown here are better than those for the full IFS model, since the benchmark amplifies the transform stages of the algorithm (by design). However, this benchmark demonstrates that ECMWF’s extremely effective and proven methods for providing world-leading predictions remain valid on NVSwitch-equipped servers such as NVIDIA’s DGX-2, since they are such a good match to the problem.

   Mixture of Experts (MoE): Based on a network published by Google at the Tensor2 Tensor github, using the Transformer model with MoE layers. The MoE layers each consist of 128 experts, each of which is a smaller feed-forward deep neural network (DNN). Each expert specializes in a different domain of knowledge, and the experts are distributed to different GPUs, creating significant all-to-all traffic due to communications between the Transformer network layers and the MoE layers. The training dataset used is the “1 billion word benchmark for language modeling” according to Google. Training operations use Volta Tensor Core and runs for 45,000 steps to reach perplexity equal to 34. This workload uses a batch size of 8,192 per GPU.

万人红黑大战 | 下一页

万人红黑大战

Explore the world’s most powerful accelerated server platform for deep learning, machine learning, and HPC.

万人红黑大战 | 下一页
万人红黑大战 | 下一页