GPU Disaggregation: A New Paradigm for High-Performance Computing Systems


Is there a way to decouple GPUs from compute nodes for better fault tolerance?

I am a sys admin who works with high-performance computing systems that use multiple GPUs per node. I wonder if there is a solution that allows GPUs to be treated as separate resources from the nodes, so that if one GPU fails, the node can still function and the GPU can be replaced without affecting the whole system. This would be similar to how storage systems handle disk failures. Does anyone know of any company or project that is developing such a technology?


GPU-accelerated computing has become increasingly popular in various domains, such as artificial intelligence, scientific simulation, and gaming. However, GPUs are typically attached to compute nodes as PCIe devices, which limits their flexibility and scalability. If a GPU fails, the whole node becomes unavailable, and replacing a GPU requires shutting down the node and disrupting the running applications. Moreover, the fixed ratio of GPUs to CPUs in a node may not match the demand of different workloads, resulting in resource underutilization and inefficiency.

To address these challenges, some researchers and companies have proposed the idea of GPU disaggregation, which decouples GPUs from compute nodes and allows them to be accessed over the network. GPU disaggregation enables a more cost-effective and adaptive deployment of GPUs, as well as better fault tolerance and load balancing. In this article, we will introduce some of the existing solutions and projects that aim to achieve GPU disaggregation in the cloud.

One of the early attempts to disaggregate GPUs is the DxPU project, which was developed by Alibaba Group and Zhejiang University. DxPU is a large-scale GPU pool system that can flexibly allocate GPU nodes to users according to their demands. DxPU uses a custom hardware design that connects GPUs to a high-speed network switch using PCIe switches and cables. DxPU also employs a software stack that supports API transparency, latency hiding, and load balancing for remote GPU access. DxPU claims to achieve less than 10% performance overhead compared to native GPU servers in most user scenarios.

Another approach to GPU disaggregation is the DGSF project, which was developed by Carnegie Mellon University and Microsoft Research. DGSF is a system that enables GPU acceleration for serverless functions, which are short-lived and stateless computations that run on demand in the cloud. DGSF leverages a disaggregated GPU fabric that consists of a pool of GPU servers and a set of proxy servers that mediate the communication between the serverless functions and the GPUs. DGSF supports various GPU programming models, such as CUDA, OpenCL, and TensorFlow, and provides transparent and efficient GPU scheduling and sharing among multiple functions.

A third example of GPU disaggregation is the DPU project, which was proposed by NVIDIA as part of its vision for accelerated data center infrastructure. A DPU, or data processing unit, is a specialized processor that offloads various tasks from the CPU, such as networking, security, storage, and management. A DPU can also act as a GPU disaggregator, by connecting multiple GPUs to a DPU using NVLink, and then exposing the GPUs to the network using NVIDIA’s GPUDirect technology. A DPU can thus enable a more flexible and scalable GPU provisioning, as well as a higher performance and lower latency GPU access.

These are some of the current efforts to achieve GPU disaggregation in the cloud, but they are not the only ones. There are also other projects and products that explore different aspects and challenges of GPU disaggregation, such as GPU virtualization, GPU migration, GPU orchestration, and GPU security. GPU disaggregation is still an active and evolving research area, and we can expect more innovations and solutions to emerge in the near future.

Leave a Reply

Your email address will not be published. Required fields are marked *

Privacy Terms Contacts About Us