GPU Multitasking
SOTA GPU Multitasking
Industrial:
- MIG: Nvidia Multi-Instance GPU
- FGPU: Aliyun Fraction GPU
- MPS: Nvidia Multi-Process Service
Academic:
- (OSDI'22) Microsecond-scale Preemption for Concurrent GPU-accelerated DNN Inferences
- (SOSP'23) Paella: Low-latency model serving with software-defined gpu scheduling
- (NSDI'23) Transparent GPU sharing in container clouds for deep learning workloads
- (EuroSys'25) Improving GPU Sharing Performance through Adaptive Bubbleless Spatial-Temporal Sharing
- (PPoPL'25) SGDRC: Software-Defined Dynamic Resource Control for Concurrent DNN Inference on NVIDIA GPUs
Current Limitations
- GPU tasks are forced to share a single execution context, like the CUDA context
- Temporal sharing incurs context switching time to like ~100 microseconds.
GPU communications
Intra-node: NVLink Inter-node: GPUDirect RDMA