Skip to content

GPU Multitasking

SOTA GPU Multitasking

Industrial:

  • MIG: Nvidia Multi-Instance GPU
  • FGPU: Aliyun Fraction GPU
  • MPS: Nvidia Multi-Process Service

Academic:

  • (OSDI'22) Microsecond-scale Preemption for Concurrent GPU-accelerated DNN Inferences
  • (SOSP'23) Paella: Low-latency model serving with software-defined gpu scheduling
  • (NSDI'23) Transparent GPU sharing in container clouds for deep learning workloads
  • (EuroSys'25) Improving GPU Sharing Performance through Adaptive Bubbleless Spatial-Temporal Sharing
  • (PPoPL'25) SGDRC: Software-Defined Dynamic Resource Control for Concurrent DNN Inference on NVIDIA GPUs

Current Limitations

  • GPU tasks are forced to share a single execution context, like the CUDA context
  • Temporal sharing incurs context switching time to like ~100 microseconds.

GPU communications

Intra-node: NVLink Inter-node: GPUDirect RDMA