OSDI

2024

2024 OSDI has the following sessions:

Memory
Low-Latency LLM Serving
Distributed
Deep Learning
Operating System
Cloud Computing
Formal Verification
Cloud Security
Data Management
Analysis of Correctness (formal verification, reliability, etc.)
ML Scheduling

A Tale of Two Paths: Toward a Hybrid Data Plane for Efficient Far-Memory Applications

Existing far memory systems include two implementation:

Uses the kernel's paging system
Bypass the kernel, fetching data at the object granularity

Sabre: Hardware-Accelerated Snapshot Compression for Serverless MicroVMs

Motivation: cold starts, VM snapshotting.

Backgrounds:

Firecvracker can snapshot the full guest memory or only the dirty pages.
Working sets of pages can make serverless VM snapshots smaller and faster to fetch.
Hardware implementations of (de)compression include Intel In-Memory Analytics Accelerator (IAA)

Contribution:

Characterizing the IAA accelerator on a set of diverse benchmarks, and show its potential for compressing memory pages.
Build Sabre and integrating it with the Firecracker virtual machine monitor (VMM) in a serverless environment with snapshotting support.

Fairness in Serving Large Language Models

🏫: UCB Skylab

LLM serving scheduling.

The first author (Ying Sheng) will be joining UCLA in Fall 2026.

MonoNN: Enabling a New Monolithic Optimization Space for Neural Network Inference Tasks on Modern GPU-Centric Architectures

🏫: University of Sydney

ServiceLab: Preventing Tiny Performance Regressions at Hyperscale through Pre-Production Testing

⚠️ Industrial paper from META Platforms.

Motivation: detect small performance regressions, sometimes as tiny as 0.01%, on serverless platform with million of machines.

Contribution:

(Heterogeneous cloud machines) The performance variance on two machines is comparable based on instance type, CPU architecture, kernel version, datacenter region and have CPU turbo disabled.
(Detect small regressions)
(Support diverse services) ServiceLab takes the record-and-replay approach for testing.

Performace variance includes: 1. accidents 2. environment 3. true regression. After analysis and filtering, the performance invariants are: kernel versions, ServerTypes, CPU architecture, and datacenter regions.

Optimizing Resource Allocation in Hyperscale Datacenters: Scalability, Usability, and Experiences

dLoRA: Dynamically Orchestrating Requests and Adapters for LoRA LLM Serving

Parrot: Efficient Serving of LLM-based Applications with Semantic Variable

🏫: Microsoft Research

Performance Interfaces for Hardware Accelerators

🏫: EPFL, Dependable System Group

ACCL+: an FPGA-Based Collective Engine for Distributed Applications

🏫: ETH Zurich, System Group

ServerlessLLM: Low-Latency Serverless Inference for Large Language Models

🏫: University of Edinburgh

Serverless inference takes users' model execution and parameter as the input, placed at a checkpoint storage system. When a request arrives, the scheduler selects available GPUs to initiate these checkpoints and a router directs the request to the seleted GPUs. But it generally suffers from high latency and cold start.

Core idea: leveraging the multi-tier storage hierarchy for local checkpoint storage and harnessing their significant storage bandwidth for efficient checkpoint loading.

Challenges:

Bandwidth
Live migration of inference, with two types: (1) token-only (2) full kv-cache
Predict the resource consumption

Multitier storage hierarchy: 1. memory, 2. NVMe SSD, 3. SATA SSD

DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving

🏫: UCSD

Llumnix: Dynamic Scheduling for Large Language Model Serving

⚠️ Industrial paper from Alibaba.

DRust: Language-Guided Distributed Shared Memory with Fine Granularity, Full Transparency, and Ultra Efficiency

When will my ML Job finish? Toward providing Completion Time Estimates through Predictability-Centric Scheduling

🏫: Tufts University

A new scheduler to reach a balance between predictability and practicality.

2023

Honeycomb: Secure and Efficient GPU Executions via Static Validation

TAILCHECK: A Lightweight Heap Overflow Detection Mechanism with Page Protection and Tagged Pointers

2022

Automatic Reliability Testing for Cluster Management Controllers

KSplit: Automating Device Driver Isolation

Join static analysis and kernel isolation.

RESIN: A Holistic Service for Dealing with Memory Leaks in Production Cloud Infrastructure

Memory leakage, cloud infrastructure

TODO

XRP: In-Kernel Storage Functions with eBPF

TODO

zIO: Accelerating IO-Intensive Applications with Transparent Zero-Copy IO

TODO

TriCache: A User-Transparent Block Cache Enabling High-Performance Out-of-Core Processing with In-Memory Programs

TODO

Design and Verification of the Arm Confidential Compute Architecture

TODO

CAP-VMs: Capability-Based Isolation and Sharing in the Cloud

TODO

Application-Informed Kernel Synchronization Primitives

TODO

From Dynamic Loading to Extensible Transformation: An Infrastructure for Dynamic Library Transformation

TODO

Operating System Support for Safe and Efficient Auxiliary Execution

Auxiliary tasks: tasks for fault detection, performance monitoring, online diagnosis, resource management, etc.

Three protection scenarios:

application extensibility: protect main realm from untrusted extension code.
secure partitioning: protect sensitive procedure from main application being compromised.
maintenance: protect main application from trusted code.

BlackBox: A Container Security Monitor for Protecting Containers on Untrusted Operating Systems

Terminology