Skip to content

SOSP

2025

LithOS: An Operating Systemfor Efficient Machine Learning on GPUs

This paper is funded by META

Current SOTA GPU sharing is not effective, like MPS and MIG, which is also stated in openvgpu.

Reverse engineering to maintain API compatibility. Preemption -> Kernel Atomization TODO

Q&A

  1. Mine: What is the difference with openvgpu and LithOS
  2. GPU vendors may have some modification.
  3. Does TPC scheduling introduce overhead? The overhead is manageable. Besides, LithOS does not virtualize memory.
  4. How to determine the right size of kernel. Use a module to profile and predict what size to slice.

μFork: Supporting POSIX fork Within a Single-Address-Space OS

Use CHERI tag to identify pointers in pages, then change policy from copy-on-write to copy-on-pointer-area.

Q&A

  1. Linux kernel commit to memory protect feature to achieve process isolation in the single address space? No resp.
  2. Page region overlap issue? No resp.

Tock: From Research To Securing 10 Million Computers

Rust to interact with syscall and language requirements.

Tock's 3 major syscall 1. command 2. subscribe 3. allow: expose a process memory

Q&A

  1. What is the problem if a user process mess up with itself? Not the truth, because a user process may be in privileged mode.

Proto: A Guided Journey through Modern OS Construction

QA

  1. how students use AI to solve the coding issues? No resp.
  2. Why a more "fun" work is needed? No resp.

The Design and Implementation of a Virtual Firmware Monitor

Firmware is the security liability. A virtual firmware monitor can de-privilege firmware just like a guest kernel. Yet an ISA should be virtualizable to achieve this. After review, RISC-V is a virtualizable ISA.

Translate RISC-V spec to RISC-V rust model and verify with a rust verifier.

[Oasis: Pooling PCIe Devices Over CXL to Boost Utilization]

PCIe pooling enables multiple hosts to use shared PCIe devices, which can reduct TCO (total cost of ownership) in datacenter.

Software implementation with RDMA has high latency. RDMA has been used in remote SSDs. Yet full remote SSD in AWS services are refused by the customers due to the low performance.

M6i supported local NVMe drives, M7i did not support local NVMe drives and was introduced in 2023. AWS later introduced I7ie with local NVMe drives.

Current PCIe pooling are carried by PCIe switches, which are expensive and can reach over $80k.

Challenges: 1. CXL increases access latency by 2-3×compared to local DDR5 memory 2. CXL memory pool devices available today are not cache-coherent across hosts, so have to implement software cache coherent.

QA

  1. is oasis transparent to application? Yes. And the application does not need to aware the change to NIC.
  2. is the latency acceptable for certain protocol? For some protocol yes.

Spirit: Fair Allocation of Interdependent Resources in Remote Memory Systems

Shared remote memory fails on fairness. This is because larger cache capacity needs larger bandwidth. Current scheme like static allocation has low performance. Symbiosis algorithm: model the resources like a free market. If cache demand > supply, then increase cache price and decrease bandwidth price. Thus spirit achieves fairness between bandwidth and cache capacity.

QA

  1. no solution of evicted memory pages.

Scalable Far Memory: Balancing Faults and Evictions

Far memory reduct TCO. But how page-based far memory is challenging. Bottlenecks: 1. Software page eviction is the bottleneck. Solution is limit evict cores. 2. Tracking page hotness. Eviction scalability > accuracy.

QA

  1. How ideal performance is estimated?
  2. Ideal case of latency?
  3. Batching?

Device-Assisted Live Migration of RDMA Devices

Demeter: A Scalable and Elastic Tiered Memory Solution for Virtualized Cloud via Guest Delegation

Robust LLM Training Infrastructure at ByteDance

paper funded by ByteDance Seed Goal: minimal unproductive failure time. Combine real-time and stop-time checks.

QA

  1. NVIDIA superpod difference? Yes, similar idea.

Sailor: Automating Distributed Training over Dynamic, Heterogeneous, and Geo-distributed Clusters

Power grid failure -> on cloud + heterogeneous. How sailor works? 1. Profiler 2. Planner 3. Simulator

QA

  1. failure on heterogeneous? have a failure tolerant design.
  2. slower links? use a profiler to make plan.
  3. planner only see next available resources.

TrainVerify: Equivalence-Based Verification for Distributed LLM Training

funded by MSR LLM training is costly due to errors -> silent parallelization bugs (do not crash).

Formal verification on the level of execution plan of GPU.

Execution model: - assign - sync - schedule

But false alarms are cause by floating point in deep learning -> numerical drift. Tensor IDs follows SSA so enables lineage verification.

Mycroft: Tracing Dependencies in Collective Communication Towards Reliable LLM Training

Observability for runtime debugging.

QA

  1. any time Mycroft cannot find bugs?

Orthrus: Efficient and Timely Detection of Silent User Data Corruption in the Cloud with Resource-Adaptive Computation Validation

SOTA: - offline: 3 month core test cases. - online: replication-based validation (RBV) and instructin-level validation (ILV).

Observation: the codebase of a typical cloud application often exhibits a clear separation between a control path and a data path.

Optimistic Recovery for High-Availability Software via Partial Process State Preservation

Availability recovery should be quick and meaningful. Current approaches are restart everything or preserve everything. preserve everything could persist false state.

Partial state preservation: fallback to restart when state is inconsistent, when write is inconsistent. Unsafe region: execute cross-check validation periodically.

QA

  1. the states should be assigned by application developers and involve modification to applications.
  2. How to waiver pre-defined API? Use static analysis for higher automation.

Fast End-to-End Performance Simulation of Accelerated Hardware–Software Stacks

Goal: run CPU threads natively. NEX sync

Characterizing Mobile SoC for Accelerating Heterogeneous LLM Inference

Gaps: 1. GPU and NPU have not parallelization (NPU > GPU > CPU) 2. decoding is pure memory bound and has no optimization

Characterization 1. NPU stage performance. 2. GPU high sync cost 3.

Hardware systolic array. Memory bandwidth ~62GB on SnapDragon 8Gen3. Parallel matmul on NPU and GPU. NPU can maximize the memory bandwidth.

Q&A 1. Compare with friendliai? No. 2. Power supply problem? performance is better but no surprising result.

IC-Cache: Efficient Large Language Model Serving via In-context Caching

Naive semantic cache is insufficient. Example selector = 2 stage (relevancy + small proxy model)

QA 1. combined with RACK?

PrefillOnly: An Inference Engine for Prefill-only Workloads in Large Language Model Applications

Single output token in recommendation system (e.g. given the purchase history, buy or not?) Can keep prefill-only workload. MLP layer triggers high, temporary GPU memory spikes. scheduling optimization -> predict job duration

Pie: A Programmable Serving System for Emerging LLM Applications

No clean way to use API in serving pipelines. Pie delegate token control to user level. Pie accept a program (called inferlet) not the prompt and then stream the API calls.

One example is to put agent tools into the serving system, thus lower the latency by reduce prompts between serving system and user app.

QA 1. other scenario like prefill only? granularity of API.

DiffKV: Differentiated Memory Management for Large Language Models with Parallel KV Compaction

In attention calculation, key are more impactful than values. keys precision > value.

QA 1. memory order issue?

Jenga: Effective Memory Management for Serving LLM with Heterogeneity

4 kinds of heterogeneous attention: F S M E.

KNighter: Transforming Static Analysis with LLM-Synthesized Checkers

By Lingming Zhang from UIUC

2 basic elements: bug-fixing commits and static analysis possibilities. KNighter's 3 stages: bug pattern analysis -> plan synthesis -> checker implementation.

KNighter's limitations: 1. value related issues 2. locks. Possible solution: abstract complex bug reports into more formal state-machine representations. Evaluation against smatch, which cannot utilize semantics in Linux kernel.

My QA 1. Comparison to CodeQL? Since CodeQL has simpler syntax structure so less syntax noise? 2. One problem in research in comparison to other static analysis tools is that they are large unusable. Is it valuable to leverage LLM to fix such tools and evolve from such tools. 3. Possibilities of generating patches by LLM? 4. how would cross-component systems leverage similar approaches.

eBPF Misbehavior Detection: Fuzzing with a Specification-Based Oracle

3 verifier failure: 1. over approximated state 2. incorrect checks 3. implementation errors

spec = semantics + security checks

WASIT: Deep and Continuous Differential Testing of WebAssembly System Interface Implementations

WASI hard to implement - spec interpretation failure - semantic mismatch between WASI and OS - Frequent SDK change

Define a DSL to filter false input.

Moirai: Optimizing Placement of Data and Compute in Hybrid Clouds

This work is funded by Uber Cross-site data movement is costly. Uber uses a hybrid cloud datalake. But 90% of Uber computation involve data movement between public and private cloud.

Solution: identify the data-job dependencies and schedule the data. - Pre-replication for scale (trade storage for latency)

Tai Chi: A General High-Efficiency Scheduling Framework for SmartNICs in Hyperscale Clouds

Cloud-based I/O still have internal latency. DP services and CP tasks are widely adopted.

DP and CP scheduling in SmartNICs has latency.

Quilt: Resource-aware Merging of Serverless Workflows

High invocation delay. Simple idea is to merge functions to reduct invocation. Quilt focuses on serverless functions which implement REST API. Quilt use LLVM IR that converts RPC into function calls. How many functions can be merged depends on container size.

Evaluation 1. DeathStarBench latency reduction. 2. Quilt can suits many small functions.

QA 1. function changes a lot? mark them as not to merge 2. isolation between different function? basically you have to trust the developer, which aligns with today's developers' assumption.

Mantle: Efficient Hierarchical Metadata Management for Cloud Object Storage Services

Unlocking True Elasticity for the Cloud-Native Era with Dandelion

Separate computation and communication. computation functions are untrusted and communication functions are trusted. Remove guest OS

QA 1. how to access hardware? pre allocate 2. contraint model like webasm? 3. rewrite user function? no

Aegaeon: Effective GPU Pooling for Concurrent LLM Serving on the Market

This work is by Alibaba and Peking University

2024

Dirigent: Lightweight Serverless Orchestration

If At First You Don’t Succeed, Try, Try, Again...? Insights and LLM-informed Tooling for Detecting Retry Bugs in Software Systems

Unearthing Semantic Checks for Cloud Infrastructure-as-Code Programs

2023

One Simple API Can Cause Hundreds of Bugs An Analysis of Refcounting Bugs in All Modern Linux Kernels

reference counting, bug, Linux kernel

Acto: Automatic End-to-End Testing for Operation Correctness of Cloud System Management

Kubernetes, operation, system management, cloud, reliability, operation correctness, operator

state-reconciliation: An operation declares a desired system state and the operator automatically reconciles the system to that declared state.

acto models operations as state transitions and realizes state-transition sequences.

motivation study:

  1. Unit tests cannot validate operation correctness. Only 34% contains e2e tests.
  2. Existing e2e tests cover only 1.27 - 2.15% of supported properties.
  3. State-based assertions in existing e2e tests cover only 0.24–10.90% of managed systems’ state-object fields.

design:

acto support black box mode and white box mode. Each takes two inputs: 1) a building minifest, 2) specification of state declaration. white box mode additionally requires the source code for program analysis.

Black box mode infer the relationship between interface declaration and resource based on composite types (e.g. class). However, such design cannot infer primitive types like int, float etc. White box mode build the relationship based on taint analysis

After building relationship between interface declaration and resource objects, acto implement consistency oracle and differential oracle to find false values and inconsistent items respectively.

2021

Boki: Stateful Serverless Computing with Shared Logs

ghOSt: Fast & Flexible User-Space Delegation of Linux Scheduling

Exploiting Nil-Externality for Fast Replicated Storage

Syrup: User-Defined Scheduling Across the Stack

TwinVisor: Hardware-isolated Confidential Virtual Machines for ARM

Formal Verification of a Multiprocessor Hypervisor on Arm Relaxed Memory Hardware

Rudra: Finding Memory Safety Bugs in Rust at the Ecosystem Scale

Faster and Cheaper Serverless Computing on Harvested Resources

Harvest VM: defined by its minimum size and how many harvested physical cores they are capable of using. The workload running on the Harvest VM can query the number of physical cores assigned to it in /proc in Linux

SoK of improving the serverless infrastructure (performance up & cost down):

  1. scheduling policies for making serverless platforms cost-effective and performant
  2. performance-aware and cost-effective storage
  3. secure and lightweight container infrastructure
  4. characterization of existing serverless workloads
  5. enabling applications to run in a serverless-native manner

The Demikernel Datapath OS Architecture for Microsecond-scale Datacenter Systems

An Analysis of Performance Evolution of Linux’s Core Operations

2017

Komodo: Using verification to disentangle secure-enclave hardware from software

2014

Eidetic Systems

2011

An Empirical Study on Configuration Errors in Commercial and Open Source Systems