SOSP
2025
LithOS: An Operating Systemfor Efficient Machine Learning on GPUs
This paper is funded by META
Current SOTA GPU sharing is not effective, like MPS and MIG, which is also stated in openvgpu.
Reverse engineering to maintain API compatibility. Preemption -> Kernel Atomization TODO
Q&A
- Mine: What is the difference with openvgpu and LithOS
- GPU vendors may have some modification.
- Does TPC scheduling introduce overhead? The overhead is manageable. Besides, LithOS does not virtualize memory.
- How to determine the right size of kernel. Use a module to profile and predict what size to slice.
μFork: Supporting POSIX fork Within a Single-Address-Space OS
Use CHERI tag to identify pointers in pages, then change policy from copy-on-write to copy-on-pointer-area.
Q&A
- Linux kernel commit to memory protect feature to achieve process isolation in the single address space? No resp.
- Page region overlap issue? No resp.
Tock: From Research To Securing 10 Million Computers
Rust to interact with syscall and language requirements.
Tock's 3 major syscall 1. command 2. subscribe 3. allow: expose a process memory
Q&A
- What is the problem if a user process mess up with itself? Not the truth, because a user process may be in privileged mode.
Proto: A Guided Journey through Modern OS Construction
QA
- how students use AI to solve the coding issues? No resp.
- Why a more "fun" work is needed? No resp.
The Design and Implementation of a Virtual Firmware Monitor
Firmware is the security liability. A virtual firmware monitor can de-privilege firmware just like a guest kernel. Yet an ISA should be virtualizable to achieve this. After review, RISC-V is a virtualizable ISA.
Translate RISC-V spec to RISC-V rust model and verify with a rust verifier.
[Oasis: Pooling PCIe Devices Over CXL to Boost Utilization]
PCIe pooling enables multiple hosts to use shared PCIe devices, which can reduct TCO (total cost of ownership) in datacenter.
Software implementation with RDMA has high latency. RDMA has been used in remote SSDs. Yet full remote SSD in AWS services are refused by the customers due to the low performance.
M6i supported local NVMe drives, M7i did not support local NVMe drives and was introduced in 2023. AWS later introduced I7ie with local NVMe drives.
Current PCIe pooling are carried by PCIe switches, which are expensive and can reach over $80k.
Challenges: 1. CXL increases access latency by 2-3×compared to local DDR5 memory 2. CXL memory pool devices available today are not cache-coherent across hosts, so have to implement software cache coherent.
QA
- is oasis transparent to application? Yes. And the application does not need to aware the change to NIC.
- is the latency acceptable for certain protocol? For some protocol yes.
Spirit: Fair Allocation of Interdependent Resources in Remote Memory Systems
Shared remote memory fails on fairness. This is because larger cache capacity needs larger bandwidth. Current scheme like static allocation has low performance. Symbiosis algorithm: model the resources like a free market. If cache demand > supply, then increase cache price and decrease bandwidth price. Thus spirit achieves fairness between bandwidth and cache capacity.
QA
- no solution of evicted memory pages.
Scalable Far Memory: Balancing Faults and Evictions
Far memory reduct TCO. But how page-based far memory is challenging. Bottlenecks: 1. Software page eviction is the bottleneck. Solution is limit evict cores. 2. Tracking page hotness. Eviction scalability > accuracy.
QA
- How ideal performance is estimated?
- Ideal case of latency?
- Batching?
Device-Assisted Live Migration of RDMA Devices
Demeter: A Scalable and Elastic Tiered Memory Solution for Virtualized Cloud via Guest Delegation
Robust LLM Training Infrastructure at ByteDance
paper funded by ByteDance Seed Goal: minimal unproductive failure time. Combine real-time and stop-time checks.
QA
- NVIDIA superpod difference? Yes, similar idea.
Sailor: Automating Distributed Training over Dynamic, Heterogeneous, and Geo-distributed Clusters
Power grid failure -> on cloud + heterogeneous. How sailor works? 1. Profiler 2. Planner 3. Simulator
QA
- failure on heterogeneous? have a failure tolerant design.
- slower links? use a profiler to make plan.
- planner only see next available resources.
TrainVerify: Equivalence-Based Verification for Distributed LLM Training
funded by MSR LLM training is costly due to errors -> silent parallelization bugs (do not crash).
Formal verification on the level of execution plan of GPU.
Execution model: - assign - sync - schedule
But false alarms are cause by floating point in deep learning -> numerical drift. Tensor IDs follows SSA so enables lineage verification.
Mycroft: Tracing Dependencies in Collective Communication Towards Reliable LLM Training
Observability for runtime debugging.
QA
- any time Mycroft cannot find bugs?
Orthrus: Efficient and Timely Detection of Silent User Data Corruption in the Cloud with Resource-Adaptive Computation Validation
SOTA: - offline: 3 month core test cases. - online: replication-based validation (RBV) and instructin-level validation (ILV).
Observation: the codebase of a typical cloud application often exhibits a clear separation between a control path and a data path.
Optimistic Recovery for High-Availability Software via Partial Process State Preservation
Availability recovery should be quick and meaningful. Current approaches are restart everything or preserve everything. preserve everything could persist false state.
Partial state preservation: fallback to restart when state is inconsistent, when write is inconsistent. Unsafe region: execute cross-check validation periodically.
QA
- the states should be assigned by application developers and involve modification to applications.
- How to waiver pre-defined API? Use static analysis for higher automation.
Fast End-to-End Performance Simulation of Accelerated Hardware–Software Stacks
Goal: run CPU threads natively. NEX sync
Characterizing Mobile SoC for Accelerating Heterogeneous LLM Inference
Gaps: 1. GPU and NPU have not parallelization (NPU > GPU > CPU) 2. decoding is pure memory bound and has no optimization
Characterization 1. NPU stage performance. 2. GPU high sync cost 3.
Hardware systolic array. Memory bandwidth ~62GB on SnapDragon 8Gen3. Parallel matmul on NPU and GPU. NPU can maximize the memory bandwidth.
Q&A 1. Compare with friendliai? No. 2. Power supply problem? performance is better but no surprising result.
IC-Cache: Efficient Large Language Model Serving via In-context Caching
Naive semantic cache is insufficient. Example selector = 2 stage (relevancy + small proxy model)
QA 1. combined with RACK?
PrefillOnly: An Inference Engine for Prefill-only Workloads in Large Language Model Applications
Single output token in recommendation system (e.g. given the purchase history, buy or not?) Can keep prefill-only workload. MLP layer triggers high, temporary GPU memory spikes. scheduling optimization -> predict job duration
Pie: A Programmable Serving System for Emerging LLM Applications
No clean way to use API in serving pipelines. Pie delegate token control to user level.
Pie accept a program (called inferlet) not the prompt and then stream the API calls.
One example is to put agent tools into the serving system, thus lower the latency by reduce prompts between serving system and user app.
QA 1. other scenario like prefill only? granularity of API.
DiffKV: Differentiated Memory Management for Large Language Models with Parallel KV Compaction
In attention calculation, key are more impactful than values. keys precision > value.
QA 1. memory order issue?
Jenga: Effective Memory Management for Serving LLM with Heterogeneity
4 kinds of heterogeneous attention: F S M E.
KNighter: Transforming Static Analysis with LLM-Synthesized Checkers
By Lingming Zhang from UIUC
2 basic elements: bug-fixing commits and static analysis possibilities. KNighter's 3 stages: bug pattern analysis -> plan synthesis -> checker implementation.
KNighter's limitations: 1. value related issues 2. locks. Possible solution: abstract complex bug reports into more formal state-machine representations. Evaluation against smatch, which cannot utilize semantics in Linux kernel.
My QA 1. Comparison to CodeQL? Since CodeQL has simpler syntax structure so less syntax noise? 2. One problem in research in comparison to other static analysis tools is that they are large unusable. Is it valuable to leverage LLM to fix such tools and evolve from such tools. 3. Possibilities of generating patches by LLM? 4. how would cross-component systems leverage similar approaches.
eBPF Misbehavior Detection: Fuzzing with a Specification-Based Oracle
3 verifier failure: 1. over approximated state 2. incorrect checks 3. implementation errors
spec = semantics + security checks
WASIT: Deep and Continuous Differential Testing of WebAssembly System Interface Implementations
WASI hard to implement - spec interpretation failure - semantic mismatch between WASI and OS - Frequent SDK change
Define a DSL to filter false input.
Moirai: Optimizing Placement of Data and Compute in Hybrid Clouds
This work is funded by Uber Cross-site data movement is costly. Uber uses a hybrid cloud datalake. But 90% of Uber computation involve data movement between public and private cloud.
Solution: identify the data-job dependencies and schedule the data. - Pre-replication for scale (trade storage for latency)
Tai Chi: A General High-Efficiency Scheduling Framework for SmartNICs in Hyperscale Clouds
Cloud-based I/O still have internal latency. DP services and CP tasks are widely adopted.
DP and CP scheduling in SmartNICs has latency.
Quilt: Resource-aware Merging of Serverless Workflows
High invocation delay. Simple idea is to merge functions to reduct invocation. Quilt focuses on serverless functions which implement REST API. Quilt use LLVM IR that converts RPC into function calls. How many functions can be merged depends on container size.
Evaluation 1. DeathStarBench latency reduction. 2. Quilt can suits many small functions.
QA 1. function changes a lot? mark them as not to merge 2. isolation between different function? basically you have to trust the developer, which aligns with today's developers' assumption.
Mantle: Efficient Hierarchical Metadata Management for Cloud Object Storage Services
Unlocking True Elasticity for the Cloud-Native Era with Dandelion
Separate computation and communication. computation functions are untrusted and communication functions are trusted. Remove guest OS
QA 1. how to access hardware? pre allocate 2. contraint model like webasm? 3. rewrite user function? no
Aegaeon: Effective GPU Pooling for Concurrent LLM Serving on the Market
This work is by Alibaba and Peking University
2024
Dirigent: Lightweight Serverless Orchestration
If At First You Don’t Succeed, Try, Try, Again...? Insights and LLM-informed Tooling for Detecting Retry Bugs in Software Systems
Unearthing Semantic Checks for Cloud Infrastructure-as-Code Programs
2023
One Simple API Can Cause Hundreds of Bugs An Analysis of Refcounting Bugs in All Modern Linux Kernels
reference counting, bug, Linux kernel
Acto: Automatic End-to-End Testing for Operation Correctness of Cloud System Management
Kubernetes, operation, system management, cloud, reliability, operation correctness, operator
state-reconciliation: An operation declares a desired system state and the operator automatically reconciles the system to that declared state.
acto models operations as state transitions and realizes state-transition sequences.
motivation study:
- Unit tests cannot validate operation correctness. Only 34% contains e2e tests.
- Existing e2e tests cover only 1.27 - 2.15% of supported properties.
- State-based assertions in existing e2e tests cover only 0.24–10.90% of managed systems’ state-object fields.
design:
acto support black box mode and white box mode. Each takes two inputs: 1) a building minifest, 2) specification of state declaration. white box mode additionally requires the source code for program analysis.
Black box mode infer the relationship between interface declaration and resource based on composite types (e.g. class). However, such design cannot infer primitive types like int, float etc. White box mode build the relationship based on taint analysis
After building relationship between interface declaration and resource objects, acto implement consistency oracle and differential oracle to find false values and inconsistent items respectively.
2021
Boki: Stateful Serverless Computing with Shared Logs
ghOSt: Fast & Flexible User-Space Delegation of Linux Scheduling
Exploiting Nil-Externality for Fast Replicated Storage
Syrup: User-Defined Scheduling Across the Stack
TwinVisor: Hardware-isolated Confidential Virtual Machines for ARM
Formal Verification of a Multiprocessor Hypervisor on Arm Relaxed Memory Hardware
Rudra: Finding Memory Safety Bugs in Rust at the Ecosystem Scale
Faster and Cheaper Serverless Computing on Harvested Resources
Harvest VM: defined by its minimum size and how many harvested physical cores they are capable of using. The workload running on the Harvest VM can query the number of physical cores assigned to it in /proc in Linux
SoK of improving the serverless infrastructure (performance up & cost down):
- scheduling policies for making serverless platforms cost-effective and performant
- performance-aware and cost-effective storage
- secure and lightweight container infrastructure
- characterization of existing serverless workloads
- enabling applications to run in a serverless-native manner