Skip to content

Other Papers

arXiv

Barbarians at the Gate: How AI is Upending Systems Research

This paper is presented by Ion Stoica on SOSP 2025. ADRS (AI-Driven Research for Systems) targets system performance research, because: 1. easy to verify the improvement 2. correctness can be preserved 3. modification can be little and interpretable 4. simulator-based verification is cheap and practical

This paper use OpenEvolve (an open-source implementation of Google DeepMind AlphaEvolve) as the example of ADRS.

Limitations

Yet current ADRS has limitations, featured by 3 common patterns: 1. runtime errors 2. search failures 3. algorithm failures

Best Practices

Prompt generator: 1. structured problem formulation by external LLMs 2. suitable baseline program. If the original program is not perfect, ADRS could stuck in less meaningful trivial fixes, which may drained the budgets. 3. Solution hints. Too many guidances may induce premature convergence 4. Level of abstraction in terms of access to external APIs. Should keep balanced between micro-optimization and genuine algorithmic improvement.

My question: how to know why an AI-developed solution is performant. Like a tricky performance failure in Linux kernel network subsystem. (This is also mentioned in ADRS's suited problems at 7.1)

ATLANTIS: AI-driven Threat Localization, Analysis, and Triage Intelligence System

4 requirements: 1. multiple CPs (challenge project) 2. fail-safe 3. fully utilize budget 4. observability

RepoCoder: Repository-Level Code Completion Through Iterative Retrieval and Generation

Automated Static Vulnerability Detection via a Holistic Neuro-symbolic Approach

By Junfeng Yang and compared by KNighter in SOSP 2025

Towards Efficient and Practical GPU Multitasking in the Era of LLM

code

Motivation: low GPU utilization is inefficient. SOTA limitations: - lack of memory sharing - lack of GPU temporal sharing - no orchestration integration

GPU does not have mechanisms like virtual memory in CPU operating system. Though CUDA release CUDA virtual memory management, many AI applications still implement their own memory management.

MIG does support fault isolation. MPS will compromise fault isolation because the processes share a single CUDA context.

NVLink bandwidth should also be partitioned to support compute partition.

Current limitations: side-channel attacks,

Prism: Unleashing GPU Sharing for Cost-Efficient Multi-LLM Serving

This paper is like the solution to the above paper by Ying Sheng

STRATUS: A Multi-agent System for Autonomous Reliability Engineering of Modern Clouds

Motivation: human-centric engineering practice is not reliable.

Formalize a specification that guarantees agents will not "make things worse". Multiple factors contribute to such degression: hullucination, incorrect understanding, ...

CoCoMIC: Code Completion By Jointly Modeling In-file and Cross-file Context

Motivation: single-file completion misses details.

Simply concatenating code is not feasible, because:

  1. contextual information could be different.
  2. private context are not available at pre-training stage.
  3. context length is limited.

COCOMIC leverages static analysis to extract cross-file context. Plus built a benchmark based on PyPI.

Why Do Multi-Agent LLM Systems Fail?

HotOS

Towards Modern Development of Cloud Applications

Thesis Papers

Strengthening memory safety in Rust: exploring CHERI capabilities for a safe language

TODO

Preserving Memory Safety in Safe Rust during Interactions with Unsafe Languages

TODO

Security analysis of hardware-OS interfaces in Linux

A Zero Kernel Operating System: Rethinking Microkernel Design by Leveraging Tagged Architectures and Memory-Safe Languages

Others

HotBPF - An On-demand and On-the-fly Memory Protection for the Linux Kernel

Key idea: separate the vulnerable object to virtual memory region (vmalloc region is not continuous physically).

Challenges

  1. identify the vulnerable object from thousdands of kernel objects
  2. separate potential corruption without recompiling and rebooting the whole system

Understanding and Detecting Cloud-nativeness Vulnerabilities in Distributed Systems

Distrubuted systems != cloud native

A distributed system has a cloud-nativeness vulnerability if, under the same configuration, input and execution sequence, the system fails in the cloud environment, but not in the bare-metal server environment

A failure caused by cloud-nativeness vulnerability is a cloud-nativeness failure.

JIRA issue format.