Other Papers
arXiv
Barbarians at the Gate: How AI is Upending Systems Research
This paper is presented by Ion Stoica on SOSP 2025. ADRS (AI-Driven Research for Systems) targets system performance research, because: 1. easy to verify the improvement 2. correctness can be preserved 3. modification can be little and interpretable 4. simulator-based verification is cheap and practical
This paper use OpenEvolve (an open-source implementation of Google DeepMind AlphaEvolve) as the example of ADRS.
Limitations
Yet current ADRS has limitations, featured by 3 common patterns: 1. runtime errors 2. search failures 3. algorithm failures
Best Practices
Prompt generator: 1. structured problem formulation by external LLMs 2. suitable baseline program. If the original program is not perfect, ADRS could stuck in less meaningful trivial fixes, which may drained the budgets. 3. Solution hints. Too many guidances may induce premature convergence 4. Level of abstraction in terms of access to external APIs. Should keep balanced between micro-optimization and genuine algorithmic improvement.
My question: how to know why an AI-developed solution is performant. Like a tricky performance failure in Linux kernel network subsystem. (This is also mentioned in ADRS's suited problems at 7.1)
ATLANTIS: AI-driven Threat Localization, Analysis, and Triage Intelligence System
4 requirements: 1. multiple CPs (challenge project) 2. fail-safe 3. fully utilize budget 4. observability
RepoCoder: Repository-Level Code Completion Through Iterative Retrieval and Generation
Automated Static Vulnerability Detection via a Holistic Neuro-symbolic Approach
By Junfeng Yang and compared by KNighter in SOSP 2025
Towards Efficient and Practical GPU Multitasking in the Era of LLM
Motivation: low GPU utilization is inefficient. SOTA limitations: - lack of memory sharing - lack of GPU temporal sharing - no orchestration integration
GPU does not have mechanisms like virtual memory in CPU operating system. Though CUDA release CUDA virtual memory management, many AI applications still implement their own memory management.
MIG does support fault isolation. MPS will compromise fault isolation because the processes share a single CUDA context.
NVLink bandwidth should also be partitioned to support compute partition.
Current limitations: side-channel attacks,
Prism: Unleashing GPU Sharing for Cost-Efficient Multi-LLM Serving
This paper is like the solution to the above paper by Ying Sheng
STRATUS: A Multi-agent System for Autonomous Reliability Engineering of Modern Clouds
Motivation: human-centric engineering practice is not reliable.
Formalize a specification that guarantees agents will not "make things worse". Multiple factors contribute to such degression: hullucination, incorrect understanding, ...
CoCoMIC: Code Completion By Jointly Modeling In-file and Cross-file Context
Motivation: single-file completion misses details.
Simply concatenating code is not feasible, because:
- contextual information could be different.
- private context are not available at pre-training stage.
- context length is limited.
COCOMIC leverages static analysis to extract cross-file context. Plus built a benchmark based on PyPI.
Why Do Multi-Agent LLM Systems Fail?
HotOS
Towards Modern Development of Cloud Applications
Thesis Papers
Strengthening memory safety in Rust: exploring CHERI capabilities for a safe language
TODO
Preserving Memory Safety in Safe Rust during Interactions with Unsafe Languages
TODO
Security analysis of hardware-OS interfaces in Linux
Others
HotBPF - An On-demand and On-the-fly Memory Protection for the Linux Kernel
Key idea: separate the vulnerable object to virtual memory region (vmalloc region is not continuous physically).
Challenges
- identify the vulnerable object from thousdands of kernel objects
- separate potential corruption without recompiling and rebooting the whole system
Understanding and Detecting Cloud-nativeness Vulnerabilities in Distributed Systems
Distrubuted systems != cloud native
A distributed system has a cloud-nativeness vulnerability if, under the same configuration, input and execution sequence, the system fails in the cloud environment, but not in the bare-metal server environment
A failure caused by cloud-nativeness vulnerability is a cloud-nativeness failure.
JIRA issue format.