ATC
2023
MELF: Multivariant Executables for a Heterogeneous World
Analysis and Optimization of Network I/O Tax in Confidential Virtual Machines
zpoline: a system call hook mechanism based on binary rewriting
SAGE: Software-based Attestation for GPU Execution
The Hitchhiker's Guide to Operating Systems
2022
现有的 secure container 使用一个 microVM 套在外层用于隔离(MicroVM is for isolation, and the container is for abstraction)。但是没法解决 high-density 以及 high-concurrency 的 FaaS 的场景要求。因此这篇论文实现了一个轻量级的container runtime: RunD,有如下的效果:
Contributions:
- 识别 high-density 和 high-concurrency 的瓶颈
- Guest-to-host 的方案
- RunD
现有方案的问题:
- 创建 rootfs 和 cgroup 的时间开销导致无法高并行
- rootfs 和 guest kernel 的高内存占用量导致无法密集部署
Investigations:
- virtio-blk performs best at random/sequential writing, but has duplicated page cache problem
- virtio-fs resolves duplicated page cache problem, yet is slow in writing
- code self-modifying reduces the shareable memory when using microVM template
- mutex locks introduce critical section, causing host-side overhead of cgroups
- user-provided code/data is read-only for the operating system, and the system-provided runtime files are also read-only for user functions
Therefore, using virtio-fs to support the readonly part of rootfs for sharing page cache between host and guests, and using virtio-blk to support the writeable part of rootfs for high I/O performance. A solution is also required to further reduce the duplicated writable part for rootfs.
Design:
- reduce guest kernel size and pre-patch microVM template to eliminate self-modification
- split the rootfs into a read-only layer and a writable layer
- re-implement a lightweight cgroup and cgroup pool
Faster Software Packet Processing on FPGA NICs with eBPF Program Warping
TODO
Real-world eBPF applications: Katran, Suricate.
hXDP is state-of-the-art eBPF processor for FPGA.
In serverless computing, cold startup container latency is more than 10x warm startup, including 3 time-consuming steps:
- create container from the image
- initialize software environment
- initialize application-specific code
The key idea is to "fork or lend" runtime checkpoint from an idle container to eliminate another container's code startup.
By using privilege control in the operating system, a function that uses a helper container cannot obtain any code, data, or package information of other functions.
RRC: Responsive Replicated Containers
TODO
KRCORE: A Microsecond-scale RDMA Control Plane for Elastic Computing
elastic application, RDMA
Zero-Change Object Transmission for Distributed Big Data Analytics
OSD: object serialization and deserialization brings cost of object transmission.
Privbox: Faster System Calls Through Sandboxed Privileged Execution
Based on NaCI (Google Native Client)
Current system calls have overhead introduced by mitigations of transient execution attacks (Hertzbleed, Meltdown, Spectre):
- Flush CPU indirect branch predictor's state to block Spectre v2.
- Linux's page table isolation (PTI) switches page tables during system call entry.
Current mechanisms reducing system calls:
- Asynchronous:
FlexSC
and make system calls asynchronous, yet need re-developing applications. - Batching:
preadv
andCassyopia
make multiple system calls with one kernel entry/exit. - Asynchronous + batching:
io_uring
- Kernel bypassing:
DPDK
,SPDK
- Transient execution mitigations:
Ward
- eBPF for storage:
BBQ: A Block-based Bounded Queue for Exchanging Data and Profiling
TODO
Memory Harvesting in Multi-GPU Systems with Hierarchical Unified Virtual Memory
TODO
Zero Overhead Monitoring for Cloud-native Infrastructure using RDMA
Two major in cloud-native monitoring:
- traditional monitoring systems occupies host resources to collect, process and upload metrics causing resource contentions
- quality of service (QoS) of monitoring itself shrinks, which fails to support massive metrics with rapid variations
CRISP: Critical Path Analysis of Large-Scale Microservice Architectures
TODO
Hardening Hypervisors with Ombro
TODO
HyperEnclave: An Open and Cross-platform Trusted Execution Environment
TODO
KSG: Augmenting Kernel Fuzzing with System Call Specification Generation
2019
LXDs: Towards Isolation of Kernel Subsystems
LXD (Lightweight Execution Domains) 提供轻量化的隔离方案,可以用 minimal 的 modification 将现有的 kernel subsystem 隔离起来。
Design:
- LXD domains
- Transparent decomposition
- Asynchronous runtime
- Cross-core IPC
Extension Framework for File Systems in User space
TODO
Hodor: Intra-Process Isolation for High-Throughput Data Plane Libraries
TODO
2018
The Design and Implementation of Hyperupcalls
Paravirtualization: makes the guest aware of the hypervisor, yet harms the performance.
Paravirtualization drawbacks:
- introduce context switches between hypervisors and guests
- the requestor of a paravirtual mechanism must wait for it to be serviced in another context which may be busy, or waking the guest if it is idle
- pavirtual mechanisms couple the design of the hypervisor and guest: paravirtual mechanisms need to be implemented for each guest and hypervisor
Hyperupcalls are built with the guest OS codebase and share the same code, thereby simplifying maintenance while providing the OS with an expressive mechanism to describe its state to underlying hypervisors.
Hyperupcall registration consists of compiling C code, which may reference guest data structures, into verifiable bytecode. The guest registers the generated bytecode with the hypervisor, which verifies its safety, compiles it into native code and sets it in the VM hyperupcall table. When the hypervisor encounters an event, such as a memory pressure, it executes the respective hyperupcall, which can access and update data structures of the guest.
KylinX: A Dynamic Library Operating System for Simplified and Efficient Cloud Virtualization
TODO
2017
Optimizing the Design and Implementation of the Linux ARM Hypervisor
Hypervisors using ARM Virtualization Extensions: XEN (Type 1), KVM (Type 2).
The cost of transitioning from a VM to the hypervisor can be many times worse on ARM than x86. The main reason is that ARM VE requires hypercall redirection (guest kernel -> EL2 lowvisor -> EL1 KVM).
To reduce such performance overhead, virtualization host extensions are presented to run an unmodified Linux/KVM in EL2 to reduce hypercall redirection.
2013
Practical and Effective Sandboxing for Non-root Users
TODO