Skip to content

ATC

2023

MELF: Multivariant Executables for a Heterogeneous World

Analysis and Optimization of Network I/O Tax in Confidential Virtual Machines

zpoline: a system call hook mechanism based on binary rewriting

EPF: Evil Packet Filter

SAGE: Software-based Attestation for GPU Execution

The Hitchhiker's Guide to Operating Systems

2022

RunD: A Lightweight Secure Container Runtime for High-density Deployment and High-concurrency Startup in Serverless Computing

现有的 secure container 使用一个 microVM 套在外层用于隔离(MicroVM is for isolation, and the container is for abstraction)。但是没法解决 high-density 以及 high-concurrency 的 FaaS 的场景要求。因此这篇论文实现了一个轻量级的container runtime: RunD,有如下的效果:

Contributions:

  • 识别 high-density 和 high-concurrency 的瓶颈
  • Guest-to-host 的方案
  • RunD

现有方案的问题:

  • 创建 rootfs 和 cgroup 的时间开销导致无法高并行
  • rootfs 和 guest kernel 的高内存占用量导致无法密集部署

Investigations:

  • virtio-blk performs best at random/sequential writing, but has duplicated page cache problem
  • virtio-fs resolves duplicated page cache problem, yet is slow in writing
  • code self-modifying reduces the shareable memory when using microVM template
  • mutex locks introduce critical section, causing host-side overhead of cgroups
  • user-provided code/data is read-only for the operating system, and the system-provided runtime files are also read-only for user functions

Therefore, using virtio-fs to support the readonly part of rootfs for sharing page cache between host and guests, and using virtio-blk to support the writeable part of rootfs for high I/O performance. A solution is also required to further reduce the duplicated writable part for rootfs.

Design:

  • reduce guest kernel size and pre-patch microVM template to eliminate self-modification
  • split the rootfs into a read-only layer and a writable layer
  • re-implement a lightweight cgroup and cgroup pool

Faster Software Packet Processing on FPGA NICs with eBPF Program Warping

TODO

Real-world eBPF applications: Katran, Suricate.

hXDP is state-of-the-art eBPF processor for FPGA.

Help Rather Than Recycle: Alleviating Cold Startup in Serverless Computing Through Inter-Function Container Sharing

In serverless computing, cold startup container latency is more than 10x warm startup, including 3 time-consuming steps:

  • create container from the image
  • initialize software environment
  • initialize application-specific code

The key idea is to "fork or lend" runtime checkpoint from an idle container to eliminate another container's code startup.

By using privilege control in the operating system, a function that uses a helper container cannot obtain any code, data, or package information of other functions.

RRC: Responsive Replicated Containers

TODO

KRCORE: A Microsecond-scale RDMA Control Plane for Elastic Computing

elastic application, RDMA

Zero-Change Object Transmission for Distributed Big Data Analytics

OSD: object serialization and deserialization brings cost of object transmission.

Privbox: Faster System Calls Through Sandboxed Privileged Execution

Based on NaCI (Google Native Client)

Current system calls have overhead introduced by mitigations of transient execution attacks (Hertzbleed, Meltdown, Spectre):

  • Flush CPU indirect branch predictor's state to block Spectre v2.
  • Linux's page table isolation (PTI) switches page tables during system call entry.

Current mechanisms reducing system calls:

  • Asynchronous: FlexSC and make system calls asynchronous, yet need re-developing applications.
  • Batching: preadv and Cassyopia make multiple system calls with one kernel entry/exit.
  • Asynchronous + batching: io_uring
  • Kernel bypassing: DPDK, SPDK
  • Transient execution mitigations: Ward
  • eBPF for storage:

BBQ: A Block-based Bounded Queue for Exchanging Data and Profiling

TODO

Memory Harvesting in Multi-GPU Systems with Hierarchical Unified Virtual Memory

TODO

Zero Overhead Monitoring for Cloud-native Infrastructure using RDMA

Two major in cloud-native monitoring:

  1. traditional monitoring systems occupies host resources to collect, process and upload metrics causing resource contentions
  2. quality of service (QoS) of monitoring itself shrinks, which fails to support massive metrics with rapid variations

CRISP: Critical Path Analysis of Large-Scale Microservice Architectures

TODO

Hardening Hypervisors with Ombro

TODO

HyperEnclave: An Open and Cross-platform Trusted Execution Environment

TODO

KSG: Augmenting Kernel Fuzzing with System Call Specification Generation

2019

LXDs: Towards Isolation of Kernel Subsystems

LXD (Lightweight Execution Domains) 提供轻量化的隔离方案,可以用 minimal 的 modification 将现有的 kernel subsystem 隔离起来。

Design:

  • LXD domains
  • Transparent decomposition
  • Asynchronous runtime
  • Cross-core IPC

Extension Framework for File Systems in User space

TODO

Hodor: Intra-Process Isolation for High-Throughput Data Plane Libraries

TODO

2018

The Design and Implementation of Hyperupcalls

Paravirtualization: makes the guest aware of the hypervisor, yet harms the performance.

Paravirtualization drawbacks:

  • introduce context switches between hypervisors and guests
  • the requestor of a paravirtual mechanism must wait for it to be serviced in another context which may be busy, or waking the guest if it is idle
  • pavirtual mechanisms couple the design of the hypervisor and guest: paravirtual mechanisms need to be implemented for each guest and hypervisor

Hyperupcalls are built with the guest OS codebase and share the same code, thereby simplifying maintenance while providing the OS with an expressive mechanism to describe its state to underlying hypervisors.

Hyperupcall registration consists of compiling C code, which may reference guest data structures, into verifiable bytecode. The guest registers the generated bytecode with the hypervisor, which verifies its safety, compiles it into native code and sets it in the VM hyperupcall table. When the hypervisor encounters an event, such as a memory pressure, it executes the respective hyperupcall, which can access and update data structures of the guest.

KylinX: A Dynamic Library Operating System for Simplified and Efficient Cloud Virtualization

TODO

2017

Optimizing the Design and Implementation of the Linux ARM Hypervisor

Hypervisors using ARM Virtualization Extensions: XEN (Type 1), KVM (Type 2).

The cost of transitioning from a VM to the hypervisor can be many times worse on ARM than x86. The main reason is that ARM VE requires hypercall redirection (guest kernel -> EL2 lowvisor -> EL1 KVM).

To reduce such performance overhead, virtualization host extensions are presented to run an unmodified Linux/KVM in EL2 to reduce hypercall redirection.

2013

Practical and Effective Sandboxing for Non-root Users

TODO