Skip to content

ATC

2023

MELF: Multivariant Executables for a Heterogeneous World

Analysis and Optimization of Network I/O Tax in Confidential Virtual Machines

zpoline: a system call hook mechanism based on binary rewriting

EPF: Evil Packet Filter

SAGE: Software-based Attestation for GPU Execution

The Hitchhiker's Guide to Operating Systems

2022

RunD: A Lightweight Secure Container Runtime for High-density Deployment and High-concurrency Startup in Serverless Computing

Existing secure containers use a microVM as an outer isolation layer (the microVM provides isolation, while the container provides abstraction). However, this cannot satisfy the high-density and high-concurrency requirements of FaaS scenarios. Therefore, this paper implements a lightweight container runtime, RunD, with the following effects:

Contributions:

  • identify the bottlenecks of high-density and high-concurrency deployment
  • Guest-to-host design
  • RunD

Problems with existing solutions:

  • The time overhead of creating rootfs and cgroups prevents high parallelism.
  • The high memory footprint of rootfs and guest kernels prevents dense deployment.

Investigations:

  • virtio-blk performs best at random/sequential writing, but has duplicated page cache problem
  • virtio-fs resolves duplicated page cache problem, yet is slow in writing
  • code self-modifying reduces the shareable memory when using microVM template
  • mutex locks introduce critical section, causing host-side overhead of cgroups
  • user-provided code/data is read-only for the operating system, and the system-provided runtime files are also read-only for user functions

Therefore, using virtio-fs to support the readonly part of rootfs for sharing page cache between host and guests, and using virtio-blk to support the writeable part of rootfs for high I/O performance. A solution is also required to further reduce the duplicated writable part for rootfs.

Design:

  • reduce guest kernel size and pre-patch microVM template to eliminate self-modification
  • split the rootfs into a read-only layer and a writable layer
  • re-implement a lightweight cgroup and cgroup pool

Faster Software Packet Processing on FPGA NICs with eBPF Program Warping

TODO

Real-world eBPF applications: Katran, Suricata.

hXDP is state-of-the-art eBPF processor for FPGA.

Help Rather Than Recycle: Alleviating Cold Startup in Serverless Computing Through Inter-Function Container Sharing

In serverless computing, cold startup container latency is more than 10x warm startup, including 3 time-consuming steps:

  • create container from the image
  • initialize software environment
  • initialize application-specific code

The key idea is to "fork or lend" runtime checkpoint from an idle container to eliminate another container's code startup.

By using privilege control in the operating system, a function that uses a helper container cannot obtain any code, data, or package information of other functions.

RRC: Responsive Replicated Containers

TODO

KRCORE: A Microsecond-scale RDMA Control Plane for Elastic Computing

elastic application, RDMA

Zero-Change Object Transmission for Distributed Big Data Analytics

OSD: object serialization and deserialization brings cost of object transmission.

Privbox: Faster System Calls Through Sandboxed Privileged Execution

Based on NaCI (Google Native Client)

Current system calls have overhead introduced by mitigations of transient execution attacks (Hertzbleed, Meltdown, Spectre):

  • Flush CPU indirect branch predictor's state to block Spectre v2.
  • Linux's page table isolation (PTI) switches page tables during system call entry.

Current mechanisms reducing system calls:

  • Asynchronous: FlexSC and make system calls asynchronous, yet need re-developing applications.
  • Batching: preadv and Cassyopia make multiple system calls with one kernel entry/exit.
  • Asynchronous + batching: io_uring
  • Kernel bypassing: DPDK, SPDK
  • Transient execution mitigations: Ward
  • eBPF for storage:

BBQ: A Block-based Bounded Queue for Exchanging Data and Profiling

TODO

Memory Harvesting in Multi-GPU Systems with Hierarchical Unified Virtual Memory

TODO

Zero Overhead Monitoring for Cloud-native Infrastructure using RDMA

Two major problems in cloud-native monitoring:

  1. traditional monitoring systems occupy host resources to collect, process, and upload metrics, causing resource contention
  2. quality of service (QoS) of monitoring itself shrinks, which fails to support massive metrics with rapid variations

CRISP: Critical Path Analysis of Large-Scale Microservice Architectures

TODO

Hardening Hypervisors with Ombro

TODO

HyperEnclave: An Open and Cross-platform Trusted Execution Environment

TODO

KSG: Augmenting Kernel Fuzzing with System Call Specification Generation

2019

LXDs: Towards Isolation of Kernel Subsystems

LXD (Lightweight Execution Domains) provides a lightweight isolation scheme that can isolate existing kernel subsystems with minimal modifications.

Design:

  • LXD domains
  • Transparent decomposition
  • Asynchronous runtime
  • Cross-core IPC

Extension Framework for File Systems in User space

TODO

Hodor: Intra-Process Isolation for High-Throughput Data Plane Libraries

TODO

2018

The Design and Implementation of Hyperupcalls

Paravirtualization: makes the guest aware of the hypervisor, yet harms the performance.

Paravirtualization drawbacks:

  • introduces context switches between hypervisors and guests
  • the requestor of a paravirtual mechanism must wait for it to be serviced in another context, which may be busy, or wake the guest if it is idle
  • paravirtual mechanisms couple the design of the hypervisor and guest: paravirtual mechanisms need to be implemented for each guest and hypervisor

Hyperupcalls are built with the guest OS codebase and share the same code, thereby simplifying maintenance while providing the OS with an expressive mechanism to describe its state to underlying hypervisors.

Hyperupcall registration consists of compiling C code, which may reference guest data structures, into verifiable bytecode. The guest registers the generated bytecode with the hypervisor, which verifies its safety, compiles it into native code and sets it in the VM hyperupcall table. When the hypervisor encounters an event, such as a memory pressure, it executes the respective hyperupcall, which can access and update data structures of the guest.

KylinX: A Dynamic Library Operating System for Simplified and Efficient Cloud Virtualization

TODO

2017

Optimizing the Design and Implementation of the Linux ARM Hypervisor

Hypervisors using ARM Virtualization Extensions: XEN (Type 1), KVM (Type 2).

The cost of transitioning from a VM to the hypervisor can be many times worse on ARM than x86. The main reason is that ARM VE requires hypercall redirection (guest kernel -> EL2 lowvisor -> EL1 KVM).

To reduce such performance overhead, virtualization host extensions are presented to run an unmodified Linux/KVM in EL2 to reduce hypercall redirection.

2013

Practical and Effective Sandboxing for Non-root Users

TODO