Skip to content

EuroSys

2024

Effective Bug Detection with Unused Definitions

Apply live variable analysis to collect unused definitions, and filter harmful peer definitions.

Use code familiarity models to prioritize the hunted bugs.

Automatic Root Cause Analysis via Large Language Models for Cloud Incidents

A typical incident life-cycle:

  1. detection
  2. triaging
  3. diagnosis
  4. mitigation

Main challenge: efficiently collecting and interpreting comprehensive, incident-specific data.

LLM can:

  1. parse through high-volume data, discern relevant information, and produce succinct, insightful output.
  2. adapt to new and evolving types of incidents, learning from previous data to improve future predictions.

But, they lack intrinsic domain-specific knowledge, especially in specialized areas such as cloud incident management.

RCA-Copilot's prompt
Please summarize the above input. 
Please note that the above input is incident diagnostic information. 
The summary results should be about 120 words, no more than 140 words, 
and should cover important information as much as possible. 
Just return the summary without any additional output

2023

Fail through the Cracks: Cross-System Interaction Failures in Modern Cloud Systems

Modern cloud systems interacts with many (sub-)systems. Yet, such interaction may introduce failures, termed cross-system interaction failures (CSI failures).

CSI failures are different from dependency failures and libraryinteraction failures. CSI failures are organized by control, data and management planes.

  • Control plane embodies the system core control logic, such as scheduling, resource allocation, coordination, fault tolerance, recovery, etc
  • Data plane embodies the components for data operations, in forms of tables, files, tuples, and streams.
  • Management plane embodies the components for system configuration and monitoring.

Collecting methodology: JIRA issue tracking system. With quantitive study, root causes are extracted as following:

For data plane:

  1. Discrepancies of data-plane CSI failures lie in many different data properties.
    • The majority (50/61) of dataplane CSI failures are caused by metadata, namely typical metadata (42/61) such as addresses/names and data schemas, and custom metadata (8/61).
    • The others (11/61) are caused by custom properties and API semantics. (:question: inconsistant with Table 4)
  2. Complicated data abstractions (e.g., tables) are more error-prone to CSI failures, compared with simple data abstractions.
    • 57% (35/61) of data-plane CSI failures are induced by table-related operations.
    • None are induced by key-value tuple operations.
  3. 25% (15/61) data-plane CSI failures are root-caused by data serialization.

For management plane:

  1. CSI-failure-inducing configuration issues are mostly about failures of coherently configuring multiple involved systems.
  2. Parameter-related configuration issues are the majority (21/30) of configuration-induced CSI failures. The rest (9/30) are in configuration components of the involved systems.
  3. Monitoring-related CSIs are critical to reliability, especially when monitoring data is used for critical actions.

For control plane:

  1. Most control-plane CSI failures are rooted in discrepancies of implicit properties, including implicit API semantics and state/resource inconsistencies.
  2. API misuses, despite being a classic problem, are still common defects and contribute to the majority (13/20) of control-plane CSI failures. The main patterns are implicit semantic violation (8/13) and incorrect invocation context (5/13).

Each root causes come with the corresponding implications. Study on the fixes implies:

  1. In 40% (46/115) CSI failures, the merged fixes improve condition checking and error handling instead of repairing the failed interactions.
  2. In 69% (79/115) CSI failures, fixes were applied to code in the upstream system specific to interaction with a downstream system. Furthermore, among these 79 cases, fixes for 68 (86%) cases resided in dedicated “connector” modules.

2022

Verified Programs Can Party: Optimizing Kernel Extensions via Post-Verification Merging

Current BPF extensions:

  • system call security: seccomp-BPF
  • performance tracing: tracepoints, bcc
  • network packet processing: XDP
  • system monitor: sysdig
  • performance enhancement: ExtFUSE

You shall not (by)pass!: practical, secure, and fast PKU-based sandboxing

TODO

PKRU-safe: automatically locking down the heap between safe and unsafe languages

PKRU-Safe: an automated method for enforcing the principle of least privilege on unsafe components in mixed-language environments.

Design: a program is splitted into trusted compartment and untrusted compartment, each occupies a memory region protected by MPK. To support compartment, PKRU-safe require developers' annotations on source code and perform data-flow analysis to mark objects in different compartments. Finally, PKRU-safe makes original allocator aware of the annotations.

Security evaluation on Servo, whose JavaScript engine written in unsafe C/C++. Performance evaluation on Micro benchmarks, Servo, Dromaeo benchmarks, Kraken benchmarks, JetStream2.0 and Octane.

Related work:

Unikraft: Fast, Specialized Unikernels the Easy Way

TODO

Kite: Lightweight Critical Service Domains

TODO

2020

SEUSS: Skip Redundant Paths to Make Serverless Fast

TODO

A Linux in Unikernel Clothing

TODO