Skip to content

NSDI

2024

MegaScale: Scaling Large Language Model Training to More Than 10,000 GPUs

2023

Push-Button Reliability Testing for Cloud-Backed Applications with Rainmaker

Bug taxonomy

  1. no error handling
  2. throwing unrelated exceptions
  3. silent semantic violations
  4. state divergence

Rainmaker first installs a local HTTP proxy that can intercept and manipulate HTTP traffic to and from cloud services. Rainmaker injects faults into their REST API calls according to automatic fault-injection policies. After that, Rainmaker’s oracles analyze the test outcomes and raise alerts as potential bugs are detected.