NSDI
2024
MegaScale: Scaling Large Language Model Training to More Than 10,000 GPUs
2023
Push-Button Reliability Testing for Cloud-Backed Applications with Rainmaker
Bug taxonomy
- no error handling
- throwing unrelated exceptions
- silent semantic violations
- state divergence
Rainmaker first installs a local HTTP proxy that can intercept and manipulate HTTP traffic to and from cloud services. Rainmaker injects faults into their REST API calls according to automatic fault-injection policies. After that, Rainmaker’s oracles analyze the test outcomes and raise alerts as potential bugs are detected.