DSN
2024
Mutiny! How does Kubernetes fail, and what can we do about it?
Fault-Error-Failure chain:
- Fault: a static defect in software
- Error: an incorrect internal state
- Failure: external, incorrect behavior
What kind of failures?
- misconfigurations
- communication errors
- other underlying errors like OS
Etcd alterations can recreate a majority (54/81) of real-world failures analyzed.
The core idea is to understand why an injected fault propagates to the whole system.