DSN
2024
Mutiny! How does Kubernetes fail, and what can we do about it?
Fault-Error-Failure
chain:
- Fault: a static defect in software
- Error: an incorrect internal state
- Failure: external, incorrect behavior
What kind of failures?
- misconfigurations
- communication errors
- other underlying errors like OS
Etcd alterations can recreate a majority (54/81) of real-world failures analyzed.
The core idea is that given a injected fault, why the fault is propagated to the whole system.