Chaos Engineering
Kill Your Own Servers for the Greater Good
The guy on stage
That's Bence
Software Platform Lead @ Kiwi.com
Helping our developers
write better software, faster
This talk is mostly about systems design
(and psychology)
If you only care about Python
str(...)[...==...]+str(....__doc__)[
...==...]+str(...)[(...==...)<<(...==...)]
think about this until the end of the talk
Accurate Depiction of Netflix
before Chaos Monkey
Accurate Depiction of Netflix
after Chaos Monkey
Experiments to reveal system weaknesses
It's basically like a vaccine
Both involve injecting a little harm
Both help build immunity against catastrophe
Neither of them cause autism
Your mind is in a happy bubble
Yes, you handle errors in code most of the time
This happy bubbleβ¦
inside
1 happy path Γ 8 parameters
8 considerations
90% of your time
outside
12 dependencies Γ 10 failure modes
120 considerations
10% of your time
Wait, “12 dependencies”?
Highly confidential
Amazon & Netflix service diagrams
Do not distribute
Wait, “10 failure modes”?
CPU overload
Full disk
Disk I/O overload
Full memory
Server shutdown
Server time desync
Network disconnect
DNS failure
Network latency
Packet loss
First thing we suck at:
Akrasia
“lack of command”
π now, orβ¦
π in 24 hours, orβ¦
πππ in 1 hour
πππ in 25 hours
Second thing we suck at:
Prediction
- Scenario: Break network between app and user API
- Expectation: Data will come from cache for 15 minutes
- Reality: Cache was configured to health check the backend, took whole site down
expectation
reality
expectation
reality
Okay, no, but seriously
Software is full of unexpected crap
Third thing we suck at:
Execution
Try roleplaying outages
- Dungeon Master tells the tale of an outage
- Player says what they would do
- Presenter acts it out on a shared screen
>50% of steps taken won't be optimal
β¦and this just get worse with stress
Practice
to get yourself from 50% optimal to 80%
Automate
to get yourself from 80% optimal to 100%
How to get started with chaos
- Inject a small problem in staging
Gremlin is a cool tool for this
- Slowly increase the problem
Start with like 10ms latency on 1% traffic
- Diligently document everything