Chaos Engineering
Kill Your Own Servers for the Greater Good

The guy on stage

That's Bence
Software Platform Lead @ Kiwi.com

Helping our developers
write better software, faster

This talk is mostly about systems design
(and psychology)



If you only care about Python

str(...)[...==...]+str(....__doc__)[
            ...==...]+str(...)[(...==...)<<(...==...)]

think about this until the end of the talk

Accurate Depiction of Netflix
before Chaos Monkey

Accurate Depiction of Netflix
after Chaos Monkey

Chaos Engineering is…

Experiments to reveal system weaknesses

It's basically like a vaccine

Both involve injecting a little harm

Both help build immunity against catastrophe

Neither of them cause autism

But why?

humans suck

Your mind is in a happy bubble

Yes, you handle errors in code most of the time

This happy bubble…

inside

1 happy path Γ— 8 parameters

8 considerations

90% of your time

outside

12 dependencies Γ— 10 failure modes

120 considerations

10% of your time

Wait, “12 dependencies”?

Highly confidential

Amazon & Netflix service diagrams

Do not distribute

Wait, “10 failure modes”?

CPU overload

Full disk

Disk I/O overload

Full memory

Server shutdown

Server time desync

Network disconnect

DNS failure

Network latency

Packet loss

First thing we suck at:
Akrasia

“lack of command”

Image from
youtube.com/kurzgesagt’s
“The Origin of Consciousness”

πŸ• now, or…


πŸ• in 24 hours, or…

πŸ•πŸ•πŸ• in 1 hour


πŸ•πŸ•πŸ• in 25 hours

Second thing we suck at:
Prediction

  • Scenario: Break network between app and user API
  • Expectation: Data will come from cache for 15 minutes
  • Reality: Cache was configured to health check the backend, took whole site down

expectation

reality

expectation

reality

Okay, no, but seriously

Software is full of unexpected crap

Third thing we suck at:
Execution

Try roleplaying outages

  • Dungeon Master tells the tale of an outage
  • Player says what they would do
  • Presenter acts it out on a shared screen

>50% of steps taken won't be optimal

…and this just get worse with stress

Practice
to get yourself from 50% optimal to 80%

Automate
to get yourself from 80% optimal to 100%

Demo

How to get started with chaos

  • Inject a small problem in staging
    Gremlin is a cool tool for this
  • Slowly increase the problem
    Start with like 10ms latency on 1% traffic
  • Diligently document everything