Chaos Engineering — Testing Failure Scenarios

When Netflix introduced Chaos Monkey in 2011, the uncomfortable truth it exposed was this: by the time a production incident hits, it is too late to discover that your retry logic doesn't work. Chaos engineering deliberately injects failures into controlled environments so you learn about system weaknesses before your users do. Instead of asking "will this break in production?", you manufacture the break in staging, watch how the system responds, and fix whatever doesn't hold up. The goal is never to cause harm — it is to surface the hidden assumptions that sit quietly in your architecture until 3am on a Friday.

What chaos engineering is

Chaos engineering is the practice of deliberately injecting failures into a running system to validate that it handles them gracefully. It is not random destruction — it is a structured experiment with a measurable outcome.

The process follows four steps:

Define steady state — what does normal look like? For an Order Service this might be: error rate below 0.1%, p99 latency below 200ms, order success rate above 99.9%. Write these numbers down before you start.
Form a hypothesis — predict what should happen during the failure. For example: "if the Inventory Service goes down, the Order Service degrades gracefully with a fallback response rather than returning a 500."
Inject real-world failures — kill a container, inject network latency, fill a connection pool, drop packets. Use tooling designed for this so you can precisely control blast radius.
Observe whether steady state held — if your metrics stayed within the defined thresholds, your hypothesis is confirmed. If they didn't, you found a real weakness to fix before it finds you.

Failure types worth testing in microservices

Not all failures are equally likely or equally damaging. These five categories cover the most common production failure modes:

Service crash: kill the Payment Service container mid-test. Does the Order Service return a 503 with a user-friendly message, or does it hang indefinitely waiting for a TCP connection that will never come?
Network latency: inject 5-second delays on all calls to the database. Does the circuit breaker open and shed load, or do threads pile up until the thread pool exhausts and the entire service stops responding?
Packet loss: drop 30% of packets between services. Does the retry logic handle partial failures cleanly, or does it retry in a way that causes duplicate writes?
Resource exhaustion: fill the connection pool to the Payment Service completely. Does the bulkhead isolate the failure and keep the rest of the Order Service healthy, or does the cascade take down unrelated features?
DNS failure: make DNS lookups for a service return NXDOMAIN. Does the service fall back to a cached address and continue, or does it fail immediately with an unhandled exception?

Toxiproxy — chaos in your tests

Toxiproxy is a programmable TCP proxy that sits between your service and its dependencies. You inject a "toxic" — a latency delay, packet loss rule, or connection limit — run your test, then remove it. Here is a complete Testcontainers example that tests database latency handling:

@Testcontainers
class DatabaseLatencyResilienceTest {
 
    static Network network = Network.newNetwork();
 
    @Container
    static PostgreSQLContainer<?> postgres = new PostgreSQLContainer<>("postgres:15")
        .withNetwork(network)
        .withNetworkAliases("postgres-real");
 
    @Container
    static ToxiproxyContainer toxiproxy = new ToxiproxyContainer(
        DockerImageName.parse("ghcr.io/shopify/toxiproxy:2.5.0"))
        .withNetwork(network);
 
    static ToxiproxyContainer.ContainerProxy dbProxy;
 
    @BeforeAll
    static void setup() {
        dbProxy = toxiproxy.getProxy(postgres, 5432);
    }
 
    @Test
    void shouldTimeoutAndReturnFallbackWhenDatabaseIsSlow() throws Exception {
        // Inject 6-second latency — service has 3s timeout configured
        dbProxy.toxics()
            .latency("db-latency", ToxicDirection.DOWNSTREAM, 6000);
 
        try {
            Response response = given().get("/orders/recent");
            // Service should return cached/fallback data, not hang
            assertThat(response.statusCode()).isEqualTo(200);
            assertThat(response.jsonPath().getBoolean("cached")).isTrue();
        } finally {
            dbProxy.toxics().get("db-latency").remove();
        }
    }
}

Toxiproxy sits between the service and its database as a transparent proxy. You inject the toxic, run the test, then remove it. The finally block is critical — always clean up toxics so subsequent tests aren't affected by a latency rule you forgot to tear down. A missed cleanup is one of the most common sources of mysteriously slow test suites.

Chaos maturity levels

Most teams don't run chaos experiments at all. Here is how maturity progresses:

Level 0: failures are never tested. The team discovers resilience gaps during production incidents.
Level 1: chaos runs in staging, manually and ad hoc, usually triggered by a curious engineer with too much coffee.
Level 2: "game days" — scheduled team exercises where engineers intentionally break staging and observe the outcome together.
Level 3: automated chaos in staging as part of the CI/CD pipeline. Failure scenarios run on every release candidate.
Level 4: controlled chaos in production with automated monitoring and kill switches — the Netflix and Google model.

For most QA teams, Level 2 is the right starting point. Run a quarterly "failure injection game day" where the team picks two or three failure scenarios, injects them into staging, and documents what breaks. It surfaces real problems, builds incident-response muscle memory, and doesn't require any chaos platform investment beyond Toxiproxy.

Other chaos tools

Once you outgrow Toxiproxy for local and CI use, there are purpose-built platforms for broader chaos work:

Chaos Mesh: Kubernetes-native; inject failures at the pod, network, and I/O levels using YAML manifests. No code changes required — failures are declared as Kubernetes custom resources.
Gremlin: a SaaS platform with an easy UI for running chaos experiments across your entire infrastructure. Good for teams that want to run experiments without writing code.
LitmusChaos: a CNCF project built on the Kubernetes operator model. Well-suited for teams already using Argo Workflows, since chaos experiments slot into existing pipeline definitions.

Step 1 of 5

Baseline measurement

Before injecting any failure, measure steady state: error rate, latency p99, throughput. This is your control reading.

⚠️ Common mistakes

Starting with production chaos before staging is stable. Run chaos experiments in staging first. If your staging environment can't survive a Payment Service crash, production definitely can't. Validate resilience from the bottom up.
Not cleaning up chaos state between tests. A Toxiproxy latency toxic left running after a test will silently slow every subsequent test in the suite. Always remove toxics in a finally block or @AfterEach.
Testing only the happy path of resilience patterns. If you have a circuit breaker configured, test both that it opens (after threshold failures) and that it closes again (after the half-open probe succeeds). Leaving one half untested means you could have a circuit that opens but never recovers.

🎯 Practice task

Set up Toxiproxy via Testcontainers in a test project. Create a proxy that sits between your service and its database. Verify the proxy works by running a normal query through it.
Write a test that injects 6-second latency through the Toxiproxy. Run it against a service with no timeout configured. Observe the test hanging. This is what production looks like when dependencies are slow and timeouts are missing.
Add a 3-second timeout to the service's database connection pool. Re-run the test. Confirm the service returns an error within the timeout window instead of hanging.
Extend the test to verify graceful degradation: configure the service to return cached data when the database is slow. Assert the response contains "cached": true and a stale but valid response body.
Design a game day exercise for a system you know. List three failure scenarios to inject, the steady-state metrics to monitor during each, and the expected outcome. Run the exercise in staging and document what you find.

The next lesson moves from deliberately breaking systems to observing them — you will learn how distributed tracing gives you a single screen showing every service, every span, and every millisecond of a failed request, making test debugging dramatically faster.