Don’t ignore Chaos testing!

In March this year, I left JPMorgan Chase to join Amazon Web Services. I was at JPMC for close to eight years and my last role there was to lead the product development & delivery of observability services. This included passive observability services for aggregating & alerting over operational metrics, events, logs, and traces; as well as active observability services like synthetic transaction testing & chaos engineering.

That last bit around chaos engineering was one of the more memorable, forward looking, leading edge, I-can’t-believe-a-bank-does-this, pieces of work that I’m especially proud of incubating & delivering. From that vantage point, I thought I’ll share a bit of my perspective on this field.

I’m not going to delve into explaining chaos engineering in this post. The site does a great job of explaining the fundamentals in 10 minutes and is well worth a read. If you have more time, then consider watching this talk on chaos engineering that a colleague and I gave. It’s a great talk, I promise. 😊

One of the ideas we touch on in that talk is the distinction between chaos testing and chaos experimentation.

In essence:

Experimentation leads to learning new things about the system while testing validates our existing understanding of the system.

In a complex system, it’s very hard, if not impossible, to predict the behaviour of the system based purely on an inside out understanding of the system’s components and internal design1. Thus, we run chaos experiments with failure injection to understand how the system behaves under turbulent conditions. The results of these chaos experiments provide us with new knowledge about the system2.

On the other hand, chaos testing also involves failure injection but with the goal of validating that the currently known properties of the system (related to failure modes) haven’t changed. The properties are either known because they were a design goal (e.g. fallback to cache when source is unavailable) or were added to the system in response to a failure mode discovered through chaos experiments (e.g. circuit breakers).

I feel that in the chaos engineering community, chaos testing is sometimes given the short shrift.

Chaos testing is an important new tool to add to your arsenal of unit, integration, and performance tests. Chaos tests can be automated and added to your CI/CD pipeline as part of your regression test suite. In an enterprise and/or regulated context, automated chaos tests provide always available evidence of conformance to BCP standards.

While there’s immense value to be unlocked through chaos experimentation, you shouldn’t ignore chaos testing! After all, what good is the knowledge accrued from a chaos experiment if the next big feature release or refactoring leaves your system vulnerable to the same failure mode due to a regression?

Hence, your chaos engineering adoption journey must include both chaos experimentation and chaos testing.

The challenge adopters face today is that the tooling for automated chaos testing is still in infancy. While significant progress is being made by the likes of Chaos Toolkit, Verica, Gremlin, AWS, and others, I believe we are still at the beginning of this promising journey3. As a point of comparison, consider how pervasive unit testing is today. However, this wasn’t always the case. JUnit was created in 1997 and Kent Beck’s XP book came out in 1999. Evidently, it took decades for unit testing to become pervasive, requiring a combination of greater collective experience and better tooling.

We are in that same primordial state with chaos engineering and I’m convinced that we’ll see immense progress in the tooling landscape as well as our understanding of the field in the coming decade.

  1. Systems theory makes this explicit in that the internals of a system are incidental and what matters are the system’s inputs & outputs. 

  2. In fact, Chaos Engineering emerged because complex systems’ failure modes couldn’t be understood by traditional inside-out methods. 

  3. At JPMC, my team built our own internal chaos service because none of the tools met all of our needs.