Chaos Testing is Not Failure Testing but Learning from Failures

Over the last few years, I have been reading a lot on resiliency testing and already published few blog articles. We all know organisations are now very much concerned about resilience to ensure superior end-user experience. Organizations are conducting chaos testing to confirm resilience. Overall objective is to deliberately introduce failures into the systems to test their response and discover flaws before they became downtime well in advance. To ensure, those failure (chaotic) scenarios are proactively executed and end to end system is thoroughly observed.

In addition, chaos testing also assists to learn from the failures. In this blog, I will talk about chaos testing is not a straightforward failure testing but learning from failures.

Chaos Testing:

Chaos testing assists to know system’s ability to handle under different chaos or failure conditions. Then how quickly system can bounce back to its normal state and how gracefully system can recover from failures under those chaotic conditions. And it is observed comprehensively via E2E (end-to-end) monitoring system. 

Info

Chaos is not a Straightforward Failure Testing:

A typical failure test verifies one testing criteria and tells whether it is true or false. So, either failure test is passed or failed. They always provide binary results. Also, there is not much detailed information about the application failures.

Chaos is also a failure testing. But it is not straightforward failure testing and don’t have binary results. Chaos testing is conducted by deliberately introducing chaotic conditions into the system. Even though ideally chaos can be thinking of as randomly introduce failure scenarios, but it is not recommended specially for Production system. On the same line, chaos testing can’t be executed directly in production. It is recommended to execute chaos testing first in non-production system and then in production system.

Typical failure testing breaks to test in a defined way whereas chaos failure testing breaks to explore or examine in unspecified way. So, many unpredictable things can be occurred. Chaos testing is more of like a failure experimentation rather than straightforward testing and offers new findings of the entire system.

Learning from Failures:

These chaotic scenarios are planned, careful failure scenarios. Executing chaotic scenarios with continuous E2E monitoring is absolutely required to experiment the system. This proactively preparing for failure, observe the system, fixing any issues and re-examine before it affects the real end-user is the main objective. This type of testing provides many application performance insights, new system knowledge. These in turns enable a great learning of the overall system. In a nutshell, these are learning from failures. These learning eventually leads to overall improvement of the system which in turn creates a fault-tolerant, resilient system that can handle unexpected real-life events and ensure less downtime.

We need to keep in mind that chaos testing is meant for generating new information of the system as well as learn about the weakness of the system. As mentioned above. continuous E2E monitoring is crucial. When we have E2E observability of the whole system, we get enormous information on the failure scenario experimentation. 

Info

Failure Scenarios:

Identifying failure scenarios (strategic and thoughtful) are very critical to this process and should be finalized as a single team. Chaos failure scenarios can be infinite and can be different based on the system architecture and organization’s overall purpose.

In a complex system, failure can occur in any components and executing all permutations and combinations are quite impossible. When organizations do chaos testing continuously and repetitively, it simply provides many details about their system. Learning from those scenarios, eventually assist to identify the potential failure component/s.

Steps To followed:

First and foremost, we need to think on the small and deliberate failure scenarios. This will ensure minimise blast radius experiment. Consequences of that, there will be less impact on the services and components on those small intentional failure scenarios.

Identify steady state as normal or baseline behaviour is the first thing that is required. Then, assume the hypothesis that this steady state will continue in both normal and experimental testing. Later, introduce real life failure scenarios to the system. Once done, try to invalidate the hypothesis by finding a difference between normal and experimental testing. Either chaos failure testing verifies the resilience of the system or it finds a problem that needs to be resolved.

We all know failure can occur at any point of time. So, we need to do chaos failure testing continuously to locate new issues that needs to be resolved at the earliest. Below the steps to be followed:

  • Create a hypothesis
  • Inject failures
  • Measure impact
  • Verify hypothesis
  • Learn from failures

And do this continuously.

Conclusion:

This type of testing is not a straightforward failure testing rather it is a failure investigation and provides many findings of the complete system with continuous E2E monitoring. Chaos failure testing enables huge learning and assists in overall improvement of the system. In a nutshell, chaos testing ensures a fault-tolerant, resilient system. 

 

Check out all the software testing webinars and eBooks here on EuroSTARHuddle.com

References:

About the Author

Arun Kumar

Arun earned a degree in Computer science from Govt. Engg. College, India. He is having 14+ years of working and managing E2E testing delivery experience in different types of applications. He has a keen interest in reading and writing different technical papers. He has been selected in multiple international conferences; global webinars and his papers have been published in multiple forums and also won various awards. He is now working as Senior Test Manager in Atos & Global Subdomain Leader for Atos Expert: Applications-Testing.
Find out more about @arun2005413gmail-com

Related Content