Before discussing testing the unknown. First, we should mention two of the known forms of testing:
- Unit Testing – Where we take a single component (smallest testable parts of an application). We give input and make sure output is okay
- Integration Testing – We do this between components Given input X we expect component A to output Y and then we expect component B to output Z.
These are two of the known forms of testing. So, what is Testing the Unknown?
Testing the Unknown
Let me explain by taking an example of Integration Testing. It looks a lot like integration testing except here Service A and Service B we have an option to add failure or latency so that we can trouble the system.
We are calling it Chaos Engineering instead of Chaos testing because we are testing the Unknowns. Chaos Engineering is thoughtful, planned experiments designed to reveal the weakness in our system. Inject failure into the system to make them more reliable.
Chaos Engineering is not to replace system Unit or integration testing. They are meant to work together in harmony to give you the most availability possible. To ensure your customer has a great experience and that your business stays up and running.
Benefits of Testing the Unknown
Prevent Expensive outages: Avoid Costly downtime. Minimize the risk of system failure by proactively testing for weaknesses before they become outages.
- Identify: Uncover critical failures BEFORE they impact customer
- Accelerate: Reduce detection and resolution time for incidents
- Validate: test your disaster recovery mechanisms to prevent a false sense of security
Shorten Development, Deployment, and Migration Cycles: Prevent Rollbacks and service distribution by identifying weak points in your system before launch.
- Deliver zero-regression, on-time, on budget migrations
- Ship more reliable code, More often
- Train the next generation of SREs with real-world scenarios
Win Customer Trust: Customer expectations have changed. Make sure your application delivers a seamless experience, every time.
- Prepare for launches and high-scale events
- Deliver a seamless experience and win customer trust
- Prevent Failures from impacting your reputation
What is the Difference Between Chaos Engineering and Performance Testing
This could also been see as a comparison of Load Testing vs. Stress testing vs. Performance testing vs. Chaos testing. Even though all the tests mentioned above have one common goal — to prevent system failures — they are designed to test different parameters. So what are the performance, load, stress, and chaos tests?
- Performance testing is software that checks the system’s performance in terms of speed and reliability under different loads. It includes the behavior of the system when the load is normal, high, or low.
- While speed, scalability, stability, and reliability are tested by running a performance test, load tests are designed to check the maximum number of users that can exceed the system simultaneously and what load will cause a break. When a team runs a load test, they aim to check the performance of the system under an extreme load. When a company needs to know if its system can withstand an extreme load, they run a stress test. This kind of test shows the level of robustness the system has.
- And last but not least — chaos testing. This is an ideology that was first coined by Netflix. It states that failures and breakdowns are inevitable, thus why not deliberately launch failure to see where the weakness of your system is. The advantage of such an approach is drawbacks and issues detection before they occur unexpectedly.
Running Chaos Engineering experiments
- Start with the Hypothesis:Stating the question that you are trying to answer, and what you think the results will be. For Example, if your experiment is to test whether your web server can handle the increased load, Your Hypothesis might state that “as CPU usage increases, Request throughput remains consistent.
- Define your Blast Radius: The blast radius includes any components affected by the test. A smaller blast radius will limit the potential damage done by the test. It is recommended to start with the smallest blast radius possible. once you are more comfortable running the chaos experiment, you can increase the blast radius to include more components.
- Run the Experiment: Make sure to have a way to stop the experiment and revert any changes it should be introduced before you begin the experimentation process.
- Analyse the data: Does it confirm or reject your hypothesis? Use the results to address the failure points in your system and refine your experiment.
- Share the Results: Once you have completed your experiment and analyzed the data, share your results, and the failure widely in the organization. sharing the results all the organization to understand how using chaos engineering practice leads to a more reliable system by either validating a hypothesis or discovering the potential failure
Understanding Abort Conditions
Safety is one of the most important factors when running chaos engineering experiments. This is why we limit the blast radius.
But Beyond the blast radius, you should always have ABORT condition defined for your experiment.
Abort Condition – An abort condition is a predetermined situation or trigger for ending the experiment. Setting an abort condition is more important in chaos engineering than in other forms of load testing because of the increased likelihood of unintended or unexpected consequences from an experiment.
Chaos engineers employ sometimes destructive tactics to disprove their hypotheses. An unexpectedly large blast radius (where more systems are affected than was planned) could cause significant production downtime. They are the system condition that includes “when we should stop the chaos experiment to avoid accidental damage “. Status Checks are the way to automate abort conditions during the scenario.
NOTE: It is important to set your abort condition before your experiment
What is an attack in Chaos Engineering?
An attack is a term for creating or injecting failure in some part of a system, such as causing networking problems or exhausting resources. Attacks are a way to test your system to better understand how they operate in dynamic environments and uncover vulnerabilities.
Types of Attack
- Resource Attack
- State Attack
- Network Attack
Resource Attack
Resource attacks involve artificially using increasing load to the server’s CPU, memory, disk, or I/O to see how it responds. Resource attack let you prepare for certain changes in load. It can validate autoscaling rules, Monitoring, and alerting configuration, and make sure the system is stable under heavy load.
Resource Attack | Impact |
CPU | Generates high load for one or more CPU cores. |
Memory | Allocates a specific amount of RAM. |
IO | Puts read/write pressure on I/O devices such as hard disks. |
Disk | Writes files to disk to fill it to a specific percentage. |
State Attack
State attack involves changing the state of the application environment such as power outages, Node failure, Clock drift, or application Crashes to test against unexpected changes in your environment.
State Attack | Impact |
Shutdown | Performs a shutdown (and an optional reboot) on the host operating system to test how your system behaves when losing one or more cluster machines |
Time Travel | Changes the host’s system time, which can be used to simulate adjusting to daylight saving time and other time-related events. |
Process Killer | Changes the host’s system time, which can be used to simulate adjusting to daylight saving time and other time-related events. |
Network Attack
Network attacks show you the impact of lost or delayed traffic to your application. Test how your service behaves when you are unable to reach one of your dependencies, internal or external. Limit the impact to only the traffic you want to test by specifying ports, hostnames, and IP addresses.
Resource Attack | Impact |
Blackhole | Drops all matching network traffic. |
Latency | Injects latency into all matching egress network traffic. |
Packet Loss | Induces packet loss into all matching egress network traffic. |
DNS | Blocks access to DNS servers |
Is There a Downside to Testing the Unknown?
Critics feel that chaos engineering is just another industry buzzword or cover-up for apps that were poorly designed in the first place. Some chaos engineering proponents opine that this is the result of an ego-driven mentality. If you’re confident in your capabilities and work product, there should be nothing to fear in testing their limits.
Chaos engineering is meant to eliminate the eight logical fallacies that plague many developers and software engineers who are new to distributed networks while providing a system for more refined testing.
These incorrect assumptions are that:
- Networks are reliable
- Latency is zero
- Bandwidth is infinite
- Networks are secure
- Topology never changes
- Each system has only one admin, who also doesn’t change
- Transportation costs nothing
- Networks are homogeneous
Why it is Important to Test the Unknown?
Because systems are changing. Traditionally, QA runs a variety of tests and test types to proactively seek out these problems, long before the code ends up in production. These tests are run at the end of a build and before that code is deployed publicly, typically testing in done on stage or testing environment.
So far, so good, if we are operating in a traditional software development model and deployment model. Monolithic designs and deployments to corporate-owned machines give a great amount of control. There is stability inherent in this control. This makes these stage and testing environments like the production environments and permits testing in them to be successful.
Distributed systems are different. The cloud is different. We don’t control the infrastructure. It is constantly changing. The infrastructure changes according to our design with individual services and microservices and load balancing spinning up additional compute nodes or removing them as needed. Failover systems adjust themselves to ensure risks are managed. The constant change causes unexpected, emergent behaviors. These are behaviors that we can’t always predict, but which we can reproduce and cause using a form of testing called Chaos Engineering.
Chaos engineering is gaining popularity with some of the industry’s largest IT and Testing teams.
- Testing teams can more quickly identify and resolve issues that might not be captured with another testing
- Unplanned downtime and outages are far less likely to occur due to proactive and constant testing
- Strengthens system integrity
- Great for large, complex systems (i.e.: cloud-based applications and services) as well as for scaling up
Best Chaos Engineering Tools
- Chaos Mesh
- Chaos Monkey
- Gremlin
- ChaosBlade
Chaos Mesh
Chaos Mesh is an open-source cloud-native tool specifically designed for Chaos Engineering. Using various fault simulations, Chaos Mesh helps organizations determine system abnormalities that may occur during various portions of the development, testing, and production stages.
Chaos Monkey
Chaos Monkey is an open-source chaos tool created by Netflix developers. It was developed to help test their system reliability and resiliency after moving to the AWS cloud. The software functions by implementing continuous unpredictable attacks. Chaos Monkey uses the fundamental approach of terminating one or more virtual machine instances.
Gremlin
Gremlin is the first hosted Chaos Engineering service designed to improve web-based reliability. Offered as a SaaS (Software-as-a-Service) technology, Gremlin can test system resiliency using one of three attack modes. Users provide system inputs as a means of determining which type of attack will provide the most optimal results. Tests can be performed in conjunction with one another as a means of facilitating comprehensive infrastructural assessments.
ChaosBlade
ChaosBlade was designed as an open-source Chaos Engineering tool originally developed by Alibaba. It was created to ensure systems are fault tolerant as a means of improving business operations. ChaosBlade operates by creating chaos attacks in different types of environments ranging from the cloud to containers.
How to run an Attack on A Host with Gremlin?
Let’s understand it using a simple Disk Attack on A Host using Gremlin
- Step 1: Log into your Gremlin account
- Step 2: Click on Attacks option on Homepage
- Step 3 – Click on New Attack
- Step 4: Choose “Target all host” Option.
- Step 5: Click on “Choose a gremlin” option
- Step 6: Choose Category: Resource and Attacks: Disk
- Step 7: Add the Length of Attack and Volume Percentage
- Step 8: Click on Run the attack
- Step 9: Test will start
- Step 10: We will be able to see the spike in monitoring Tool
Conclusion
The term “chaos engineering” may sound like an oxymoron or even the name of an evil force from a sci-fi movie, but it’s actually a prevailing approach that’s making modern technology architectures more resilient.
Chaos Engineering is the latest method of software testing to eliminate unpredictability by checking the system’s ability to tolerate unavoidable failures. For instance, 98% of organizations during 2017 said a cost of an hour of downtime crossed $100,000. A few years ago, Gartner estimated that outages can cost anywhere between $140,000 to $540,000 per hour for organizations.
The CEO of British Airways recently mentioned that due to downtime in 2017, thousands of passengers were left stranded, which cost the company $102.19 million USD.
Chaos engineering is not about breaking things it’s never been about breaking things but about learning. Chaos Engineering is all about curiosity, idiosyncrasy and peculiarity to learn about you System and make it buoyancy, dodgeable and durable.
EuroSTAR Huddle shares articles from our community. Check out our collection of eBooks from test experts and come together with the community in-person at the annual EuroSTAR Software Testing Conference. The EuroSTAR Conference has been running since 1993 and is the largest testing event in Europe, welcoming 1000+ software testers and QA professionals every year.