Chaos Engineering with Litmus on AKS

Litmus

Litmus Chaos Engineering

Litmus chaos engineering is a powerful tool that can help you harness the power of chaos engineering concepts. By combining litmus with chaos engineering, you can create a powerful resilience testing strategy that can help improve the stability of your system. In this blog post, we will discuss the benefits of using litmus chaos engineering with chaos engineering concepts, how to set up litmus, and how to implement Chaos Engineering with Litmus.

Installation of Litmus on Kubernetes

Let’s start with prerequisites

  • Kubernetes 1.17 or later…..hopefully 1.26
  • A Persistent Volume of 20GB
    • Recommend to have a persistent volume of 20gb, you can start with 1GB for test purposes as well. This PV is used as persistent storage to store the chaos config and chaos-metrics in the Portal. By default, litmus install would use the default storage class to allocate the PV. Provide this value
  • Helm3 or kubectl

Like all cloud native projects you have options on installation you can use Harness Chaos Engineering SaaS or you can install kubernetes on your own cluster.

Self-hosted – Installation using Litmus with Helm

Let’s add the helm repo

helm repo add litmuschaos https://litmuschaos.github.io/litmus-helm/
helm repo list

Step-2 Create the namespace on which you want to install Litmus ChaosCenter

kubectl create ns chaos-center

Step-3 Install Litmus ChaosCenter

helm install chaos litmuschaos/litmus --namespace=chaos-center --set port.frontend.service.type=NodePort

Okay we are up and running so let’s check out our services for this namespace as well

kubectl get svc -n chaos-center

Run the following command

kubectl -n chaos-center port-forward service/chaos-litmus-frontend 9091

The UI should look like this now we will run the default password/username

username: admin
password: litmus

Notice we are by default permissions as owner

Run through the following wizard to start a chaos scenario if you are following along.

Choose a Chaos Scenario for this I’m choosing the Chaos Scenario from one of the pre-defined Chaos Scenario templates – we will choose podtato-head

Let’s select the Edit Sequence

The experiments are represented graphically with the testing tube you can move these around in a different sequence.

You can also take a viewing of the yaml file click next we can adjust the weight of the experiment. I’m leaving this default for now and selecting next.

This is essentially our status screen that shows the scenarios we are running.

We can see our self-hosted agent on the ChaosDelegate tab.

This was established as the known target of our scenario (experiment)

Let’s navigate to ChaosHubs

We can see a community list of Chaos-experiments

I’ve selected the Azure/azure-instance-stop

I’ll then move to Analytics

We can see the visualization of our stats along with seeing our chaos scenario we ran failed……

Navigating back to Chaos Scenarios we can view our test results

Selecting the pod-delete failed test to see further details

We have the logs tab and Chaos Results

I’m going to run a Azure test by going through the UI again to see our results in a different scenario

After creation we will go back to our dashboard with the new experiment running we will return once the test is over to review our findings.

We have another failure it appears our error shows that the experiment was unable to patch Chaos Resources required for Chaos Experiment.

The next line leads me to believe I might need to check the YAML file as the warning of unable to find the field experments[0].spec.rank this could lead me to believe perhaps I’ll have to dig into some documentation on the YAML manifest that is deployed via this chart.

Ensure you delete/resources that we’ve used so you don’t incur further costs.

What Are the Chaos Engineering Concepts.

The practice of chaos engineering is based on the principle that organizations must embrace uncertainty and failure in order to build more resilient systems. By purposely injecting failures into a system and observing how it responds, engineers can gain valuable insights into the system’s behavior and identify potential weaknesses.

Chaos engineering is not about breaking things for the sake of breaking them. Rather, it is a disciplined approach to testing system resilience that can help prevent outages and failures in production.

There are four key principles of chaos engineering:

1) Systems are most likely to fail when they are under stress.

2) The best way to find weaknesses in a system is to expose it to failure.

3) Injecting failure should be done in a controlled manner.

4) Testing should be continuous so that issues can be identified and fixed quickly.

What is Chaos Monkey.

Chaos Monkey is an open source tool from Netflix that helps engineers test the resilience of their systems by randomly injecting faults (such as crashes, delays, and data corruptions) into services running in the cloud. By doing so, Chaos Monkey helps engineers build more resilient systems that can better handle outages and failures.

What is Resilience Testing.

Resilience testing is a type of chaos engineering that specifically focuses on testing system resilience. This can be done by purposely injecting failures into a system and observing how it responds. By doing so, engineers can gain valuable insights into the system’s behavior and identify potential weaknesses.

There are three key principles of resilience testing:

1) Systems are most likely to fail when they are under stress.

2) The best way to find weaknesses in a system is to expose it to failure.

3) Testing should be continuous so that issues can be identified and fixed quickly.

How to Leverage Litmus Chaos Engineering with Chaos Engineering Concepts.

There are many benefits to combining Litmus and chaos engineering concepts. By doing so, you can create powerful tests that can help identify potential issues in your system before they cause problems in production. Additionally, these tests can also help improve the overall resilience of your system by helping to identify weaknesses that can be addressed.

Some of the specific benefits of combining Litmus and chaos engineering include:

1. Improved test coverage – Using both tools together can help increase the coverage of your tests, making it more likely that potential issues will be identified before they cause problems in production.

2. More comprehensive testing – Combining the two approaches can also help provide more comprehensive testing, as each tool brings different strengths to the table. For example, Litmus can help with automated testing while chaos engineering provides a more manual approach.

3. Increased test accuracy – In addition to improved coverage and comprehensiveness, combining Litmus and chaos engineering can also help improve the accuracy of your tests. This is because each tool can provide complementary information that can help verify results.

4. Enhanced test automation – One of the key benefits of using both tools together is enhanced test automation capabilities. By leveraging the strengths of both approaches, you can automate a larger portion of your testing process, which can save time and resources in the long run.

5.Improved system resilience – As mentioned above, one of the main goals of using both Litmus and chaos engineering is to improve the resilience of your system. By identifying potential weaknesses early on, you can address them before they cause problems in production. Additionally, these tests can also help validate fixes or enhancements made to your system to ensure that they are effective at addressing identified issues.How to Implement Chaos Engineering with Litmus.

There are many ways to implement chaos engineering with Litmus. The specific approach that you take will depend on the needs of your system and the goals of your tests. However, there are some general tips that can help you get started:

1. Define your goals – Before you begin, it is important to define the goals of your tests. What are you hoping to achieve? What issues are you trying to identify? Having a clear understanding of your goals will help you create more effective tests.

2. Select the right tool – Once you have defined your goals, the next step is to select the right tool for the job. There are many different chaos engineering tools available, so it is important to select one that will best meet your needs. For example, if you are looking for a tool that can automate a large portion of your testing process, Litmus may be a good option.

3. Set up your environment – Once you have selected a tool, the next step is to set up your testing environment. This includes creating any necessary accounts or configurations and installing any required software. It is important to make sure that everything is set up correctly before proceeding with your tests.

4. Run your tests – Once everything is set up and ready to go, it’s time to run your tests! Depending on the tool that you’re using, this process will vary somewhat. However, in general, you’ll need to inject faults or perturbations into your system and observe how it responds. Be sure to document all of your findings so that you can analyze them later on.

5 .Analyze your results – After running your tests, it’s time to analyze the results and determine what they mean for your system. Are there any potential issues that need to be addressed? Are there any areas where improvements can be made? Use these findings to improve the design of your system and make it more resilient against potential problems.

Examples of Litmus and Chaos Engineering Strategies.

There are many different ways to combine Litmus and chaos engineering concepts. Here are a few examples of strategies that you may want to consider:

1. Automated testing – One of the key strengths of Litmus is its ability to automate tests. This can be leveraged to create automated chaos engineering tests that can help identify potential issues in your system.

2. Resilience testing – As mentioned above, one of the main goals of chaos engineering is to improve the resilience of your system. By combining Litmus and chaos engineering concepts, you can create powerful tests that can help identify weaknesses in your system and validate fixes or enhancements made to address them.

3. End-to-end testing – Another common goal of chaos engineering is to test systems end-to-end in order to identify potential issues with interactions between different components. This can be accomplished by combining Litmus and chaos engineering concepts to create tests that cover the entire system from start to finish.

4. Performance testing – In addition to identifying potential issues, another goal of chaos engineering is to test the performance of systems under load. This can be done by using Litmus to generate load on a system and then observing how it responds.

Conclusion

The benefits of chaos engineering are well-documented, and combining the power of Litmus with chaos engineering concepts can help you take your resilience testing to the next level. If you’re looking to implement chaos engineering in your organization, Litmus is a great tool to help you get started. Thanks for reading!