Chaos Mesh on EKS

Introduction

Ensuring the reliability and resilience of modern cloud-native applications is crucial, especially as services scale to support more users and traffic. One effective approach is chaos engineering – intentionally introducing failures, delays, and other adverse conditions to evaluate a system’s response and ability to recover. By proactively testing how an application behaves under chaotic conditions, weaknesses can be identified and addressed before real-world outages occur. While of course, unit testing has been around and application security isn’t a new concept this takes on the perspective of when something goes wrong is your application able to handle it. The use case of this or at least notably the famous company that introduced this into production was Netflix with the OSS ChaosMonkey.

Configuration

For this demonstration I’m running the following in EKS

  • Kubernetes v1.28
  • Helm
  • Terraform
  • Kubectl (Installed)
  • API Access is scoped to private IP (no public access) up to you if you follow along this should be private and not accessed publicly.

Clone our repository

git clone https://github.com/sn0rlaxlife/eks-chaos-mesh.git
cd aws-eks
terraform init
terraform apply --auto-approve (you can edit config on your own private IP)

This should be the configuration output as the plan if you’re following along we’ve added tags as shown in the image to quickly identify and to save some money, I’ve put the capacity to SPOT.

If we are in our console after creation we can see the our cluster is up and running and also on 1.28, select the cluster to gather more information on grabbing the credentials locally.

Notably on our Add-ons tab of our cluster we can see the exact build version of components of our cluster listed.

Let’s run the following command to grab a kubeconfig to our local machine.

aws eks update-kubeconfig --region us-east-1 --name eks-cluster-prod

Now let’s navigate to our folder chaosmesh that will use the helm chart to install our Chaos Mesh in our cluster.

We are pulling from the repository charts.chaos-mesh.org and also notating we want a new namespace.

If this gives you a issue we can run the following.

helm repo add chaos-mesh https://charts.chaos-mesh.org
kubectl create ns chaos-mesh
# Default to /var/run/docker.sock
helm install chaos-mesh chaos-mesh/chaos-mesh -n=chaos-mesh --version 2.6.2

Amazon Web Services console also provides some detailed metrics from the events of the node you can see the following context on our node.

After installation you’ll see a screen like the following.

If we need to access our dashboard run the following since Chaos Mesh listens on port 2333

We need to generate a token click on generate token

I’ve selected the Cluster Scoped role (this is excessive in nature and should be more scoped down to certain namespaces)

We can see the actions that we have full cluster access with the following verbs to operate with.

kubectl create token <account-cluster-manager>

Copy the code <redact> and don’t post that anywhere but that localhost portion that we are port-forwarding with the account name listed then authenticate.

Let’s create a experiment with the New Experiment, button this will direct us to configuration of the experiment.

Let’s go with the following injection with Restart EC2.

Once we submit this will start its process we can also navigate to Cloud Watch and see what is going on in our cluster.

If we navigate to Cloud Watch for metrics we can see what is going on with our cluster by analyzing the CPU utilization.

Let’s run another experiment going back to our dashboard.

What is interesting if we dive into our pod of our cluster I’ve found the image that it pulls from.

We can see a restart is initiated but Kubernetes already restarts our pod and we are still up and running.

We’ve also have a restarted pod again as we can see with the time being 4 seconds ago causing the pod to initialize its process again.

Now we can tear out cluster down to ensure we don’t incur more cost as you’ve seen in the configuration this is t3.large.

Summary

Chaos Engineering encompasses testing resiliency and also challenging your infrastructure logic to the bounds of utilizing the clouds capacity such as load balancing and scaling out/in. Depending on your use case you can think of how this would work in the busy season such as Black Friday (if you are a ecommerce application/website) or another media organization running a CDN to distribute your media.

Ensure you tear down your resources to avoid costs this cost me a extra dollar for leaving on over night with some testing, and I will post more on using this tool and others in the next couple months on Elastic Kubernetes Service. Feel free to use the Github repository as its intended to help you learn the nuances of the terraform configuration IaC.