Introduction
Chaos Studio was presented as a service in Microsoft Azure that is to measure and understand your applications service resilience, I’ve wrote about using LitmusChaos previously in a blog but felt like I could create more on this topic as application resiliency is not only pivotal to organizations operations. Chaos Engineering is the practice of testing distributed software that introduces failures or fault scenarios to test the applications ability to withstand the operations and ideally prevent outages prior to occurring. Notably Netflix was a huge contributor in this space with the use of this and open-sourced Chaos Monkey, I’ll include the link in the references.
Exploration for Demo
For us to explore the use of this in Azure Kubernetes Service, I’ve replicated a terraform script from a configuration and modified it with the helm repository so we can deploy this as IaC in your Azure Subscription showing the diagram as the reference however we are only deploying one cluster not two as shown with some modifications to keep our costs low.
In this image above I’ve created a diagram demonstrating the use of Azure Container Registry and the security capabilities of pushing a image to the registry (private) having it scanned prior to approval gate in deploying to our cluster then conducting our experiment.
While for this tutorial we are going to deploy the cluster a few items should be on hand if needed listed below
- Azure Subscription
- Familiarity with Azure CLI
- Kubectl (Installed)
Getting Started
First we need to clone the repository
git clone https://github.com/sn0rlaxlife/aks-chaos-engineer.git
Our files once we list a ls should be as shown in the image, if you are running this locally or on Azure Cloud Shell ensure you have exported your credentials needed for applying terraform (i.e. Subscription_clientid, ARM_Tenant_ID etc).
module "aks" {
source = "Azure/aks/azurerm"
version = "7.3.1"
resource_group_name = azurerm_resource_group.aks.name
kubernetes_version = var.kubernetes_version
orchestrator_version = var.kubernetes_version
prefix = "aks-chaos-mesh"
network_plugin = "kubenet"
vnet_subnet_id = lookup(module.aks-vnet.vnet_subnets_name_id, "subnet0")
os_disk_size_gb = 50
sku_tier = "Standard" # defaults to Free
private_cluster_enabled = false
rbac_aad = var.rbac_aad
role_based_access_control_enabled = var.role_based_access_control_enabled
http_application_routing_enabled = false
enable_auto_scaling = true
enable_host_encryption = false
log_analytics_workspace_enabled = false
agents_min_count = 1
agents_max_count = 3
agents_count = null # Please set `agents_count` `null` while `enable_auto_scaling` is `true` to avoid possible `agents_count` changes.
agents_max_pods = 100
agents_pool_name = "system"
agents_availability_zones = ["1", "2"]
agents_type = "VirtualMachineScaleSets"
agents_size = var.agents_size
agents_labels = {
"nodepool" : "defaultnodepool"
}
agents_tags = {
"Agent" : "defaultnodepoolagent"
}
ingress_application_gateway_enabled = false
network_policy = "calico"
net_profile_dns_service_ip = "10.0.0.10"
net_profile_service_cidr = "10.0.0.0/16"
key_vault_secrets_provider_enabled = true
secret_rotation_enabled = true
secret_rotation_interval = "3m"
depends_on = [module.aks-vnet]
}
We can now run a terraform init (after inspecting our files)
Looks okay and appears a update on my main module needs to be updated however this shouldn’t interfere with our deployment.
Run a terraform apply and confirm the deployment now we are going to see the clusters that are provisioned from our code.
So a couple items we will need now to grab our credentials to access our kubectl client
az aks get-credentials --resource-group aks-chaos-mesh-rg --name aks-chaos-mesh-aks
Then we navigate to the our cluster and select Properties we are looking for the ResourceID put this somewhere safe we will use this later for other commands if you’d like to export it into a variable you can run export RESOURCE_ID=<value>
Now back in our shell we navigate to the ChaosMesh folder and see the two files as shown in the image
Run another terraform init for this directory to initialize our configuration.
When we run a terraform plan we can see we are utilizing our helm provider and using the chart provided for ChaosMesh. Now we apply our configuration to apply to our cluster with our access this should be seamless. Apply our configuration run a terraform apply and approve deployment once completed run the following command.
kubectl get pods -n chaos-testing
Enable Chaos Studio on AKS
For Chaos Studio to work with AKS we need to to give our resource ID as obtained earlier to the Rest API to register our cluster as added to our studio.
az rest --method put --url "https://management.azure.com/$RESOURCE_ID/providers/Microsoft.Chaos/targets/Microsoft-AzureKubernetesServiceChaosMesh?api-version=2021-09-15-preview" --body "{\"properties\":{}}"
The output should look like this for our shell session now we have to put our capabilities for our target (cluster).
If we want to add different capabilities reference this document https://learn.microsoft.com/en-us/azure/chaos-studio/chaos-studio-fault-library
For ours we are going to keep it at PodChaos so we can run the following
export CAPABILITY=PodChaos-2.1
az rest --method put --url "https://management.azure.com/$RESOURCE_ID/providers/Microsoft.Chaos/targets/Microsoft-AzureKubernetesServiceChaosMesh/capabilities/$CAPABILITY?api-version=2021-09-15-preview" --body "{\"properties\":{}}"
Crafting our Experiment
We will use the following YAML
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
name: pod-failure-example
namespace: chaos-mesh
spec:
action: pod-failure
mode: one
duration: '30s'
However, we are going to narrow this down and convert this to json by cleaning this up as shown below.
action: pod-failure
mode: all
selector:
namespaces:
- default
We can use a convert to json as shown in the image to translate our YAML to json in a minimized output.
Then take this output and put this in a JSON Escape to make our JSON formatted as shown
Okay now we are ready to craft our experiment with the help of the schema run a nano experiment.json
{
"location": "eastus",
"identity": {
"type": "SystemAssigned"
},
"properties": {
"steps": [
{
"name": "pod-kill",
"branches": [
{
"name": "pod-kill",
"actions": [
{
"type": "continuous",
"selectorId": "Selector1",
"duration": "PT10M",
"parameters": [
{
"key": "jsonSpec",
"value": "{\"action\":\"pod-failure\",\"mode\":\"all\",\"selector\":{\"namespaces\":[\"default\"]}}"
}
],
"name": "urn:csci:microsoft:azureKubernetesServiceChaosMesh:podChaos/2.1"
}
]
}
]
}
],
"selectors": [
{
"id": "Selector1",
"type": "List",
"targets": [
{
"type": "ChaosTarget",
"id": "/subscriptions/<your-input>/resourceGroups/myRG/providers/Microsoft.ContainerService/managedClusters/myCluster/providers/Microsoft.Chaos/targets/Microsoft-AzureKubernetesServiceChaosMesh"
}
]
}
]
}
}
Ensure the ID section references the ID we’ve pulled earlier as a resource ID and save this file locally
Run the following I’ve modified this so it can align with our env variables – experiment_name is the PodChaos-2.1
az rest --method put --uri https://management.azure.com/subscriptions/$ARM_SUBSCRIPTION_ID/resourceGroups/$RESOURCE_GROUP/providers/Microsoft.Chaos/experiments/$EXPERIMENT_NAME?api-version=2021-09-15-preview --body @experiment.json
Assigning permissions to experiment for AKS cluster
The way Chaos Studio works is upon creation the system-assigned managed identity executes the faults chosen against targeted resources.
At the top of the output we will take down the Principal ID and use the following command
az role assignment create --role "Azure Kubernetes Service Cluster Admin Role" --assignee-object-id $EXPERIMENT_PRINCIPAL_ID --scope $RESOURCE_ID
#run our experiment
az rest --method post --uri https://management.azure.com/subscriptions/$ARM_SUBSCRIPTION_ID/resourceGroups/$RESOURCE_GROUP/providers/Microsoft.Chaos/experiments/$EXPERIMENT_NAME/start?api-version=2021-09-15-preview
Once the experiment is complete let’s navigate to monitoring and observe any changes in our clusters overall health.
I’ve created a Azure Monitor Workspace and have AKS onboarded with use of the Insights resource this allows me to drill down.
Summary
Azure Chaos Studio can be deployed via a UI or programmatically this is intended to show the use of terraform to quickly stand up a cluster and start testing a experiment on a cluster should you package your application in a helm chart you can additionally use the helm provider to install on creation. Chaos Mesh is a open source project as well if you’d like to go the OSS route to use in any cloud platform. As always ensure our terraform configuration is torn down by running terraform delete with approval. Hope this makes a visualization of commands more meaningful of how this can stress test resources in the cloud.