Chaos Studio Experiments in AKS

Introduction

Chaos Studio was presented as a service in Microsoft Azure that is to measure and understand your applications service resilience, I’ve wrote about using LitmusChaos previously in a blog but felt like I could create more on this topic as application resiliency is not only pivotal to organizations operations. Chaos Engineering is the practice of testing distributed software that introduces failures or fault scenarios to test the applications ability to withstand the operations and ideally prevent outages prior to occurring. Notably Netflix was a huge contributor in this space with the use of this and open-sourced Chaos Monkey, I’ll include the link in the references.

Exploration for Demo

For us to explore the use of this in Azure Kubernetes Service, I’ve replicated a terraform script from a configuration and modified it with the helm repository so we can deploy this as IaC in your Azure Subscription showing the diagram as the reference however we are only deploying one cluster not two as shown with some modifications to keep our costs low.

In this image above I’ve created a diagram demonstrating the use of Azure Container Registry and the security capabilities of pushing a image to the registry (private) having it scanned prior to approval gate in deploying to our cluster then conducting our experiment.

While for this tutorial we are going to deploy the cluster a few items should be on hand if needed listed below

  • Azure Subscription
  • Familiarity with Azure CLI
  • Kubectl (Installed)

Getting Started

First we need to clone the repository

git clone https://github.com/sn0rlaxlife/aks-chaos-engineer.git

Our files once we list a ls should be as shown in the image, if you are running this locally or on Azure Cloud Shell ensure you have exported your credentials needed for applying terraform (i.e. Subscription_clientid, ARM_Tenant_ID etc).

module "aks" {
  source                            = "Azure/aks/azurerm"
  version                           = "7.3.1"
  resource_group_name               = azurerm_resource_group.aks.name
  kubernetes_version                = var.kubernetes_version
  orchestrator_version              = var.kubernetes_version
  prefix                            = "aks-chaos-mesh"
  network_plugin                    = "kubenet"
  vnet_subnet_id                    = lookup(module.aks-vnet.vnet_subnets_name_id, "subnet0")
  os_disk_size_gb                   = 50
  sku_tier                          = "Standard" # defaults to Free
  private_cluster_enabled           = false
  rbac_aad                          = var.rbac_aad
  role_based_access_control_enabled = var.role_based_access_control_enabled
  http_application_routing_enabled  = false
  enable_auto_scaling               = true
  enable_host_encryption            = false
  log_analytics_workspace_enabled   = false
  agents_min_count                  = 1
  agents_max_count                  = 3
  agents_count                      = null # Please set `agents_count` `null` while `enable_auto_scaling` is `true` to avoid possible `agents_count` changes.
  agents_max_pods                   = 100
  agents_pool_name                  = "system"
  agents_availability_zones         = ["1", "2"]
  agents_type                       = "VirtualMachineScaleSets"
  agents_size                       = var.agents_size


  agents_labels = {
    "nodepool" : "defaultnodepool"
  }

  agents_tags = {
    "Agent" : "defaultnodepoolagent"
  }

  ingress_application_gateway_enabled = false

  network_policy             = "calico"
  net_profile_dns_service_ip = "10.0.0.10"
  net_profile_service_cidr   = "10.0.0.0/16"

  key_vault_secrets_provider_enabled = true
  secret_rotation_enabled            = true
  secret_rotation_interval           = "3m"

  depends_on = [module.aks-vnet]
}

We can now run a terraform init (after inspecting our files)

Looks okay and appears a update on my main module needs to be updated however this shouldn’t interfere with our deployment.

Run a terraform apply and confirm the deployment now we are going to see the clusters that are provisioned from our code.

So a couple items we will need now to grab our credentials to access our kubectl client

az aks get-credentials --resource-group aks-chaos-mesh-rg --name aks-chaos-mesh-aks

Then we navigate to the our cluster and select Properties we are looking for the ResourceID put this somewhere safe we will use this later for other commands if you’d like to export it into a variable you can run export RESOURCE_ID=<value>

Now back in our shell we navigate to the ChaosMesh folder and see the two files as shown in the image

Run another terraform init for this directory to initialize our configuration.

When we run a terraform plan we can see we are utilizing our helm provider and using the chart provided for ChaosMesh. Now we apply our configuration to apply to our cluster with our access this should be seamless. Apply our configuration run a terraform apply and approve deployment once completed run the following command.

kubectl get pods -n chaos-testing

Enable Chaos Studio on AKS

For Chaos Studio to work with AKS we need to to give our resource ID as obtained earlier to the Rest API to register our cluster as added to our studio.

az rest --method put --url "https://management.azure.com/$RESOURCE_ID/providers/Microsoft.Chaos/targets/Microsoft-AzureKubernetesServiceChaosMesh?api-version=2021-09-15-preview" --body "{\"properties\":{}}"

The output should look like this for our shell session now we have to put our capabilities for our target (cluster).

If we want to add different capabilities reference this document https://learn.microsoft.com/en-us/azure/chaos-studio/chaos-studio-fault-library

For ours we are going to keep it at PodChaos so we can run the following

export CAPABILITY=PodChaos-2.1
az rest --method put --url "https://management.azure.com/$RESOURCE_ID/providers/Microsoft.Chaos/targets/Microsoft-AzureKubernetesServiceChaosMesh/capabilities/$CAPABILITY?api-version=2021-09-15-preview"  --body "{\"properties\":{}}"

Crafting our Experiment

We will use the following YAML

apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
  name: pod-failure-example
  namespace: chaos-mesh
spec:
  action: pod-failure
  mode: one
  duration: '30s'

However, we are going to narrow this down and convert this to json by cleaning this up as shown below.

action: pod-failure
mode: all
selector:
  namespaces:
    - default

We can use a convert to json as shown in the image to translate our YAML to json in a minimized output.

Then take this output and put this in a JSON Escape to make our JSON formatted as shown

Okay now we are ready to craft our experiment with the help of the schema run a nano experiment.json

{
  "location": "eastus",
  "identity": {
    "type": "SystemAssigned"
  },
  "properties": {
    "steps": [
      {
        "name": "pod-kill",
        "branches": [
          {
            "name": "pod-kill",
            "actions": [
              {
                "type": "continuous",
                "selectorId": "Selector1",
                "duration": "PT10M",
                "parameters": [
                  {
                      "key": "jsonSpec",
                      "value": "{\"action\":\"pod-failure\",\"mode\":\"all\",\"selector\":{\"namespaces\":[\"default\"]}}"
                  }
                ],
                "name": "urn:csci:microsoft:azureKubernetesServiceChaosMesh:podChaos/2.1"
              }
            ]
          }
        ]
      }
    ],
    "selectors": [
      {
        "id": "Selector1",
        "type": "List",
        "targets": [
          {
            "type": "ChaosTarget",
            "id": "/subscriptions/<your-input>/resourceGroups/myRG/providers/Microsoft.ContainerService/managedClusters/myCluster/providers/Microsoft.Chaos/targets/Microsoft-AzureKubernetesServiceChaosMesh"
          }
        ]
      }
    ]
  }
}

Ensure the ID section references the ID we’ve pulled earlier as a resource ID and save this file locally

Run the following I’ve modified this so it can align with our env variables – experiment_name is the PodChaos-2.1

az rest --method put --uri https://management.azure.com/subscriptions/$ARM_SUBSCRIPTION_ID/resourceGroups/$RESOURCE_GROUP/providers/Microsoft.Chaos/experiments/$EXPERIMENT_NAME?api-version=2021-09-15-preview --body @experiment.json

Assigning permissions to experiment for AKS cluster

The way Chaos Studio works is upon creation the system-assigned managed identity executes the faults chosen against targeted resources.

At the top of the output we will take down the Principal ID and use the following command

az role assignment create --role "Azure Kubernetes Service Cluster Admin Role" --assignee-object-id $EXPERIMENT_PRINCIPAL_ID --scope $RESOURCE_ID

#run our experiment
az rest --method post --uri https://management.azure.com/subscriptions/$ARM_SUBSCRIPTION_ID/resourceGroups/$RESOURCE_GROUP/providers/Microsoft.Chaos/experiments/$EXPERIMENT_NAME/start?api-version=2021-09-15-preview

Once the experiment is complete let’s navigate to monitoring and observe any changes in our clusters overall health.

I’ve created a Azure Monitor Workspace and have AKS onboarded with use of the Insights resource this allows me to drill down.

Summary

Azure Chaos Studio can be deployed via a UI or programmatically this is intended to show the use of terraform to quickly stand up a cluster and start testing a experiment on a cluster should you package your application in a helm chart you can additionally use the helm provider to install on creation. Chaos Mesh is a open source project as well if you’d like to go the OSS route to use in any cloud platform. As always ensure our terraform configuration is torn down by running terraform delete with approval. Hope this makes a visualization of commands more meaningful of how this can stress test resources in the cloud.