Jul 25, 2021

GCP VM Disk Loss Experiment for LitmusChaos

 GCP VM Disk Loss Experiment for LitmusChaos

In this beginner-friendly blog, we’ll be going through the GCP VM Disk Loss experiment for LitmusChaos. GCP VM Disk Loss experiment causes detachment of a non-boot persistent storage disk from a GCP VM instance for a specified duration of time and later re-attaches the disk to its respective VM instance. The broad objective of this experiment is to extend the principles of cloud-native chaos engineering to non-Kubernetes targets while ensuring resiliency for all kinds of targets, be it Kubernetes or non-Kubernetes ones, as a part of a single chaos workflow for the entirety of a business.

At the time of writing this blog, the experiment is available only as a technical preview in the chaos hub, but in the upcoming releases, the experiment will surely become an integral part of the chaos hub. That being said, we can still access and execute the experiment without any problem, as I am about to show you in this blog.

Pre-Requisites

Before we begin with the steps of the experiment, let’s check the pre-requisites for performing this experiment:

  1. A GCP project containing the target Persistent Storage Disks attached to their respective VM instances
  2. A GCP Service Account having project level Editor or Owner permissions
  3. A Kubernetes cluster with Litmus 2.0 installed

STEP 1: Updating The Chaos Hub

Browse and log in to your Litmus portal. You should be on the dashboard.

Select ChaosHubs. Here you’d be able to see the default Chaos Hub.

Choose to Edit the default Chaos Hub and instead of the v1.13.x branch, choose the master branch.

Click Submit Now. Now you’d be able to access all the experiments, even those under the technical preview. To confirm that the experiments have been added successfully, click on Chaos Hub.

You should see the GCP Experiments listed here. Now we are all set to begin the steps of the experiment.

STEP 2: Setting Up the Chaos Experiment

We’d be using the experiment docs to help us with a few steps.

In this demo, we will inject chaos into two non-boot persistent storage disks named test-disk and test-disk-1, attached to the VM instances named test-instance and test-instance-1 respectively, the disks are located in the zones us-central1-a and us-central1-b respectively, and belong to the GCP project “Litmus GCP Instance Delete” with the ID of litmus-gcp-instance-delete.

Please notice that both the disks are initially attached to their respective VM instances, before the injection of the chaos. Now that we have our disks ready, we can set up our experiment. Before scheduling the chaos experiment, we need to make the GCP Service Account credentials available to Litmus, so that the instances can be shut down and later started as part of the experiment. To do that, we’ll make a Kubernetes secret named secret.yaml as follows:

apiVersion: v1 kind: Secret metadata: name: cloud-secret type: Opaque stringData: type: "service_account" project_id: "litmus-gcp-instance-delete" private_key_id: "9e0jacc5e0abb74f3426df51c0ca5065904c6beb" private_key: -----BEGIN PRIVATE KEY-----\nMIIEvgIBADANBgkqhkiG9w0BAQEJAASCBKgwggSkAgEAAoIBAQD1JSTjKKN5CCGF\nUsWnaCHfFOReX6wDT+toYz065z5t4cYq3wb/RUGJz4q6n0Z> client_email: "experiment-demo@litmus-gcp-instance-delete.iam.gserviceaccount.com" client_id: "123476663820197864518297" auth_uri: "https://accounts.google.com/o/oauth2/auth" token_uri: "https://oauth2.googleapis.com/token" auth_provider_x509_cert_url: "https://www.googleapis.com/oauth2/v1/certs" client_x509_cert_url: "https://www.googleapis.com/robot/v1/metadata/x509/experiment-demo%40litmus-gcp-instance-delete.iam.gserviceaccount.com"

The format of this secret is also available in the experiment docs. Make sure the name of the secret is cloud-secret and replace the respective fields of the secret with your own service account credentials. Once done, apply the secret in the litmus namespace using the command:

kubectl apply -f secret.yaml -n litmus

Once the secret is applied, we’re all set to schedule our experiment from the Litmus portal. In Dashboard, click on the Schedule a Workflow button. In the workflow creation page, choose the self-agent and click Next.

In the Choose a Workflow page, select “Create a new workflow using the experiments from MyHub” and select Chaos Hub in the dropdown. Then click Next.

In the Workflow Settings page, fill in the workflow name and description of your choice. Click Next.

In the Tune Workflow page, click on “Add a new experiment” and choose gcp/gcp-vm-disk-loss.

Click Done. Notice that the experiment has been added to the experiment graph diagram. Now click on “Edit YAML”. Here we will edit the workflow manifest to specify the experiment resource details.

Scroll down to the manifest of the ChaosExperiment:

Notice that the name of the secret that we had previously created is being passed to the ChaosExperiment to be mounted at the path /tmp/.

Scroll further down and similarly fill in the relevant experiment details in the manifest of the ChaosEngine as follows:

Please take note that the zone for each target disk is to be mentioned in DISK_ZONES in the same order of the DISK_VOLUME_NAMES. Similarly, the device name for each target disk is to be mentioned in DEVICE_NAMES in the same order of the DISK_VOLUME_NAMES. If you like, feel free to modify the parameters of the experiment such as the RAMP_TIME, TOTAL_CHAOS_DURATION, etc. As you would have noticed, some of the experiment tunables are common for both the ChaosEngine and ChaosExperiment, and the values of ChaosExperiment get overridden by that of the values of the ChaosEngine if they differ in both the manifests. Once you have made the changes, click Save Changes. We’ve now specified all the experiment details and are ready to go to the next step. Click Next.

In the Reliability Score, we will use the default score of 10. Click Next.

In Schedule, click Schedule Now. Click Next. On the Verify and Commit page verify all the details and once satisfied click on Finish. We’ve successfully scheduled our chaos experiment.

STEP 3: Observing the Chaos

Click on Go to Workflow and choose the workflow that we just created. Here we can observe the different steps of the workflow execution including chaos experiment installation, chaos injection, and chaos revert.

You can also determine if the chaos injection has taken place and as a result, the disks have detached from their respective VM instances or not from the GCP Console.

We can also view the Table View for the experiment logs as the experiment proceeds through the various steps.

Once completed, the workflow graph should have executed all the steps successfully.

We have got a 100% Resiliency Score. We can also check the ChaosResult verdict which should say the experiment has passed. The Probe Success Percentage should be 100% as all our disks re-attached successfully post their detachment.

Again you can check in the GCP console if the disks have been re-attached or not.

We can also perform post chaos analysis for the experiment results in the Analytics section.

In conclusion of this blog, we saw how we can perform the GCP VM Disk Loss chaos experiment using Litmus Chaos 2.0. This experiment is only one of the many experiments for the Non-Kubernetes experiments in LitmusChaos, including experiments for AWS, Azure, VMWare, and many more, which are targeted towards making Litmus an absolute Chaos Engineering toolset for every enterprise regardless of the technology stack used by them.

Come join me at the Litmus community to contribute your bit in developing chaos engineering for everyone. To join the Litmus community:
Step 1: Join the Kubernetes slack using the following link: https://slack.k8s.io/
Step 2: Join the #litmus channel on the Kubernetes slack or use this link after joining the Kubernetes slack: [https://slack.litmuschaos.io/](https://slack.litmuschaos.io/)

Show your ❤️ with a ⭐ on our Github. To learn more about Litmus, check out the Litmus documentation. Thank you! 🙏

CONTACT ME

© Neelanjan Manna, 2020 - 2024