Jul 24, 2021
GCP VM Instance Stop Experiment for LitmusChaos
This blog is a beginner-friendly guide for the GCP VM Instance Stop chaos experiment for LitmusChaos. The experiment causes the shutdown of one or more GCP VM instances for a specified duration of time and later restarts them. The broad objective of this experiment is to extend the principles of cloud-native chaos engineering to non-Kubernetes targets while ensuring resiliency for all kinds of targets, be it Kubernetes or non-Kubernetes ones, as a part of a single chaos workflow for the entirety of a business.
At the time of writing this blog, the experiment is available only as a technical preview in the chaos hub, but in the upcoming releases, the experiment will surely become an integral part of the chaos hub. That being said, we can still access and execute the experiment without any problem, as I am about to show you in this blog.
Pre-Requisites
Before we begin with the steps of the experiment, let’s check the pre-requisites for performing this experiment:
- A GCP project containing the target VM instances
- A GCP Service Account having sufficient permissions to stop or start the VM Instances
- A Kubernetes cluster with Litmus 2.0 installed
STEP 1: Updating The Chaos Hub
Browse and log in to your Litmus portal. You should be on the dashboard.
Select ChaosHubs. Here you’d be able to see the default ChaosHub.
Choose to Edit the default Chaos Hub and instead of the v1.13.x
branch,
choose the master
branch.
Click Submit Now. Now you’d be able to access all the experiments, even those under the technical preview. To confirm that the experiments have been added successfully, click on Chaos Hub and view the Chaos Hub.
You should see the GCP Experiments listed here. Now we are all set to begin the steps of the experiment.
STEP 2: Setting Up the Chaos Experiment
We’d be using the experiment docs to help us with a few steps.
In this demo, we will inject chaos into two VM instances named
test-instance
and test-instance-1
, belonging to the zones
us-central1-a
and us-central1-b
respectively, belonging to the GCP
project “Litmus GCP Instance Delete” with the ID of
litmus-gcp-instance-delete
.
Please notice that the instances are in a running state initially, before
the injection of the chaos. Now that we have our instances ready, we can
set up our experiment. Before scheduling the chaos experiment, we need to
make the GCP Service Account credentials available to Litmus, so that the
instances can be shut down and later started as part of the experiment. To
do that, we’d make a Kubernetes secret named secret.yaml
as follows:
apiVersion: v1 kind: Secret metadata: name: cloud-secret type: Opaque stringData: type: "service_account" project_id: "litmus-gcp-instance-delete" private_key_id: "9e0jacc5e0abb74f3426df51c0ca5065904c6beb" private_key: -----BEGIN PRIVATE KEY-----\nMIIEvgIBADANBgkqhkiG9w0BAQEJAASCBKgwggSkAgEAAoIBAQD1JSTjKKN5CCGF\nUsWnaCHfFOReX6wDT+toYz065z5t4cYq3wb/RUGJz4q6n0Z> client_email: "experiment-demo@litmus-gcp-instance-delete.iam.gserviceaccount.com" client_id: "123476663820197864518297" auth_uri: "https://accounts.google.com/o/oauth2/auth" token_uri: "https://oauth2.googleapis.com/token" auth_provider_x509_cert_url: "https://www.googleapis.com/oauth2/v1/certs" client_x509_cert_url: "https://www.googleapis.com/robot/v1/metadata/x509/experiment-demo%40litmus-gcp-instance-delete.iam.gserviceaccount.com"
The format of this secret is also available in the experiment docs. Make
sure the name of the secret is cloud-secret
and replace the respective
fields of the secret with your own service account credentials. Once done,
apply the secret in the litmus
namespace using the command:
kubectl apply -f secret.yaml -n litmus
Once the secret is applied, we’re all set to schedule our experiment from the Litmus portal. In Dashboard, click on the Schedule a Workflow button. In the workflow creation page, choose the self-agent and click Next.
In the Choose a Workflow page, select “Create a new workflow using the experiments from MyHub” and select Chaos Hub in the dropdown. Then click Next.
In the Workflow Settings page, fill in the workflow name and description of your choice. Click Next.
In the Tune Workflow page, click on “Add a new experiment” and choose
gcp/gcp-vm-instance-stop
.
Click Done. Notice that the experiment has been added to the experiment graph diagram. Now click on “Edit YAML”. Here we will edit the workflow manifest to specify the experiment resource details.
Scroll down to the manifest of the ChaosExperiment
:
Notice that the name of the secret that we had previously created is being
passed to the ChaosExperiment
to be mounted at the path /tmp/
.
Scroll further down and similarly fill in the relevant experiment details
in the manifest of the ChaosEngine
as follows:
Please take note that the zone for each target instance is to be mentioned
in INSTANCE_ZONES
in the same order of the VM_INSTANCE_NAMES
. If you
like, feel free to modify the other parameters of the experiment such as
the RAMP_TIME
, TOTAL_CHAOS_DURATION
, etc. As you would have noticed,
some of the experiment tunables are common for both the ChaosEngine
and
ChaosExperiment
, and the values of ChaosExperiment
get overridden by
that of the values of the ChaosEngine
if they differ in both the
manifests. Once done, click Save Changes. We’ve now specified all the
experiment details and are ready to go to the next step. Click Next.
In the Reliability Score, we will use the default score of 10. Click Next.
In Schedule, click Schedule Now. Click Next. On the Verify and Commit page verify all the details and once satisfied click on Finish. We’ve successfully scheduled our chaos experiment.
STEP 3: Observing the Chaos
Click on Go to Workflow and choose the workflow that we just created. Here we can observe the different steps of the workflow execution including chaos experiment installation, chaos injection, and chaos revert.
You can also determine if the chaos injection has taken place and as a result, the instances have shutdown or not from the GCP Console.
We can also view the Table View for the experiment logs as the experiment proceeds through the various steps.
Once completed, the workflow graph should have executed all the steps successfully.
We can also check the ChaosResult
verdict which should say the
experiment has passed. The Probe Success Percentage should be 100% as all
our instances restarted successfully post their shutdown.
Again you can check in the GCP console if the instances have restarted or not.
We can also perform post chaos analysis for the experiment results in the Analytics section.
In conclusion of this blog, we saw how we can perform the GCP VM Instance Stop chaos experiment using Litmus Chaos 2.0. This experiment is only one of the many experiments for the Non-Kubernetes experiments in LitmusChaos, including experiments for AWS, Azure, VMWare, and many more, which are targeted towards making Litmus an absolute Chaos Engineering toolset for every enterprise regardless of the technology stack used by them.
Come join me at the Litmus community to contribute your bit in developing
chaos engineering for everyone. To join the Litmus community:
Step 1: Join the Kubernetes slack using the following link:
https://slack.k8s.io/
Step 2: Join the #litmus
channel on the Kubernetes slack or use this
link after joining the Kubernetes slack:
[https://slack.litmuschaos.io/](https://slack.litmuschaos.io/)
Show your ❤️ with a ⭐ on our Github. To learn more about Litmus, check out the Litmus documentation. Thank you! 🙏