Chaos Experiment on Cosmos DB
In today's cloud-driven landscape, ensuring the reliability and resilience of your database is paramount. Azure Cosmos DB, a globally distributed NoSQL database, offers robust features such as multi-region failover to enhance fault tolerance. But how do you ensure these mechanisms function as expected during unforeseen failures? Enter Azure Chaos Studio, a powerful tool to simulate and test failure scenarios in a controlled environment.
In this blog, we will explore how to perform a Chaos Experiment on Cosmos DB using Azure Chaos Studio. Specifically, we’ll create an experiment to test failover functionality in a Cosmos DB account. This hands-on approach provides critical insights into system behavior, helping you identify weaknesses and ensure the database can handle disruptions seamlessly.
If you haven’t checked my previous blogs on cosmosDB yet, I would recommend you to read them first, and create a similar setup before following this blog.
Creating a NoSQL account in CosmosDB
Using Python script to fetch and write data to the cosmosDB container
Enable Geo redundancy:
You also need to increase throughput under monitoring of the DB to 4000 before enabling Geo-redundancy.
Geo redundancy is enabled.
Define CosmosDB account as a target
Open Azure portal and search Azure Chaos Studio.
Click on Targets
Select your cosmosDB account.
Click on Enable service-direct targets (All resources).
Click on Review + Enable.
You've now successfully added your Azure Cosmos DB account to Chaos Studio.
Create an experiment
In Chaos Studio, click on Experiments
Select + Create option > New Experiment.
Select your subscription and resource group. Give a name to your experiment, for example, cosmosDBFailover.
Select Next: Permissions
Select System assigned identity.
Click on Next: Experiment designer.
Add a name for your Step and Branch and select Add action > Add fault.
Select a fault as CosmosDB Failover.
Set the Duration to 15 minutes.
Click on Next: Target resources.
Select your target cosmosdb and click on Add.
Click on Review + Create.
Grant experiment permission to the CosmosDB
Go to your Azure Cosmos DB account and select Access control (IAM).
Click on Add > Add role assignment.
Select the role Cosmos DB Operator. Click Next.
Click on Select > Review + Add.
The role has been assigned to the experiment.
Run the Experiment
In the Experiments view, select your experiment. Select Start > OK.
The experiment has started.
Click on Action.
It is running on the target that we have defined.
Observations
Go to cosmosdb and click on Settings > Replicate data globally.
You will notice that by default, the Write region is East US and the Read region is West US 3.
Wait until your experiment is finished.
Once the experiment is completed, you will notice that the Write region has changed to West US 3 and the Read region has become the East US.
This proves that the DB Failover happened successfully.
Conclusion
Resilience testing is a cornerstone of building reliable systems, and Azure Chaos Studio makes it easy to simulate real-world failure scenarios like Cosmos DB failover. By conducting this Chaos Experiment, we verified that Cosmos DB's failover mechanism works as intended, ensuring minimal disruption during regional outages.