Chaos Experiment on Aurora (PostgreSQL) RDS using AWS Fault Injection Simulator (FIS)

Chaos engineering is a discipline that helps us prepare for unexpected failures by deliberately injecting faults into a system. In this blog, we will perform a chaos experiment on an Amazon Aurora (PostgreSQL) RDS cluster. The setup includes an Aurora cluster with one writer and one reader instance. Using AWS Fault Injection Simulator (FIS), we will introduce faults into the database and observe its failover mechanism. The experiment aims to validate Aurora’s high availability by ensuring that the writer instance is automatically replaced during failover while minimizing downtime.

Experiment

  1. Setup the Aurora Cluster:

    • Create an Aurora PostgreSQL cluster with one writer instance and one reader instance.

    • Ensure proper replication between the writer and reader instances to enable failover.

  2. Prepare the Environment:

    • Enable FIS for the AWS account and grant necessary permissions for FIS to manage the Aurora resources.

    • Define an IAM role with the rds:FailoverDBCluster permission for the FIS experiment.

  3. Define the Chaos Experiment:

    • Create an FIS template targeting the Aurora cluster.

    • Use the action aws:failover-db-cluster in the FIS template to trigger a failover in the cluster.

    • Set up monitoring using Amazon CloudWatch to observe changes in instance roles and cluster health.

  4. Execute the Experiment:

    • Start the FIS experiment to initiate the failover.

    • Observe how Aurora handles the failover process.

Expectation

  • Upon initiating the failover, the current writer instance should switch to a reader role, and one of the reader instances should be promoted to the writer role.

  • The failover should happen automatically with minimal impact on the availability of the database.

  • Application connectivity should be restored to the new writer instance seamlessly, provided it uses the cluster endpoint instead of instance endpoints.

Experiment

Creating Aurora PostgreSQL DB cluster

Login to your AWS Account. Go to RDS service.

Click on Create Database option in the RDS console.

Under Configuration:

Choose Aurora (PostgreSQL compatible) and give an identifier to your DB cluster.

You can choose the master username and provide a password during its creation.

Optional: If you want to setup an EC2 connection, choose Connect to an EC2 compute resource. You can attach the EC2 instance id.

Click on Create Database.

Your database cluster is getting created.

Create FIS Template to run the experiment

Go to AWS FIS service.

Click on Create Experiment Template.

Under template details:

Add a description as: DB Failover. You can also provide a name to your template.

Click Next.

Add Action tab:

Provide a name to the action and select aws:rds:failover-db-cluster under Action Type.

Click on Save.

Next edit the Target information. Provide a name to your target and add the resourceID of your Database cluster.

Click on Save.

Click on Create Experiment template.

The experiment template is ready. Before we begin the experiment, lets look at the current setup of Aurora.

pg-db is the DB identifier. pg-db-instance-1 is the Writer Instance and pg-db-1-us-east-1c is the Reader Instance.

Now let us begin the experiment.

In the FIS experiment template tab, click on Start Experiment.

Click on Start Experiment.

Your experiment has started successfully.

Observation

Let us go back to the RDS console and look for our database state.

pg-db is Failing over. We will wait for a couple of seconds for the instance to complete the failover.

As you can see, pg-db-instance-1-us-east-1c has been promoted to the Writer instance. We can also confirm the failover action by looking at the Logs and events section of the Aurora Cluster.

During the experiment:

  • The failover process was initiated by the FIS experiment.

  • The Aurora cluster automatically promoted a reader instance to the writer role.

  • The previous writer instance was demoted to a reader role after the failover.

  • CloudWatch metrics displayed a temporary increase in latency, but connectivity to the database was restored promptly.

Result

The experiment validated Aurora’s failover mechanism:

  • The writer instance was successfully replaced by a reader instance during the failover.

  • Downtime was minimal, demonstrating Aurora’s high availability and fault tolerance.

  • The application continued functioning without requiring manual intervention, confirming the resilience of the architecture.

Conclusion

Conducting chaos experiments like this one on Aurora RDS clusters helps uncover hidden weaknesses and ensures that the system’s failover mechanisms operate as expected. By using AWS FIS, organizations can simulate real-world failure scenarios, build resilience, and maintain the reliability of their applications.