As organizations strive for high availability and resilience in their cloud environments, chaos engineering has emerged as a vital practice. AWS Fault Injection Simulator (FIS) is a managed service that enables DevOps teams to test the resilience of applications by injecting faults in a controlled manner. In this blog, we’ll focus on testing instance stop and start scenarios using AWS FIS.
By simulating instance stops and starts, you can validate whether your application’s architecture can handle such disruptions gracefully, ensuring minimal downtime and service recovery.
Prerequisites
Before you begin, ensure the following:
AWS Account: You need access to an AWS account with appropriate permissions.
Amazon EC2 Instances: Ensure you have two or more running EC2 instances that are hosting an nginx based web application. You can also check my previous blogs of FIS where I have created this setup from scratch.
IAM Role: Create or update an IAM role with permissions for AWS FIS, EC2, and other necessary services.
FIS Template: Familiarity with AWS FIS templates is recommended, as we’ll create one for this experiment.
Step 1: Create an FIS template
Go to AWS FIS console and Click on Create experiment template.
Add a description and name to your template.
Click Next.
For Actions, do the following:
Choose Add action.
Enter a name for the action. For example, enter
stopInstance
.For Action type, choose aws:ec2:stop-instances.
For Target keep the target that AWS FIS creates for you.
For Action parameters, Start instances after duration, specify 3 minutes (PT3M).
Choose Save.
For Targets, do the following:
Choose Edit for the target that AWS FIS automatically created for you in the previous step.
Verify that Resource type is aws:ec2:instance.
For Target method, choose Resource IDs, and then choose the IDs of the two test instances.
For Selection mode, choose Count. For Number of resources, enter
1
.Choose Save.
Choose Add target and do the following:
Enter a name for the target. For example, enter
bothInstances
.For Resource type, choose aws:ec2:instance.
For Target method, choose Resource IDs, and then choose the IDs of the two test instances.
For Selection mode, choose All.
Choose Save.
From the Actions section, choose Add action. Do the following:
For Name, enter a name for the action. For example, enter
stopBothInstances
.For Action type, choose aws:ec2:stop-instances.
For Start after, choose the first action that you added (
stopinstance
).For Target, choose the second target that you added (
bothInstances
).For Action parameters, Start instances after duration, specify 3 minutes (PT3M).
Choose Save.
Click on Save.
Your template is ready for testing.
Start the experiment
Click on Start Experiment.
Add a name tag to the experiment and click on Start experiment.
Track the Experiment progress
You may experience a delay when accessing the load balancer's endpoint and could encounter a 504 Gateway Timeout error.
Running a script to detect latency at 3-second intervals may reveal occasional delays during the experiment. These delays occur when both instances are in a "not ready" state.
You can also monitor the status of the instances in the EC2 dashboard.
Conclusion
AWS Fault Injection Simulator empowers teams to proactively identify weaknesses in their systems by simulating real-world failures. Testing instance stop and start scenarios ensures your applications are robust against unplanned disruptions, helping you build confidence in your architecture's resilience.