EC2 Spot Interruptions using Fault Injection Service

Spot Instances use extra EC2 capacity and can be up to 90% cheaper than On-Demand pricing. However, Amazon EC2 can stop your Spot Instances if it needs the capacity back. When using Spot Instances, be ready for possible interruptions.

You can utilize the AWS Fault Injection Service (AWS FIS) to thoroughly test and evaluate how your applications respond to an interruption of a Spot Instance. This is crucial for ensuring that your applications are resilient and can maintain functionality even when unexpected disruptions occur. In this blog, we will guide you through the process of creating and executing an experiment template. This template will employ the AWS FIS aws:ec2:send-spot-instance-interruptions action, which is specifically designed to simulate an interruption of one of your Spot Instances. By doing so, you will gain valuable insights into the behavior of your applications under such conditions, allowing you to make necessary adjustments to improve their robustness and reliability.

Setup

The setup includes an Application Load Balancer (ALB) configured with a target group that comprises two EC2 instances, both of which are Spot instances. These EC2 instances are hosting a web server to handle user traffic. To assess the resilience of this configuration, we will simulate the termination of the Spot instances (a scenario often caused by Spot instance interruptions). The goal is to observe how the load balancer manages traffic redirection and to analyze the user experience during this disruption. Specifically, we aim to evaluate the application's behavior, response times, and any errors encountered by end users when one or both instances are terminated unexpectedly. This experiment will provide insights into the fault tolerance and reliability of the system under spot instance termination conditions.

Implementation

Create Launch Template
Create a launch template by selecting any name and description you prefer. I have used an existing AMI with an nginx server installed and a customized HTML page. You can choose a Linux-based OS and add user data scripts for a similar setup.

Next, choose a security group and key pair for login. I specified an instance profile to use the sessions manager for logging into my private instances. Under purchasing options, select spot instances. Add any other necessary configurations and click on Create.

Create Autoscaling Group
Go to the EC2 console. Click on autoscaling groups. Click on create. Provide a name to the autoscaling group and add the launch template.

Select your VPC and other network configurations. You can skip attaching to any load balancer for now. For group sizing and scaling, configure the desired settings according to your needs.

You can set the minimum desired capacity to 2. Add relevant tags and click on create autoscaling group.
Create an Application Loadbalancer

Go to the load balancer section in the EC2 console's left menu and click on "Create Load Balancer."

For network mapping, choose your VPC and attach public subnets.

Create a new security group and add it to the inbound rules of the security group for private EC2 instances.

Create a new target group with the type set to instances. You don’t need to attach any instances yet.
Click on "Create Load Balancer."
Attach the loadbalancer with your autoscaling group
Go to the autoscaling group you just created. Click on integrations and edit the load balancing section. Choose Application, Network, or Gateway Load Balancer target groups.

Click on Update.
Verify targets of your target group

Your target group is performing health checks on the targets.

Now that they are healthy
Open the load balancer DNS in a new tab

Create an FIS template

Go to AWS FIS console.
Click on Create experiment template. Add a Description and Name and click Next.

For Actions, do the following:

Choose Add action.
Enter a name for the action. For example, enter interruptSpotInstance.
For Action type, choose aws:ec2:send-spot-instance-interruptions.
For Target keep the target that AWS FIS creates for you.
For Action parameters, Duration before interruption, specify 2 Minutes (PT2M).
Choose Save

For Targets, do the following:

Choose Edit for the target that AWS FIS automatically created for you in the previous step.
Replace the default name with a more descriptive name. For example, enter oneSpotInstance.
Verify that Resource type is aws:ec2:spot-instance.
For Target method, choose Resource IDs.
For Resource IDs, choose an instance ID of any server.
For Selection mode, choose Count. For Number of resources, enter 1.
Choose Save.

Choose a role that has permission to perform fault testing in your instances. Click next, review and create.

Start Experiment

Now that your experiment template is ready, click on Start experiment.

Add a name tag for your experiment and click on:

Click on Start experiment.

Observations

The target Spot Instance gets an instance rebalance recommendation.
A Spot Instance interruption notice is sent two minutes before Amazon EC2 terminates or stops your instance.
After two minutes, the Spot Instance is terminated or stopped.
A Spot Instance stopped by AWS FIS stays stopped until you restart it.

The experiment is complete.
Go to the EC2 console to check the status of the running instances.

The instance moves to the Stopped state, and a new instance is spun up by the auto-scaling group to balance the load.

From the navigation pane, open Spot Requests.

You will notice the Status as instance-terminated-by-experiment.

How will your application respond during this experiment in real-time?

Go to the load balancer and check the response time under monitoring.

You will see a spike after this experiment is done.

Conclusion

This chaos experiment demonstrated the resilience of an ALB-backed architecture using Spot instances. While the system managed partial outages effectively, full outages highlighted areas for improvement. By incorporating the recommendations above, the architecture can achieve greater reliability and user satisfaction even during Spot interruptions.