Chaos Experiment on EC2 using AWS Fault Injection Simulator (FIS)

Introduction

Chaos engineering is a discipline that focuses on improving system resilience by simulating failures in a controlled environment. AWS Fault Injection Simulator (FIS) enables engineers to run chaos experiments on AWS infrastructure, providing insights into system behavior under stress.

In this blog, we will perform a chaos experiment on an autoscaling group of two EC2 instances hosting a web server. The experiment involves inducing CPU stress on one of the instances to evaluate its impact on the user experience and latency.

Experiment Setup

Architecture

Web Server: Hosted on two EC2 instances as part of an Auto Scaling Group.
Load Balancer: Distributes traffic evenly between instances to ensure high availability.
Monitoring: Bash script to measure and log website latency during the experiment.

Chaos Experiment Steps

Environment Preparation:
- Ensure the web server is running and accessible via the load balancer.
- Verify the Auto Scaling Group is set up with two instances.
- Deploy a monitoring script on a client machine to track latency.
Inject Fault:
- Use AWS FIS to induce CPU stress on one EC2 instance.
- Configure an FIS template to execute the CPU stress action for a specific duration.
Monitor System:
- Measure website accessibility and latency during the experiment.
- Observe Auto Scaling Group behavior (e.g., scaling or instance replacement).

Expectation

User Experience: The website should remain accessible with minimal disruption.
Latency: A slight increase in latency might be observed during the experiment due to load redistribution.
Resilience: The Auto Scaling Group should ensure that the workload is balanced, potentially launching a new instance if the stressed instance becomes unresponsive.

Performing the Experiment

Create an experiment template. Go to AWS FIS service in the AWS console.

Click on Create experiment template.

Under Specify template details, add a description and name. Choose the account you want to target, and click Next.

Now you need to add an Action. Give a name that reflects what you are supposed to do. For example, cpu-stress. In Action type, search aws:ssm:send-command/AWSFIS-Run-CPU-Stress.

Under Action Parameters, type the Document ARN as arn:aws:ssm:us-east-1::document/AWSFIS-Run-CPU-Stress and Document Parameters as {“DurationSeconds”:”120”}. Set the duration time as 15minutes and click Save.

Now we need to add a target. Provide a name for the target. Choose any of the instance Id that is running your application. This instance will be the target and FIS will perform CPU stress on this instance.

Click on Save. Review everything and click on Create template.

To check the latency during this experiment, we will run a script that will highlight the latency of accessing the website to detect if there is any sort of drop. You can monitor the lag using this script at certain time intervals:

start_time=$(date +%s)
duration=$((15 * 60))  # 15 minutes in seconds

while true; do
    current_time=$(date +%s)
    elapsed_time=$((current_time - start_time))

    # Break the loop if the elapsed time exceeds the duration
    if [ "$elapsed_time" -ge "$duration" ]; then
        break
    fi

    # Make the curl request and capture response details
    timestamp=$(date +"%Y-%m-%d %H:%M:%S")
    result=$(curl -o /dev/null -s -w "%{http_code} %{time_total}" http://<load-balancer-dns>)
    http_code=$(echo $result | awk '{print $1}')
    latency=$(echo $result | awk '{print $2}')

    # Check if latency exceeds 0.01 seconds
    if (( $(echo "$latency > 0.01" | bc -l) )); then
        echo "$timestamp $result (Exceeded)" >> latency.log
    else
        echo "$timestamp $result" >> latency.log
    fi

    sleep 1
done

echo "Monitoring completed after 15 minutes."

You can save and run this script in your local or on a remote server.

bash script.sh

Now that everything is ready, we can start the experiment. Click on Start experiment.

Provide a name under tags to the experiment and click on Start experiment.

The FIS has begun to inject fault in your EC2 instance.

Go to the EC2 service console and check the CPU Utilization of the instance you added as a target.

You would be able to see a spike in the CPU utilization dashboard. Let us compare it with the other instance.

The CPU utilization of the other instance is stable.

The experiment concluded within 15 minutes. The peak stress was observed at around 17:33.

Now let us check the output of the script to examine the lag in accessing the website during the experiment.

Observation

During the chaos experiment:

Accessibility: Users were able to access the website without encountering errors.
Latency: The monitoring script logged an increase in response time during the initial stress period. After load redistribution, latency normalized.
Auto Scaling Group Behavior: The group detected the stressed instance’s degraded performance and initiated recovery actions, such as replacing the instance or balancing the load effectively.

Conclusion

This experiment demonstrated the resilience of a well-architected AWS environment. The Auto Scaling Group ensured uninterrupted service despite the CPU stress on one instance.

Key Takeaways:

Chaos experiments like this validate system robustness and help identify weak points.
AWS FIS provides a reliable platform to simulate real-world failures.
Continuous monitoring and automation are critical for maintaining user experience during unexpected scenarios.

By regularly conducting chaos experiments, organizations can build confidence in their systems and deliver reliable services to their users.

Chaos Experiment on EC2 using AWS Fault Injection Simulator (FIS) - CPU Stress