Chaos testing with Fault Injection Service with S3 bucket

Chaos testing with Fault Injection Service with S3 bucket

In this blog, we will test S3 impairment in AZ1 and AZ2 using AWS Fault injection service.

Architecture Diagram:

Architecture Overview:

  1. Web Application:

    • Runs on NGINX as a webserver on 2 EC2 instances.

    • Part of an Auto Scaling Group (ASG) across 2 Availability Zones (AZs) for high availability.

  2. S3 Storage:

    • Stores the data (webpage or content) to be fetched when the button is clicked.
  3. AWS Load Balancer:

    • An Application Load Balancer (ALB) routes traffic to the EC2 instances.
  4. IAM Role:

    • EC2 instances are assigned an IAM Role with permissions to access the S3 bucket.

Implementation Steps:

1. S3 Bucket:

  • Create an S3 bucket and upload the content (e.g., a webpage or file).

  • Make sure the bucket is private but accessible using the IAM role.

2. IAM Role and Instance Profile:

  • Create an IAM role with the policy allowing access to S3:

      jsonCopy code{
        "Version": "2012-10-17",
        "Statement": [
          {
            "Effect": "Allow",
            "Action": "s3:GetObject",
            "Resource": "arn:aws:s3:::webcontent-01/*"
          }
        ]
      }
    
  • Attach the IAM role to the EC2 instance profile.

3. Web Server (NGINX):

  • Install and configure NGINX on both EC2 instances.

  • The NGINX configuration includes:

    • A homepage (index.html) with a button to fetch data.

    • A backend handler (using a CGI script or reverse proxy) to fetch the content from S3.

Example NGINX config: (/etc/nginx/nginx.conf)

server {
    listen 80;
    server_name localhost;

    location / {
        root /var/www/html;
        index index.html;
    }

    location /fetch-data {
        proxy_pass http://127.0.0.1:8080;  # Backend handler (e.g., Python/Node.js)
    }
}

4. Backend Script:

  • Implement a simple script to fetch data from S3 (e.g., Python using boto3).

  • Deploy the script on both EC2 instances and ensure it listens on port 8080.

Example Python script (fetch_data.py):

from flask import Flask, jsonify
import boto3

app = Flask(__name__)
s3_client = boto3.client('s3')

@app.route('/fetch-data/fetch', methods=['GET'])
def fetch_data():
    bucket_name = "webcontent-01"
    key = "file.txt"
    try:
      response = s3_client.get_object(Bucket=bucket_name, Key=key)
      content = response['Body'].read().decode('utf-8')
      return jsonify({"content": content})
    except Exception as e:
      return jsonify({"error": str(e)}), 500

if __name__ == '__main__':
    app.run(host='0.0.0.0', port=8080)

You can create a systemd service file for running this application as a daemon service.

Navigate to the /etc/systemd/system/ directory:

cd /etc/systemd/system/

Create a new service file (e.g., fetch-data.service):

sudo nano fetch-data.service

Add the following content to the service file:

[Unit]
Description=Fetch Data Python Application
After=network.target

[Service]
User=root
Group=root
ExecStart=/usr/bin/python3 /etc/systemd/system/fetch_data.py
Restart=always
RestartSec=5
Environment=PYTHONUNBUFFERED=1

[Install]
WantedBy=multi-user.target

Reload systemd to recognize the new service:

sudo systemctl daemon-reload

Start the service:

sudo systemctl start fetch-data.service

Enable the service to start at boot:

sudo systemctl enable fetch-data.service

Check the service status:

sudo systemctl status fetch-data.service

Your service is now running in the background.

5. Load Balancer:

  • Set up an Application Load Balancer (ALB) with target groups pointing to the EC2 instances.

  • Configure health checks for the instances.

6. Auto Scaling Group:

  • Configure an Auto Scaling Group (ASG) for the EC2 instances across 2 AZs.

  • Set scaling policies (e.g., minimum 2, maximum 4 instances).

7. Deploy the Application:

  • Deploy the application using a tool like Terraform, AWS CDK, or CloudFormation.

8. Homepage (index.html):

  • Create a basic HTML page with a button to fetch data:

      <!DOCTYPE html>
      <html>
      <head>
          <title>Fetch S3 Data</title>
          <script>
              function fetchData() {
                  fetch('/fetch-data/fetch')
                      .then(response => response.json())
                      .then(data => {
                          document.getElementById("content").innerText = data.content;
                      })
                      .catch(error => console.error('Error fetching data:', error));
              }
          </script>
      </head>
      <body>
          <h1>Welcome!</h1>
          <button onclick="fetchData()">Fetch Data</button>
          <div id="content"></div>
      </body>
      </html>
    

You can create a snapshot and image of this instance and use it to spin up an autoscaling group by creating a launch template with your own AMI.

I used the same AMI and created a launch template which I further used in creating an autoscaling group. Set the desired minimum replicas to 2 so that 2 instances spin up in different availability zones.


Workflow:

  1. User Interaction:

    • User accesses the application through the ALB.

    • The homepage is served by NGINX.

  2. Button Click:

    • A click on the button triggers a fetch request to the backend.
  3. Data Fetching:

    • The backend script fetches content from the S3 bucket using the IAM role.

    • The response is sent back to the frontend.

  4. Display:

    • The fetched data is displayed dynamically on the webpage.

When the SUBSCRIBE button is clicked, the script we created retrieves and displays the content from an S3 bucket.

Now let us create an FIS template that will insert fault in AZ2.

Go to AWS FIS, and click on Create experiment template.

Specify template details

  1. For Description and name, enter a description for the template, such as S3AZDisrupt.

  2. Choose Next, and move to Step 2, Specify actions and targets.

  1. Under Actions, choose Add action.

    1. For the Name, enter S3AZDisrupt.

    2. For Action type, select aws:network:disrupt-connectivity.

    3. Under Action parameters, set the Duration to 2 minutes.

    4. Under Scope, select s3.

    5. At the top, choose Save.

  1. Under Targets, you should see the target that has been created automatically. Choose Edit.

    1. Verify that Resource type is aws:ec2:subnet.

    2. Under Target method, select Resource IDs, and then choose the subnet that you used when creating your Amazon EC2 instance.

    3. Verify that Selection mode is All.

    4. Choose Save.

Start the experiment. Click on Start Experiment. Add a name tag to identify your experiment.

You experiment has started.

You can open the website to check if there is any problem while clicking on SUBSCRIBE button.

You would be able to get the result.

Now let’s change the subnet to. cause impairment in another availability zone.

Click on Save and Start Experiment.

You can verify connectivity to the S3 bucket by pinging it from both instances.

Instance1 is starting to lose packets. You might observe results similar to the following when attempting to ping the S3 bucket:

On the other instance, the results of pinging the S3 bucket would show consistent responses without any packet loss, similar to this:

Conclusion

By simulating real-world failures using AWS Fault Injection Simulator (FIS) with an S3 bucket, we gain valuable insights into how our applications behave under stress and disruption. This proactive approach allows us to identify weaknesses, improve recovery strategies, and ensure a seamless experience for end-users.