如何用Boto3 Python脚本遍历S3存储桶的目录与文件？

阿华AIGC实验室

2026-5-21

How to Traverse "Folders" and Access Files in S3 with Boto3

I totally get your frustration—S3's "folder" structure can feel counterintuitive at first because it doesn’t work like a traditional filesystem. Let me break this down for you with a practical script tailored exactly to your SNS delivery report use case.

First, a critical point to wrap your head around: S3 doesn’t have actual folders. What looks like a folder is just a prefix in an object’s key. For example, if you have a report at sns-delivery-reports/2024-05-20/daily-summary.csv, the sns-delivery-reports/2024-05-20/ part is the prefix that mimics a folder path.

Here’s a complete script that will:

Automatically find all your date-based "daily folders"
Fetch every SNS report file inside those folders
Let you either download files locally or process their content directly

import boto3
from datetime import datetime

def process_sns_delivery_reports(bucket_name, parent_prefix="sns-delivery-reports/"):
    # Initialize S3 client (works just as well with boto3.resource if you prefer that syntax)
    s3_client = boto3.client('s3')
    
    # First, get all date-based "folders" (prefixes) under your parent path
    paginator = s3_client.get_paginator('list_objects_v2')
    folder_pages = paginator.paginate(
        Bucket=bucket_name,
        Prefix=parent_prefix,
        Delimiter='/'  # This tells S3 to group objects by the "/" delimiter, mimicking folders
    )
    
    # Iterate through each date folder
    for page in folder_pages:
        if 'CommonPrefixes' in page:
            for prefix in page['CommonPrefixes']:
                folder_path = prefix['Prefix']
                print(f"Processing folder: {folder_path}")
                
                # Now list all files within this date folder
                file_paginator = s3_client.get_paginator('list_objects_v2')
                file_pages = file_paginator.paginate(
                    Bucket=bucket_name,
                    Prefix=folder_path,
                    Delimiter=''  # No delimiter here to grab every object under the prefix
                )
                
                # Process each individual report file
                for file_page in file_pages:
                    if 'Contents' in file_page:
                        for obj in file_page['Contents']:
                            file_key = obj['Key']
                            file_size = obj['Size']
                            last_modified = obj['LastModified']
                            
                            print(f"  Found report: {file_key} (Size: {file_size} bytes, Updated: {last_modified})")
                            
                            # Option 1: Download the file to your local machine
                            local_save_path = f"./sns-reports/{file_key.split('/')[-1]}"
                            s3_client.download_file(bucket_name, file_key, local_save_path)
                            print(f"    Saved to: {local_save_path}")
                            
                            # Option 2: Read file content directly (no download needed)
                            # response = s3_client.get_object(Bucket=bucket_name, Key=file_key)
                            # report_content = response['Body'].read().decode('utf-8')
                            # print(f"    Preview: {report_content[:150]}...")

if __name__ == "__main__":
    # Replace these with your actual bucket details
    YOUR_BUCKET_NAME = "your-s3-bucket-name"
    YOUR_PARENT_FOLDER_PREFIX = "sns-delivery-reports/"
    
    process_sns_delivery_reports(YOUR_BUCKET_NAME, YOUR_PARENT_FOLDER_PREFIX)

Key Details to Know:

Delimiter='/': This is the trick to mimicking folder traversal. When set, S3 returns CommonPrefixes which are the "folders" under your target path.
Paginators: Used to handle large numbers of files/folders—S3 returns results in chunks, and paginators automatically loop through all pages so you don’t miss any data.
Flexible Processing: Pick between downloading files for later review or reading content directly to parse metrics right away (just comment/uncomment the relevant section).

To make this even more powerful, you could:

Add date filtering to only process folders from the last 7 days (use datetime to compare the prefix date with today)
Parse CSV/JSON report content to extract delivery success rates or error details
Schedule the script to run daily using cron (Linux) or Task Scheduler (Windows)

Let me know if you need help tweaking this to match your exact bucket structure or report format!

内容的提问来源于stack exchange，提问作者Monty