请求协助：批量下载维基共享资源分类图片并添加元数据描述

阿华AIGC实验室

2026-5-20

Hey there! As someone who’s messed around with scraping Wikimedia Commons for personal projects before, I’ve got a solid, beginner-friendly workflow for you. We’ll use Python (free, easy to set up) and a couple simple libraries to pull down all original images from your target category, grab their official descriptions, and inject those into the image’s metadata. Let’s dive in!

Step 1: Set Up Your Tools

First, make sure you have Python installed (grab it from the official site if you don’t—don’t forget to check the "Add Python to PATH" box during setup!). Then open your terminal/command prompt and install these required libraries:

pip install requests piexif

requests: Handles fetching data from Wikimedia’s API (way more reliable than scraping HTML)
piexif: Makes editing image metadata (EXIF) straightforward, no fancy image editing skills needed

Step 2: The Python Script (Customize This!)

Copy this code into a new file named commons_downloader.py. I’ve added comments to explain each part, so you can tweak it for your specific category.

import requests
import os
import piexif
import time  # Optional, for rate limiting

# --------------------------
# Customize these variables!
# --------------------------
TARGET_CATEGORY = "Air Ministry Second World War Official Collection"
DOWNLOAD_FOLDER = "wikimedia_air_ministry_images"  # Folder where images will save

# Create download folder if it doesn't exist
os.makedirs(DOWNLOAD_FOLDER, exist_ok=True)

def get_all_files_in_category(category_name):
    """Fetch every file title in the target Wikimedia Commons category"""
    api_endpoint = "https://commons.wikimedia.org/w/api.php"
    params = {
        "action": "query",
        "list": "categorymembers",
        "cmtitle": f"Category:{category_name}",
        "cmtype": "file",
        "cmlimit": "max",
        "format": "json"
    }
    response = requests.get(api_endpoint, params=params)
    data = response.json()
    # Extract just the file titles (e.g., "File:XYZ.jpg")
    return [item["title"] for item in data["query"]["categorymembers"]]

def get_file_details(file_title):
    """Grab the original image URL and official description for a single file"""
    api_endpoint = "https://commons.wikimedia.org/w/api.php"
    params = {
        "action": "query",
        "prop": "imageinfo",
        "iiprop": "url|extmetadata",
        "titles": file_title,
        "format": "json"
    }
    response = requests.get(api_endpoint, params=params)
    data = response.json()
    page_data = next(iter(data["query"]["pages"].values()))
    image_info = page_data["imageinfo"][0]
    
    # Get the original high-res image URL
    original_image_url = image_info["url"]
    
    # Pull the description (fallback to file name if no description exists)
    description = image_info.get("extmetadata", {}).get("Description", {}).get("value", file_title.replace("File:", ""))
    
    return original_image_url, description

def download_image_and_add_metadata(image_url, description, save_path):
    """Download the image and inject the description into its EXIF metadata"""
    # Download the image in chunks (better for large files)
    response = requests.get(image_url, stream=True)
    with open(save_path, "wb") as file:
        for chunk in response.iter_content(chunk_size=8192):
            file.write(chunk)
    
    # Load existing EXIF data (or create empty if none exists)
    exif_data = piexif.load(save_path)
    
    # Add description to standard Title and Comment metadata fields
    # These fields are recognized by most photo viewers/editors
    exif_data["0th"][piexif.ImageIFD.XPTitle] = description.encode("utf-16")
    exif_data["0th"][piexif.ImageIFD.XPComment] = description.encode("utf-16")
    
    # Save the updated metadata back to the image
    exif_bytes = piexif.dump(exif_data)
    piexif.insert(exif_bytes, save_path)

# Main workflow to run everything
if __name__ == "__main__":
    print(f"Fetching files from category: {TARGET_CATEGORY}")
    file_titles = get_all_files_in_category(TARGET_CATEGORY)
    print(f"Found {len(file_titles)} files to process")
    
    for index, file_title in enumerate(file_titles, 1):
        print(f"\nProcessing file {index}/{len(file_titles)}: {file_title}")
        try:
            image_url, description = get_file_details(file_title)
            # Extract the actual file name from the URL
            file_name = image_url.split("/")[-1]
            save_location = os.path.join(DOWNLOAD_FOLDER, file_name)
            
            download_image_and_add_metadata(image_url, description, save_location)
            print(f"Successfully saved: {save_location}")
            
            # Optional: Add a 1-second delay to avoid hitting Wikimedia's rate limits
            time.sleep(1)
        except Exception as error:
            print(f"Failed to process {file_title}: {str(error)}")

Step 3: Run the Script

Open your terminal/command prompt
Navigate to the folder where you saved commons_downloader.py
Run this command:

python commons_downloader.py

The script will:

Fetch all files in your target category
For each file, grab the original high-res image URL and its official description
Download the image to your specified folder
Inject the description into the image’s EXIF title and comment fields (most photo apps will display these as "Title" and "Description")

Quick Tips for Beginners

If you get a "rate limit exceeded" error, uncomment the time.sleep(1) line to slow down requests.
Some images might not have a formal description—this script uses the file name as a fallback.
The metadata fields we’re using work for most Windows/macOS photo viewers. If you need to target other fields (like IPTC), check the piexif documentation for more options.

内容的提问来源于stack exchange，提问作者PJF