如何批量下载网站链接PDF？能否自动化下载美联储档案PDF？

免费开始使用

如何批量下载网站链接PDF？能否自动化下载美联储档案PDF？

阿华AIGC实验室

2026-5-20

当然可以实现这个批量下载的自动化！我之前帮朋友处理过类似的美联储档案PDF爬取需求，给你分享两个实用的方案，你可以根据自己的技术背景选择：

方案1：Python爬虫直接提取PDF下载链接（高效首选）

这个方法不需要模拟浏览器操作，直接分析页面HTML结构获取PDF的真实下载地址，速度更快。

核心思路

先爬取目标列表页（比如你提供的https://fraser.stlouisfed.org/title/5170）的所有演讲详情页链接
进入每个详情页，提取PDF的直接下载URL
批量下载并保存到本地文件夹

示例代码

import requests
from bs4 import BeautifulSoup
import os
import time

# 配置参数
list_page_url = "https://fraser.stlouisfed.org/title/5170"
save_folder = "fed_speeches_pdfs"
os.makedirs(save_folder, exist_ok=True)

# 模拟浏览器请求头，避免被服务器拦截
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/118.0.0.0 Safari/537.36"
}

# 获取列表页所有演讲链接
response = requests.get(list_page_url, headers=headers)
soup = BeautifulSoup(response.text, "html.parser")
# 这里的CSS选择器需要根据页面实际结构调整，可通过浏览器开发者工具查看
speech_links = soup.select("div.document-title a")

for idx, link in enumerate(speech_links):
    speech_detail_url = "https://fraser.stlouisfed.org" + link["href"]
    speech_title = link.get_text(strip=True)
    # 处理文件名，避免非法字符
    safe_title = speech_title.replace("/", "_").replace("\\", "_").replace(":", "_").replace("?", "_")
    
    print(f"[{idx+1}/{len(speech_links)}] 正在处理: {speech_title}")
    
    # 进入详情页提取PDF下载链接
    detail_response = requests.get(speech_detail_url, headers=headers)
    detail_soup = BeautifulSoup(detail_response.text, "html.parser")
    pdf_download_tag = detail_soup.select_one("a.download-pdf")
    if not pdf_download_tag:
        print(f"⚠️ 未找到{speech_title}的PDF下载链接，跳过")
        continue
    
    pdf_full_url = "https://fraser.stlouisfed.org" + pdf_download_tag["href"]
    
    # 下载PDF文件
    pdf_response = requests.get(pdf_full_url, headers=headers)
    with open(os.path.join(save_folder, f"{safe_title}.pdf"), "wb") as f:
        f.write(pdf_response.content)
    
    # 加延时，避免请求过于频繁被封IP
    time.sleep(2)

print("✅ 所有PDF下载完成！")

关键提醒

用浏览器开发者工具检查页面元素，调整代码中的CSS选择器（比如div.document-title a或a.download-pdf），确保能正确抓取链接
遵守网站的robots.txt规则（可访问https://fraser.stlouisfed.org/robots.txt查看），不要无限制高频请求

方案2：Selenium浏览器自动化（模拟手动操作）

如果网站有反爬机制，或者PDF链接难以直接提取，用Selenium模拟浏览器点击操作最稳妥，完全复刻手动流程。

核心思路

启动浏览器打开列表页
自动点击每个演讲标题进入详情页
模拟点击PDF下载按钮
自动保存文件到指定文件夹

示例代码

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import os
import time

# 配置参数
save_folder = "fed_speeches_pdfs"
os.makedirs(save_folder, exist_ok=True)

# 设置Chrome浏览器选项，指定下载路径并自动下载PDF
chrome_options = webdriver.ChromeOptions()
prefs = {
    "download.default_directory": os.path.abspath(save_folder),
    "download.prompt_for_download": False,
    "plugins.always_open_pdf_externally": True  # 跳过PDF预览，直接下载
}
chrome_options.add_experimental_option("prefs", prefs)

# 启动浏览器
driver = webdriver.Chrome(options=chrome_options)
driver.get("https://fraser.stlouisfed.org/title/5170")

# 等待列表加载完成
WebDriverWait(driver, 15).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, "div.document-title a")))
speech_links = driver.find_elements(By.CSS_SELECTOR, "div.document-title a")

for idx in range(len(speech_links)):
    # 重新获取链接（避免页面刷新后元素失效）
    speech_links = driver.find_elements(By.CSS_SELECTOR, "div.document-title a")
    current_link = speech_links[idx]
    speech_title = current_link.text.strip()
    
    print(f"[{idx+1}/{len(speech_links)}] 正在下载: {speech_title}")
    
    # 点击进入详情页
    current_link.click()
    
    # 等待下载按钮可点击并点击
    try:
        download_btn = WebDriverWait(driver, 10).until(
            EC.element_to_be_clickable((By.CSS_SELECTOR, "a.download-pdf"))
        )
        download_btn.click()
        time.sleep(4)  # 等待下载完成
    except Exception as e:
        print(f"❌ {speech_title}下载失败: {str(e)}")
    
    # 返回列表页
    driver.back()
    time.sleep(2)

driver.quit()
print("✅ 全部下载任务完成！")

关键提醒

需要安装Selenium库和对应浏览器的驱动（比如ChromeDriver），确保驱动版本与浏览器版本匹配
这种方法速度较慢，但兼容性最好，适合复杂页面场景

内容的提问来源于stack exchange，提问作者user9500674

火山引擎最新活动

方舟 Coding Plan

模型自由，工具不限，最新支持 DeepSeek-V4 系列与 GLM-5.1，受邀下单叠加9.5折

ArkClaw

7×24在线专属智能伙伴

Seedance 2.0 全面开放 API

创作无限可能，一键生成电影级 AI 视频

新用户特惠专场

大模型19元起，Al应用9.9元畅享，新人首购爆款尽享优惠