如何用Python实现网页文章提取、无瑕疵PDF生成及Google Drive批量归档？

如何用Python实现网页文章提取、无瑕疵PDF生成及Google Drive批量归档？

阿华AIGC实验室

2026-3-26

如何用Python实现网页文章提取、无瑕疵PDF生成及Google Drive批量归档？

嘿，看起来你需要一套自动化工具来批量归档研究文章对吧？我刚好做过类似的需求，给你拆解一下每个环节的实现要点，最后再给你整合好的完整代码，绝对实用！

一、核心需求拆解

你要的其实是一个三步自动化流水线：

从指定URL提取纯净的文章内容（过滤导航、广告、侧边栏这类冗余元素）
把提取的内容转换成格式工整的PDF，阅读体验要舒服
批量上传这些PDF到Google Drive，方便跨设备随时查看
每天处理10-20篇，得保证稳定性和处理效率

二、分步实现细节

1. 网页文章内容精准提取

这一步的关键是只抓有效内容，别把杂七杂八的元素带进来。我常用requests拉取网页，BeautifulSoup做解析，踩过不少坑，给你几个实用技巧：

先打开网页的开发者工具（F12），定位文章所在的HTML标签：大部分网站会用<article>标签包裹正文，或者给正文div加article-body、post-content这类辨识度高的class

解析时直接定位目标标签，别全页面瞎遍历，比如：

import requests
from bs4 import BeautifulSoup

def extract_article_content(url):
    try:
        resp = requests.get(url, timeout=10)
        resp.raise_for_status()  # 主动抛出HTTP错误，比如404、500
        soup = BeautifulSoup(resp.text, 'html.parser')
        
        # 这里根据目标网站调整选择器，优先抓article标签
        article = soup.select_one('article')
        if not article:
            # 备选方案：抓class为post-content的div（大部分博客站会用这个）
            article = soup.find('div', class_='post-content')
        
        if not article:
            print(f"没找到{url}的文章主体，跳过")
            return None
        
        # 清除冗余元素：广告、侧边栏、脚本、样式标签
        for unwanted in article.select('aside, script, style, .ad-box, .sidebar-wrap'):
            unwanted.decompose()
        
        # 返回处理后的HTML内容（保留图片标签，WeasyPrint会自动加载图片）
        return str(article)
    except Exception as e:
        print(f"提取{url}失败: {str(e)}")
        return None

一定要加超时和异常捕获，避免单个网页卡死整个脚本

2. 生成无瑕疵的PDF

用WeasyPrint就对了，它比其他PDF生成库对HTML/CSS的支持更好，生成的排版更接近原生网页。注意这几个细节：

给提取的内容套一个基础HTML结构，加自定义CSS优化排版（比如边距、行高、字体）
处理懒加载图片：如果网站图片用了data-src替代src，要手动替换过来，不然WeasyPrint加载不到
避免内容挤在页面边缘，给body加合适的边距

代码示例：

from weasyprint import HTML, CSS
from weasyprint.fonts import FontConfiguration

def generate_clean_pdf(html_content, output_path):
    if not html_content:
        return False
    try:
        font_config = FontConfiguration()
        # 自定义CSS，让PDF排版更符合阅读习惯
        custom_css = CSS(string='''
            body { 
                font-family: sans-serif; 
                margin: 2cm; 
                line-height: 1.6; 
                color: #333;
            }
            h1, h2, h3 { 
                color: #222; 
                margin-top: 1.5em; 
                border-bottom: 1px solid #eee;
                padding-bottom: 0.3em;
            }
            img { 
                max-width: 100%; 
                height: auto; 
                display: block;
                margin: 1em auto;
            }
            p { margin-bottom: 1em; }
            .article-meta { color: #666; font-size: 0.9em; margin-bottom: 2em; }
        ''', font_config=font_config)
        
        # 处理懒加载图片，把data-src替换成src
        html_content = html_content.replace('data-src', 'src')
        
        # 套完整HTML结构，保证WeasyPrint能正确解析
        full_html = f'''
        <!DOCTYPE html>
        <html>
        <head>
            <meta charset="UTF-8">
            <title>Archived Article</title>
        </head>
        <body>
            {html_content}
        </body>
        </html>
        '''
        
        HTML(string=full_html).write_pdf(
            output_path,
            stylesheets=[custom_css],
            font_config=font_config
        )
        return True
    except Exception as e:
        print(f"生成PDF失败: {str(e)}")
        return False

3. 对接Google Drive实现批量上传

首先得做一点前期配置（5分钟就能搞定）：

去Google Cloud Console新建一个项目，启用Google Drive API
创建OAuth 2.0客户端ID，下载credentials.json放到脚本同目录
第一次运行脚本会弹出浏览器让你登录授权，之后会自动保存token.json，下次就不用重复授权了

上传函数示例：

from google.oauth2.credentials import Credentials
from googleapiclient.discovery import build
from googleapiclient.errors import HttpError
from google.auth.transport.requests import Request
from google_auth_oauthlib.flow import InstalledAppFlow
from googleapiclient.http import MediaFileUpload
import os

# 授权范围：只请求上传文件的权限，最小权限原则
SCOPES = ['https://www.googleapis.com/auth/drive.file']

def get_drive_service():
    creds = None
    # 加载已保存的授权token
    if os.path.exists('token.json'):
        creds = Credentials.from_authorized_user_file('token.json', SCOPES)
    # 刷新或重新获取授权
    if not creds or not creds.valid:
        if creds and creds.expired and creds.refresh_token:
            creds.refresh(Request())
        else:
            flow = InstalledAppFlow.from_client_secrets_file(
                'credentials.json', SCOPES)
            creds = flow.run_local_server(port=0)
        # 保存token供下次使用
        with open('token.json', 'w') as token:
            token.write(creds.to_json())
    return build('drive', 'v3', credentials=creds)

def upload_to_drive(file_path, drive_folder_id=None):
    service = get_drive_service()
    try:
        file_name = os.path.basename(file_path)
        file_metadata = {'name': file_name}
        # 如果指定了Drive文件夹，设置文件的父目录，方便分类归档
        if drive_folder_id:
            file_metadata['parents'] = [drive_folder_id]
        
        media = MediaFileUpload(file_path, mimetype='application/pdf')
        # 执行上传
        file = service.files().create(
            body=file_metadata,
            media_body=media,
            fields='id'
        ).execute()
        print(f"✅ {file_name} 上传成功，Drive文件ID: {file.get('id')}")
        return True
    except HttpError as error:
        print(f"❌ {file_name} 上传失败: {error}")
        return False

注意要安装对应的Google API依赖
如果你想把所有归档文件都放到同一个Drive文件夹，提前创建好文件夹，复制它的ID（在文件夹URL里的那串长字符）

4. 批量处理的优化技巧

每天10-20篇，这些小技巧能提升体验：

用tqdm加进度条，直观看到每篇文章的处理进度
多线程并行处理：用concurrent.futures.ThreadPoolExecutor同时请求多个网页，节省等待时间
日志记录：把成功/失败的URL记录到本地文件，方便后续核对
去重检查：上传前可以通过Drive API查询是否已有同名文件，避免重复归档（需要额外实现文件列表查询逻辑）

三、完整整合脚本

把上面的函数拼起来，加上批量处理逻辑：

import os
from concurrent.futures import ThreadPoolExecutor
from tqdm import tqdm

# ---------------------- 配置参数 ----------------------
# 批量处理的URL列表，你可以从文件读取或者手动维护
BATCH_URLS = [
    "https://example.com/research-article-1",
    "https://example.com/research-article-2",
    # 更多URL...
]
# 临时PDF存储目录，上传后可以自动删除
OUTPUT_DIR = "temp_archived_pdfs"
# 可选：Google Drive文件夹ID，留空则上传到根目录
DRIVE_FOLDER_ID = "你的Drive文件夹ID"
# 并行处理的线程数，别设太大避免被网站反爬
MAX_WORKERS = 5
# -----------------------------------------------------

def process_single_url(url):
    # 1. 提取网页文章内容
    html_content = extract_article_content(url)
    if not html_content:
        return False
    # 2. 生成安全的PDF文件名（用URL最后一段，避免重复）
    safe_filename = url.split('/')[-1].replace('.html', '').replace('/', '_') + '.pdf'
    pdf_path = os.path.join(OUTPUT_DIR, safe_filename)
    # 3. 生成干净的PDF
    if not generate_clean_pdf(html_content, pdf_path):
        return False
    # 4. 上传到Google Drive
    upload_success = upload_to_drive(pdf_path, DRIVE_FOLDER_ID)
    # 5. 上传成功后删除本地临时PDF，节省空间
    if upload_success and os.path.exists(pdf_path):
        os.remove(pdf_path)
    return upload_success

if __name__ == "__main__":
    # 创建临时目录
    os.makedirs(OUTPUT_DIR, exist_ok=True)
    
    print(f"开始批量处理{len(BATCH_URLS)}篇文章...")
    # 多线程处理+进度条
    success_count = 0
    with ThreadPoolExecutor(max_workers=MAX_WORKERS) as executor:
        results = list(tqdm(
            executor.map(process_single_url, BATCH_URLS),
            total=len(BATCH_URLS),
            desc="处理进度"
        ))
        success_count = sum(results)
    
    print(f"\n🎉 批量处理完成！成功归档{success_count}/{len(BATCH_URLS)}篇文章")
    if success_count < len(BATCH_URLS):
        print("⚠️ 部分文章处理失败，请查看上面的错误提示")

最后踩坑提醒

每个网站的HTML结构都不一样，extract_article_content里的选择器一定要根据目标网站调整，这是唯一需要手动适配的部分
运行前先安装所有依赖：pip install requests beautifulsoup4 weasyprint google-api-python-client google-auth-httplib2 google-auth-oauthlib tqdm
如果WeasyPrint提示字体缺失，确保你的系统安装了常用字体（比如宋体、Arial、Noto Sans）

有什么具体网站适配的问题，或者其他小bug，随时喊我~

火山引擎最新活动

方舟 Coding Plan

模型自由，工具不限，免费解锁 ArkClaw，7*24 小时在线的专属智能伙伴

一键部署 OpenClaw

分钟级部署，云服务器包月低至￥9.9，与 CodingPlan 组合购买仅需19.8元

Seedance2.0 体验中心上线

注册即享免费500万Tokens，抢先领略新一代AI视频技术跃迁

新用户特惠专场

大模型19元起，Al应用9.9元畅享，新人首购爆款尽享优惠