You need to enable JavaScript to run this app.
最新活动
大模型
产品
解决方案
定价
生态与合作
支持与服务
开发者
了解我们

如何使用Python按PDF文件中的主题或标题拆分并提取各章节为独立文件

Split PDF into Chapter-Specific Files Using Python

Hey there! Splitting a research paper (or any structured PDF) into individual chapter files based on headings like Introduction, Background, etc., is super doable with Python. Below is a step-by-step implementation that combines text extraction and page manipulation libraries to get the job done smoothly.

Prerequisites

First, install the required libraries—these are the workhorses for this task:

pip install pypdf2 pdfplumber
  • pdfplumber: Excels at accurate text extraction and identifying page metadata (critical for pinpointing where headings live)
  • PyPDF2: Handles the heavy lifting of splitting pages and writing new PDF files

Step-by-Step Implementation

1. Define Your Heading Patterns

First, figure out how your chapter headings are formatted. For example:

  • Exact matches: "Introduction", "Background"
  • Numbered headings: "1. Introduction", "2. Related Work"
  • Use regular expressions to match these patterns—tweak them to fit your PDF’s specific style.

2. Extract Heading Positions & Page Numbers

This script scans the PDF, finds all matching headings, and records their starting page numbers (note: PyPDF2 uses 0-indexed pages, so we adjust for that):

import pdfplumber
import re
from PyPDF2 import PdfReader, PdfWriter
import os

def extract_chapter_boundaries(pdf_path, heading_patterns):
    """Extract start pages for each chapter based on given heading patterns"""
    chapter_boundaries = []
    
    with pdfplumber.open(pdf_path) as pdf:
        for page_num, page in enumerate(pdf.pages, start=1):
            text = page.extract_text()
            if not text:
                continue  # Skip pages with no extractable text (e.g., scanned images)
            
            # Check each heading pattern against the page's text
            for pattern in heading_patterns:
                match = re.search(pattern, text)
                if match:
                    chapter_boundaries.append({
                        "chapter_title": match.group().strip(),
                        "start_page": page_num - 1  # Convert to 0-index for PyPDF2
                    })
    
    # Add the end of the document as the final boundary
    total_pages = len(pdf.pages)
    chapter_boundaries.append({"chapter_title": "END", "start_page": total_pages})
    
    return chapter_boundaries

3. Split the PDF into Chapter Files

With the boundaries mapped, we can now slice the original PDF into individual chapter files:

def split_pdf_by_chapters(input_pdf_path, output_dir, heading_patterns):
    """Split the input PDF into separate files for each chapter"""
    # Create output directory if it doesn't exist
    os.makedirs(output_dir, exist_ok=True)
    
    # Get chapter start/end boundaries
    boundaries = extract_chapter_boundaries(input_pdf_path, heading_patterns)
    
    # Load the original PDF
    reader = PdfReader(input_pdf_path)
    
    # Loop through each chapter pair to split pages
    for i in range(len(boundaries) - 1):
        current_chapter = boundaries[i]
        next_chapter = boundaries[i + 1]
        
        chapter_title = current_chapter["chapter_title"]
        start_page = current_chapter["start_page"]
        end_page = next_chapter["start_page"]
        
        # Initialize a PDF writer for the current chapter
        writer = PdfWriter()
        
        # Add the relevant pages to the writer
        for page_num in range(start_page, end_page):
            writer.add_page(reader.pages[page_num])
        
        # Clean up the title to make a valid filename
        safe_title = re.sub(r'[^\w\s-]', '', chapter_title).replace(' ', '_')
        output_path = os.path.join(output_dir, f"{safe_title}.pdf")
        
        # Save the chapter PDF
        with open(output_path, "wb") as out_file:
            writer.write(out_file)
        
        print(f"Saved chapter: {chapter_title} -> {output_path}")

# Example usage (customize these values for your PDF)
if __name__ == "__main__":
    INPUT_PDF = "your_research_paper.pdf"
    OUTPUT_DIR = "split_chapters"
    # Adjust regex patterns to match your PDF's heading format
    HEADING_PATTERNS = [
        r"^\d+\.\sIntroduction$",
        r"^\d+\.\sBackground$",
        r"^\d+\.\sRelated Work$",
        r"^\d+\.\sMethodology$",
        r"^\d+\.\sExperiments$",
        r"^\d+\.\sConclusion$"
    ]
    
    split_pdf_by_chapters(INPUT_PDF, OUTPUT_DIR, HEADING_PATTERNS)

Key Notes & Troubleshooting

  • Scanned PDFs: If your PDF is a scanned image, pdfplumber can’t extract text. You’ll need to add OCR first using pytesseract and PIL to convert images to machine-readable text.
  • Heading Variations: Tweak the regex patterns to match your PDF’s exact formatting (e.g., uppercase headings, bold text, different numbering styles like 1.1 Introduction).
  • Page Alignment: Double-check start/end pages—some PDFs have separate cover/title pages that might shift the page numbering, so adjust the 0-index offset if needed.

内容的提问来源于stack exchange,提问作者Ahmed Zedan

火山引擎 最新活动