如何使用Python按PDF文件中的主题或标题拆分并提取各章节为独立文件
Hey there! Splitting a research paper (or any structured PDF) into individual chapter files based on headings like Introduction, Background, etc., is super doable with Python. Below is a step-by-step implementation that combines text extraction and page manipulation libraries to get the job done smoothly.
Prerequisites
First, install the required libraries—these are the workhorses for this task:
pip install pypdf2 pdfplumber
pdfplumber: Excels at accurate text extraction and identifying page metadata (critical for pinpointing where headings live)PyPDF2: Handles the heavy lifting of splitting pages and writing new PDF files
Step-by-Step Implementation
1. Define Your Heading Patterns
First, figure out how your chapter headings are formatted. For example:
- Exact matches:
"Introduction","Background" - Numbered headings:
"1. Introduction","2. Related Work" - Use regular expressions to match these patterns—tweak them to fit your PDF’s specific style.
2. Extract Heading Positions & Page Numbers
This script scans the PDF, finds all matching headings, and records their starting page numbers (note: PyPDF2 uses 0-indexed pages, so we adjust for that):
import pdfplumber import re from PyPDF2 import PdfReader, PdfWriter import os def extract_chapter_boundaries(pdf_path, heading_patterns): """Extract start pages for each chapter based on given heading patterns""" chapter_boundaries = [] with pdfplumber.open(pdf_path) as pdf: for page_num, page in enumerate(pdf.pages, start=1): text = page.extract_text() if not text: continue # Skip pages with no extractable text (e.g., scanned images) # Check each heading pattern against the page's text for pattern in heading_patterns: match = re.search(pattern, text) if match: chapter_boundaries.append({ "chapter_title": match.group().strip(), "start_page": page_num - 1 # Convert to 0-index for PyPDF2 }) # Add the end of the document as the final boundary total_pages = len(pdf.pages) chapter_boundaries.append({"chapter_title": "END", "start_page": total_pages}) return chapter_boundaries
3. Split the PDF into Chapter Files
With the boundaries mapped, we can now slice the original PDF into individual chapter files:
def split_pdf_by_chapters(input_pdf_path, output_dir, heading_patterns): """Split the input PDF into separate files for each chapter""" # Create output directory if it doesn't exist os.makedirs(output_dir, exist_ok=True) # Get chapter start/end boundaries boundaries = extract_chapter_boundaries(input_pdf_path, heading_patterns) # Load the original PDF reader = PdfReader(input_pdf_path) # Loop through each chapter pair to split pages for i in range(len(boundaries) - 1): current_chapter = boundaries[i] next_chapter = boundaries[i + 1] chapter_title = current_chapter["chapter_title"] start_page = current_chapter["start_page"] end_page = next_chapter["start_page"] # Initialize a PDF writer for the current chapter writer = PdfWriter() # Add the relevant pages to the writer for page_num in range(start_page, end_page): writer.add_page(reader.pages[page_num]) # Clean up the title to make a valid filename safe_title = re.sub(r'[^\w\s-]', '', chapter_title).replace(' ', '_') output_path = os.path.join(output_dir, f"{safe_title}.pdf") # Save the chapter PDF with open(output_path, "wb") as out_file: writer.write(out_file) print(f"Saved chapter: {chapter_title} -> {output_path}") # Example usage (customize these values for your PDF) if __name__ == "__main__": INPUT_PDF = "your_research_paper.pdf" OUTPUT_DIR = "split_chapters" # Adjust regex patterns to match your PDF's heading format HEADING_PATTERNS = [ r"^\d+\.\sIntroduction$", r"^\d+\.\sBackground$", r"^\d+\.\sRelated Work$", r"^\d+\.\sMethodology$", r"^\d+\.\sExperiments$", r"^\d+\.\sConclusion$" ] split_pdf_by_chapters(INPUT_PDF, OUTPUT_DIR, HEADING_PATTERNS)
Key Notes & Troubleshooting
- Scanned PDFs: If your PDF is a scanned image,
pdfplumbercan’t extract text. You’ll need to add OCR first usingpytesseractandPILto convert images to machine-readable text. - Heading Variations: Tweak the regex patterns to match your PDF’s exact formatting (e.g., uppercase headings, bold text, different numbering styles like
1.1 Introduction). - Page Alignment: Double-check start/end pages—some PDFs have separate cover/title pages that might shift the page numbering, so adjust the 0-index offset if needed.
内容的提问来源于stack exchange,提问作者Ahmed Zedan




