如何用Python或Java从PDF中仅提取段落？

阿华AIGC实验室

2026-5-14

Extracting Paragraphs from PDFs with PyPDF2 and PDFBox

Got it, let's tackle this problem—extracting just paragraphs from PDFs when you can already pull all text with PyPDF2 and PDFBox. PDFs don’t natively store "paragraphs" as a distinct element, so we need to work with text layout and spacing clues to group chunks into logical paragraphs. Here’s how to do it for both tools, plus some practical tweaks:

Handling Paragraph Extraction with PyPDF2

PyPDF2 doesn’t have built-in paragraph detection, but we can use line breaks and spacing to group text into paragraphs after extraction:

First, use the layout_mode="layout" parameter in extract_text() to preserve the original line structure (this keeps more spacing than the default extraction).
Then, split the text into lines and group consecutive non-empty lines—blank lines are usually the marker for paragraph breaks.

Example code snippet:

from PyPDF2 import PdfReader

reader = PdfReader("your_document.pdf")
paragraphs = []
current_paragraph = []

for page in reader.pages:
    # Preserve original line breaks with layout mode
    page_text = page.extract_text(layout_mode="layout")
    lines = page_text.split("\n")
    
    for line in lines:
        stripped_line = line.strip()
        if stripped_line:
            # Add non-empty line to current paragraph
            current_paragraph.append(stripped_line)
        else:
            # Blank line = end of paragraph
            if current_paragraph:
                paragraphs.append(" ".join(current_paragraph))
                current_paragraph = []
    # Catch the last paragraph on the page
    if current_paragraph:
        paragraphs.append(" ".join(current_paragraph))
        current_paragraph = []

# Print or process your paragraphs
for para in paragraphs:
    print(para)

Pro tip: If your PDF uses indentation instead of blank lines to separate paragraphs, add logic to check if a line starts with spaces/tabs and merge it with the previous paragraph.

Handling Paragraph Extraction with PDFBox

PDFBox gives you more control over text position metadata, which makes paragraph detection more precise:

Use a custom PDFTextStripper to track the vertical position of each text line. A larger gap between consecutive lines usually signals a new paragraph.

Example Java code snippet:

import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.text.PDFTextStripper;
import org.apache.pdfbox.text.TextPosition;

import java.io.File;
import java.util.ArrayList;
import java.util.List;

public class ParagraphExtractor {
    public static void main(String[] args) throws Exception {
        PDDocument document = PDDocument.load(new File("your_document.pdf"));
        
        PDFTextStripper stripper = new PDFTextStripper() {
            private List<String> paragraphs = new ArrayList<>();
            private StringBuilder currentPara = new StringBuilder();
            private float lastLineY = -1;
            // Adjust this threshold based on your PDF's font size/spacing
            private final float PARAGRAPH_SPACING_THRESHOLD = 15;

            @Override
            protected void writeString(String text, List<TextPosition> textPositions) {
                String trimmedText = text.trim();
                if (!trimmedText.isEmpty()) {
                    float currentLineY = textPositions.get(0).getY();
                    // Check if the gap between lines is large enough for a new paragraph
                    if (lastLineY != -1 && (lastLineY - currentLineY) > PARAGRAPH_SPACING_THRESHOLD) {
                        if (currentPara.length() > 0) {
                            paragraphs.add(currentPara.toString().trim());
                            currentPara.setLength(0);
                        }
                    }
                    currentPara.append(trimmedText).append(" ");
                    lastLineY = currentLineY;
                }
            }

            @Override
            protected void endPage(int page) {
                // Add the last paragraph of the page
                if (currentPara.length() > 0) {
                    paragraphs.add(currentPara.toString().trim());
                    currentPara.setLength(0);
                }
                lastLineY = -1;
            }
        };

        stripper.setSortByPosition(true);
        stripper.getText(document);
        document.close();

        // Output the extracted paragraphs
        for (String para : paragraphs) {
            System.out.println(para);
        }
    }
}

Pro tip: Tweak the PARAGRAPH_SPACING_THRESHOLD value—larger fonts will need a bigger threshold to distinguish paragraph breaks from line breaks within a paragraph.

General Tips for All PDFs

For scanned (image-based) PDFs: Neither tool can extract text directly—run OCR first (e.g., Tesseract integrated with PyPDF2/PDFBox), then apply the same paragraph grouping logic.
Test with your specific document: Every PDF has unique formatting, so you might need to adjust spacing thresholds or add logic for indentation, bullet points, or other structural cues.

内容的提问来源于stack exchange，提问作者Ashok Kuramdasu