如何用Python或Java从PDF中仅提取段落?
Extracting Paragraphs from PDFs with PyPDF2 and PDFBox
Got it, let's tackle this problem—extracting just paragraphs from PDFs when you can already pull all text with PyPDF2 and PDFBox. PDFs don’t natively store "paragraphs" as a distinct element, so we need to work with text layout and spacing clues to group chunks into logical paragraphs. Here’s how to do it for both tools, plus some practical tweaks:
Handling Paragraph Extraction with PyPDF2
PyPDF2 doesn’t have built-in paragraph detection, but we can use line breaks and spacing to group text into paragraphs after extraction:
- First, use the
layout_mode="layout"parameter inextract_text()to preserve the original line structure (this keeps more spacing than the default extraction). - Then, split the text into lines and group consecutive non-empty lines—blank lines are usually the marker for paragraph breaks.
- Example code snippet:
from PyPDF2 import PdfReader reader = PdfReader("your_document.pdf") paragraphs = [] current_paragraph = [] for page in reader.pages: # Preserve original line breaks with layout mode page_text = page.extract_text(layout_mode="layout") lines = page_text.split("\n") for line in lines: stripped_line = line.strip() if stripped_line: # Add non-empty line to current paragraph current_paragraph.append(stripped_line) else: # Blank line = end of paragraph if current_paragraph: paragraphs.append(" ".join(current_paragraph)) current_paragraph = [] # Catch the last paragraph on the page if current_paragraph: paragraphs.append(" ".join(current_paragraph)) current_paragraph = [] # Print or process your paragraphs for para in paragraphs: print(para) - Pro tip: If your PDF uses indentation instead of blank lines to separate paragraphs, add logic to check if a line starts with spaces/tabs and merge it with the previous paragraph.
Handling Paragraph Extraction with PDFBox
PDFBox gives you more control over text position metadata, which makes paragraph detection more precise:
- Use a custom
PDFTextStripperto track the vertical position of each text line. A larger gap between consecutive lines usually signals a new paragraph. - Example Java code snippet:
import org.apache.pdfbox.pdmodel.PDDocument; import org.apache.pdfbox.text.PDFTextStripper; import org.apache.pdfbox.text.TextPosition; import java.io.File; import java.util.ArrayList; import java.util.List; public class ParagraphExtractor { public static void main(String[] args) throws Exception { PDDocument document = PDDocument.load(new File("your_document.pdf")); PDFTextStripper stripper = new PDFTextStripper() { private List<String> paragraphs = new ArrayList<>(); private StringBuilder currentPara = new StringBuilder(); private float lastLineY = -1; // Adjust this threshold based on your PDF's font size/spacing private final float PARAGRAPH_SPACING_THRESHOLD = 15; @Override protected void writeString(String text, List<TextPosition> textPositions) { String trimmedText = text.trim(); if (!trimmedText.isEmpty()) { float currentLineY = textPositions.get(0).getY(); // Check if the gap between lines is large enough for a new paragraph if (lastLineY != -1 && (lastLineY - currentLineY) > PARAGRAPH_SPACING_THRESHOLD) { if (currentPara.length() > 0) { paragraphs.add(currentPara.toString().trim()); currentPara.setLength(0); } } currentPara.append(trimmedText).append(" "); lastLineY = currentLineY; } } @Override protected void endPage(int page) { // Add the last paragraph of the page if (currentPara.length() > 0) { paragraphs.add(currentPara.toString().trim()); currentPara.setLength(0); } lastLineY = -1; } }; stripper.setSortByPosition(true); stripper.getText(document); document.close(); // Output the extracted paragraphs for (String para : paragraphs) { System.out.println(para); } } } - Pro tip: Tweak the
PARAGRAPH_SPACING_THRESHOLDvalue—larger fonts will need a bigger threshold to distinguish paragraph breaks from line breaks within a paragraph.
General Tips for All PDFs
- For scanned (image-based) PDFs: Neither tool can extract text directly—run OCR first (e.g., Tesseract integrated with PyPDF2/PDFBox), then apply the same paragraph grouping logic.
- Test with your specific document: Every PDF has unique formatting, so you might need to adjust spacing thresholds or add logic for indentation, bullet points, or other structural cues.
内容的提问来源于stack exchange,提问作者Ashok Kuramdasu




