You need to enable JavaScript to run this app.
最新活动
大模型
产品
解决方案
定价
生态与合作
支持与服务
开发者
了解我们

如何用Python或Java从PDF中仅提取段落?

Extracting Paragraphs from PDFs with PyPDF2 and PDFBox

Got it, let's tackle this problem—extracting just paragraphs from PDFs when you can already pull all text with PyPDF2 and PDFBox. PDFs don’t natively store "paragraphs" as a distinct element, so we need to work with text layout and spacing clues to group chunks into logical paragraphs. Here’s how to do it for both tools, plus some practical tweaks:

Handling Paragraph Extraction with PyPDF2

PyPDF2 doesn’t have built-in paragraph detection, but we can use line breaks and spacing to group text into paragraphs after extraction:

  • First, use the layout_mode="layout" parameter in extract_text() to preserve the original line structure (this keeps more spacing than the default extraction).
  • Then, split the text into lines and group consecutive non-empty lines—blank lines are usually the marker for paragraph breaks.
  • Example code snippet:
    from PyPDF2 import PdfReader
    
    reader = PdfReader("your_document.pdf")
    paragraphs = []
    current_paragraph = []
    
    for page in reader.pages:
        # Preserve original line breaks with layout mode
        page_text = page.extract_text(layout_mode="layout")
        lines = page_text.split("\n")
        
        for line in lines:
            stripped_line = line.strip()
            if stripped_line:
                # Add non-empty line to current paragraph
                current_paragraph.append(stripped_line)
            else:
                # Blank line = end of paragraph
                if current_paragraph:
                    paragraphs.append(" ".join(current_paragraph))
                    current_paragraph = []
        # Catch the last paragraph on the page
        if current_paragraph:
            paragraphs.append(" ".join(current_paragraph))
            current_paragraph = []
    
    # Print or process your paragraphs
    for para in paragraphs:
        print(para)
    
  • Pro tip: If your PDF uses indentation instead of blank lines to separate paragraphs, add logic to check if a line starts with spaces/tabs and merge it with the previous paragraph.

Handling Paragraph Extraction with PDFBox

PDFBox gives you more control over text position metadata, which makes paragraph detection more precise:

  • Use a custom PDFTextStripper to track the vertical position of each text line. A larger gap between consecutive lines usually signals a new paragraph.
  • Example Java code snippet:
    import org.apache.pdfbox.pdmodel.PDDocument;
    import org.apache.pdfbox.text.PDFTextStripper;
    import org.apache.pdfbox.text.TextPosition;
    
    import java.io.File;
    import java.util.ArrayList;
    import java.util.List;
    
    public class ParagraphExtractor {
        public static void main(String[] args) throws Exception {
            PDDocument document = PDDocument.load(new File("your_document.pdf"));
            
            PDFTextStripper stripper = new PDFTextStripper() {
                private List<String> paragraphs = new ArrayList<>();
                private StringBuilder currentPara = new StringBuilder();
                private float lastLineY = -1;
                // Adjust this threshold based on your PDF's font size/spacing
                private final float PARAGRAPH_SPACING_THRESHOLD = 15;
    
                @Override
                protected void writeString(String text, List<TextPosition> textPositions) {
                    String trimmedText = text.trim();
                    if (!trimmedText.isEmpty()) {
                        float currentLineY = textPositions.get(0).getY();
                        // Check if the gap between lines is large enough for a new paragraph
                        if (lastLineY != -1 && (lastLineY - currentLineY) > PARAGRAPH_SPACING_THRESHOLD) {
                            if (currentPara.length() > 0) {
                                paragraphs.add(currentPara.toString().trim());
                                currentPara.setLength(0);
                            }
                        }
                        currentPara.append(trimmedText).append(" ");
                        lastLineY = currentLineY;
                    }
                }
    
                @Override
                protected void endPage(int page) {
                    // Add the last paragraph of the page
                    if (currentPara.length() > 0) {
                        paragraphs.add(currentPara.toString().trim());
                        currentPara.setLength(0);
                    }
                    lastLineY = -1;
                }
            };
    
            stripper.setSortByPosition(true);
            stripper.getText(document);
            document.close();
    
            // Output the extracted paragraphs
            for (String para : paragraphs) {
                System.out.println(para);
            }
        }
    }
    
  • Pro tip: Tweak the PARAGRAPH_SPACING_THRESHOLD value—larger fonts will need a bigger threshold to distinguish paragraph breaks from line breaks within a paragraph.

General Tips for All PDFs

  • For scanned (image-based) PDFs: Neither tool can extract text directly—run OCR first (e.g., Tesseract integrated with PyPDF2/PDFBox), then apply the same paragraph grouping logic.
  • Test with your specific document: Every PDF has unique formatting, so you might need to adjust spacing thresholds or add logic for indentation, bullet points, or other structural cues.

内容的提问来源于stack exchange,提问作者Ashok Kuramdasu

火山引擎 最新活动