You need to enable JavaScript to run this app.
最新活动
大模型
产品
解决方案
定价
生态与合作
支持与服务
开发者
了解我们

如何利用正则表达式索引PDF中特定格式的作者信息?

解决方案:提取PDF中带括号的作者信息并索引页码

Got it, let's tackle this problem of extracting and indexing author info from your PDF while keeping those critical parentheses intact. Here are a few solid solutions that should work better than your initial attempts:

方法1:改进Adobe Acrobat DC的JavaScript逻辑

Your first approach using getPageNumWords() and getPageNthWord() failed because those methods split text into individual "words"—and punctuation like parentheses often get treated as separate entities, breaking your regex match. Instead, use Acrobat's getPageText() method, which pulls the full, unbroken text of a page including all punctuation.

Here's a revised script you can run in Acrobat's JavaScript console (Ctrl+J/ Cmd+J):

// Initialize a map to store authors and their page numbers
const authorIndex = {};
// Regex to match your author pattern: "Last, First, City, (CountryCode)"
const authorRegex = /([A-Za-z]+),\s*([A-Za-z]+),\s*([A-Za-z\s]+),\s*\(([A-Z]{2,3})\)/g;

// Loop through every page in the PDF
for (let page = 0; page < this.numPages; page++) {
  const pageText = this.getPageText(page);
  let match;
  // Find all author matches on the current page
  while ((match = authorRegex.exec(pageText)) !== null) {
    const fullAuthorStr = match[0]; // e.g., "Doe, John, New York City, (USA)"
    // Convert to 1-based page number
    const pageNumber = page + 1;
    // Add the page to the author's entry
    if (!authorIndex[fullAuthorStr]) {
      authorIndex[fullAuthorStr] = [];
    }
    authorIndex[fullAuthorStr].push(pageNumber);
  }
}

// Print the final index to the console
for (const [author, pages] of Object.entries(authorIndex)) {
  console.log(`${author} - 页码:${pages.join(",")}`);
}

This script will preserve the parentheses because it works with full page text, not split words. You can copy the console output directly or modify the script to save it to a file if needed.

方法2:使用开源PDF文本提取工具(pdftotext)

If you don't want to rely on Acrobat, the pdftotext tool (part of the Poppler utility suite) is a reliable open-source alternative that preserves punctuation and layout far better than random word splitting.

Steps:

  1. Install Poppler:
    • Windows: Download pre-built binaries from the Poppler project
    • Mac: Run brew install poppler in Terminal
    • Linux: Run sudo apt-get install poppler-utils
  2. Extract text with layout preserved:
    Run this command in your terminal to export the PDF to a text file while keeping original formatting:
    pdftotext -layout your_document.pdf author_info.txt
    
  3. Index authors with a script
    Use a simple Python script to scan the text file, match authors, and track their pages (adjust if your text file has explicit page markers like "Page X"):
    import re
    from collections import defaultdict
    
    # Match your author pattern
    author_pattern = re.compile(r'([A-Za-z]+),\s*([A-Za-z]+),\s*([A-Za-z\s]+),\s*\(([A-Z]{2,3})\)')
    author_pages = defaultdict(list)
    
    # Extract one page at a time to track page numbers accurately
    total_pages = 10  # Replace with your PDF's actual page count
    for page_num in range(1, total_pages + 1):
        import os
        # Extract single page to a temporary file
        os.system(f'pdftotext -f {page_num} -l {page_num} your_document.pdf temp_page.txt')
        with open('temp_page.txt', 'r', encoding='utf-8') as f:
            text = f.read()
            matches = author_pattern.findall(text)
            for match in matches:
                author_str = f"{match[0]}, {match[1]}, {match[2]}, ({match[3]})"
                author_pages[author_str].append(page_num)
        # Clean up the temporary file
        os.remove('temp_page.txt')
    
    # Print the final index
    for author, pages in author_pages.items():
        print(f"{author} - 页码:{','.join(map(str, pages))}")
    

This method is great for batch processing and doesn't require a paid Acrobat license.

方法3:尽量跳过Word/RTF导出(除非万不得已)

Exporting to Word or RTF is risky—these formats often mess up punctuation, add extra line breaks, or convert parentheses to full-width characters, which breaks your regex. If you absolutely have to go this route, first export the PDF, then do a quick check to ensure all author entries are still in the Last, First, City, (CountryCode) format before running your regex.


内容的提问来源于stack exchange,提问作者Code and coffee

火山引擎 最新活动