如何利用正则表达式索引PDF中特定格式的作者信息?
Got it, let's tackle this problem of extracting and indexing author info from your PDF while keeping those critical parentheses intact. Here are a few solid solutions that should work better than your initial attempts:
方法1:改进Adobe Acrobat DC的JavaScript逻辑
Your first approach using getPageNumWords() and getPageNthWord() failed because those methods split text into individual "words"—and punctuation like parentheses often get treated as separate entities, breaking your regex match. Instead, use Acrobat's getPageText() method, which pulls the full, unbroken text of a page including all punctuation.
Here's a revised script you can run in Acrobat's JavaScript console (Ctrl+J/ Cmd+J):
// Initialize a map to store authors and their page numbers const authorIndex = {}; // Regex to match your author pattern: "Last, First, City, (CountryCode)" const authorRegex = /([A-Za-z]+),\s*([A-Za-z]+),\s*([A-Za-z\s]+),\s*\(([A-Z]{2,3})\)/g; // Loop through every page in the PDF for (let page = 0; page < this.numPages; page++) { const pageText = this.getPageText(page); let match; // Find all author matches on the current page while ((match = authorRegex.exec(pageText)) !== null) { const fullAuthorStr = match[0]; // e.g., "Doe, John, New York City, (USA)" // Convert to 1-based page number const pageNumber = page + 1; // Add the page to the author's entry if (!authorIndex[fullAuthorStr]) { authorIndex[fullAuthorStr] = []; } authorIndex[fullAuthorStr].push(pageNumber); } } // Print the final index to the console for (const [author, pages] of Object.entries(authorIndex)) { console.log(`${author} - 页码:${pages.join(",")}`); }
This script will preserve the parentheses because it works with full page text, not split words. You can copy the console output directly or modify the script to save it to a file if needed.
方法2:使用开源PDF文本提取工具(pdftotext)
If you don't want to rely on Acrobat, the pdftotext tool (part of the Poppler utility suite) is a reliable open-source alternative that preserves punctuation and layout far better than random word splitting.
Steps:
- Install Poppler:
- Windows: Download pre-built binaries from the Poppler project
- Mac: Run
brew install popplerin Terminal - Linux: Run
sudo apt-get install poppler-utils
- Extract text with layout preserved:
Run this command in your terminal to export the PDF to a text file while keeping original formatting:pdftotext -layout your_document.pdf author_info.txt - Index authors with a script
Use a simple Python script to scan the text file, match authors, and track their pages (adjust if your text file has explicit page markers like "Page X"):import re from collections import defaultdict # Match your author pattern author_pattern = re.compile(r'([A-Za-z]+),\s*([A-Za-z]+),\s*([A-Za-z\s]+),\s*\(([A-Z]{2,3})\)') author_pages = defaultdict(list) # Extract one page at a time to track page numbers accurately total_pages = 10 # Replace with your PDF's actual page count for page_num in range(1, total_pages + 1): import os # Extract single page to a temporary file os.system(f'pdftotext -f {page_num} -l {page_num} your_document.pdf temp_page.txt') with open('temp_page.txt', 'r', encoding='utf-8') as f: text = f.read() matches = author_pattern.findall(text) for match in matches: author_str = f"{match[0]}, {match[1]}, {match[2]}, ({match[3]})" author_pages[author_str].append(page_num) # Clean up the temporary file os.remove('temp_page.txt') # Print the final index for author, pages in author_pages.items(): print(f"{author} - 页码:{','.join(map(str, pages))}")
This method is great for batch processing and doesn't require a paid Acrobat license.
方法3:尽量跳过Word/RTF导出(除非万不得已)
Exporting to Word or RTF is risky—these formats often mess up punctuation, add extra line breaks, or convert parentheses to full-width characters, which breaks your regex. If you absolutely have to go this route, first export the PDF, then do a quick check to ensure all author entries are still in the Last, First, City, (CountryCode) format before running your regex.
内容的提问来源于stack exchange,提问作者Code and coffee




