Java如何读取PDF特定位置文本?解决固定格式PDF信息提取问题
Hey there! Let’s work through this problem you’re having with your fixed-format PDF reader in Java. Since your PDFs follow a consistent layout, targeting specific positions to check for missing info is totally feasible—here are some practical, code-backed approaches to solve this:
Fixed-format PDFs mean each data field lives in a predictable bounding box. You can define these boxes for every field you care about, then extract text only from those regions to check if they’re empty.
Two popular Java libraries make this straightforward: Apache PDFBox and iText 7.
Example with Apache PDFBox
PDFBox uses a coordinate system where the top-left corner of the page is (0,0). First, you’ll need to figure out the exact x, y, width, and height of each field’s bounding box (you can use tools like Adobe Acrobat’s "Measure" tool to get these values).
import org.apache.pdfbox.pdmodel.PDDocument; import org.apache.pdfbox.text.PDFTextStripperByArea; import java.awt.Rectangle; import java.io.File; public class PdfFieldChecker { public static void main(String[] args) throws Exception { // Load the PDF document try (PDDocument document = PDDocument.load(new File("your_target.pdf"))) { PDFTextStripperByArea areaStripper = new PDFTextStripperByArea(); // Define the bounding box for your target field (adjust coordinates to match your PDF) Rectangle customerNameArea = new Rectangle(120, 180, 250, 25); areaStripper.addRegion("customerName", customerNameArea); // Extract text from the defined region on the first page areaStripper.extractRegions(document.getPage(0)); String extractedText = areaStripper.getTextForRegion("customerName").trim(); // Check if the field is empty/missing if (extractedText.isEmpty()) { System.out.println("Customer Name field is missing info"); // Handle the missing data (e.g., set a default value, flag the record) } else { System.out.println("Extracted Customer Name: " + extractedText); // Proceed to upload this value to your database } } } }
Example with iText 7
iText uses a coordinate system where the bottom-left corner is (0,0), so you’ll need to adjust the y coordinate by subtracting from the page’s total height.
import com.itextpdf.kernel.pdf.PdfDocument; import com.itextpdf.kernel.pdf.PdfReader; import com.itextpdf.kernel.pdf.canvas.parser.PdfTextExtractor; import com.itextpdf.kernel.pdf.canvas.parser.filter.TextRegionEventFilter; import com.itextpdf.kernel.pdf.canvas.parser.listener.FilteredTextEventListener; import com.itextpdf.kernel.pdf.canvas.parser.listener.LocationTextExtractionStrategy; import java.awt.Rectangle; public class ITextFieldChecker { public static void main(String[] args) throws Exception { try (PdfDocument pdfDoc = new PdfDocument(new PdfReader("your_target.pdf"))) { var page = pdfDoc.getPage(1); float pageHeight = page.getPageSize().getHeight(); // Define target region (adjust coordinates: x, y (from bottom), width, height) Rectangle customerEmailArea = new Rectangle(120, pageHeight - 210, 250, 25); TextRegionEventFilter regionFilter = new TextRegionEventFilter(customerEmailArea); // Extract text from the filtered region LocationTextExtractionStrategy extractionStrategy = new LocationTextExtractionStrategy(); FilteredTextEventListener filteredListener = new FilteredTextEventListener(extractionStrategy, regionFilter); String extractedText = PdfTextExtractor.getTextFromPage(page, filteredListener).trim(); // Validate the extracted content if (extractedText.isEmpty()) { System.out.println("Customer Email field has missing info"); // Add your missing-data handling logic here } else { System.out.println("Extracted Customer Email: " + extractedText); // Continue with database upload } } } }
- Whitespace vs. Empty: Always use
trim()to eliminate accidental spaces or newlines that might look like empty content. - Scanned PDFs: If some of your "fixed-format" PDFs are scanned images (not editable text), text extraction won’t work—you’ll need to add OCR (like Tesseract) to convert the image to text first before checking positions.
- Inconsistent Coordinates: Double-check coordinates across a few sample PDFs to ensure the layout is truly fixed (minor variations might require adjusting bounding boxes slightly).
Once you’ve got the per-region extraction working, you can loop through all your target fields, validate each one, and handle missing data consistently before uploading to your database.
内容的提问来源于stack exchange,提问作者Mario Jaramillo




