Java实现PDF文本替换时如何保留原格式与结构?
PDF文本替换自动化:保留原始格式的实现困境与求助
需求概述
我正在用Java开发PDF文档文本替换的自动化功能,核心目标是完全保留PDF的原始格式与结构。尝试了多种方案后均未达到预期效果,寻求可行的实现建议。
已尝试的方法
方法1:PDFBox提取-编辑-重新插入文本
提取文本代码
package PDFbox; import org.apache.pdfbox.pdmodel.PDDocument; import org.apache.pdfbox.text.PDFTextStripper; import java.io.File; import java.io.FileWriter; import java.io.IOException; public class ExtractText { public static void main(String[] args) throws IOException { if (args.length < 2) { System.err.println("Usage: java ExtractText <pdfFilePath> <outputTextFilePath>"); return; } String pdfFilePath = args[0]; String outputTextFilePath = args[1]; try (PDDocument document = PDDocument.load(new File(pdfFilePath)); FileWriter writer = new FileWriter(outputTextFilePath)) { PDFTextStripper textStripper = new PDFTextStripper(); String text = textStripper.getText(document); writer.write(text); System.out.println("Text content extracted and saved to " + outputTextFilePath); } } }
编辑步骤
将提取出的文本在文本文件中手动编辑修改。
重新插入文本代码
package PDFbox; import org.apache.pdfbox.pdmodel.PDDocument; import org.apache.pdfbox.pdmodel.PDPage; import org.apache.pdfbox.pdmodel.PDPageContentStream; import org.apache.pdfbox.pdmodel.font.PDType1Font; import java.io.File; import java.io.IOException; import java.nio.file.Files; import java.nio.file.Paths; public class InsertText { public static void main(String[] args) throws IOException { if (args.length < 3) { System.err.println("Usage: java InsertText <pdfFilePath> <textFilePath> <outputPdfFilePath>"); return; } String pdfFilePath = args[0]; String textFilePath = args[1]; String outputPdfFilePath = args[2]; // Load the text content String editedText = new String(Files.readAllBytes(Paths.get(textFilePath))); try (PDDocument document = PDDocument.load(new File(pdfFilePath))) { PDPage page = document.getPage(0); // Modify the page content try (PDPageContentStream contentStream = new PDPageContentStream(document, page, PDPageContentStream.AppendMode.APPEND, true, true)) { contentStream.beginText(); contentStream.setFont(PDType1Font.HELVETICA, 12); contentStream.newLineAtOffset(25, 750); // Split text into lines to handle line breaks String[] lines = editedText.split("\n"); for (String line : lines) { contentStream.showText(line); contentStream.newLineAtOffset(0, -15); // Move to the next line } contentStream.endText(); } // Save the updated PDF document.save(outputPdfFilePath); System.out.println("Edited text inserted and PDF saved to " + outputPdfFilePath); } } }
遇到的问题
- 编辑后PDF的格式与布局发生显著改变
- 存在文本对齐、字体不匹配及分页错乱问题
方法2:iText提取-编辑-重新插入文本
提取文本代码
import com.itextpdf.text.pdf.PdfReader; import com.itextpdf.text.pdf.parser.PdfTextExtractor; public class ExtractTextUsingIText { public static void main(String[] args) { try { PdfReader reader = new PdfReader("path/to/pdf"); String text = PdfTextExtractor.getTextFromPage(reader, 1); System.out.println(text); reader.close(); } catch (Exception e) { e.printStackTrace(); } } }
编辑步骤
将提取出的文本在文本文件中手动编辑修改。
重新插入文本代码
import com.itextpdf.text.Document; import com.itextpdf.text.DocumentException; import com.itextpdf.text.Paragraph; import com.itextpdf.text.pdf.PdfWriter; import java.io.FileOutputStream; import java.io.IOException; public class InsertTextUsingIText { public static void main(String[] args) { Document document = new Document(); try { PdfWriter.getInstance(document, new FileOutputStream("path/to/edited_pdf")); document.open(); document.add(new Paragraph("Edited text goes here")); document.close(); System.out.println("Edited text inserted and PDF saved"); } catch (DocumentException | IOException e) { e.printStackTrace(); } } }
遇到的问题
- 编辑后PDF原始结构完全被破坏
- 丢失所有原始格式与布局
方法3:直接编辑PDF二进制数据
导出PDF为二进制文件代码
package PDFbox; import java.io.File; import java.io.FileInputStream; import java.io.FileOutputStream; import java.io.IOException; public class ExportPDFAsBinary { public static void main(String[] args) throws IOException { if (args.length < 2) { System.err.println("Usage: java ExportPDFAsBinary <sourcePdfPath> <outputBinaryFilePath>"); return; } String sourcePdfPath = args[0]; String outputBinaryFilePath = args[1]; try (FileInputStream fis = new FileInputStream(new File(sourcePdfPath)); FileOutputStream fos = new FileOutputStream(new File(outputBinaryFilePath))) { byte[] buffer = new byte[1024]; int bytesRead; while ((bytesRead = fis.read(buffer)) != -1) { fos.write(buffer, 0, bytesRead); } System.out.println("PDF content exported as binary data to " + outputBinaryFilePath); } } }
注:输出文件需使用.binary扩展名,此步骤运行完全正常。
编辑二进制数据代码
package PDFbox; import java.io.File; import java.io.FileInputStream; import java.io.FileOutputStream; import java.io.IOException; import java.nio.charset.StandardCharsets; public class EditBinaryPDF { public static void main(String[] args) throws IOException { if (args.length < 4) { System.err.println("Usage: java EditBinaryPDF <binaryFilePath> <outputBinaryFilePath> <searchString> <replaceString>"); return; } String binaryFilePath = args[0]; String outputBinaryFilePath = args[1]; String searchString = args[2]; String replaceString = args[3]; // Ensure search and replace strings are of the same length if (searchString.length() != replaceString.length()) { System.err.println("Search and replace strings must be of the same length"); return; } // Read the binary file into a byte array byte[] binaryData = readBinaryFile(binaryFilePath); // Convert search and replace strings to byte arrays byte[] searchBytes = searchString.getBytes(StandardCharsets.ISO_8859_1); byte[] replaceBytes = replaceString.getBytes(StandardCharsets.ISO_8859_1); // Edit the binary data binaryData = replaceTextInBinaryData(binaryData, searchBytes, replaceBytes); // Save the edited binary data to the output file try (FileOutputStream fos = new FileOutputStream(new File(outputBinaryFilePath))) { fos.write(binaryData); } System.out.println("Edited binary data saved to " + outputBinaryFilePath); } private static byte[] readBinaryFile(String filePath) throws IOException { File file = new File(filePath); byte[] binaryData = new byte[(int) file.length()]; try (FileInputStream fis = new FileInputStream(file)) { fis.read(binaryData); } return binaryData; } private static byte[] replaceTextInBinaryData(byte[] binaryData, byte[] searchBytes, byte[] replaceBytes) { for (int i = 0; i <= binaryData.length - searchBytes.length; i++) { boolean match = true; for (int j = 0; j < searchBytes.length; j++) { if (binaryData[i + j] != searchBytes[j]) { match = false; break; } } if (match) { System.arraycopy(replaceBytes, 0, binaryData, i, replaceBytes.length); i += searchBytes.length - 1; // Move past the replaced text } } return binaryData; } }
注:代码要求替换文本与原文本长度必须一致,局限性极大。
从二进制文件重建PDF代码
package PDFbox; import java.io.File; import java.io.FileInputStream; import java.io.FileOutputStream; import java.io.IOException; public class RecreatePDFFromBinary { public static void main(String[] args) throws IOException { if (args.length < 2) { System.err.println("Usage: java RecreatePDFFromBinary <inputBinaryFilePath> <outputPdfPath>"); return; } String inputBinaryFilePath = args[0]; String outputPdfPath = args[1]; try (FileInputStream fis = new FileInputStream(new File(inputBinaryFilePath)); FileOutputStream fos = new FileOutputStream(new File(outputPdfPath))) { byte[] buffer = new byte[1024]; int bytesRead; while ((bytesRead = fis.read(buffer)) != -1) { fos.write(buffer, 0, bytesRead); } System.out.println("PDF recreated from binary data at " + outputPdfPath); } } }
注:未编辑的二进制文件重建PDF完全正常,但编辑后的二进制文件重建出的PDF完全混乱。
其他尝试
尝试过Adobe Acrobat Pro中的JavaScript功能,同样无法实现需求,查阅文档发现Adobe不鼓励此类自动化操作。
求助
任何可行的实现思路或技术方案都将非常感谢。
内容的提问来源于stack exchange,提问作者Ezee




