You need to enable JavaScript to run this app.
优惠活动
大模型
产品
解决方案
定价
更多
文档控制台
免费开始使用

Java实现PDF文本替换时如何保留原格式与结构?

PDF文本替换自动化:保留原始格式的实现困境与求助

需求概述

我正在用Java开发PDF文档文本替换的自动化功能,核心目标是完全保留PDF的原始格式与结构。尝试了多种方案后均未达到预期效果,寻求可行的实现建议。

已尝试的方法

方法1:PDFBox提取-编辑-重新插入文本

提取文本代码

package PDFbox;

import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.text.PDFTextStripper;

import java.io.File;
import java.io.FileWriter;
import java.io.IOException;

public class ExtractText {
    public static void main(String[] args) throws IOException {
        if (args.length < 2) {
            System.err.println("Usage: java ExtractText <pdfFilePath> <outputTextFilePath>");
            return;
        }

        String pdfFilePath = args[0];
        String outputTextFilePath = args[1];

        try (PDDocument document = PDDocument.load(new File(pdfFilePath));
             FileWriter writer = new FileWriter(outputTextFilePath)) {
            PDFTextStripper textStripper = new PDFTextStripper();
            String text = textStripper.getText(document);
            writer.write(text);
            System.out.println("Text content extracted and saved to " + outputTextFilePath);
        }
    }
}

编辑步骤

将提取出的文本在文本文件中手动编辑修改。

重新插入文本代码

package PDFbox;

import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDPage;
import org.apache.pdfbox.pdmodel.PDPageContentStream;
import org.apache.pdfbox.pdmodel.font.PDType1Font;

import java.io.File;
import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Paths;

public class InsertText {
    public static void main(String[] args) throws IOException {
        if (args.length < 3) {
            System.err.println("Usage: java InsertText <pdfFilePath> <textFilePath> <outputPdfFilePath>");
            return;
        }

        String pdfFilePath = args[0];
        String textFilePath = args[1];
        String outputPdfFilePath = args[2];

        // Load the text content
        String editedText = new String(Files.readAllBytes(Paths.get(textFilePath)));

        try (PDDocument document = PDDocument.load(new File(pdfFilePath))) {
            PDPage page = document.getPage(0);

            // Modify the page content
            try (PDPageContentStream contentStream = new PDPageContentStream(document, page, PDPageContentStream.AppendMode.APPEND, true, true)) {
                contentStream.beginText();
                contentStream.setFont(PDType1Font.HELVETICA, 12);
                contentStream.newLineAtOffset(25, 750);

                // Split text into lines to handle line breaks
                String[] lines = editedText.split("\n");
                for (String line : lines) {
                    contentStream.showText(line);
                    contentStream.newLineAtOffset(0, -15); // Move to the next line
                }

                contentStream.endText();
            }

            // Save the updated PDF
            document.save(outputPdfFilePath);
            System.out.println("Edited text inserted and PDF saved to " + outputPdfFilePath);
        }
    }
}

遇到的问题

  • 编辑后PDF的格式与布局发生显著改变
  • 存在文本对齐、字体不匹配及分页错乱问题

方法2:iText提取-编辑-重新插入文本

提取文本代码

import com.itextpdf.text.pdf.PdfReader;
import com.itextpdf.text.pdf.parser.PdfTextExtractor;

public class ExtractTextUsingIText {
    public static void main(String[] args) {
        try {
            PdfReader reader = new PdfReader("path/to/pdf");
            String text = PdfTextExtractor.getTextFromPage(reader, 1);
            System.out.println(text);
            reader.close();
        } catch (Exception e) {
            e.printStackTrace();
        }
    }
}

编辑步骤

将提取出的文本在文本文件中手动编辑修改。

重新插入文本代码

import com.itextpdf.text.Document;
import com.itextpdf.text.DocumentException;
import com.itextpdf.text.Paragraph;
import com.itextpdf.text.pdf.PdfWriter;

import java.io.FileOutputStream;
import java.io.IOException;

public class InsertTextUsingIText {
    public static void main(String[] args) {
        Document document = new Document();
        try {
            PdfWriter.getInstance(document, new FileOutputStream("path/to/edited_pdf"));
            document.open();
            document.add(new Paragraph("Edited text goes here"));
            document.close();
            System.out.println("Edited text inserted and PDF saved");
        } catch (DocumentException | IOException e) {
            e.printStackTrace();
        }
    }
}

遇到的问题

  • 编辑后PDF原始结构完全被破坏
  • 丢失所有原始格式与布局

方法3:直接编辑PDF二进制数据

导出PDF为二进制文件代码

package PDFbox;

import java.io.File;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.IOException;

public class ExportPDFAsBinary {
    public static void main(String[] args) throws IOException {
        if (args.length < 2) {
            System.err.println("Usage: java ExportPDFAsBinary <sourcePdfPath> <outputBinaryFilePath>");
            return;
        }

        String sourcePdfPath = args[0];
        String outputBinaryFilePath = args[1];

        try (FileInputStream fis = new FileInputStream(new File(sourcePdfPath));
             FileOutputStream fos = new FileOutputStream(new File(outputBinaryFilePath))) {
            byte[] buffer = new byte[1024];
            int bytesRead;
            while ((bytesRead = fis.read(buffer)) != -1) {
                fos.write(buffer, 0, bytesRead);
            }

            System.out.println("PDF content exported as binary data to " + outputBinaryFilePath);
        }
    }
}

注:输出文件需使用.binary扩展名,此步骤运行完全正常。

编辑二进制数据代码

package PDFbox;

import java.io.File;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.IOException;
import java.nio.charset.StandardCharsets;

public class EditBinaryPDF {
    public static void main(String[] args) throws IOException {
        if (args.length < 4) {
            System.err.println("Usage: java EditBinaryPDF <binaryFilePath> <outputBinaryFilePath> <searchString> <replaceString>");
            return;
        }

        String binaryFilePath = args[0];
        String outputBinaryFilePath = args[1];
        String searchString = args[2];
        String replaceString = args[3];

        // Ensure search and replace strings are of the same length
        if (searchString.length() != replaceString.length()) {
            System.err.println("Search and replace strings must be of the same length");
            return;
        }

        // Read the binary file into a byte array
        byte[] binaryData = readBinaryFile(binaryFilePath);

        // Convert search and replace strings to byte arrays
        byte[] searchBytes = searchString.getBytes(StandardCharsets.ISO_8859_1);
        byte[] replaceBytes = replaceString.getBytes(StandardCharsets.ISO_8859_1);

        // Edit the binary data
        binaryData = replaceTextInBinaryData(binaryData, searchBytes, replaceBytes);

        // Save the edited binary data to the output file
        try (FileOutputStream fos = new FileOutputStream(new File(outputBinaryFilePath))) {
            fos.write(binaryData);
        }

        System.out.println("Edited binary data saved to " + outputBinaryFilePath);
    }

    private static byte[] readBinaryFile(String filePath) throws IOException {
        File file = new File(filePath);
        byte[] binaryData = new byte[(int) file.length()];
        try (FileInputStream fis = new FileInputStream(file)) {
            fis.read(binaryData);
        }
        return binaryData;
    }

    private static byte[] replaceTextInBinaryData(byte[] binaryData, byte[] searchBytes, byte[] replaceBytes) {
        for (int i = 0; i <= binaryData.length - searchBytes.length; i++) {
            boolean match = true;
            for (int j = 0; j < searchBytes.length; j++) {
                if (binaryData[i + j] != searchBytes[j]) {
                    match = false;
                    break;
                }
            }
            if (match) {
                System.arraycopy(replaceBytes, 0, binaryData, i, replaceBytes.length);
                i += searchBytes.length - 1;  // Move past the replaced text
            }
        }
        return binaryData;
    }
}

注:代码要求替换文本与原文本长度必须一致,局限性极大。

从二进制文件重建PDF代码

package PDFbox;

import java.io.File;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.IOException;

public class RecreatePDFFromBinary {
    public static void main(String[] args) throws IOException {
        if (args.length < 2) {
            System.err.println("Usage: java RecreatePDFFromBinary <inputBinaryFilePath> <outputPdfPath>");
            return;
        }

        String inputBinaryFilePath = args[0];
        String outputPdfPath = args[1];

        try (FileInputStream fis = new FileInputStream(new File(inputBinaryFilePath));
             FileOutputStream fos = new FileOutputStream(new File(outputPdfPath))) {

            byte[] buffer = new byte[1024];
            int bytesRead;
            while ((bytesRead = fis.read(buffer)) != -1) {
                fos.write(buffer, 0, bytesRead);
            }

            System.out.println("PDF recreated from binary data at " + outputPdfPath);
        }
    }
}

注:未编辑的二进制文件重建PDF完全正常,但编辑后的二进制文件重建出的PDF完全混乱。


其他尝试

尝试过Adobe Acrobat Pro中的JavaScript功能,同样无法实现需求,查阅文档发现Adobe不鼓励此类自动化操作。

求助

任何可行的实现思路或技术方案都将非常感谢。


内容的提问来源于stack exchange,提问作者Ezee

火山引擎 最新活动