使用iText7提取编码损坏PDF文本遇乱码,求解决方案咨询
First, let’s break down why you’re seeing that garbled output like ǪȃǷǻȁǭǵǶǬdzȇǹǺǸǶǰǺǭdzȄǹǺǪǨ ,668(')25&216758&7,21 instead of the correct Cyrillic/English text. This usually happens when the PDF has font encoding mismatches or corrupted font dictionaries—the text is stored with incorrect mapping between glyphs and Unicode values, and the default LocationTextExtractionStrategy doesn’t handle these edge cases well.
To answer your question directly: Yes, customizing or using a modified LocationTextExtractionStrategy can help resolve this, but you’ll need to address the encoding mismatch directly. Here’s how to approach it step by step:
1. First, Inspect the PDF’s Font Encoding
Before modifying the strategy, figure out what’s wrong with the font setup. Sometimes Cyrillic text uses a custom encoding that iText doesn’t auto-detect. Add this debug code to your method to check:
var fontDict = page.GetResources().GetFonts(); foreach (var font in fontDict.Values) { PdfDictionary fontObj = font.GetPdfObject(); Console.WriteLine($"Font Name: {fontObj.GetAsString(PdfName.BaseFont)}"); Console.WriteLine($"Encoding: {fontObj.GetAsString(PdfName.Encoding)}"); Console.WriteLine($"Has ToUnicode CMap: {fontObj.ContainsKey(PdfName.ToUnicode)}"); }
This will tell you if the font uses a non-standard encoding or a missing/corrupted ToUnicode map (the key file that links glyphs to Unicode).
2. Implement a Custom Text Extraction Strategy
If the default strategy fails, create a modified version that overrides glyph-to-text mapping. For example, if you can identify which garbled characters correspond to the correct Cyrillic letters, you can build a manual mapping:
public class CyrillicFixExtractionStrategy : LocationTextExtractionStrategy { // Map garbled glyph codes to correct Unicode characters (populate this based on your test case) private readonly Dictionary<int, char> _glyphCorrectionMap = new Dictionary<int, char> { { 'Ǫ', 'В' }, { 'ȃ', 'Ы' }, { 'Ƿ', 'П' }, // Add more mappings as needed from your garbled text sample }; protected override void RenderText(TextRenderInfo renderInfo) { var correctedGlyphs = new List<char>(); foreach (char glyph in renderInfo.GetGlyphText()) { // Replace garbled glyph with correct character if mapping exists correctedGlyphs.Add(_glyphCorrectionMap.TryGetValue(glyph, out char correctChar) ? correctChar : glyph); } // Create a new TextRenderInfo with corrected text and pass to base method var correctedText = new string(correctedGlyphs.ToArray()); base.RenderText(new TextRenderInfo( renderInfo.GetGlyphLine(), renderInfo.GetStartPoint(), renderInfo.GetBaseline(), renderInfo.GetAscentLine(), renderInfo.GetDescentLine(), renderInfo.GetFont(), renderInfo.GetFontSize(), renderInfo.GetWidthOfSpace(), renderInfo.GetTextRenderMode(), renderInfo.GetPageNumber(), renderInfo.GetMatrix(), renderInfo.GetCanvasRotation() ) { Text = correctedText }); } }
Then update your original method to use this custom strategy:
var filteredTextEventListener = new FilteredTextEventListener(new CyrillicFixExtractionStrategy(), filter); var result = PdfTextExtractor.GetTextFromPage(page, filteredTextEventListener);
3. Fallback: Force Explicit Encoding for the Font
If the PDF uses a known encoding (like Windows-1251 for Cyrillic) that iText isn’t detecting, you can explicitly set the encoding when loading the font. You can integrate this into your custom strategy:
protected override void RenderText(TextRenderInfo renderInfo) { PdfFont font = renderInfo.GetFont(); // Force Windows-1251 encoding for Cyrillic text PdfFont fixedFont = PdfFontFactory.CreateFont(font.GetFontProgram().GetFontNames().GetFontName(), "Windows-1251", true); // Use fixedFont to re-encode the glyph text string correctedText = fixedFont.Decode(renderInfo.GetGlyphBytes()); // Pass corrected text to base render method base.RenderText(new TextRenderInfo(/* ... */) { Text = correctedText }); }
4. Extreme Case: Use Low-Level Canvas Processing
If all else fails, use PdfCanvasProcessor to handle glyphs directly. This gives you full control over every rendered character:
var processor = new PdfCanvasProcessor(new CustomGlyphHandler()); processor.ProcessPageContent(page); public class CustomGlyphHandler : IEventListener { private readonly StringBuilder _extractedText = new StringBuilder(); public void EventOccurred(IEventData data, EventType type) { if (type == EventType.RENDER_TEXT) { TextRenderInfo info = (TextRenderInfo)data; // Manually map glyphs to correct text using your encoding knowledge string corrected = CorrectGlyphs(info.GetGlyphText(), info.GetFont()); _extractedText.Append(corrected); } } private string CorrectGlyphs(string glyphs, PdfFont font) { // Implement your custom correction logic here return glyphs.Replace('Ǫ', 'В').Replace('ȃ', 'Ы'); // Example replacement } public ICollection<EventType> GetSupportedEvents() => new List<EventType> { EventType.RENDER_TEXT }; }
Key Takeaway
The default LocationTextExtractionStrategy works for well-formed PDFs, but corrupted or non-standard encoding requires targeted customizations. Start by inspecting the font’s encoding to understand the mismatch, then build a correction map or force the correct encoding in a modified strategy.
内容的提问来源于stack exchange,提问作者DelyaHF




