C#开发PDF转Excel:iTextSharp代码解析与疑问咨询
Hey Alan, let's break down this iTextSharp code step by step to verify your understanding and clear up the parts you're unsure about. First, here's the code for reference:
StringBuilder text = new StringBuilder(); // 用于存储PDF内容? PdfReader pdfReader = new PdfReader(myPath); // 是否是iTextSharp读取PDF的方式? for (int page = 1; page <= pdfReader.NumberOfPages; page++) // 循环读取PDF所有页面? { ITextExtractionStrategy strategy = new LocationTextExtractionStrategy(); // ? string currentText = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy); // ? currentText = Encoding.UTF8.GetString(Encoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.UTF8.GetBytes(currentText))); // 是否是编码转换? text.Append(currentText); // 是否拼接完整PDF内容? } pdfReader.Close(); // 是否关闭PdfReader?
Your Existing Understandings (All Correct!)
Let's confirm each of your notes:
StringBuilder text = new StringBuilder();: Exactly right.StringBuilderis used here because it's far more efficient than regularstringfor frequent text concatenation (which happens every time we add a page's content).PdfReader pdfReader = new PdfReader(myPath);: Correct.PdfReaderis iTextSharp's core class for loading and accessing the contents of a PDF file from a path, stream, or byte array.for (int page = 1; ...): Yep, this loop iterates through every page in the PDF. Note that iTextSharp uses 1-based indexing for pages, so we start at 1 instead of 0.text.Append(currentText);: Right again. This adds each page's extracted text to theStringBuilderto build the full content of the PDF.pdfReader.Close();: Correct. This releases the file resources held by thePdfReader. As a side note, using ausingblock is a safer practice here—it automatically callsClose()even if an error occurs:using (PdfReader pdfReader = new PdfReader(myPath)) { // Your text extraction logic here }
Unclear Code Explained
Now let's unpack the parts you marked with "?":
1. ITextExtractionStrategy strategy = new LocationTextExtractionStrategy();
This is a text extraction strategy—it defines how iTextSharp organizes the text it pulls from the PDF.
LocationTextExtractionStrategy is one of the most commonly used strategies because it extracts text in the visual order you'd read it (top to bottom, left to right). This is important because PDFs store text in the order it was drawn (not necessarily the order you see), so other strategies like SimpleTextExtractionStrategy might return text in a jumbled order if the PDF was created with complex layout.
2. string currentText = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy);
PdfTextExtractor is iTextSharp's dedicated utility class for pulling text from PDF pages.
The GetTextFromPage method does the heavy lifting: it takes your loaded PdfReader, the page number you want to extract, and the strategy you defined, then returns the text from that page formatted according to the strategy's rules. In short, this line is where the actual text extraction happens.
3. The Encoding Conversion Line
currentText = Encoding.UTF8.GetString(Encoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.UTF8.GetBytes(currentText)));
You're right—this is an encoding conversion, but it's a bit convoluted. Let's break down what it's trying to do:
Encoding.UTF8.GetBytes(currentText): Converts the extracted string to a UTF-8 byte array.Encoding.Convert(...): Takes that byte array and converts it from the system's default encoding (e.g., GBK on Chinese Windows) to UTF-8.Encoding.UTF8.GetString(...): Converts the converted byte array back to a string.
The intent here is likely to fix encoding-related garbled text, but this approach is not ideal. iTextSharp typically extracts text in UTF-8 by default, so this line might be a legacy workaround for older versions or specific edge cases. In most modern scenarios, you can probably omit it unless you're seeing consistent encoding issues.
内容的提问来源于stack exchange,提问作者alan13




