C#开发PDF转Excel：iTextSharp代码解析与疑问咨询

阿华AIGC实验室

2026-5-15

关于iTextSharp读取PDF代码的理解验证与逻辑解释

Hey Alan, let's break down this iTextSharp code step by step to verify your understanding and clear up the parts you're unsure about. First, here's the code for reference:

StringBuilder text = new StringBuilder(); // 用于存储PDF内容？
PdfReader pdfReader = new PdfReader(myPath); // 是否是iTextSharp读取PDF的方式？
for (int page = 1; page <= pdfReader.NumberOfPages; page++) // 循环读取PDF所有页面？
{
    ITextExtractionStrategy strategy = new LocationTextExtractionStrategy(); // ？
    string currentText = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy); // ？
    currentText = Encoding.UTF8.GetString(Encoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.UTF8.GetBytes(currentText))); // 是否是编码转换？
    text.Append(currentText); // 是否拼接完整PDF内容？
}
pdfReader.Close(); // 是否关闭PdfReader？

Your Existing Understandings (All Correct!)

Let's confirm each of your notes:

StringBuilder text = new StringBuilder();: Exactly right. StringBuilder is used here because it's far more efficient than regular string for frequent text concatenation (which happens every time we add a page's content).
PdfReader pdfReader = new PdfReader(myPath);: Correct. PdfReader is iTextSharp's core class for loading and accessing the contents of a PDF file from a path, stream, or byte array.
for (int page = 1; ...): Yep, this loop iterates through every page in the PDF. Note that iTextSharp uses 1-based indexing for pages, so we start at 1 instead of 0.
text.Append(currentText);: Right again. This adds each page's extracted text to the StringBuilder to build the full content of the PDF.
pdfReader.Close();: Correct. This releases the file resources held by the PdfReader. As a side note, using a using block is a safer practice here—it automatically calls Close() even if an error occurs:
```
using (PdfReader pdfReader = new PdfReader(myPath))
{
    // Your text extraction logic here
}
```

Unclear Code Explained

Now let's unpack the parts you marked with "?":

1. `ITextExtractionStrategy strategy = new LocationTextExtractionStrategy();`

This is a text extraction strategy—it defines how iTextSharp organizes the text it pulls from the PDF.

LocationTextExtractionStrategy is one of the most commonly used strategies because it extracts text in the visual order you'd read it (top to bottom, left to right). This is important because PDFs store text in the order it was drawn (not necessarily the order you see), so other strategies like SimpleTextExtractionStrategy might return text in a jumbled order if the PDF was created with complex layout.

2. `string currentText = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy);`

PdfTextExtractor is iTextSharp's dedicated utility class for pulling text from PDF pages.

The GetTextFromPage method does the heavy lifting: it takes your loaded PdfReader, the page number you want to extract, and the strategy you defined, then returns the text from that page formatted according to the strategy's rules. In short, this line is where the actual text extraction happens.

3. The Encoding Conversion Line

currentText = Encoding.UTF8.GetString(Encoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.UTF8.GetBytes(currentText)));

You're right—this is an encoding conversion, but it's a bit convoluted. Let's break down what it's trying to do:

Encoding.UTF8.GetBytes(currentText): Converts the extracted string to a UTF-8 byte array.
Encoding.Convert(...): Takes that byte array and converts it from the system's default encoding (e.g., GBK on Chinese Windows) to UTF-8.
Encoding.UTF8.GetString(...): Converts the converted byte array back to a string.

The intent here is likely to fix encoding-related garbled text, but this approach is not ideal. iTextSharp typically extracts text in UTF-8 by default, so this line might be a legacy workaround for older versions or specific edge cases. In most modern scenarios, you can probably omit it unless you're seeing consistent encoding issues.

内容的提问来源于stack exchange，提问作者alan13