You need to enable JavaScript to run this app.
最新活动
大模型
产品
解决方案
定价
生态与合作
支持与服务
开发者
了解我们

iTextSharp提取PDF文本失败:对象引用未设置实例错误

Hey there, let's figure out why you're getting that annoying NullReferenceException when trying to extract text from a PDF page with iTextSharp. I've run into this exact issue before, so here are the most common fixes:

1. Check if your PdfReader instance is null

The most obvious culprit is that pdfReader hasn't been initialized properly. Maybe the file path is wrong, the PDF is corrupted, or your app doesn't have permission to read the file. Before calling GetTextFromPage, add a quick check to avoid null references:

if (pdfReader == null)
{
    // Handle the error—log it, show a user message, whatever fits your app
    Console.WriteLine("Failed to initialize PdfReader. Double-check the file path and permissions.");
    continue;
}

Also, always wrap PdfReader in a using statement to dispose it properly—this prevents weird null issues from unmanaged resource leaks.

2. Verify your page number i is valid

iTextSharp uses 1-based indexing for pages. If you're starting your loop at 0, or using a number higher than the total pages in the PDF, that can trigger a null reference under the hood. Always validate the page range first:

int totalPages = pdfReader.NumberOfPages;
for (int i = 1; i <= totalPages; i++) // Start at 1, end at totalPages
{
    // Your extraction code goes here
}

3. Check if the PDF is encrypted or non-textual

  • Encrypted PDFs: If the PDF is password-protected, PdfReader might not access page content properly, leading to null references. Decrypt it first:
    if (pdfReader.IsEncrypted())
    {
        // Use this for password-free encryption, or replace with your password
        pdfReader.Unencrypt();
        // For password-protected files: pdfReader.Unencrypt("your-password-here");
    }
    
  • Scanned/Image-based PDFs: If the PDF is just a scan of a document (no actual text layers), PdfTextExtractor can't pull text from it. iTextSharp doesn't handle OCR—you'll need a library like Tesseract for that scenario.

4. Update your iTextSharp version

Old versions of iTextSharp have bugs with certain PDF formats. If you're using an outdated package, upgrading to the latest stable release (note: legacy iTextSharp is now part of iText 7's .NET port, but if you're sticking with the older library, grab the newest version from NuGet) might resolve the null reference issue.

Here's a revised code snippet incorporating all these checks:

using (PdfReader pdfReader = new PdfReader("your-document-path.pdf"))
{
    // Handle encrypted PDFs
    if (pdfReader.IsEncrypted())
    {
        pdfReader.Unencrypt(); // Add your password if required
    }

    int totalPages = pdfReader.NumberOfPages;
    for (int i = 1; i <= totalPages; i++)
    {
        if (pdfReader == null)
        {
            Console.WriteLine("PdfReader is null—skipping extraction for page {0}", i);
            continue;
        }

        try
        {
            string extractedText = PdfTextExtractor.GetTextFromPage(pdfReader, i);
            // Process the extracted text here
            Console.WriteLine("Page {0} text:\n{1}", i, extractedText);
        }
        catch (NullReferenceException ex)
        {
            Console.WriteLine("Failed to extract text from page {0}: {1}", i, ex.Message);
            // If this persists, the page might be non-textual or corrupted
        }
    }
}

内容的提问来源于stack exchange,提问作者Felipe Gregorio Ercolin

火山引擎 最新活动