如何用Python提取PDF内链接指向的目标PDF页码？

阿华AIGC实验室

2026-5-8

Great question! You’re absolutely right that surface-level link extraction might only show the file path, but under the hood, internal PDF links are tied to specific page targets—here’s how to pull those page numbers using Python with two popular libraries.

Using PyPDF2

PyPDF2 is a widely-used library for PDF manipulation, and it can access the underlying annotation data to get link targets.

First, install the library:

pip install PyPDF2

Then use this script to extract links and their target pages:

from PyPDF2 import PdfReader

def extract_internal_link_pages(pdf_path, password=None):
    # Initialize reader (handle encrypted PDFs if needed)
    reader = PdfReader(pdf_path, password=password)
    link_details = []

    # Loop through each page (0-indexed internally)
    for source_page_idx, page in enumerate(reader.pages):
        # Check if the page has annotations
        if "/Annots" in page:
            for annot_ref in page["/Annots"]:
                annot = annot_ref.get_object()
                # Filter for link annotations
                if annot.get("/Subtype") == "/Link":
                    action = annot.get("/A")
                    if not action:
                        continue

                    # Case 1: Direct GoTo action (points straight to a page)
                    if action.get("/S") == "/GoTo":
                        dest = action.get("/D")
                        if isinstance(dest, list):
                            # Dest format: [page_object, /XYZ, left, top, zoom]
                            target_page_idx = reader.get_page_number(dest[0])
                            link_details.append({
                                "source_page": source_page_idx + 1,  # Convert to 1-indexed
                                "target_page": target_page_idx + 1,
                                "link_type": "Direct internal link"
                            })

                    # Case 2: Named destination (common in longer PDFs)
                    elif action.get("/S") == "/GoToR":
                        dest_name = action.get("/D")
                        if dest_name in reader.named_destinations:
                            dest = reader.named_destinations[dest_name]
                            target_page_idx = reader.get_page_number(dest[0])
                            link_details.append({
                                "source_page": source_page_idx + 1,
                                "target_page": target_page_idx + 1,
                                "link_type": "Named destination link"
                            })

    return link_details

# Example usage
pdf_file = "your_document.pdf"
links = extract_internal_link_pages(pdf_file)
for link in links:
    print(f"Link on page {link['source_page']} → Target page {link['target_page']} ({link['link_type']})")

Using pdfplumber

pdfplumber is another great option—it has a more intuitive API for annotations and works well with structured PDFs.

Install it first:

pip install pdfplumber

Here’s a script for extracting link targets with pdfplumber:

import pdfplumber

def extract_links_with_target_pages(pdf_path, password=None):
    link_info = []
    with pdfplumber.open(pdf_path, password=password) as pdf:
        # Loop through pages (1-indexed here for convenience)
        for source_page_num, page in enumerate(pdf.pages, start=1):
            for annot in page.annots:
                if annot["type"] == "link":
                    # Check for a destination tied to the link
                    if "dest" in annot:
                        dest = annot["dest"]
                        # Case 1: Direct page reference
                        if isinstance(dest, list) and len(dest) > 0 and hasattr(dest[0], "page_number"):
                            target_page = dest[0].page_number + 1  # Convert to 1-indexed
                            link_info.append({
                                "source_page": source_page_num,
                                "target_page": target_page,
                                "link_area": annot["rect"]
                            })
                        # Case 2: Named destination
                        elif isinstance(dest, str):
                            named_dest = pdf.named_destinations.get(dest)
                            if named_dest and hasattr(named_dest[0], "page_number"):
                                target_page = named_dest[0].page_number + 1
                                link_info.append({
                                    "source_page": source_page_num,
                                    "target_page": target_page,
                                    "link_type": "Named destination"
                                })
    return link_info

# Example usage
links = extract_links_with_target_pages("your_document.pdf")
for link in links:
    print(f"Page {link['source_page']} links to page {link['target_page']}")

Key Notes to Keep in Mind

Encrypted PDFs: Both libraries support password-protected files—just pass the password parameter when initializing the reader.
External Links: If a link points to a different PDF file entirely, you won’t get a target page number unless you parse that external PDF separately using the same methods.
Edge Cases: Some PDFs use non-standard destination formats (like /Fit instead of /XYZ), but the scripts above handle most common scenarios. If you hit odd structures, you might need to add extra checks for the destination object.
Page Numbering: Remember that most PDF libraries use 0-indexed pages internally, so we add 1 to get the user-facing page number that matches what you see in the PDF viewer.

内容的提问来源于stack exchange，提问作者CrioWulf