You need to enable JavaScript to run this app.
最新活动
大模型
产品
解决方案
定价
生态与合作
支持与服务
开发者
了解我们

如何用Python-docx获取段落内的InlineShape并结合周边文本处理

How to Get Inline Shapes Within Specific Paragraphs in a DOCX File

When working with python-docx, the doc.inline_shapes method gives you all embedded images in the document—but it doesn’t tell you which paragraph each image belongs to. If you need to process images alongside their surrounding paragraph text, here’s a reliable way to map inline shapes to their parent paragraphs:

Step 1: Understand the XML Structure Under the Hood

Every paragraph in a DOCX is represented by a <w:p> XML element, and each inline shape lives inside a <wp:inline> element nested within a run (<w:r>) of that paragraph. We can leverage this structure to match shapes to their paragraphs.

Step 2: Code Implementation (Basic Version)

This approach loops through each paragraph, then checks which inline shapes are nested inside its XML structure:

from docx import Document
from docx.oxml.ns import qn

doc = Document("test.docx")

# Iterate through each paragraph in the document
for para in doc.paragraphs:
    para_text = para.text.strip()
    print(f"Paragraph Content: {para_text if para_text else '[Empty Paragraph]'}")
    
    # Get the underlying XML element of the paragraph
    para_xml = para._p
    # Filter inline shapes that belong to this paragraph
    paragraph_shapes = []
    for inline_shape in doc.inline_shapes:
        shape_xml = inline_shape._inline
        # Check if the shape's XML is a descendant of the paragraph's XML
        if shape_xml in para_xml.iterdescendants():
            paragraph_shapes.append(inline_shape)
    
    # Process the shapes in the current paragraph
    if paragraph_shapes:
        print(f"Found {len(paragraph_shapes)} embedded image(s) in this paragraph:")
        for i, shape in enumerate(paragraph_shapes, 1):
            # Extract the image's rID (used to access the actual image data)
            blip = shape._inline.graphic.graphicData.pic.blipFill.blip
            rID = blip.get(qn('r:embed'))
            # Get the image part from the document
            doc_part = doc.part
            image_part = doc_part.related_parts[rID]
            
            # Example: Print image details (you could save it here too)
            print(f"Image {i}: rID={rID}, File Type={image_part.content_type}")
    print("---")

Step 3: Optimized Version (Faster for Large Documents)

If you’re working with a document with lots of images, pre-mapping inline shapes to their XML elements will save you from looping through all shapes for every paragraph:

from docx import Document
from docx.oxml.ns import qn

doc = Document("test.docx")

# Pre-create a map: XML element -> corresponding inline shape
shape_map = {shape._inline: shape for shape in doc.inline_shapes}

for para in doc.paragraphs:
    para_text = para.text.strip()
    print(f"Paragraph Content: {para_text if para_text else '[Empty Paragraph]'}")
    
    para_xml = para._p
    # Directly find all inline shape elements in the paragraph
    paragraph_shapes = []
    for inline_element in para_xml.findall(qn('w:r/w:drawing/wp:inline')):
        if inline_element in shape_map:
            paragraph_shapes.append(shape_map[inline_element])
    
    # Same processing as before...
    if paragraph_shapes:
        print(f"Found {len(paragraph_shapes)} embedded image(s) in this paragraph:")
        for i, shape in enumerate(paragraph_shapes, 1):
            blip = shape._inline.graphic.graphicData.pic.blipFill.blip
            rID = blip.get(qn('r:embed'))
            doc_part = doc.part
            image_part = doc_part.related_parts[rID]
            print(f"Image {i}: rID={rID}, File Type={image_part.content_type}")
    print("---")

Key Notes

  • The _inline attribute gives you direct access to the underlying XML element of the inline shape, which is crucial for matching it to the parent paragraph.
  • Once you have the rID, you can retrieve the actual image data from the document’s related parts—this lets you save the image to disk or process it further.

内容的提问来源于stack exchange,提问作者Ruben Kostandyan

火山引擎 最新活动