如何用Python-docx获取段落内的InlineShape并结合周边文本处理
When working with python-docx, the doc.inline_shapes method gives you all embedded images in the document—but it doesn’t tell you which paragraph each image belongs to. If you need to process images alongside their surrounding paragraph text, here’s a reliable way to map inline shapes to their parent paragraphs:
Step 1: Understand the XML Structure Under the Hood
Every paragraph in a DOCX is represented by a <w:p> XML element, and each inline shape lives inside a <wp:inline> element nested within a run (<w:r>) of that paragraph. We can leverage this structure to match shapes to their paragraphs.
Step 2: Code Implementation (Basic Version)
This approach loops through each paragraph, then checks which inline shapes are nested inside its XML structure:
from docx import Document from docx.oxml.ns import qn doc = Document("test.docx") # Iterate through each paragraph in the document for para in doc.paragraphs: para_text = para.text.strip() print(f"Paragraph Content: {para_text if para_text else '[Empty Paragraph]'}") # Get the underlying XML element of the paragraph para_xml = para._p # Filter inline shapes that belong to this paragraph paragraph_shapes = [] for inline_shape in doc.inline_shapes: shape_xml = inline_shape._inline # Check if the shape's XML is a descendant of the paragraph's XML if shape_xml in para_xml.iterdescendants(): paragraph_shapes.append(inline_shape) # Process the shapes in the current paragraph if paragraph_shapes: print(f"Found {len(paragraph_shapes)} embedded image(s) in this paragraph:") for i, shape in enumerate(paragraph_shapes, 1): # Extract the image's rID (used to access the actual image data) blip = shape._inline.graphic.graphicData.pic.blipFill.blip rID = blip.get(qn('r:embed')) # Get the image part from the document doc_part = doc.part image_part = doc_part.related_parts[rID] # Example: Print image details (you could save it here too) print(f"Image {i}: rID={rID}, File Type={image_part.content_type}") print("---")
Step 3: Optimized Version (Faster for Large Documents)
If you’re working with a document with lots of images, pre-mapping inline shapes to their XML elements will save you from looping through all shapes for every paragraph:
from docx import Document from docx.oxml.ns import qn doc = Document("test.docx") # Pre-create a map: XML element -> corresponding inline shape shape_map = {shape._inline: shape for shape in doc.inline_shapes} for para in doc.paragraphs: para_text = para.text.strip() print(f"Paragraph Content: {para_text if para_text else '[Empty Paragraph]'}") para_xml = para._p # Directly find all inline shape elements in the paragraph paragraph_shapes = [] for inline_element in para_xml.findall(qn('w:r/w:drawing/wp:inline')): if inline_element in shape_map: paragraph_shapes.append(shape_map[inline_element]) # Same processing as before... if paragraph_shapes: print(f"Found {len(paragraph_shapes)} embedded image(s) in this paragraph:") for i, shape in enumerate(paragraph_shapes, 1): blip = shape._inline.graphic.graphicData.pic.blipFill.blip rID = blip.get(qn('r:embed')) doc_part = doc.part image_part = doc_part.related_parts[rID] print(f"Image {i}: rID={rID}, File Type={image_part.content_type}") print("---")
Key Notes
- The
_inlineattribute gives you direct access to the underlying XML element of the inline shape, which is crucial for matching it to the parent paragraph. - Once you have the
rID, you can retrieve the actual image data from the document’s related parts—this lets you save the image to disk or process it further.
内容的提问来源于stack exchange,提问作者Ruben Kostandyan




