使用python-docx提取.docx结构、文本类型及图片位置的技术求助
Great question! I've run into this exact issue when trying to reconstruct docx files while preserving content order, so let's break down how to solve this with python-docx.
Your core need is to capture paragraphs (with type distinctions) and image positions in the order they appear in the document flow—this way you can accurately rebuild a new document. Your current code only grabs paragraph text, completely ignoring images and paragraph styles, so we'll extend it to handle both.
Step 1: Get a Sequence of Paragraphs (with Type) + Images
The key thing to understand is: most images inserted into docx files are inline within paragraphs—either alongside text in the same paragraph, or occupying a paragraph with no text. So we can iterate through each paragraph, first handle the paragraph text (distinguish headings from regular text), then process any images in that paragraph.
Here's the full implementation:
from docx import Document def get_doc_content_sequence(doc_path): doc = Document(doc_path) content_sequence = [] image_count = 1 for para in doc.paragraphs: cleaned_text = para.text.strip() # 1. Handle paragraph text (if there's content) if cleaned_text: # Check for built-in heading styles (Word uses "Heading 1", "Heading 2", etc.) if para.style.name.startswith("Heading"): heading_level = para.style.name.split()[-1] content_sequence.append(f"Heading {heading_level}: {cleaned_text}") else: content_sequence.append(f"Paragraph: {cleaned_text}") # 2. Handle inline images in the paragraph for shape in para.inline_shapes: # Type 3 indicates an image (python-docx's internal enum value) if shape.type == 3: content_sequence.append(f"Image {image_count}") image_count += 1 return content_sequence # Test it out sequence = get_doc_content_sequence("demo.docx") for item in sequence: print(item)
Running this will give you output matching the structure you wanted, like:
Paragraph: This is the first paragraph text Image 1 Heading 1: This is a level 1 heading Paragraph: Second paragraph content Image 2 Image 3 Paragraph: Third paragraph text
Step 2: Extract Image Files (for Document Reconstruction)
To rebuild a new document, knowing image positions isn't enough—you need to extract the actual image files from the original docx. This code will save all inline images to a specified folder:
import os def extract_doc_images(doc_path, save_folder): doc = Document(doc_path) # Create the save folder if it doesn't exist if not os.path.exists(save_folder): os.makedirs(save_folder) image_count = 1 for para in doc.paragraphs: for shape in para.inline_shapes: if shape.type == 3: # Get the image's binary data image_embed_id = shape._inline.graphic.graphicData.pic.blipFill.blip.embed image_data = doc.part.related_parts[image_embed_id].blob # Save as PNG (adjust the file extension if your images are a different format) with open(f"{save_folder}/image_{image_count}.png", "wb") as img_file: img_file.write(image_data) image_count += 1 # Example usage: save images to an "extracted_images" folder in your current directory extract_doc_images("demo.docx", "extracted_images")
Step 3: Rebuild the Document Using the Sequence
With the content sequence and extracted images, you can easily rebuild the document. Iterate through the sequence, add paragraphs with the correct style, and insert images in the right positions:
def rebuild_document(sequence, image_folder, output_path): new_doc = Document() image_index = 1 for item in sequence: if item.startswith("Heading"): # Split heading level and text content level_part, text_part = item.split(": ", 1) level = int(level_part.split()[-1]) new_doc.add_heading(text_part, level=level) elif item.startswith("Paragraph"): text = item.split(": ", 1)[1] new_doc.add_paragraph(text) elif item.startswith("Image"): # Insert the corresponding extracted image img_path = f"{image_folder}/image_{image_index}.png" new_doc.add_picture(img_path) image_index += 1 new_doc.save(output_path) # Example usage rebuild_document(sequence, "extracted_images", "rebuilt_demo.docx")
Notes
- This code targets inline images (the most common type inserted directly into text flow). If your document has floating images (e.g., positioned above/below text), python-docx has limited support for these—they're stored in document canvases and require extra handling, but this is a rare scenario for most use cases.
- The paragraph style check relies on Word's built-in heading styles ("Heading 1" to "Heading 9"). If you use custom styles in your document, adjust the
para.style.namelogic to match your custom style names.
This setup fully meets your needs: you'll have clear visibility of image positions in the original document, and you can accurately rebuild a new document with the same content order.
内容的提问来源于stack exchange,提问作者Benji Tan




