You need to enable JavaScript to run this app.
最新活动
大模型
产品
解决方案
定价
生态与合作
支持与服务
开发者
了解我们

使用python-docx提取.docx结构、文本类型及图片位置的技术求助

Great question! I've run into this exact issue when trying to reconstruct docx files while preserving content order, so let's break down how to solve this with python-docx.

Your core need is to capture paragraphs (with type distinctions) and image positions in the order they appear in the document flow—this way you can accurately rebuild a new document. Your current code only grabs paragraph text, completely ignoring images and paragraph styles, so we'll extend it to handle both.

Step 1: Get a Sequence of Paragraphs (with Type) + Images

The key thing to understand is: most images inserted into docx files are inline within paragraphs—either alongside text in the same paragraph, or occupying a paragraph with no text. So we can iterate through each paragraph, first handle the paragraph text (distinguish headings from regular text), then process any images in that paragraph.

Here's the full implementation:

from docx import Document

def get_doc_content_sequence(doc_path):
    doc = Document(doc_path)
    content_sequence = []
    image_count = 1

    for para in doc.paragraphs:
        cleaned_text = para.text.strip()
        # 1. Handle paragraph text (if there's content)
        if cleaned_text:
            # Check for built-in heading styles (Word uses "Heading 1", "Heading 2", etc.)
            if para.style.name.startswith("Heading"):
                heading_level = para.style.name.split()[-1]
                content_sequence.append(f"Heading {heading_level}: {cleaned_text}")
            else:
                content_sequence.append(f"Paragraph: {cleaned_text}")
        
        # 2. Handle inline images in the paragraph
        for shape in para.inline_shapes:
            # Type 3 indicates an image (python-docx's internal enum value)
            if shape.type == 3:
                content_sequence.append(f"Image {image_count}")
                image_count += 1

    return content_sequence

# Test it out
sequence = get_doc_content_sequence("demo.docx")
for item in sequence:
    print(item)

Running this will give you output matching the structure you wanted, like:

Paragraph: This is the first paragraph text
Image 1
Heading 1: This is a level 1 heading
Paragraph: Second paragraph content
Image 2
Image 3
Paragraph: Third paragraph text

Step 2: Extract Image Files (for Document Reconstruction)

To rebuild a new document, knowing image positions isn't enough—you need to extract the actual image files from the original docx. This code will save all inline images to a specified folder:

import os

def extract_doc_images(doc_path, save_folder):
    doc = Document(doc_path)
    # Create the save folder if it doesn't exist
    if not os.path.exists(save_folder):
        os.makedirs(save_folder)
    
    image_count = 1
    for para in doc.paragraphs:
        for shape in para.inline_shapes:
            if shape.type == 3:
                # Get the image's binary data
                image_embed_id = shape._inline.graphic.graphicData.pic.blipFill.blip.embed
                image_data = doc.part.related_parts[image_embed_id].blob
                # Save as PNG (adjust the file extension if your images are a different format)
                with open(f"{save_folder}/image_{image_count}.png", "wb") as img_file:
                    img_file.write(image_data)
                image_count += 1

# Example usage: save images to an "extracted_images" folder in your current directory
extract_doc_images("demo.docx", "extracted_images")

Step 3: Rebuild the Document Using the Sequence

With the content sequence and extracted images, you can easily rebuild the document. Iterate through the sequence, add paragraphs with the correct style, and insert images in the right positions:

def rebuild_document(sequence, image_folder, output_path):
    new_doc = Document()
    image_index = 1

    for item in sequence:
        if item.startswith("Heading"):
            # Split heading level and text content
            level_part, text_part = item.split(": ", 1)
            level = int(level_part.split()[-1])
            new_doc.add_heading(text_part, level=level)
        elif item.startswith("Paragraph"):
            text = item.split(": ", 1)[1]
            new_doc.add_paragraph(text)
        elif item.startswith("Image"):
            # Insert the corresponding extracted image
            img_path = f"{image_folder}/image_{image_index}.png"
            new_doc.add_picture(img_path)
            image_index += 1
    
    new_doc.save(output_path)

# Example usage
rebuild_document(sequence, "extracted_images", "rebuilt_demo.docx")

Notes

  • This code targets inline images (the most common type inserted directly into text flow). If your document has floating images (e.g., positioned above/below text), python-docx has limited support for these—they're stored in document canvases and require extra handling, but this is a rare scenario for most use cases.
  • The paragraph style check relies on Word's built-in heading styles ("Heading 1" to "Heading 9"). If you use custom styles in your document, adjust the para.style.name logic to match your custom style names.

This setup fully meets your needs: you'll have clear visibility of image positions in the original document, and you can accurately rebuild a new document with the same content order.

内容的提问来源于stack exchange,提问作者Benji Tan

火山引擎 最新活动