You need to enable JavaScript to run this app.
最新活动
大模型
产品
解决方案
定价
生态与合作
支持与服务
开发者
了解我们

如何用Python 3提取docx文档的部分内容(如全部内容的10%)?

Extract 10% of Content from a Docx File with Python 3

Got it, let's break down how to pull 10% of the total content from a docx file using Python 3. I'll use the python-docx library—it's the go-to tool for this kind of task, and it’s super straightforward to work with.

First, install the library if you haven’t already:

pip install python-docx

Basic Implementation (Extract First 10% by Character Count)

This approach reads all text from the document, calculates 10% of the total character length, and grabs that portion from the start of the content.

from docx import Document

def get_full_docx_text(doc_path):
    """Read all paragraph text from a docx file and return as a single string."""
    doc = Document(doc_path)
    content_snippets = []
    # Pull text from every paragraph in the document
    for para in doc.paragraphs:
        content_snippets.append(para.text)
    # Join all snippets into one continuous block of text
    return '\n'.join(content_snippets)

# Replace with your actual docx file path
target_document = "your_file.docx"
full_content = get_full_docx_text(target_document)

# Calculate 10% of the total content length
total_chars = len(full_content)
ten_percent_length = int(total_chars * 0.1)

# Extract the first 10% of the content
extracted_content = full_content[:ten_percent_length]

print("Extracted 10% of the document:\n")
print(extracted_content)

Want a Random 10% Slice Instead?

If you don’t want just the opening section, use the random module to pick a random starting point for your 10% slice:

import random

if total_chars > 0:
    # Make sure we don't go out of bounds when choosing the start index
    start_position = random.randint(0, total_chars - ten_percent_length)
    extracted_content = full_content[start_position:start_position + ten_percent_length]
    
    print("Random 10% of the document:\n")
    print(extracted_content)

Include Table Content (If Your Doc Has Tables)

The basic function above skips table content. If your document includes tables you want to count toward the total content, use this updated function:

def get_full_content_with_tables(doc_path):
    """Read both paragraph text and table cell content from a docx file."""
    doc = Document(doc_path)
    content_snippets = []
    
    # Add paragraph text
    for para in doc.paragraphs:
        content_snippets.append(para.text)
    
    # Add text from every table cell
    for table in doc.tables:
        for row in table.rows:
            for cell in row.cells:
                content_snippets.append(cell.text)
    
    return '\n'.join(content_snippets)

# Swap in this function if you need to include table content
full_content = get_full_content_with_tables(target_document)

Quick Notes

  • This calculates 10% based on character count. If you need to calculate by paragraph count instead, just count the number of paragraphs, take 10% of that number, and slice the paragraph list directly.
  • For complex docs with images, headers/footers, or styled text, python-docx can handle those too—but you’ll need extra logic to extract that content if it’s part of your "total content" definition.

内容的提问来源于stack exchange,提问作者bohdan.k

火山引擎 最新活动