Python解析docx文件：提取加粗标题重复问题排查

阿华AIGC实验室

2026-5-28

问题分析与解决方案

首先，你的代码出现重复标题的核心原因很清晰：同一个标题段落里可能包含多个加粗的run对象，导致你遍历paragraph.runs时，每遇到一个加粗的run就执行一次head1s.append(head1)，最终同一个标题被多次添加到列表中。

举个实际场景：假设标题段落D. Fox被拆成了两个run——第一个是D. （加粗），第二个是Fox（加粗）。你的循环会遍历这两个run，每次都满足run.bold和正则匹配的条件，于是Fox就被重复添加了两次，出现你看到的[Cat, Dog, Pig, Fox, Fox]这种结果。

修正思路

我们只需要确认**整个段落里存在至少一个加粗的run**即可，不需要每个加粗run都执行添加操作。调整逻辑顺序：

先判断段落是否符合标题的正则格式
再检查该段落是否包含至少一个加粗的run
两个条件都满足时，仅将标题添加到列表一次

修正后的代码

import os
from docx import Document
import re

directory = input("Copy and paste the location of the files.\n").lower()
for file in os.listdir(directory):
    # 用os.path.join拼接路径，避免不同系统的路径分隔符问题
    document = Document(os.path.join(directory, file))
    head1s = []
    for paragraph in document.paragraphs:
        # 先匹配标题前缀格式
        heading_match = re.match(r'^[A-Z]+[.]\s', paragraph.text)
        if not heading_match:
            continue  # 不符合标题格式，直接跳过
        
        # 检查段落是否存在至少一个加粗的run
        has_bold = any(run.bold for run in paragraph.runs)
        if has_bold:
            # split只分割一次，避免标题文本本身含小数点时出错，再去除前后空白
            head1 = paragraph.text.split('.', 1)[1].strip()
            head1s.append(head1)
    
    print(head1s)