如何使用BeautifulSoup为Ditamap文件中的topicref标签添加指定keys属性？

阿华AIGC实验室

2026-4-30

Extract section_-prefixed substring from topicref href to set keys attribute in Ditamap files

I needed to process Ditamap files, where for each <topicref> tag, I have to extract the substring starting with section_ from its href attribute, and set that substring as the value of the keys attribute for the same tag.

Input Example

<topicref href="xyz/debug_logging_in_xyz-section_i_y_mn.dita"/>
<topicref href="xyz/workflows_id-section_exf_zaz_lo.dita"/>
<topicref href="xyz/images_id-section_ekl_bbz_lo.dita"/>

Desired Output

<topicref href="xyz/debug_logging_in_xyz-section_i_y_mn.dita" keys="section_i_y_mn"/>
<topicref href="xyz/workflows_id-section_exf_zaz_lo.dita" keys="section_exf_zaz_lo"/>
<topicref href="xyz/images_id-section_ekl_bbz_lo.dita" keys="section_ekl_bbz_lo"/>

Initial Attempt (Issue with String Replacement)

I first tried using BeautifulSoup but messed up the substring extraction logic. The problem was that I used string replace() with wildcard-like patterns (which don't work for standard string operations), resulting in the full href path being assigned to keys instead of the target substring:

import os
from bs4 import BeautifulSoup as bs

globpath = "C:/DATA" #add your directory path here
def main(path):
    with open(path, encoding="utf-8") as f:
        s = f.read()
    s = bs(s, "xml")
    imgs = s.find_all("topicref")
    for i in imgs:
        if "section" in i["href"]:
            # This doesn't work - string replace doesn't support wildcards
            i["keys"] = i["href"].replace("*-","").replace(".dita*","")
    s = str(s)
    with open(path, "w", encoding="utf-8") as f:
        f.write(s)

for dirpath, directories, files in os.walk(globpath):
    for fname in files:
        if fname.endswith(".ditamap"):
            path = os.path.join(dirpath, fname)
            main(path)

Fixed Solution with Regular Expressions

To correctly extract the section_-prefixed substring up to the .dita extension, I switched to using regular expressions with re.findall(). This lets me precisely match the part of the href I need:

from bs4 import BeautifulSoup as bs
import re
import os

globpath = "C:/DATA" #add your directory path here

def main(path):
    with open(path, encoding="utf-8") as f:
        s = f.read()
    # Parse the XML content with BeautifulSoup
    soup = bs(s, "xml")
    # Find all topicref tags
    topic_refs = soup.find_all("topicref")
    
    for ref in topic_refs:
        href = ref.get("href")
        if href and "section" in href:
            try:
                # Extract substring starting with 'section' up to the first '.'
                section_substring = re.findall(r"section[^\.]*", href)[0]
                ref["keys"] = section_substring
            except IndexError:
                # Handle cases where the pattern isn't found
                print(f"Could not extract section substring for href: {href} in file {path}")
    
    # Write the modified content back to the file
    with open(path, "w", encoding="utf-8") as f:
        f.write(str(soup))

# Walk through all directories to process .ditamap files
for dirpath, directories, files in os.walk(globpath):
    for fname in files:
        if fname.endswith(".ditamap"):
            file_path = os.path.join(dirpath, fname)
            main(file_path)

Key Improvements:

Regular Expression Matching: re.findall(r"section[^\.]*", href) finds all sequences starting with section and continuing until the first . (which precedes .dita), giving us exactly the substring we need.
Error Handling: The try-except block catches cases where the pattern might not match (though unlikely if your input follows the expected format), so the script doesn't crash unexpectedly.
Readability: Renamed variables for clarity (e.g., imgs → topic_refs, i → ref) and added comments to explain each step.