如何使用BeautifulSoup为Ditamap文件中的topicref标签添加指定keys属性?
section_-prefixed substring from topicref href to set keys attribute in Ditamap files I needed to process Ditamap files, where for each <topicref> tag, I have to extract the substring starting with section_ from its href attribute, and set that substring as the value of the keys attribute for the same tag.
Input Example
<topicref href="xyz/debug_logging_in_xyz-section_i_y_mn.dita"/> <topicref href="xyz/workflows_id-section_exf_zaz_lo.dita"/> <topicref href="xyz/images_id-section_ekl_bbz_lo.dita"/>
Desired Output
<topicref href="xyz/debug_logging_in_xyz-section_i_y_mn.dita" keys="section_i_y_mn"/> <topicref href="xyz/workflows_id-section_exf_zaz_lo.dita" keys="section_exf_zaz_lo"/> <topicref href="xyz/images_id-section_ekl_bbz_lo.dita" keys="section_ekl_bbz_lo"/>
Initial Attempt (Issue with String Replacement)
I first tried using BeautifulSoup but messed up the substring extraction logic. The problem was that I used string replace() with wildcard-like patterns (which don't work for standard string operations), resulting in the full href path being assigned to keys instead of the target substring:
import os from bs4 import BeautifulSoup as bs globpath = "C:/DATA" #add your directory path here def main(path): with open(path, encoding="utf-8") as f: s = f.read() s = bs(s, "xml") imgs = s.find_all("topicref") for i in imgs: if "section" in i["href"]: # This doesn't work - string replace doesn't support wildcards i["keys"] = i["href"].replace("*-","").replace(".dita*","") s = str(s) with open(path, "w", encoding="utf-8") as f: f.write(s) for dirpath, directories, files in os.walk(globpath): for fname in files: if fname.endswith(".ditamap"): path = os.path.join(dirpath, fname) main(path)
Fixed Solution with Regular Expressions
To correctly extract the section_-prefixed substring up to the .dita extension, I switched to using regular expressions with re.findall(). This lets me precisely match the part of the href I need:
from bs4 import BeautifulSoup as bs import re import os globpath = "C:/DATA" #add your directory path here def main(path): with open(path, encoding="utf-8") as f: s = f.read() # Parse the XML content with BeautifulSoup soup = bs(s, "xml") # Find all topicref tags topic_refs = soup.find_all("topicref") for ref in topic_refs: href = ref.get("href") if href and "section" in href: try: # Extract substring starting with 'section' up to the first '.' section_substring = re.findall(r"section[^\.]*", href)[0] ref["keys"] = section_substring except IndexError: # Handle cases where the pattern isn't found print(f"Could not extract section substring for href: {href} in file {path}") # Write the modified content back to the file with open(path, "w", encoding="utf-8") as f: f.write(str(soup)) # Walk through all directories to process .ditamap files for dirpath, directories, files in os.walk(globpath): for fname in files: if fname.endswith(".ditamap"): file_path = os.path.join(dirpath, fname) main(file_path)
Key Improvements:
- Regular Expression Matching:
re.findall(r"section[^\.]*", href)finds all sequences starting withsectionand continuing until the first.(which precedes.dita), giving us exactly the substring we need. - Error Handling: The
try-exceptblock catches cases where the pattern might not match (though unlikely if your input follows the expected format), so the script doesn't crash unexpectedly. - Readability: Renamed variables for clarity (e.g.,
imgs→topic_refs,i→ref) and added comments to explain each step.
内容的提问来源于stack exchange,提问作者Shilpa




