如何用Python正则表达式提取文本文件中的标题及对应内容块
Got it, let's sort this out for you! Your current line-by-line approach works for spotting headings, but it can't link each heading to the content that follows it. Instead, we'll read the entire file at once and use a regex pattern that captures each numbered heading, its text, and all the content until the next heading (or the end of the file).
Step-by-Step Solution
First, let's adjust how we read the file—grabbing the whole content at once lets us handle multi-line content blocks properly:
import re # Read the entire file content in one go with open('data/single.txt', encoding='UTF-8') as file: full_content = file.read() # Optional: Strip out the initial non-heading header text (like the IEC lines) # Find the position of the first numbered heading first_heading_match = re.search(r'^\d+(?:\.\d+)*\.?', full_content, re.MULTILINE) if first_heading_match: full_content = full_content[first_heading_match.start():]
Next, we'll use a regex pattern designed to match each complete heading block. This pattern uses multi-line mode (so ^ matches the start of each line) and DOTALL mode (so . matches newlines, which is key for multi-line content):
# Regex pattern to capture heading number, heading text, and content block heading_pattern = re.compile( r'^(\d+(?:\.\d+)*\.?)\s*(.*?)\n(.*?)(?=\n^\d+(?:\.\d+)*\.?|\Z)', re.MULTILINE | re.DOTALL ) # Iterate over all matches and extract data for match in heading_pattern.finditer(full_content): heading_number = match.group(1).strip() heading_title = match.group(2).strip() content_block = match.group(3).strip() # Print or process the extracted data as needed print(f"### {heading_number} {heading_title}") print(content_block) print("---") # Separator between blocks for clarity
Breakdown of the Regex Pattern
Let's unpack what each part does:
^(\d+(?:\.\d+)*\.?): Matches the numbered heading at the start of a line (e.g.,12.4.5.4,13) and captures it as group 1.\s*(.*?)\n: Captures the heading text (e.g.,Other ME EQUIPMENT producing diagnostic or therapeutic radiation) as group 2. The.*?is non-greedy to stop at the first newline.(.*?): Captures all content after the heading until...(?=\n^\d+(?:\.\d+)*\.?|\Z): A positive lookahead that stops the match when it hits the next numbered heading or the end of the file (\Z).
Handling Edge Cases
- Headings with special characters (like
13 * HAZARDOUS SITUATIONS and fault conditions): The pattern will automatically include the*in the heading text since it's part of the content after the numbered prefix. - Multi-line content blocks (like the list under
13.1.2): There.DOTALLflag ensures the regex captures all lines until the next heading.
Example Output
For your sample text, this code will output something like:
### 12.4.5.4 Other ME EQUIPMENT producing diagnostic or therapeutic radiation When applicable, the MANUFACTURER shall address in the RISK MANAGEMENT PROCESS the RISKS associated with ME EQUIPMENT producing diagnostic or therapeutic radiation other than for diagnostic X-rays and radiotherapy (see 12.4.5.2 and 12.4.5.3). Compliance is checked by inspection of the RISK MANAGEMENT FILE. --- ### 12.4.6 Diagnostic or therapeutic acoustic pressure When applicable, the MANUFACTURER shall address in the RISK MANAGEMENT PROCESS the RISKS associated with diagnostic or therapeutic acoustic pressure. Compliance is checked by inspection of the RISK MANAGEMENT FILE. --- ### 13 * HAZARDOUS SITUATIONS and fault conditions 13.1 Specific HAZARDOUS SITUATIONS General 13.1.1 When applying the SINGLE FAULT CONDITIONS as described in 4.7 and listed in 13.2, one at a time, none of the HAZARDOUS SITUATIONS in 13.1.2 to 13.1.4 (inclusive) shall occur in the ME EQUIPMENT. The failure of any one component at a time, which could result in a HAZARDOUS SITUATION, is described in 4.7. Emissions, deformation of ENCLOSURE or exceeding maximum temperature 13.1.2 The following HAZARDOUS SITUATIONS shall not occur: – emission of flames, molten metal, poisonous or ignitable substance in hazardous quantities; – deformation of ENCLOSURES to such an extent that compliance with 15.3.1 is impaired; – temperatures of APPLIED PARTS exceeding the allowed values identified in Table 24 when measured as described in 11.1.3; temperatures of ME EQUIPMENT parts that are not APPLIED PARTS but are likely to be touched, exceeding the allowable values in Table 23 when measured and adjusted as described in 11.1.3; – – exceeding the allowable values for “other components and materials” identified in Table 22 times 1,5 minus 12,5 °C. Limits for windings are found in Table 26, Table 27 and Table 31. In all other cases, the allowable values of Table 22 apply. Temperatures shall be measured using the method described in 11.1.3. The SINGLE FAULT CONDITIONS in 4.7, 8.1 b), 8.7.2 and 13.2.2, with regard to the emission of flames, molten metal or ignitable substances, shall not be applied to parts and components where: – The construction or the supply circuit limits the power dissipation in SINGLE FAULT CONDITION to less than 15 W or the energy dissipation to less than 900 J. ---
内容的提问来源于stack exchange,提问作者wuddadid




