You need to enable JavaScript to run this app.
最新活动
大模型
产品
解决方案
定价
生态与合作
支持与服务
开发者
了解我们

如何用Python正则表达式提取文本文件中的标题及对应内容块

Extract Headings and Content Blocks with Python Regex

Got it, let's sort this out for you! Your current line-by-line approach works for spotting headings, but it can't link each heading to the content that follows it. Instead, we'll read the entire file at once and use a regex pattern that captures each numbered heading, its text, and all the content until the next heading (or the end of the file).

Step-by-Step Solution

First, let's adjust how we read the file—grabbing the whole content at once lets us handle multi-line content blocks properly:

import re

# Read the entire file content in one go
with open('data/single.txt', encoding='UTF-8') as file:
    full_content = file.read()

# Optional: Strip out the initial non-heading header text (like the IEC lines)
# Find the position of the first numbered heading
first_heading_match = re.search(r'^\d+(?:\.\d+)*\.?', full_content, re.MULTILINE)
if first_heading_match:
    full_content = full_content[first_heading_match.start():]

Next, we'll use a regex pattern designed to match each complete heading block. This pattern uses multi-line mode (so ^ matches the start of each line) and DOTALL mode (so . matches newlines, which is key for multi-line content):

# Regex pattern to capture heading number, heading text, and content block
heading_pattern = re.compile(
    r'^(\d+(?:\.\d+)*\.?)\s*(.*?)\n(.*?)(?=\n^\d+(?:\.\d+)*\.?|\Z)',
    re.MULTILINE | re.DOTALL
)

# Iterate over all matches and extract data
for match in heading_pattern.finditer(full_content):
    heading_number = match.group(1).strip()
    heading_title = match.group(2).strip()
    content_block = match.group(3).strip()

    # Print or process the extracted data as needed
    print(f"### {heading_number} {heading_title}")
    print(content_block)
    print("---")  # Separator between blocks for clarity

Breakdown of the Regex Pattern

Let's unpack what each part does:

  • ^(\d+(?:\.\d+)*\.?): Matches the numbered heading at the start of a line (e.g., 12.4.5.4, 13) and captures it as group 1.
  • \s*(.*?)\n: Captures the heading text (e.g., Other ME EQUIPMENT producing diagnostic or therapeutic radiation) as group 2. The .*? is non-greedy to stop at the first newline.
  • (.*?): Captures all content after the heading until...
  • (?=\n^\d+(?:\.\d+)*\.?|\Z): A positive lookahead that stops the match when it hits the next numbered heading or the end of the file (\Z).

Handling Edge Cases

  • Headings with special characters (like 13 * HAZARDOUS SITUATIONS and fault conditions): The pattern will automatically include the * in the heading text since it's part of the content after the numbered prefix.
  • Multi-line content blocks (like the list under 13.1.2): The re.DOTALL flag ensures the regex captures all lines until the next heading.

Example Output

For your sample text, this code will output something like:

### 12.4.5.4 Other ME EQUIPMENT producing diagnostic or therapeutic radiation
When applicable, the MANUFACTURER shall address in the RISK MANAGEMENT PROCESS the RISKS associated with ME EQUIPMENT producing diagnostic or therapeutic radiation other than for diagnostic X-rays and radiotherapy (see 12.4.5.2 and 12.4.5.3). Compliance is checked by inspection of the RISK MANAGEMENT FILE.
---
### 12.4.6 Diagnostic or therapeutic acoustic pressure
When applicable, the MANUFACTURER shall address in the RISK MANAGEMENT PROCESS the RISKS associated with diagnostic or therapeutic acoustic pressure. Compliance is checked by inspection of the RISK MANAGEMENT FILE.
---
### 13 * HAZARDOUS SITUATIONS and fault conditions
13.1 Specific HAZARDOUS SITUATIONS General
13.1.1 When applying the SINGLE FAULT CONDITIONS as described in 4.7 and listed in 13.2, one at a time, none of the HAZARDOUS SITUATIONS in 13.1.2 to 13.1.4 (inclusive) shall occur in the ME EQUIPMENT. The failure of any one component at a time, which could result in a HAZARDOUS SITUATION, is described in 4.7.
Emissions, deformation of ENCLOSURE or exceeding maximum temperature
13.1.2 The following HAZARDOUS SITUATIONS shall not occur:
– emission of flames, molten metal, poisonous or ignitable substance in hazardous quantities;
– deformation of ENCLOSURES to such an extent that compliance with 15.3.1 is impaired;
– temperatures of APPLIED PARTS exceeding the allowed values identified in Table 24 when measured as described in 11.1.3; temperatures of ME EQUIPMENT parts that are not APPLIED PARTS but are likely to be touched, exceeding the allowable values in Table 23 when measured and adjusted as described in 11.1.3;
– – exceeding the allowable values for “other components and materials” identified in Table 22 times 1,5 minus 12,5 °C. Limits for windings are found in Table 26, Table 27 and Table 31. In all other cases, the allowable values of Table 22 apply. Temperatures shall be measured using the method described in 11.1.3.
The SINGLE FAULT CONDITIONS in 4.7, 8.1 b), 8.7.2 and 13.2.2, with regard to the emission of flames, molten metal or ignitable substances, shall not be applied to parts and components where:
– The construction or the supply circuit limits the power dissipation in SINGLE FAULT CONDITION to less than 15 W or the energy dissipation to less than 900 J.
---

内容的提问来源于stack exchange,提问作者wuddadid

火山引擎 最新活动