如何从SEC EDGAR Schedule 13表单中提取指定5类信息？

阿华AIGC实验室

2026-5-14

提取SEC EDGAR Schedule 13表单5类信息的完整Python实现

Got it, let's tackle extracting all 5 required fields from the SEC EDGAR Schedule 13 filing. Since you already have a start with CUSIP and percentage extraction, we can expand that by leveraging both BeautifulSoup's XML parsing (since the filing is in XML format) and targeted regex where needed. Here's a complete solution:

提取逻辑拆解

Let's break down how to grab each of the 5 fields:

1. 报告人名称（Reporting Person Name）

报告人名称通常嵌套在XML结构的<reportingPerson> -> <name>标签下，用BeautifulSoup直接定位这个标签更可靠；如果标签缺失，再用正则匹配兜底。

2. 发行人名称（Issuer Name）

同理，发行人名称一般在<issuer> -> <name>标签下，优先用XML解析避免正则的边缘情况。

3. CUSIP编号

你的现有正则已经能匹配CUSIP格式，我们优化为只取第一个有效匹配（部分 filings 可能有多个CUSIP，我们需要对应目标发行人的那一个）。

4. 类别占比（Percent of Class）

优先用XML的<ownership> -> <percentOwnership>标签提取，标签缺失时再用正则精准捕获百分比数值。

5. 触发申报的事件日期（Event Date）

通常在<eventDate>标签中，原始格式一般是YYYYMMDD，我们可以把它转换成你需要的Month Day, Year格式；标签缺失时用正则匹配兜底。

完整代码

import requests
import re
from bs4 import BeautifulSoup
from datetime import datetime

# Fetch the filing page
url = 'https://www.sec.gov/Archives/edgar/data/1058239/000106299319004848/formsc13da.htm'
page = requests.get(url)
soup = BeautifulSoup(page.text, 'xml')

# 1. 提取报告人名称
reporting_person = soup.find('reportingPerson')
if reporting_person:
    reporter_name = reporting_person.find('name').text.strip()
else:
    # Fallback to regex if XML tags are missing
    reporter_name_match = re.search(r'Reporting Person:\s*(.*?)\n', soup.text, re.IGNORECASE)
    reporter_name = reporter_name_match.group(1).strip() if reporter_name_match else "Not found"

# 2. 提取发行人名称
issuer = soup.find('issuer')
if issuer:
    issuer_name = issuer.find('name').text.strip()
else:
    issuer_name_match = re.search(r'Issuer Name:\s*(.*?)\n', soup.text, re.IGNORECASE)
    issuer_name = issuer_name_match.group(1).strip() if issuer_name_match else "Not found"

# 3. 提取CUSIP编号（优化版）
cusip_pattern = r'[0-9]{3}[a-zA-Z0-9]{2}[a-zA-Z0-9*@#]{3}[0-9]'
cusip_matches = re.findall(cusip_pattern, soup.text)
cusip = cusip_matches[0] if cusip_matches else "Not found"

# 4. 提取类别占比（优先用XML标签，再用regex）
percent_ownership = soup.find('percentOwnership')
if percent_ownership:
    class_percent = percent_ownership.text.strip()
else:
    percent_match = re.search(r'(?<=PERCENT OF CLASS|Percent of class)\s*(\d+\.\d+)%', soup.text, re.IGNORECASE | re.DOTALL)
    class_percent = percent_match.group(1) + "%" if percent_match else "Not found"

# 5. 提取触发申报的事件日期
event_date = soup.find('eventDate')
if event_date:
    event_date_str = event_date.text.strip()
    # Convert raw YYYYMMDD to "December 24, 2019" format
    try:
        date_obj = datetime.strptime(event_date_str, '%Y%m%d')
        event_date_str = date_obj.strftime('%B %d, %Y')
    except ValueError:
        pass  # Keep raw format if parsing fails
else:
    event_date_match = re.search(r'Date of Event Which Requires This Statement:\s*(.*?)\n', soup.text, re.IGNORECASE)
    event_date_str = event_date_match.group(1).strip() if event_date_match else "Not found"

# Print all extracted info
print("报告人名称:", reporter_name)
print("发行人名称:", issuer_name)
print("CUSIP编号:", cusip)
print("类别占比:", class_percent)
print("触发申报的事件日期:", event_date_str)