You need to enable JavaScript to run this app.
最新活动
大模型
产品
解决方案
定价
生态与合作
支持与服务
开发者
了解我们

如何从SEC EDGAR Schedule 13表单中提取指定5类信息?

提取SEC EDGAR Schedule 13表单5类信息的完整Python实现

Got it, let's tackle extracting all 5 required fields from the SEC EDGAR Schedule 13 filing. Since you already have a start with CUSIP and percentage extraction, we can expand that by leveraging both BeautifulSoup's XML parsing (since the filing is in XML format) and targeted regex where needed. Here's a complete solution:

提取逻辑拆解

Let's break down how to grab each of the 5 fields:

1. 报告人名称(Reporting Person Name)

报告人名称通常嵌套在XML结构的<reportingPerson> -> <name>标签下,用BeautifulSoup直接定位这个标签更可靠;如果标签缺失,再用正则匹配兜底。

2. 发行人名称(Issuer Name)

同理,发行人名称一般在<issuer> -> <name>标签下,优先用XML解析避免正则的边缘情况。

3. CUSIP编号

你的现有正则已经能匹配CUSIP格式,我们优化为只取第一个有效匹配(部分 filings 可能有多个CUSIP,我们需要对应目标发行人的那一个)。

4. 类别占比(Percent of Class)

优先用XML的<ownership> -> <percentOwnership>标签提取,标签缺失时再用正则精准捕获百分比数值。

5. 触发申报的事件日期(Event Date)

通常在<eventDate>标签中,原始格式一般是YYYYMMDD,我们可以把它转换成你需要的Month Day, Year格式;标签缺失时用正则匹配兜底。

完整代码

import requests
import re
from bs4 import BeautifulSoup
from datetime import datetime

# Fetch the filing page
url = 'https://www.sec.gov/Archives/edgar/data/1058239/000106299319004848/formsc13da.htm'
page = requests.get(url)
soup = BeautifulSoup(page.text, 'xml')

# 1. 提取报告人名称
reporting_person = soup.find('reportingPerson')
if reporting_person:
    reporter_name = reporting_person.find('name').text.strip()
else:
    # Fallback to regex if XML tags are missing
    reporter_name_match = re.search(r'Reporting Person:\s*(.*?)\n', soup.text, re.IGNORECASE)
    reporter_name = reporter_name_match.group(1).strip() if reporter_name_match else "Not found"

# 2. 提取发行人名称
issuer = soup.find('issuer')
if issuer:
    issuer_name = issuer.find('name').text.strip()
else:
    issuer_name_match = re.search(r'Issuer Name:\s*(.*?)\n', soup.text, re.IGNORECASE)
    issuer_name = issuer_name_match.group(1).strip() if issuer_name_match else "Not found"

# 3. 提取CUSIP编号(优化版)
cusip_pattern = r'[0-9]{3}[a-zA-Z0-9]{2}[a-zA-Z0-9*@#]{3}[0-9]'
cusip_matches = re.findall(cusip_pattern, soup.text)
cusip = cusip_matches[0] if cusip_matches else "Not found"

# 4. 提取类别占比(优先用XML标签,再用regex)
percent_ownership = soup.find('percentOwnership')
if percent_ownership:
    class_percent = percent_ownership.text.strip()
else:
    percent_match = re.search(r'(?<=PERCENT OF CLASS|Percent of class)\s*(\d+\.\d+)%', soup.text, re.IGNORECASE | re.DOTALL)
    class_percent = percent_match.group(1) + "%" if percent_match else "Not found"

# 5. 提取触发申报的事件日期
event_date = soup.find('eventDate')
if event_date:
    event_date_str = event_date.text.strip()
    # Convert raw YYYYMMDD to "December 24, 2019" format
    try:
        date_obj = datetime.strptime(event_date_str, '%Y%m%d')
        event_date_str = date_obj.strftime('%B %d, %Y')
    except ValueError:
        pass  # Keep raw format if parsing fails
else:
    event_date_match = re.search(r'Date of Event Which Requires This Statement:\s*(.*?)\n', soup.text, re.IGNORECASE)
    event_date_str = event_date_match.group(1).strip() if event_date_match else "Not found"

# Print all extracted info
print("报告人名称:", reporter_name)
print("发行人名称:", issuer_name)
print("CUSIP编号:", cusip)
print("类别占比:", class_percent)
print("触发申报的事件日期:", event_date_str)

关键说明

  • XML优先,regex兜底: SEC filings遵循标准化的XML schema,用标签提取更稳定;正则仅在标签缺失或格式异常时作为备选方案。
  • 容错处理: 每个字段提取都包含判断逻辑,避免因字段缺失导致程序崩溃,返回Not found提示。
  • 日期格式化: 自动将SEC常用的YYYYMMDD格式转换为你需要的自然语言日期格式,若转换失败则保留原始格式。

内容的提问来源于stack exchange,提问作者Lko

火山引擎 最新活动