如何从SEC EDGAR Schedule 13表单中提取指定5类信息?
Got it, let's tackle extracting all 5 required fields from the SEC EDGAR Schedule 13 filing. Since you already have a start with CUSIP and percentage extraction, we can expand that by leveraging both BeautifulSoup's XML parsing (since the filing is in XML format) and targeted regex where needed. Here's a complete solution:
提取逻辑拆解
Let's break down how to grab each of the 5 fields:
1. 报告人名称(Reporting Person Name)
报告人名称通常嵌套在XML结构的<reportingPerson> -> <name>标签下,用BeautifulSoup直接定位这个标签更可靠;如果标签缺失,再用正则匹配兜底。
2. 发行人名称(Issuer Name)
同理,发行人名称一般在<issuer> -> <name>标签下,优先用XML解析避免正则的边缘情况。
3. CUSIP编号
你的现有正则已经能匹配CUSIP格式,我们优化为只取第一个有效匹配(部分 filings 可能有多个CUSIP,我们需要对应目标发行人的那一个)。
4. 类别占比(Percent of Class)
优先用XML的<ownership> -> <percentOwnership>标签提取,标签缺失时再用正则精准捕获百分比数值。
5. 触发申报的事件日期(Event Date)
通常在<eventDate>标签中,原始格式一般是YYYYMMDD,我们可以把它转换成你需要的Month Day, Year格式;标签缺失时用正则匹配兜底。
完整代码
import requests import re from bs4 import BeautifulSoup from datetime import datetime # Fetch the filing page url = 'https://www.sec.gov/Archives/edgar/data/1058239/000106299319004848/formsc13da.htm' page = requests.get(url) soup = BeautifulSoup(page.text, 'xml') # 1. 提取报告人名称 reporting_person = soup.find('reportingPerson') if reporting_person: reporter_name = reporting_person.find('name').text.strip() else: # Fallback to regex if XML tags are missing reporter_name_match = re.search(r'Reporting Person:\s*(.*?)\n', soup.text, re.IGNORECASE) reporter_name = reporter_name_match.group(1).strip() if reporter_name_match else "Not found" # 2. 提取发行人名称 issuer = soup.find('issuer') if issuer: issuer_name = issuer.find('name').text.strip() else: issuer_name_match = re.search(r'Issuer Name:\s*(.*?)\n', soup.text, re.IGNORECASE) issuer_name = issuer_name_match.group(1).strip() if issuer_name_match else "Not found" # 3. 提取CUSIP编号(优化版) cusip_pattern = r'[0-9]{3}[a-zA-Z0-9]{2}[a-zA-Z0-9*@#]{3}[0-9]' cusip_matches = re.findall(cusip_pattern, soup.text) cusip = cusip_matches[0] if cusip_matches else "Not found" # 4. 提取类别占比(优先用XML标签,再用regex) percent_ownership = soup.find('percentOwnership') if percent_ownership: class_percent = percent_ownership.text.strip() else: percent_match = re.search(r'(?<=PERCENT OF CLASS|Percent of class)\s*(\d+\.\d+)%', soup.text, re.IGNORECASE | re.DOTALL) class_percent = percent_match.group(1) + "%" if percent_match else "Not found" # 5. 提取触发申报的事件日期 event_date = soup.find('eventDate') if event_date: event_date_str = event_date.text.strip() # Convert raw YYYYMMDD to "December 24, 2019" format try: date_obj = datetime.strptime(event_date_str, '%Y%m%d') event_date_str = date_obj.strftime('%B %d, %Y') except ValueError: pass # Keep raw format if parsing fails else: event_date_match = re.search(r'Date of Event Which Requires This Statement:\s*(.*?)\n', soup.text, re.IGNORECASE) event_date_str = event_date_match.group(1).strip() if event_date_match else "Not found" # Print all extracted info print("报告人名称:", reporter_name) print("发行人名称:", issuer_name) print("CUSIP编号:", cusip) print("类别占比:", class_percent) print("触发申报的事件日期:", event_date_str)
关键说明
- XML优先,regex兜底: SEC filings遵循标准化的XML schema,用标签提取更稳定;正则仅在标签缺失或格式异常时作为备选方案。
- 容错处理: 每个字段提取都包含判断逻辑,避免因字段缺失导致程序崩溃,返回
Not found提示。 - 日期格式化: 自动将SEC常用的
YYYYMMDD格式转换为你需要的自然语言日期格式,若转换失败则保留原始格式。
内容的提问来源于stack exchange,提问作者Lko




