如何用Python从已获取的HTML代码中提取所有电子邮箱地址？

如何用Python从已获取的HTML代码中提取所有电子邮箱地址？

阿华AIGC实验室

2026-5-19

提取HTML中的电子邮箱地址方案

嘿，提取HTML里的邮箱地址这个需求挺常见的，我给你分享两个实用的Python方案，根据你的需求选就行：

方法一：直接用正则表达式匹配（快速上手）

如果你已经拿到了完整的HTML字符串，最直接的方式就是用正则表达式匹配符合邮箱格式的字符串。常用的邮箱正则可以覆盖绝大多数场景：

import re

# 替换成你获取到的HTML内容
html_content = """<div>联系我们：support@example.com 或者 sales@test.org</div>"""

# 匹配邮箱的正则表达式
email_pattern = r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}'
# 查找所有匹配的邮箱
emails = re.findall(email_pattern, html_content)
# 去重避免重复结果
unique_emails = list(set(emails))

print(unique_emails)
# 输出: ['support@example.com', 'sales@test.org']

不过要注意：这种方法会匹配HTML标签属性里类似邮箱的字符串（比如<img src="logo@2x.png">里的logo@2x），如果你的HTML里有这类干扰项，更推荐下面的方法。

方法二：结合BeautifulSoup解析HTML（精准过滤）

先用BeautifulSoup解析HTML，提取页面的纯文本内容，再用正则匹配，这样就能避开标签里的干扰内容：

步骤1：安装BeautifulSoup（如果没装的话）

pip install beautifulsoup4

步骤2：编写代码

from bs4 import BeautifulSoup
import re

html_content = """<div>联系我们：support@example.com</div><a href="mailto:sales@test.org">发送邮件</a>"""

# 解析HTML
soup = BeautifulSoup(html_content, 'html.parser')
# 提取页面所有可见文本
text_content = soup.get_text()

# 匹配邮箱
email_pattern = r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}'
emails_from_text = re.findall(email_pattern, text_content)
unique_emails = list(set(emails_from_text))

print(unique_emails)
# 输出: ['support@example.com', 'sales@test.org']

额外需求：提取mailto链接里的邮箱

如果需要专门提取<a href="mailto:xxx@xxx.com">这种链接里的邮箱，可以单独处理：

# 查找所有带mailto的a标签
mailto_links = soup.find_all('a', href=re.compile(r'mailto:'))
# 从href属性中提取邮箱
mailto_emails = [re.search(r'mailto:(.*)', link['href']).group(1) for link in mailto_links]

# 合并文本和mailto里的邮箱
all_emails = unique_emails + mailto_emails
all_unique_emails = list(set(all_emails))
print(all_unique_emails)

这样就能全面覆盖页面里的邮箱地址啦！

内容的提问来源于stack exchange，提问作者zorange

火山引擎最新活动

方舟 Coding Plan

模型自由，工具不限，免费解锁 ArkClaw，7*24 小时在线的专属智能伙伴

一键部署 OpenClaw

分钟级部署，云服务器包月低至￥9.9，与 CodingPlan 组合购买仅需19.8元

Seedance2.0 体验中心上线

注册即享免费500万Tokens，抢先领略新一代AI视频技术跃迁

新用户特惠专场

大模型19元起，Al应用9.9元畅享，新人首购爆款尽享优惠