如何提取EML文件纯正文？Python2.7+BeautifulSoup实现遇阻

阿华AIGC实验室

2026-5-8

解决EML文件纯正文提取问题

看起来你之前的尝试走了弯路：直接用BeautifulSoup解析整个EML文件是不对的，因为EML是邮件格式文件，包含邮件头、多部分内容分隔符、附件信息等，不是纯HTML。另外你的pullout函数会处理附件，导致混入不必要的文件信息或代码文本。下面是针对性的解决方案：

核心思路

先用Python的email模块正确解析EML文件，分离出邮件的纯文本正文或HTML正文部分（忽略附件和非正文内容）
如果是HTML正文，用BeautifulSoup清理标签，提取纯文本
最终得到干净的纯正文文本，用于后续的垃圾邮件分类

完整代码实现（兼容Python2.7）

import email
import os
from bs4 import BeautifulSoup
import csv

def extract_clean_body(eml_path):
    """从EML文件中提取干净的纯正文文本"""
    with open(eml_path, 'rb') as f:
        msg = email.message_from_file(f)
    
    clean_body = ""
    
    # 处理多部分邮件
    if msg.is_multipart():
        # 遍历所有邮件部分，优先找text/plain，没有则处理text/html
        for part in msg.walk():
            content_type = part.get_content_type()
            # 跳过附件和内嵌资源（比如图片、样式）
            if part.get_filename() or content_type not in ['text/plain', 'text/html']:
                continue
            
            # 获取内容，注意解码（Python2.7需要处理bytes）
            payload = part.get_payload(decode=True)
            if not payload:
                continue
            
            if content_type == 'text/plain':
                # 纯文本直接解码使用
                clean_body = payload.decode('utf-8', errors='replace').strip()
                break  # 找到纯文本就停止，优先用纯文本
            elif content_type == 'text/html':
                # HTML转纯文本
                soup = BeautifulSoup(payload, 'lxml')
                # 清理script、style等非正文标签
                for script in soup(["script", "style"]):
                    script.extract()
                clean_body = soup.get_text(strip=True, separator='\n').strip()
    else:
        # 单部分邮件
        content_type = msg.get_content_type()
        payload = msg.get_payload(decode=True)
        if not payload:
            return clean_body
        
        if content_type == 'text/plain':
            clean_body = payload.decode('utf-8', errors='replace').strip()
        elif content_type == 'text/html':
            soup = BeautifulSoup(payload, 'lxml')
            for script in soup(["script", "style"]):
                script.extract()
            clean_body = soup.get_text(strip=True, separator='\n').strip()
    
    return clean_body

def write_to_csv(eml_dir, output_csv):
    """遍历EML文件夹，提取正文并写入CSV（包含邮件路径、正文、类型标记位）"""
    with open(output_csv, 'wb') as csvfile:
        fieldnames = ['email_path', 'clean_body', 'is_spam']
        writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
        
        writer.writeheader()
        for filename in os.listdir(eml_dir):
            if filename.endswith('.eml'):
                eml_path = os.path.join(eml_dir, filename)
                body = extract_clean_body(eml_path)
                # 这里可以根据文件名或文件夹判断是否是垃圾邮件，比如spam文件夹里的标记为1，ham为0
                # 示例：假设eml_dir下有spam和ham子文件夹
                is_spam = 1 if 'spam' in eml_path.lower() else 0
                writer.writerow({
                    'email_path': eml_path,
                    'clean_body': body.encode('utf-8', errors='replace'),
                    'is_spam': is_spam
                })

# 使用示例
if __name__ == '__main__':
    # 提取单个EML的正文
    single_eml_body = extract_clean_body("e.eml")
    print("提取的纯正文：")
    print(single_eml_body)
    
    # 批量处理并写入CSV
    # write_to_csv("path/to/your/eml/folder", "email_data.csv")

关键改进点

跳过附件和内嵌资源：通过part.get_filename()判断是否是附件，直接跳过；同时只处理text/plain和text/html类型的内容
优先使用纯文本：如果邮件同时有纯文本和HTML版本，优先提取纯文本（更干净），没有则处理HTML
清理HTML冗余内容：用BeautifulSoup移除script、style等非正文标签，提取干净的文本
编码处理：Python2.7中注意bytes和str的转换，用decode/encode处理编码问题，避免乱码

为什么你的之前尝试失败？

直接用BeautifulSoup解析整个EML文件：EML文件包含邮件头、边界符（比如------=_NextPart_000_001C_01D9A...）等非HTML内容，解析后会混入这些垃圾信息
pullout函数处理了附件：你的原始代码会提取附件并记录文件名，导致正文里混入了附件相关的信息

内容的提问来源于stack exchange，提问作者K.Malamatas