从EML文件提取邮件内容：Python3.6 email模块部分文件报错求助

阿华AIGC实验室

2026-5-20

我来帮你排查这个问题——用Python内置email模块处理Lotus Notes导出的.eml时，部分文件报错大概率是因为这些邮件的MIME结构不符合常规，或者缺少你指定的纯文本正文导致的。咱们一步步来解决：

可能的报错原因及对应解决方案

1. 先处理`get_body()`返回`None`的核心问题

很多时候报错是因为目标邮件根本没有纯文本（plain）格式的正文，这时候msg.get_body(preferencelist=('plain'))会返回None，再调用.get_content()自然会抛出AttributeError。

修改代码先判断返回值，同时兼容HTML正文的情况：

def parseEmail(emailFile):
    with open(emailFile, 'rb') as fp:
        policy500 = policy.default.clone(max_line_length=500)
        msg = BytesParser(policy=policy500).parse(fp)
        # 优先获取纯文本正文
        plain_body = msg.get_body(preferencelist=('plain'))
        if plain_body:
            text = plain_body.get_content()
        else:
            # 纯文本不存在时，尝试获取HTML正文
            html_body = msg.get_body(preferencelist=('html'))
            if html_body:
                text = html_body.get_content()
            else:
                # 邮件无可用正文，设置默认值或做异常标记
                text = "无可用正文内容"
        # 后续处理逻辑
        ...

2. 适配Lotus Notes的特殊MIME结构

Lotus Notes导出的邮件经常会有非标准的嵌套MIME结构，比如正文被藏在多层嵌套的part里，这时候get_body()可能无法直接定位到。可以用递归遍历的方式查找所有可能的正文部分：

def find_valid_body(msg):
    # 递归遍历邮件的所有MIME部分
    if msg.is_multipart():
        for part in msg.walk():
            # 跳过嵌套的容器型part和附件
            if part.is_multipart() or part.get_filename():
                continue
            content_type = part.get_content_type()
            if content_type in ['text/plain', 'text/html']:
                return part.get_content()
        return "无可用正文内容"
    else:
        # 非多部分邮件，直接返回内容
        return msg.get_content()

def parseEmail(emailFile):
    with open(emailFile, 'rb') as fp:
        policy500 = policy.default.clone(max_line_length=500)
        msg = BytesParser(policy=policy500).parse(fp)
        text = find_valid_body(msg)
        # 后续处理逻辑
        ...

3. 处理Lotus Notes的编码异常

Lotus Notes的邮件可能使用GBK等非UTF-8的特殊编码，直接调用get_content()可能会触发解码错误。可以手动指定编码并捕获异常：

def get_safe_content(part):
    try:
        return part.get_content()
    except UnicodeDecodeError:
        # 针对中文环境的Lotus Notes，优先尝试GBK解码
        decoded_payload = part.get_payload(decode=True)
        return decoded_payload.decode('gbk', errors='replace')

# 在find_valid_body函数中替换为这个安全获取内容的方法
return get_safe_content(part)

4. 添加异常捕获用于问题排查

为了方便定位具体是哪些邮件出问题，可以在代码中添加全局异常捕获，记录错误信息：

def parseEmail(emailFile):
    try:
        with open(emailFile, 'rb') as fp:
            policy500 = policy.default.clone(max_line_length=500)
            msg = BytesParser(policy=policy500).parse(fp)
            plain_body = msg.get_body(preferencelist=('plain'))
            if plain_body:
                text = plain_body.get_content()
            else:
                html_body = msg.get_body(preferencelist=('html'))
                text = html_body.get_content() if html_body else "无可用正文内容"
        # 后续处理逻辑
    except Exception as e:
        print(f"处理邮件 {emailFile} 时出错: {str(e)}")
        # 将错误写入日志文件，方便后续排查
        with open('email_parse_errors.log', 'a', encoding='utf-8') as f:
            f.write(f"{emailFile}: {str(e)}\n")
        text = None  # 或者返回自定义的错误标记
    return text

内容的提问来源于stack exchange，提问作者shole