如何解决Python中的Unicode编码错误？附相关代码片段

阿华AIGC实验室

2026-5-21

解决Python文件读取中的Unicode编码错误

你的代码在读取文件时遇到的Unicode编码问题，核心原因是指定的iso-8859-1编码（即Latin-1）仅支持西欧字符集，一旦文件包含它无法覆盖的字符（比如中文、特殊符号、非西欧语言文字），就会触发编码/解码错误。下面给你几个实用的解决思路：

1. 优先使用UTF-8编码读取

UTF-8是目前通用的字符编码，支持几乎所有已知字符。你可以直接修改文件打开的编码参数，同时添加错误处理策略避免崩溃：

import hashlib

def fileLoc(self, filename):
    md5_data_with_commented_lines = hashlib.md5()
    md5_data_without_commented_lines = hashlib.md5()
    line_of_code = 0
    line_of_comments = 0
    no_of_blank_lines = 0
    flag = 0
    
    # 改用UTF-8编码打开，添加错误处理
    with open(filename, 'r', encoding='utf-8', errors='backslashreplace') as source_file:
        for line in source_file:
            if flag == 1:
                md5_data_with_commented_lines.update(line.encode("utf-8"))
                if line.find('-->') == -1:
                    line_of_comments += 1
                else:
                    # 你的其他逻辑代码
                    pass

这里的errors='backslashreplace'会把无法解码的字符转换成Python的转义序列（比如\xXX），既不会导致程序崩溃，也能保留原始字符的信息方便调试。你也可以根据需求替换为：

errors='replace'：把无法解码的字符换成�符号
errors='ignore'：直接跳过无法解码的字符（不推荐，会丢失数据）

2. 动态检测文件的实际编码

如果你不确定目标文件的编码是什么，可以用chardet库自动检测：

首先安装依赖库：

pip install chardet

然后修改代码：

import hashlib
import chardet

def fileLoc(self, filename):
    md5_data_with_commented_lines = hashlib.md5()
    md5_data_without_commented_lines = hashlib.md5()
    line_of_code = 0
    line_of_comments = 0
    no_of_blank_lines = 0
    flag = 0
    
    # 先检测文件编码
    with open(filename, 'rb') as f:
        detect_result = chardet.detect(f.read())
    
    # 使用检测到的编码打开文件，添加错误处理
    with open(filename, 'r', encoding=detect_result['encoding'], errors='backslashreplace') as source_file:
        for line in source_file:
            if flag == 1:
                md5_data_with_commented_lines.update(line.encode(detect_result['encoding']))
                if line.find('-->') == -1:
                    line_of_comments += 1
                else:
                    # 你的其他逻辑代码
                    pass

这种方式能自动适配不同编码的文件，兼容性更强。

3. 二进制模式读取后再解码

如果上面的方法还是有问题，你可以用二进制模式读取文件，再手动解码，这样能更精细地控制编码处理，同时避免MD5计算时的编码转换问题：

import hashlib

def fileLoc(self, filename):
    md5_data_with_commented_lines = hashlib.md5()
    md5_data_without_commented_lines = hashlib.md5()
    line_of_code = 0
    line_of_comments = 0
    no_of_blank_lines = 0
    flag = 0
    
    with open(filename, 'rb') as source_file:
        for byte_line in source_file:
            # 先尝试用UTF-8解码，失败则用Latin-1兜底
            try:
                line = byte_line.decode('utf-8')
            except UnicodeDecodeError:
                line = byte_line.decode('iso-8859-1')
            
            if flag == 1:
                md5_data_with_commented_lines.update(byte_line)  # 直接用二进制数据更新MD5，避免二次编码干扰
                if line.find('-->') == -1:
                    line_of_comments += 1
                else:
                    # 你的其他逻辑代码
                    pass

内容的提问来源于stack exchange，提问作者megna