Python2.7处理CSV文件遇UnicodeDecodeError问题求助

阿华AIGC实验室

2026-5-26

Fixing UnicodeDecodeError in Python 2.7 CSV Processing

Hey there! Let's break down why you're hitting this UnicodeDecodeError and how to fix it.

The Root Cause

In Python 2.7, the csv.DictReader returns byte strings (the str type) when reading your CSV file. When you call .encode('utf-8') on a byte string, Python automatically tries to first decode it to Unicode using the default ASCII encoding—and if your CSV has any non-ASCII characters (like accents, emojis, or non-English text), this default decoding fails hard, throwing the error you're seeing.

The Fixes

The core rule here is: Decode first to get Unicode, then encode if you need byte strings later. Here are a few actionable solutions:

1. Decode with the CSV's Actual Encoding

First, figure out what encoding your Book2.csv uses (common ones are utf-8, cp1252 for Windows, or gbk for Chinese text). Then explicitly decode the byte string to Unicode:

TEST_SENTENCES = []
with open('Book2.csv', 'rb') as csvfile:
    reader = csv.DictReader(csvfile)
    for row in reader:
        # Replace 'utf-8' with your CSV's actual encoding
        unicode_tweet = row["Tweet"].decode('utf-8')
        TEST_SENTENCES.append(unicode_tweet)

2. If You Need UTF-8 Byte Strings

If your downstream program expects UTF-8 encoded byte strings, decode first then encode properly:

TEST_SENTENCES = []
with open('Book2.csv', 'rb') as csvfile:
    reader = csv.DictReader(csvfile)
    for row in reader:
        # Step 1: Decode to Unicode
        unicode_tweet = row["Tweet"].decode('utf-8')
        # Step 2: Encode to UTF-8 byte string
        encoded_tweet = unicode_tweet.encode('utf-8')
        TEST_SENTENCES.append(encoded_tweet)

3. Fallback with Latin-1 (If You Don't Know the Encoding)

If you're unsure of the CSV's encoding, latin-1 is a safe fallback because it can decode every possible byte without throwing errors (though it might not display characters correctly, it avoids crashes):

unicode_tweet = row["Tweet"].decode('latin-1')

Quick Note

Python 2's CSV handling is notoriously finicky with Unicode compared to Python 3. Sticking to the "decode early, encode late" workflow will save you a lot of headaches with text encoding issues.

内容的提问来源于stack exchange，提问作者Patrick Reid