You need to enable JavaScript to run this app.
最新活动
大模型
产品
解决方案
定价
生态与合作
支持与服务
开发者
了解我们

如何修复Python中UnicodeEncodeError: 'ascii'无法编码u'\xa0'的错误

Fixing UnicodeEncodeError & Unwanted JavaScript Scraping for Your Timetable Scraper

Hey there, let's break down and fix the two issues you're facing: the frustrating UnicodeEncodeError and the problem of scraping irrelevant JavaScript content instead of just the timetable data.

1. Resolving the UnicodeEncodeError

The root cause of your error is in how you're handling string encoding in Python 2. Let's look at the problematic line:

file.write(bytes(timetabledatasaved).encode("utf-8", errors="ignore"))

In Python 2, bytes() is just an alias for str(), so when you pass a Unicode string to it, Python tries to convert it using the default ASCII encoding. That's why it chokes on the \xa0 (non-breaking space) character.

The Fix

Instead of converting the Unicode string to str first, directly encode it to UTF-8 bytes before writing to the file (which you already opened in binary mode with wb—good call!). Update those file-writing lines to:

file.write(header.encode("utf-8"))
file.write(timetabledatasaved.encode("utf-8", errors="ignore"))

2. Filtering Out Unwanted Content (Avoiding JavaScript)

It sounds like your scraper is picking up <tr> elements that belong to JavaScript or other non-timetable parts of the page. To fix this, we can add a check to only process rows with the correct number of columns (matching your header's 10 fields). We can also clean up the \xa0 characters while we're at it.

The Fix

Modify your loop over <tr> elements to filter invalid rows and clean text:

EXPECTED_COLUMNS = 10  # Matches your header's 10 columns

for record in soup.find_all('tr'):
    tds = record.find_all('td')
    # Skip rows that don't have the right number of columns
    if len(tds) != EXPECTED_COLUMNS:
        continue
    
    timetabledata = ""
    for data in tds:
        # Replace non-breaking spaces with regular spaces, and trim extra whitespace
        clean_text = data.text.strip().replace(u'\xa0', u' ')
        # Wrap text with commas in quotes to avoid CSV formatting issues
        if ',' in clean_text:
            clean_text = f'"{clean_text}"'
        timetabledata += "," + clean_text
    
    timetabledatasaved += "\n" + timetabledata[1:]

Full Corrected Code

Here's the complete, optimized script that addresses both issues:

import urllib2
from bs4 import BeautifulSoup
import os

def make_soup(url):
    thepage = urllib2.urlopen(url)
    soupdata = BeautifulSoup(thepage, "html.parser")
    return soupdata

timetabledatasaved = ""
soup = make_soup("http://timetable.ait.ie/reporting/textspreadsheet;student+set;id;AL%5FKSWFT%5FR%5F5%0D%0A?t"
                "=student+set+textspreadsheet&amp;days=1-5&amp;weeks=21-32&amp;periods="
                "3-20&amp;template=student+set+textspreadsheet")

EXPECTED_COLUMNS = 10
header = "Activity, Module, Type, Start, End, Duration, Weeks, Room, Staff, Student Groups"

for record in soup.find_all('tr'):
    tds = record.find_all('td')
    if len(tds) != EXPECTED_COLUMNS:
        continue
    
    timetabledata = ""
    for data in tds:
        clean_text = data.text.strip().replace(u'\xa0', u' ')
        if ',' in clean_text:
            clean_text = f'"{clean_text}"'
        timetabledata += "," + clean_text
    
    timetabledatasaved += "\n" + timetabledata[1:]

# Use `with` to automatically handle file closing
with open(os.path.expanduser("timetable.csv"), "wb") as file:
    file.write(header.encode("utf-8"))
    file.write(timetabledatasaved.encode("utf-8", errors="ignore"))

Extra Improvements

  • Used with statement for file handling: This ensures the file is properly closed even if an error occurs.
  • Cleaned up text: strip() removes extra whitespace, and replacing \xa0 prevents encoding headaches.
  • Fixed CSV formatting: Wrapping text with commas in quotes stops your CSV from splitting cells incorrectly.

内容的提问来源于stack exchange,提问作者tobzville

火山引擎 最新活动