如何修复Python中UnicodeEncodeError: 'ascii'无法编码u'\xa0'的错误
Hey there, let's break down and fix the two issues you're facing: the frustrating UnicodeEncodeError and the problem of scraping irrelevant JavaScript content instead of just the timetable data.
1. Resolving the UnicodeEncodeError
The root cause of your error is in how you're handling string encoding in Python 2. Let's look at the problematic line:
file.write(bytes(timetabledatasaved).encode("utf-8", errors="ignore"))
In Python 2, bytes() is just an alias for str(), so when you pass a Unicode string to it, Python tries to convert it using the default ASCII encoding. That's why it chokes on the \xa0 (non-breaking space) character.
The Fix
Instead of converting the Unicode string to str first, directly encode it to UTF-8 bytes before writing to the file (which you already opened in binary mode with wb—good call!). Update those file-writing lines to:
file.write(header.encode("utf-8")) file.write(timetabledatasaved.encode("utf-8", errors="ignore"))
2. Filtering Out Unwanted Content (Avoiding JavaScript)
It sounds like your scraper is picking up <tr> elements that belong to JavaScript or other non-timetable parts of the page. To fix this, we can add a check to only process rows with the correct number of columns (matching your header's 10 fields). We can also clean up the \xa0 characters while we're at it.
The Fix
Modify your loop over <tr> elements to filter invalid rows and clean text:
EXPECTED_COLUMNS = 10 # Matches your header's 10 columns for record in soup.find_all('tr'): tds = record.find_all('td') # Skip rows that don't have the right number of columns if len(tds) != EXPECTED_COLUMNS: continue timetabledata = "" for data in tds: # Replace non-breaking spaces with regular spaces, and trim extra whitespace clean_text = data.text.strip().replace(u'\xa0', u' ') # Wrap text with commas in quotes to avoid CSV formatting issues if ',' in clean_text: clean_text = f'"{clean_text}"' timetabledata += "," + clean_text timetabledatasaved += "\n" + timetabledata[1:]
Full Corrected Code
Here's the complete, optimized script that addresses both issues:
import urllib2 from bs4 import BeautifulSoup import os def make_soup(url): thepage = urllib2.urlopen(url) soupdata = BeautifulSoup(thepage, "html.parser") return soupdata timetabledatasaved = "" soup = make_soup("http://timetable.ait.ie/reporting/textspreadsheet;student+set;id;AL%5FKSWFT%5FR%5F5%0D%0A?t" "=student+set+textspreadsheet&days=1-5&weeks=21-32&periods=" "3-20&template=student+set+textspreadsheet") EXPECTED_COLUMNS = 10 header = "Activity, Module, Type, Start, End, Duration, Weeks, Room, Staff, Student Groups" for record in soup.find_all('tr'): tds = record.find_all('td') if len(tds) != EXPECTED_COLUMNS: continue timetabledata = "" for data in tds: clean_text = data.text.strip().replace(u'\xa0', u' ') if ',' in clean_text: clean_text = f'"{clean_text}"' timetabledata += "," + clean_text timetabledatasaved += "\n" + timetabledata[1:] # Use `with` to automatically handle file closing with open(os.path.expanduser("timetable.csv"), "wb") as file: file.write(header.encode("utf-8")) file.write(timetabledatasaved.encode("utf-8", errors="ignore"))
Extra Improvements
- Used
withstatement for file handling: This ensures the file is properly closed even if an error occurs. - Cleaned up text:
strip()removes extra whitespace, and replacing\xa0prevents encoding headaches. - Fixed CSV formatting: Wrapping text with commas in quotes stops your CSV from splitting cells incorrectly.
内容的提问来源于stack exchange,提问作者tobzville




