如何在Python中解码JavaScript Unicode字符串?解决双心符号编码报错
Got it, let's break down why you're hitting that error and how to get the proper 💕 emoji working smoothly in Python.
What's Causing the Surrogate Error?
The root issue here is that your string a isn't the valid UTF-8 representation of the TWO HEARTS emoji—it’s probably made up of Unicode surrogate pairs.
Surrogate pairs are a UTF-16 workaround for characters that need 4 bytes (like most emojis). JavaScript often splits these 4-byte characters into two 2-byte "surrogate" characters (for example, \ud83d + \udc95 for the two hearts emoji). When Python tries to encode these unpaired/standalone surrogates as UTF-8, it throws an error because UTF-8 doesn’t recognize individual surrogate characters as valid.
How to Fix It
Here are two straightforward ways to convert those surrogate pairs into the correct emoji:
1. Use codecs to Resolve Surrogate Pairs
Python’s codecs module has built-in support for handling surrogate pairs by treating them as UTF-16. Here’s the code:
import codecs # If your input string is the surrogate pair version (e.g., from JavaScript) surrogate_str = '\ud83d\udc95' fixed_emoji = codecs.decode(surrogate_str, 'utf-16', 'surrogatepass') print(fixed_emoji) # Output: 💕 # Now encoding/decoding works without errors print(fixed_emoji.encode('utf-8').decode('utf-8')) # Output: 💕
The surrogatepass flag tells the UTF-16 decoder to combine the two surrogate characters into the single 4-byte Unicode emoji they’re meant to represent.
2. Fix the Source (Preventative Measure)
If you’re pulling this string from a JavaScript source, make sure the data is sent as valid UTF-8 instead of surrogate pairs. In JavaScript:
- Use
encodeURIComponent()when sending the string over HTTP—this encodes it as proper UTF-8 that Python can decode directly. - Or convert the string to a UTF-8 byte array before sending, so Python doesn’t have to deal with surrogates at all.
Why Your Original Code Failed
When you ran a.encode('utf-8').decode('utf-8'), if a contained unpaired surrogate characters, UTF-8 couldn’t encode them—surrogates are only valid in UTF-16 as pairs. Once you convert the surrogate pair to the actual emoji character, the encode/decode cycle works flawlessly because it’s a valid UTF-8 character.
内容的提问来源于stack exchange,提问作者Mike




