Unicode编码有多少种?已知UTF-7等编码外是否有更多?是否均在使用?
Great questions! Let's break them down clearly, since there's a common mix-up between the Unicode character set (which defines unique values for characters) and the encoding schemes that turn those values into byte sequences for storage or transmission.
First, when you ask about "Unicode编码" here, I think you're referring to Unicode code points—the unique numerical identifiers assigned to every character in the standard. Unicode defines a total code space spanning from U+0000 to U+10FFFF, which adds up to 1,114,112 possible code points total.
That said, nowhere near all of these are actively used:
- A huge chunk is unassigned: Reserved for future character additions (the Unicode Consortium regularly rolls out updates with new scripts, emojis, and symbols).
- There are Private Use Areas (PUAs): Three blocks of code points meant for custom use by apps or organizations—these aren't standardized across systems, so their meaning varies depending on who's using them.
- Some are non-characters: Specific code points (like
U+FFFEorU+FFFF) that are never assigned to actual characters; they're used for internal system checks or markers.
As of Unicode 15.1, around 149,186 code points have been assigned to characters covering all major modern scripts, historical languages, emojis, symbols, and control characters.
Yes, there are a few additional encoding schemes designed for Unicode, and several have been largely phased out over time:
Additional encoding schemes you might not know about
- UTF-1: An early 16-bit encoding that predated UTF-16. It only supported the Basic Multilingual Plane (BMP, characters from
U+0000toU+FFFF) and was quickly replaced by UTF-16, which added support for supplementary planes (like many emojis or rare scripts). - UTF-EBCDIC: A variant built for systems using the EBCDIC character set (common in older IBM mainframes). It's rarely used today as most systems have shifted to ASCII-based environments.
- UTF-32BE/UTF-32LE: Strictly speaking, these are byte-order-specific variants of UTF-32. UTF-32 can use a BOM (Byte Order Mark) to indicate endianness, but these variants enforce big-endian or little-endian without needing a BOM.
- GB18030: While it's primarily a Chinese national encoding standard, it's fully compatible with Unicode (it can represent every Unicode code point) and is sometimes grouped with UTF schemes for its Unicode support, though it also maintains compatibility with older Chinese encodings.
Which schemes are still in use, and which are obsolete?
- Widely used today:
UTF-8: The de facto standard for the web, file storage, and most modern apps. It's space-efficient for ASCII characters and supports all Unicode code points, making it the go-to choice for most scenarios.UTF-16: Used in some operating systems (like Windows' internal API) and programming languages (Java, C#). It's efficient for BMP characters but uses 4 bytes for supplementary plane characters.UTF-32: Rarely used for transmission due to its fixed 4-byte size (wasteful for most common characters), but it's sometimes used in internal system processing where fixed-size code points simplify logic.
- Largely deprecated or obsolete:
UCS-2: A subset of UTF-16 that only supports BMP characters. It can't handle supplementary plane characters, so it's been fully replaced by UTF-16 in almost all modern systems.UTF-7: Designed for systems limited to 7-bit ASCII (like old email protocols), but it has serious security vulnerabilities (e.g., it can be exploited to bypass input filters) and is rarely used today.UTF-1: Completely obsolete, replaced by UTF-16 decades ago.UTF-EBCDIC: Only found in legacy mainframe environments, not in general consumer or enterprise use.
内容的提问来源于stack exchange,提问作者user4582812




