Python regex模块中简单与完整大小写折叠的差异及示例问询
Simple vs. Full Case Folding in the
regex Module Great question! Let's break down the key differences between these two modes, plus share concrete examples where V1 (full folding) matches something V0 (simple folding) won't.
What's the Difference?
First, let's clarify what each folding type actually does:
- Simple Case Folding (V0 mode): This follows Unicode's simple case-folding rules, which are strictly one-to-one character mappings. Think of it as the "basic" case conversion you're used to—like turning
étoÉ, orиtoИ. No character counts change here, and this matches the behavior of Python's built-inremodule exactly. - Full Case Folding (V1 mode): This uses Unicode's full case-folding rules, which handle more complex scenarios: one character expanding to multiple characters, multiple characters collapsing to the same form, or context-dependent variants. This is a more comprehensive approach to case-insensitive matching that adheres more closely to Unicode's full specification.
Example 1: German Sharp S (ß)
The German ß is a perfect example of where the modes diverge:
- In full folding (V1),
ßexpands tosswhen folded. So a case-insensitive regex will matchß,ss, andSS. - In simple folding (V0),
ßonly maps to itself (since there's no direct one-to-one uppercase equivalent in simple rules), so it won't matchssorSS.
Test this in code:
import regex # V0 mode: No match for SS against ß print(regex.search("(?V0i)ß", "SS")) # Output: None # V1 mode: Matches SS against ß print(regex.search("(?V1i)ß", "SS")) # Output: <regex.Match object; span=(0, 2), match='SS'>
Example 2: Greek Final Sigma (ς)
Greek has two lowercase variants of sigma: σ (used in the middle of words) and ς (used at the end).
- In full folding (V1), both variants fold to the uppercase
Σin case-insensitive matching. So a regex forΣwill match bothσandς. - In simple folding (V0),
Σonly maps toσ, so it won't recognize the final sigmaς.
Test code:
import regex # V0 mode: No match for final sigma ς against Σ print(regex.search("(?V0i)Σ", "ς")) # Output: None # V1 mode: Matches final sigma ς against Σ print(regex.search("(?V1i)Σ", "ς")) # Output: <regex.Match object; span=(0, 1), match='ς'>
Why Your Test Examples Didn't Show a Difference
In your tests with é and E, neither mode matches because both simple and full case folding don't map é to E—they only convert é to É (and vice versa). To see the split between V0 and V1, you need characters with complex folding rules like the ones above.
内容的提问来源于stack exchange,提问作者iBug




