如何用Pandas按变位词分组?实现集合等价分组逻辑
Efficient Anagram Grouping with Pandas (No Slow Loops!)
Great question! Let's fix this up to efficiently group anagrams with Pandas, ditching those slow row-by-row loops. First, let's break down the issues with your current code:
- Using
iterrows()is inefficient (especially for larger datasets) — Pandas is built for vectorized operations, not manual row iteration. - Grouping by the original
wordcolumn won't group anagrams together, since "acb" and "bca" are treated as distinct strings. - The
splitcolumn approach doesn't create a consistent identifier for matching anagrams.
The Core Idea: Create a Unique Anagram Key
For any set of anagrams, sorting the letters of the word will produce the same string (e.g., "acb" → sorted to "abc", "bca" → also "abc"). This sorted string acts as a reliable grouping key that all anagrams share.
Here's the optimized, loop-free code:
import pandas as pd # Step 1: Read your word list correctly # Adjust sep if your file uses a different delimiter (use '\s+' for spaces, or omit sep if one word per line) wordlist = pd.read_csv('data/example.txt', sep='\s+', header=None, names=['word']) # Step 2: Remove duplicates (you had this part right!) wordlist = wordlist.drop_duplicates(keep='first') # Step 3: Generate a consistent anagram key (sorted letters joined into a string) wordlist['anagram_key'] = wordlist['word'].apply(lambda x: ''.join(sorted(x))) # Step 4: Group by the key and collect all anagrams in each group anagram_groups = wordlist.groupby('anagram_key')['word'].apply(list).reset_index(name='anagrams') print(anagram_groups)
What This Does:
- Reading the file:
sep='\s+'handles spaces between quoted words in your sample. If your file has one word per line, just remove thesepparameter entirely. - Anagram key: The lambda function sorts each word's characters and joins them back into a string. This creates a unique, shared identifier for every anagram set.
- Grouping:
groupby('anagram_key')clusters all anagrams together, andapply(list)collects them into a clean list for each group.
Example Output for Your Sample Data:
anagram_key anagrams 0 abc [acb, bca] 1 foo [foo, oof] 2 aeilnps [spaniel]
Alternative: Using Frozensets (Less Intuitive but Valid)
If you prefer using sets instead of sorted strings, you can use a frozenset (regular sets aren't hashable and can't be used as group keys):
wordlist['anagram_key'] = wordlist['word'].apply(lambda x: frozenset(x))
That said, sorted strings are more readable and easier to debug, so they're the better default choice.
Why This Beats Loops:
- Vectorized operations like
applyrun on optimized C-level code under the hood, making them drastically faster thaniterrows()for large datasets. - The logic is clean, maintainable, and avoids manual row-by-row processing.
内容的提问来源于stack exchange,提问作者Davtho1983




