如何用Pandas按变位词分组？实现集合等价分组逻辑

阿华AIGC实验室

2026-5-15

Efficient Anagram Grouping with Pandas (No Slow Loops!)

Great question! Let's fix this up to efficiently group anagrams with Pandas, ditching those slow row-by-row loops. First, let's break down the issues with your current code:

Using iterrows() is inefficient (especially for larger datasets) — Pandas is built for vectorized operations, not manual row iteration.
Grouping by the original word column won't group anagrams together, since "acb" and "bca" are treated as distinct strings.
The split column approach doesn't create a consistent identifier for matching anagrams.

The Core Idea: Create a Unique Anagram Key

For any set of anagrams, sorting the letters of the word will produce the same string (e.g., "acb" → sorted to "abc", "bca" → also "abc"). This sorted string acts as a reliable grouping key that all anagrams share.

Here's the optimized, loop-free code:

import pandas as pd

# Step 1: Read your word list correctly
# Adjust sep if your file uses a different delimiter (use '\s+' for spaces, or omit sep if one word per line)
wordlist = pd.read_csv('data/example.txt', sep='\s+', header=None, names=['word'])

# Step 2: Remove duplicates (you had this part right!)
wordlist = wordlist.drop_duplicates(keep='first')

# Step 3: Generate a consistent anagram key (sorted letters joined into a string)
wordlist['anagram_key'] = wordlist['word'].apply(lambda x: ''.join(sorted(x)))

# Step 4: Group by the key and collect all anagrams in each group
anagram_groups = wordlist.groupby('anagram_key')['word'].apply(list).reset_index(name='anagrams')

print(anagram_groups)

What This Does:

Reading the file: sep='\s+' handles spaces between quoted words in your sample. If your file has one word per line, just remove the sep parameter entirely.
Anagram key: The lambda function sorts each word's characters and joins them back into a string. This creates a unique, shared identifier for every anagram set.
Grouping: groupby('anagram_key') clusters all anagrams together, and apply(list) collects them into a clean list for each group.

Example Output for Your Sample Data:

anagram_key      anagrams
0         abc      [acb, bca]
1         foo      [foo, oof]
2     aeilnps      [spaniel]

Alternative: Using Frozensets (Less Intuitive but Valid)

If you prefer using sets instead of sorted strings, you can use a frozenset (regular sets aren't hashable and can't be used as group keys):

wordlist['anagram_key'] = wordlist['word'].apply(lambda x: frozenset(x))

That said, sorted strings are more readable and easier to debug, so they're the better default choice.

Why This Beats Loops:

Vectorized operations like apply run on optimized C-level code under the hood, making them drastically faster than iterrows() for large datasets.
The logic is clean, maintainable, and avoids manual row-by-row processing.

内容的提问来源于stack exchange，提问作者Davtho1983