关于计算英语最常见子变位词的高效算法与数据结构的技术问询

阿华AIGC实验室

2026-5-11

高效计算英语中最常见子变位词的方案

Great question—nested loops over every dictionary word are a non-starter for large datasets, so we need a preprocessing-focused approach that trades one-time setup time for lightning-fast aggregation later. Here's a practical, efficient solution with clear data structures and steps:

Core Insight

Instead of checking every pair of words directly, we'll encode each word's letter composition into a hashable "fingerprint", then use these fingerprints to quickly tally how many words can contain a given candidate as a sub-anagram. This cuts down redundant comparisons drastically by grouping anagrams together (since they share the same fingerprint).

Key Data Structures

We'll use three complementary structures to optimize speed and storage:

Letter Frequency Tuples: A fixed-order tuple (length 26, one entry per letter a-z) where each value is the count of that letter in the word. For example, "bar" becomes (1, 1, 0, 0, ..., 0, 1) (indices 0=a, 1=b, 17=r). Tuples are hashable, so they work perfectly as dictionary keys.
Feature-to-Total-Frequency Map: A dictionary (dict[Tuple[int, ...], int]) that maps each letter frequency tuple to the total number of times words with that composition appear in your corpus (e.g., if "bar", "bra", and "rab" each appear 2 times, their shared tuple maps to 6).
Length-Groupped Feature List: A list of lists, where each sublist contains all frequency tuples whose total letter count (sum of the tuple) equals a specific length. This lets us skip checking all words shorter than our candidate sub-anagram immediately.

Step-by-Step Algorithm

1. Preprocess the Dictionary

First, we'll convert all words into their fingerprints and aggregate their frequencies:

from collections import defaultdict

def get_letter_fingerprint(word):
    fingerprint = [0] * 26
    for char in word.lower():
        idx = ord(char) - ord('a')
        if 0 <= idx < 26:  # Ignore non-alphabetic chars if needed
            fingerprint[idx] += 1
    return tuple(fingerprint)

# Example corpus: replace with your actual dictionary/corpus
corpus = ["bar", "brave", "rab", "bra", "abracadabra", "bar", "brave"]
word_frequencies = defaultdict(int)
for word in corpus:
    word_frequencies[word.lower()] += 1

# Build feature-to-total map and length groups
feature_total = defaultdict(int)
length_groups = defaultdict(list)
for word, freq in word_frequencies.items():
    fp = get_letter_fingerprint(word)
    feature_total[fp] += freq
    total_length = sum(fp)
    length_groups[total_length].append(fp)

# Sort length groups for faster iteration later
sorted_lengths = sorted(length_groups.keys(), reverse=True)

2. Calculate Sub-Anagram Counts for Each Candidate

For each word you want to evaluate (all words in the corpus, or a subset), compute how many words can contain it as a sub-anagram:

def count_sub_anagram_occurrences(candidate_word):
    candidate_fp = get_letter_fingerprint(candidate_word)
    candidate_length = sum(candidate_fp)
    total = 0
    
    # Only check features with length >= candidate's length
    for length in sorted_lengths:
        if length < candidate_length:
            break  # Since lengths are sorted descending, no need to check further
        for fp in length_groups[length]:
            # Check if fp has at least as many of each letter as candidate_fp
            is_superset = all(fp[i] >= candidate_fp[i] for i in range(26))
            if is_superset:
                total += feature_total[fp]
    return total

# Example: calculate counts for all words
sub_anagram_counts = {}
for word in word_frequencies.keys():
    sub_anagram_counts[word] = count_sub_anagram_occurrences(word)

3. Sort and Output Results

Finally, sort the results by occurrence count (descending) to get the most common sub-anagrams:

# Sort by count, then alphabetically for ties
sorted_results = sorted(sub_anagram_counts.items(), key=lambda x: (-x[1], x[0]))

# Print top results
print("Most common sub-anagrams (count: word):")
for word, count in sorted_results[:10]:
    print(f"{count}: {word}")

Optimizations for Large Datasets

Prune Rare Features: If your corpus has millions of words, you can ignore features with very low total frequency to reduce the number of checks.
Bitmask Encoding: For words with letter counts ≤ 31, you can encode each letter's count into a 5-bit segment of a single integer (26*5=130 bits, which fits in Python's arbitrary-length integers). This makes superset checks faster via bitwise operations (though readability suffers).
Superset Indexing: Precompute for each feature all features that are its superset (only feasible for small corpora or short words, as the number of supersets grows exponentially with word length).

Edge Cases to Consider

Case Insensitivity: The example converts all words to lowercase—adjust if you need case-sensitive matching.
Non-Alphabetic Characters: The fingerprint function ignores non-a-z chars; modify it if you need to include numbers or symbols.
Excluding the Word Itself: If you don't want to count the candidate word's own occurrences, subtract word_frequencies[candidate_word] from the total count.

内容的提问来源于stack exchange，提问作者Jeremy Gottfried