如何高效存储层级类目数据？现有Python实现的优化问询

阿华AIGC实验室

2026-5-14

Optimizing Category Hierarchy Counting

Great question! Your current O(n) solution is already solid—since we have to process every line and every level of each category path, you can’t get better than linear time relative to the total number of category nodes. That said, there are tweaks to improve code clarity, reduce string operation overhead, or optimize memory usage depending on your needs.

Let’s Start with Your Existing Code’s Strengths

Your implementation correctly counts every level of each category path, and its time complexity is effectively O(total_nodes) (where total_nodes is the sum of levels across all lines). This is optimal because every category level needs to be counted at least once.

Alternative Approaches & Optimizations

1. Incremental Path Building (Reduce String Overhead)

Your current code uses '>'.join(llist[:i]) which creates a new string by slicing the list each time. For deeply nested categories, building the path incrementally cuts down on unnecessary string operations:

import os

def get_data_optimized(file_path):
    if not os.path.exists(file_path):
        raise FileNotFoundError("Please provide a valid file path")
    
    inventory = {}
    with open(file_path, 'r', encoding='utf-8') as f:
        for line in f:
            line = line.strip()
            if not line:
                continue
            # Split and clean up category parts
            parts = [p.strip() for p in line.split('&gt;')]
            current_path = None
            for part in parts:
                # Build path step-by-step
                if current_path is None:
                    current_path = part
                else:
                    current_path = f"{current_path}&gt;{part}"
                # Update count
                inventory[current_path] = inventory.get(current_path, 0) + 1
    return inventory

This approach avoids slicing the list repeatedly and creates fewer intermediate strings, which can add up with large datasets.

2. Use `collections.defaultdict` for Cleaner Code

If you want to reduce boilerplate (without changing time complexity), defaultdict(int) lets you skip checking if a key exists before incrementing:

from collections import defaultdict
import os

def get_data_defaultdict(file_path):
    if not os.path.exists(file_path):
        raise FileNotFoundError("Please provide a valid file path")
    
    inventory = defaultdict(int)
    with open(file_path, 'r', encoding='utf-8') as f:
        for line in f:
            line = line.strip()
            if not line:
                continue
            parts = [p.strip() for p in line.split('&gt;')]
            current_path = None
            for part in parts:
                current_path = part if current_path is None else f"{current_path}&gt;{part}"
                inventory[current_path] += 1
    return inventory

This makes the code more readable while maintaining the same performance.

3. Trie (Prefix Tree) for Memory Optimization (Large Datasets)

If you’re dealing with an extremely large number of categories and memory is a bottleneck, a trie (prefix tree) can save space by sharing common prefixes (like "Electronics" for all its subcategories). Each node in the trie stores the count of paths passing through it:

import os

class TrieNode:
    def __init__(self):
        self.children = {}
        self.count = 0

def get_data_trie(file_path):
    if not os.path.exists(file_path):
        raise FileNotFoundError("Please provide a valid file path")
    
    root = TrieNode()
    with open(file_path, 'r', encoding='utf-8') as f:
        for line in f:
            line = line.strip()
            if not line:
                continue
            parts = [p.strip() for p in line.split('&gt;')]
            current_node = root
            for part in parts:
                if part not in current_node.children:
                    current_node.children[part] = TrieNode()
                current_node = current_node.children[part]
                current_node.count += 1
    
    # Convert trie to the flat dictionary format you need
    inventory = {}
    def traverse(node, current_path):
        if current_path:
            inventory[current_path] = node.count
        for child_name, child_node in node.children.items():
            new_path = child_name if not current_path else f"{current_path}&gt;{child_name}"
            traverse(child_node, new_path)
    
    traverse(root, "")
    return inventory

The trie uses less memory when there are many shared prefixes, but adds code complexity. You only need this if memory is a critical constraint.

Final Takeaway

Your original solution is already excellent for most use cases. The incremental path building and defaultdict tweaks are easy wins for code clarity and minor performance gains. The trie is a specialized option for very large datasets where memory is a concern.

内容的提问来源于stack exchange，提问作者Akash Pagar