如何在Python中并行处理大文件，同时保证顺序并优化内存占用？

阿华AIGC实验室

2026-4-14

嘿，我太懂你的痛点了——既要用并行加速CPU密集型的文件处理，又不能把内存撑爆，还得严格保证输出和输入的行顺序完全一致，确实有点棘手！咱们先拆解一下你现有代码的问题，再一步步给出靠谱的解决方案。

先说说你当前代码的核心问题

内存爆炸：infile.readlines()会把整个文件的所有行一次性读进内存，几GB的文件直接就把内存占满了，这肯定不行。
ThreadPool不适合CPU密集任务：Python的GIL（全局解释器锁）会让线程在CPU密集型任务里无法真正利用多核，线程切换反而会增加开销，效率上不去。
executor.map的效率瓶颈：虽然它能保证结果顺序，但本质是“串行等待”——哪怕后面的任务先跑完，也得等前面的任务都完成才能输出，相当于把并行的结果又强行串行化了，浪费了多核的优势。

先搞懂：ThreadPool还是ProcessPool？

这得看你的任务类型：

如果是IO密集型（比如每行要发网络请求、读其他小文件）：用ThreadPoolExecutor，因为线程的创建/销毁开销比进程小，而且IO等待时线程可以释放GIL，其他线程能干活。
如果是CPU密集型（比如你模拟的计算任务，实际可能是数据解析、加密、复杂运算）：必须用ProcessPoolExecutor！它会创建独立的进程，绕开GIL，真正利用多个CPU核心，这才是并行加速的关键。

靠谱的解决方案：流式读取+带索引并行+结果缓冲区

核心思路是：

流式读文件：用生成器逐行/分块读取，绝不把整个文件塞进内存。
给每个单元（行/块）加索引：保证处理后能按原顺序拼接。
用as_completed异步获取结果：完成一个就存一个到缓冲区，然后按索引顺序写入文件——既不用等所有任务完成，又能严格保证输出顺序，还能高效利用多核。

方案1：逐行处理（适合对内存要求极高的场景）

这种方式每次只处理一行，内存占用几乎可以忽略：

from concurrent.futures import ProcessPoolExecutor, as_completed

def process_line_with_index(args):
    """带索引的行处理函数，替换成你的实际CPU密集逻辑"""
    idx, line = args
    # 这里是你的实际处理：比如数据清洗、计算、格式转换
    processed_line = line.upper()  # 示例：转大写，替换成你的逻辑
    return (idx, processed_line)

def line_generator(file_path):
    """流式生成带索引的行，不加载全文件"""
    with open(file_path, 'r', encoding='utf-8') as f:
        for idx, line in enumerate(f):
            yield (idx, line)

def process_large_file(input_path, output_path, max_workers=None):
    result_buffer = {}  # 存已完成但还没到写入时机的结果
    current_idx = 0  # 当前需要写入的下一个行的索引

    with open(output_path, 'w', encoding='utf-8') as outfile:
        with ProcessPoolExecutor(max_workers=max_workers) as executor:
            # 提交所有任务：用生成器按需提交，不一次性加载所有行
            futures = {
                executor.submit(process_line_with_index, item): item 
                for item in line_generator(input_path)
            }

            # 逐个获取完成的任务结果
            for future in as_completed(futures):
                idx, processed_line = future.result()
                result_buffer[idx] = processed_line

                # 把当前能连续写入的结果都写进文件
                while current_idx in result_buffer:
                    outfile.write(result_buffer.pop(current_idx))
                    current_idx += 1

if __name__ == '__main__':
    # Windows下ProcessPool需要把执行代码放在__main__里，避免重复导入
    process_large_file("large_file.txt", "output.txt", max_workers=4)

方案2：分块处理（更高效，推荐）

逐行提交任务的开销有点大，我们可以把多行打包成一个“块”，减少任务提交的次数，同时内存占用依然可控：

from concurrent.futures import ProcessPoolExecutor, as_completed

def process_chunk(args):
    """带块索引的分块处理函数，块内保持行顺序"""
    chunk_idx, lines = args
    # 块内的每行处理，保持原顺序
    processed_lines = [line.upper() for line in lines]  # 替换成你的实际逻辑
    return (chunk_idx, processed_lines)

def chunk_generator(file_path, chunk_size=1000):
    """流式生成带索引的块，每个块含chunk_size行"""
    with open(file_path, 'r', encoding='utf-8') as f:
        chunk = []
        chunk_idx = 0
        for line in f:
            chunk.append(line)
            if len(chunk) == chunk_size:
                yield (chunk_idx, chunk)
                chunk = []
                chunk_idx += 1
        # 处理最后一个不足chunk_size的块
        if chunk:
            yield (chunk_idx, chunk)

def process_large_file_chunked(input_path, output_path, chunk_size=1000, max_workers=None):
    result_buffer = {}
    current_chunk_idx = 0

    with open(output_path, 'w', encoding='utf-8') as outfile:
        with ProcessPoolExecutor(max_workers=max_workers) as executor:
            futures = {
                executor.submit(process_chunk, item): item 
                for item in chunk_generator(input_path, chunk_size)
            }

            for future in as_completed(futures):
                chunk_idx, processed_chunk = future.result()
                result_buffer[chunk_idx] = processed_chunk

                # 按块顺序连续写入
                while current_chunk_idx in result_buffer:
                    outfile.writelines(result_buffer.pop(current_chunk_idx))
                    current_chunk_idx += 1

if __name__ == '__main__':
    # 可以根据CPU核心数调整chunk_size和max_workers
    process_large_file_chunked("large_file.txt", "output.txt", chunk_size=2000, max_workers=4)

关键优化点和最佳实践

流式读取的核心：用生成器（yield）读取文件，每次只加载当前的行或块，内存占用始终保持在很低的水平，几GB的文件也能轻松处理。
结果缓冲区的妙用：不用等所有任务完成再写入，而是完成一个存一个，只要当前需要的索引/块到了就写入，既保证顺序，又不会让内存里堆太多结果。
chunk_size的调整：太小会增加任务提交的开销，太大会增加单块的内存占用。一般推荐1000-10000行一个块，根据你的内存和CPU核心数调整。
文件写入效率：用writelines替代逐行write，批量写入比逐行快得多。
ProcessPool的注意事项：Windows下必须把执行代码放在if __name__ == '__main__':下面，因为Windows的进程创建方式是重新导入主模块，避免重复执行代码。