大文件（10GB+）低内存差异对比：PowerShell脚本优化及C#/C++替代API咨询

大文件（10GB+）低内存差异对比：PowerShell脚本优化及C#/C++替代API咨询

阿华AIGC实验室

2026-4-29

咱们来解决你的两个核心问题：PowerShell处理大文件时内存占用过高，以及输出中重复路径的问题。同时我也会介绍低内存消耗的C#和C++替代方案。

一、PowerShell脚本优化（低内存+修正输出）

你的原脚本有两个主要导致内存暴涨的问题：

使用$newContent += $_累加数组——PowerShell的数组是固定大小的，每次累加都会创建新数组，频繁操作会导致内存急剧增加。
Get-Content默认会把整个文件加载到内存，大文件直接撑爆内存。

另外，输出重复路径是因为没有去重，且/tee参数可能导致日志里出现重复行（同时输出到控制台和日志）。

优化后的脚本

# 配置参数
$source1 = "C:\Folder\"
$source2 = "C:\Folder2\"
$tempDir = "c:\temp\"

# 确保临时目录存在
if (-not (Test-Path $tempDir)) { New-Item -ItemType Directory -Path $tempDir | Out-Null }

# --------------------------
# 处理第一个文件夹：生成干净的相对路径列表
# --------------------------
$log1 = Join-Path $tempDir "FolderList.txt"
$cleanLog1 = Join-Path $tempDir "FolderList_Clean.txt"
$sortedLog1 = Join-Path $tempDir "FolderList_Sorted.txt"

# 生成robocopy日志（去掉/tee避免重复输出，只写入日志文件）
robocopy.exe $source1 $source1 /l /nocopy /is /e /fp /ns /nc /njh /njs /log:$log1 | Out-Null

# 流式处理：过滤空行、替换绝对路径为相对路径、去重
$pattern = [regex]::Escape($source1)
Get-Content $log1 | 
    Where-Object { $_ -match $pattern -and $_.Trim() -ne '' } | 
    ForEach-Object { $_ -replace $pattern, '' } | 
    Select-Object -Unique | 
    Set-Content $cleanLog1

# 排序以便后续高效流式对比
Sort-Content $cleanLog1 | Set-Content $sortedLog1

# --------------------------
# 处理第二个文件夹：同上
# --------------------------
$log2 = Join-Path $tempDir "FolderList2.txt"
$cleanLog2 = Join-Path $tempDir "FolderList2_Clean.txt"
$sortedLog2 = Join-Path $tempDir "FolderList2_Sorted.txt"

robocopy.exe $source2 $source2 /l /nocopy /is /e /fp /ns /nc /njh /njs /log:$log2 | Out-Null

$pattern = [regex]::Escape($source2)
Get-Content $log2 | 
    Where-Object { $_ -match $pattern -and $_.Trim() -ne '' } | 
    ForEach-Object { $_ -replace $pattern, '' } | 
    Select-Object -Unique | 
    Set-Content $cleanLog2

Sort-Content $cleanLog2 | Set-Content $sortedLog2

# --------------------------
# 用fc.exe流式对比大文件（低内存占用）
# --------------------------
Write-Host "=== File Comparison Results ==="
fc.exe $sortedLog1 $sortedLog2 /n

优化点说明：

流式处理：通过管道逐行读取、处理、写入文件，不会把整个文件加载到内存。
去重：Select-Object -Unique直接过滤重复的路径行，解决输出重复问题。
替换Compare-Object为fc.exe：fc.exe是Windows原生的文件对比工具，采用流式逐行对比，不需要加载整个文件到内存，完美适配10GB+的大文件。
去掉/tee参数：避免日志文件中出现重复输出行。

二、C#低内存文件对比方案

.NET的File.ReadLines方法返回IEnumerable<string>，会流式读取文件（逐行加载，不一次性读入内存）。我们可以用两种方式实现低内存对比：

方式1：排序后逐行对比（内存占用极低）

适合超大文件，仅需缓存当前对比的两行内容：

using System;
using System.IO;

class LowMemoryFileComparer
{
    static void Main(string[] args)
    {
        string sortedFile1 = @"c:\temp\FolderList_Sorted.txt";
        string sortedFile2 = @"c:\temp\FolderList2_Sorted.txt";

        Console.WriteLine("=== Symmetric Difference ===");
        using (var reader1 = File.OpenText(sortedFile1))
        using (var reader2 = File.OpenText(sortedFile2))
        {
            string line1 = reader1.ReadLine();
            string line2 = reader2.ReadLine();

            while (line1 != null || line2 != null)
            {
                int compare = string.Compare(line1 ?? "", line2 ?? "", StringComparison.Ordinal);

                if (compare < 0)
                {
                    Console.WriteLine($"< {line1}"); // 仅在file1中存在
                    line1 = reader1.ReadLine();
                }
                else if (compare > 0)
                {
                    Console.WriteLine($"> {line2}"); // 仅在file2中存在
                    line2 = reader2.ReadLine();
                }
                else
                {
                    // 两行相同，跳过
                    line1 = reader1.ReadLine();
                    line2 = reader2.ReadLine();
                }
            }
        }
    }
}

方式2：用HashSet快速查找（适合中等大小文件）

如果其中一个文件的行可以全部存入内存（比如10GB的文本文件，行数可能在数亿级，内存可能不够，此时优先用方式1）：

using System;
using System.Collections.Generic;
using System.IO;

class HashSetFileComparer
{
    static void Main(string[] args)
    {
        string file1 = @"c:\temp\FolderList_Clean.txt";
        string file2 = @"c:\temp\FolderList2_Clean.txt";

        // 流式读取file1到HashSet
        var file1Lines = new HashSet<string>(File.ReadLines(file1));

        Console.WriteLine("=== Lines only in File2 ===");
        foreach (var line in File.ReadLines(file2))
        {
            if (!file1Lines.Contains(line))
            {
                Console.WriteLine(line);
            }
        }

        Console.WriteLine("\n=== Lines only in File1 ===");
        foreach (var line in File.ReadLines(file1))
        {
            if (!File.ReadLines(file2).Contains(line))
            {
                Console.WriteLine(line);
            }
        }
    }
}

三、C++低内存文件对比方案

C++用std::ifstream逐行读取文件，结合排序后逐行对比，实现极低内存占用：

流式对比排序后的文件

#include <iostream>
#include <fstream>
#include <string>

void CompareSortedFiles(const std::string& file1Path, const std::string& file2Path)
{
    std::ifstream file1(file1Path);
    std::ifstream file2(file2Path);
    std::string line1, line2;

    bool hasLine1 = std::getline(file1, line1);
    bool hasLine2 = std::getline(file2, line2);

    std::cout << "=== Symmetric Difference ===" << std::endl;
    while (hasLine1 || hasLine2)
    {
        if (!hasLine1)
        {
            std::cout << "> " << line2 << std::endl;
            hasLine2 = std::getline(file2, line2);
            continue;
        }
        if (!hasLine2)
        {
            std::cout << "< " << line1 << std::endl;
            hasLine1 = std::getline(file1, line1);
            continue;
        }

        int cmpResult = line1.compare(line2);
        if (cmpResult < 0)
        {
            std::cout << "< " << line1 << std::endl;
            hasLine1 = std::getline(file1, line1);
        }
        else if (cmpResult > 0)
        {
            std::cout << "> " << line2 << std::endl;
            hasLine2 = std::getline(file2, line2);
        }
        else
        {
            hasLine1 = std::getline(file1, line1);
            hasLine2 = std::getline(file2, line2);
        }
    }
}

int main()
{
    std::string sortedFile1 = "c:\\temp\\FolderList_Sorted.txt";
    std::string sortedFile2 = "c:\\temp\\FolderList2_Sorted.txt";
    CompareSortedFiles(sortedFile1, sortedFile2);
    return 0;
}

用unordered_set快速查找（适合中等文件）

#include <iostream>
#include <fstream>
#include <string>
#include <unordered_set>

void CompareWithUnorderedSet(const std::string& file1Path, const std::string& file2Path)
{
    std::unordered_set<std::string> file1Lines;
    std::ifstream file1(file1Path);
    std::string line;

    // 流式读取file1到unordered_set
    while (std::getline(file1, line))
    {
        if (!line.empty())
        {
            file1Lines.insert(line);
        }
    }

    std::cout << "=== Lines only in File2 ===" << std::endl;
    std::ifstream file2(file2Path);
    while (std::getline(file2, line))
    {
        if (!line.empty() && file1Lines.find(line) == file1Lines.end())
        {
            std::cout << line << std::endl;
        }
    }
}

int main()
{
    std::string file1 = "c:\\temp\\FolderList_Clean.txt";
    std::string file2 = "c:\\temp\\FolderList2_Clean.txt";
    CompareWithUnorderedSet(file1, file2);
    return 0;
}

内容的提问来源于stack exchange，提问作者Arbelac

火山引擎最新活动

方舟 Coding Plan

模型自由，工具不限，最新支持 DeepSeek-V4 系列与 GLM-5.1，受邀下单叠加9.5折

ArkClaw

7×24在线专属智能伙伴

Seedance 2.0 全面开放 API

创作无限可能，一键生成电影级 AI 视频

新用户特惠专场

大模型19元起，Al应用9.9元畅享，新人首购爆款尽享优惠