基于文本块的带过滤规则的重复内容去除Python实现咨询

基于文本块的带过滤规则的重复内容去除Python实现咨询

阿华AIGC实验室

2026-4-13

基于文本块的带过滤规则的重复内容去除Python实现咨询

我想要写一个Python脚本，用来移除文本文件中连续重复的文本块，最终要实现这样的效果：

定位内容相同的文本块（块的最大行数为X，会从X开始逐步降到1行来检查）
如果发现连续的X行重复块，就移除重复部分，替换成提示语...has an additional X similar entries...，放在第一个出现的块后面。

文本块比较的预处理规则

在比较文本块是否相同时，需要先对每行做以下清洗处理，保证比较的准确性：

将2个及以上的空格替换为1个空格
移除所有数字及带数字的特定格式：包括日期（如XX/XX/XX）、带逗号/小数的数字（如3,444.22）、序号（如1.），以及以数字开头和结尾的“单词”
跳过strip()后为空的行（比较时忽略空行）

处理完成后会生成一个新的TXT文件，移除了所有重复的文本块。

已实现的部分代码

我已经写好了用于比较的行清洗函数，以及获取文本块、计算块哈希的辅助函数，但在实现整体的检测和去重逻辑时遇到了混乱，不知道怎么高效且正确地完成功能。

行清洗函数

import re

def clean_for_comparison(line):
    # Remove dollar amounts (e.g., $1,234.56) - additional example of cleaning row
    line = re.sub(r'\$[\d,]+(?:\.\d+)?', '', line)
    # Remove numbers with commas and decimals (e.g., 1,234.56)
    line = re.sub(r'[\d,]+(?:\.\d+)?', '', line)
    # Remove date-like patterns (e.g., 17/01/2020 or 17-01-2020)
    line = re.sub(r'\d{1,2}[-/.\s]?\d{1,2}[-/.\s]?\d{2,4}', '', line)
    # Remove words starting and ending with digits (e.g., 123abc456)
    line = re.sub(r'\b\d+[a-zA-Z]+\d+\b', '', line)
    # Normalize spaces and strip
    line = re.sub(r'\s+', ' ', line)
    line = line.strip()
    return line

辅助函数（获取块、计算哈希）

import hashlib

def get_block_hash(block):
    cleaned_lines = [clean_for_comparison(line) for line in block]
    non_blank_cleaned_lines = [line for line in cleaned_lines if line]
    cleaned_block_content = ''.join(non_blank_cleaned_lines)
    return hashlib.md5(cleaned_block_content.encode()).hexdigest()

def get_block(lines, start_index, max_size):
    block = []
    i = start_index
    row_count = 0

    while i < len(lines) and row_count < max_size:
        line = lines[i].strip()
        if line:
            block.append(lines[i])
            row_count += 1
        i += 1

        if i >= len(lines):
            break
        if lines[i - 1].strip() == "" and lines[i].strip() == "":
            break

    return block, i

待完善的主逻辑代码

我尝试写了主处理函数，但逻辑有问题，还无法正确运行：

def process_file(file_path, max_consecutive_rows_to_check=4):
    with open(file_path, 'r') as file:
        lines = file.readlines()

    filtered_lines = []
    i = 0
    removed_count = 0

    while i < len(lines):
        block, j = get_block(lines, i, max_consecutive_rows_to_check)
        if not block:  # Skip processing.
            i = j  # Increment
            continue  # Skip empty block

        # Sliding window comparison
        for size in range(min(len(block), max_consecutive_rows_to_check), 0, -1):
            block_to_compare = block[:size]
            position_row = initial_block_start

            while True:
                compare_start = position_row +1 # use the next row outside the intial block to compare?
                compare_block, next_k = get_block(lines, compare_start, size)
                compare_line = compare_block

                if len(compare_block) != size:
                    break  # Done with comapre - no more
                if get_block_hash(block_to_compare) == get_block_hash(compare_block):
                    print("DUPE FOUND!")
                    found_dupes = True
                    total_dupes += 1 # dupe counter
                    position_row = compare_start #set next dupe position to compare consecutively with
                    i += compare_start
                    #continue #continue if it finds?
                else: #Did not find anything
                    print("Not a dupe.")
                    break

        # If duplicates were found:
        if found_dupes:
            print(f"Total duplicates found: {total_dupes}")
            filtered_lines.extend(block)
            filtered_lines.append(f"....<with an additional {total_dupes} entries found>...\n")
            removed_count += total_dupes

            i = position_row+max_consecutive_rows_to_check #+1 + size?

        # If no duplicates were found:
        else:
            print("No duplicates found, adding to filtered lines.")
            filtered_lines.extend(block)

        # Advance to next block
        i =+max_consecutive_rows_to_check

    # Output to new file
    output_filename = file_path
    if output_filename.endswith(".txt"):
        output_filename = output_filename[:-4]
    if not output_filename.endswith("-truncated"):
        output_filename = f"{output_filename}-truncated.txt"

    with open(output_filename, "w") as output_file:
        output_file.writelines(filtered_lines)

    print(f"Processed {len(lines)} lines from {file_path}")
    print(f"Output saved to {output_filename}")
    if removed_count > 0:
        print(f"Removed {removed_count} duplicate blocks.")

我的问题

现在我在整体逻辑流程上遇到了困惑，不知道怎么高效且正确地实现连续重复块的检测和移除。另外，有没有现成的Python库可以实现类似的功能——基于文本块去除文件中的连续重复内容？

备注：内容来源于stack exchange，提问作者jkeys

火山引擎最新活动

方舟 Coding Plan

模型自由，工具不限，免费解锁 ArkClaw，7*24 小时在线的专属智能伙伴

一键部署 OpenClaw

分钟级部署，云服务器包月低至￥9.9，与 CodingPlan 组合购买仅需19.8元

Seedance2.0 体验中心上线

注册即享免费500万Tokens，抢先领略新一代AI视频技术跃迁

新用户特惠专场

大模型19元起，Al应用9.9元畅享，新人首购爆款尽享优惠