You need to enable JavaScript to run this app.
最新活动
大模型
产品
解决方案
定价
生态与合作
支持与服务
开发者
了解我们

基于文本块的带过滤规则的重复内容去除Python实现咨询

基于文本块的带过滤规则的重复内容去除Python实现咨询

我想要写一个Python脚本,用来移除文本文件中连续重复的文本块,最终要实现这样的效果:

  • 定位内容相同的文本块(块的最大行数为X,会从X开始逐步降到1行来检查)
  • 如果发现连续的X行重复块,就移除重复部分,替换成提示语...has an additional X similar entries...,放在第一个出现的块后面。

文本块比较的预处理规则

在比较文本块是否相同时,需要先对每行做以下清洗处理,保证比较的准确性:

  • 将2个及以上的空格替换为1个空格
  • 移除所有数字及带数字的特定格式:包括日期(如XX/XX/XX)、带逗号/小数的数字(如3,444.22)、序号(如1.),以及以数字开头和结尾的“单词”
  • 跳过strip()后为空的行(比较时忽略空行)

处理完成后会生成一个新的TXT文件,移除了所有重复的文本块。

已实现的部分代码

我已经写好了用于比较的行清洗函数,以及获取文本块、计算块哈希的辅助函数,但在实现整体的检测和去重逻辑时遇到了混乱,不知道怎么高效且正确地完成功能。

行清洗函数

import re

def clean_for_comparison(line):
    # Remove dollar amounts (e.g., $1,234.56) - additional example of cleaning row
    line = re.sub(r'\$[\d,]+(?:\.\d+)?', '', line)
    # Remove numbers with commas and decimals (e.g., 1,234.56)
    line = re.sub(r'[\d,]+(?:\.\d+)?', '', line)
    # Remove date-like patterns (e.g., 17/01/2020 or 17-01-2020)
    line = re.sub(r'\d{1,2}[-/.\s]?\d{1,2}[-/.\s]?\d{2,4}', '', line)
    # Remove words starting and ending with digits (e.g., 123abc456)
    line = re.sub(r'\b\d+[a-zA-Z]+\d+\b', '', line)
    # Normalize spaces and strip
    line = re.sub(r'\s+', ' ', line)
    line = line.strip()
    return line

辅助函数(获取块、计算哈希)

import hashlib

def get_block_hash(block):
    cleaned_lines = [clean_for_comparison(line) for line in block]
    non_blank_cleaned_lines = [line for line in cleaned_lines if line]
    cleaned_block_content = ''.join(non_blank_cleaned_lines)
    return hashlib.md5(cleaned_block_content.encode()).hexdigest()

def get_block(lines, start_index, max_size):
    block = []
    i = start_index
    row_count = 0

    while i < len(lines) and row_count < max_size:
        line = lines[i].strip()
        if line:
            block.append(lines[i])
            row_count += 1
        i += 1

        if i >= len(lines):
            break
        if lines[i - 1].strip() == "" and lines[i].strip() == "":
            break

    return block, i

待完善的主逻辑代码

我尝试写了主处理函数,但逻辑有问题,还无法正确运行:

def process_file(file_path, max_consecutive_rows_to_check=4):
    with open(file_path, 'r') as file:
        lines = file.readlines()

    filtered_lines = []
    i = 0
    removed_count = 0

    while i < len(lines):
        block, j = get_block(lines, i, max_consecutive_rows_to_check)
        if not block:  # Skip processing.
            i = j  # Increment
            continue  # Skip empty block

        # Sliding window comparison
        for size in range(min(len(block), max_consecutive_rows_to_check), 0, -1):
            block_to_compare = block[:size]
            position_row = initial_block_start

            while True:
                compare_start = position_row +1 # use the next row outside the intial block to compare?
                compare_block, next_k = get_block(lines, compare_start, size)
                compare_line = compare_block

                if len(compare_block) != size:
                    break  # Done with comapre - no more
                if get_block_hash(block_to_compare) == get_block_hash(compare_block):
                    print("DUPE FOUND!")
                    found_dupes = True
                    total_dupes += 1 # dupe counter
                    position_row = compare_start #set next dupe position to compare consecutively with
                    i += compare_start
                    #continue #continue if it finds?
                else: #Did not find anything
                    print("Not a dupe.")
                    break

        # If duplicates were found:
        if found_dupes:
            print(f"Total duplicates found: {total_dupes}")
            filtered_lines.extend(block)
            filtered_lines.append(f"....<with an additional {total_dupes} entries found>...\n")
            removed_count += total_dupes

            i = position_row+max_consecutive_rows_to_check #+1 + size?

        # If no duplicates were found:
        else:
            print("No duplicates found, adding to filtered lines.")
            filtered_lines.extend(block)

        # Advance to next block
        i =+max_consecutive_rows_to_check

    # Output to new file
    output_filename = file_path
    if output_filename.endswith(".txt"):
        output_filename = output_filename[:-4]
    if not output_filename.endswith("-truncated"):
        output_filename = f"{output_filename}-truncated.txt"

    with open(output_filename, "w") as output_file:
        output_file.writelines(filtered_lines)

    print(f"Processed {len(lines)} lines from {file_path}")
    print(f"Output saved to {output_filename}")
    if removed_count > 0:
        print(f"Removed {removed_count} duplicate blocks.")

我的问题

现在我在整体逻辑流程上遇到了困惑,不知道怎么高效且正确地实现连续重复块的检测和移除。另外,有没有现成的Python库可以实现类似的功能——基于文本块去除文件中的连续重复内容?

备注:内容来源于stack exchange,提问作者jkeys

火山引擎 最新活动