基于文本块的带过滤规则的重复内容去除Python实现咨询
基于文本块的带过滤规则的重复内容去除Python实现咨询
我想要写一个Python脚本,用来移除文本文件中连续重复的文本块,最终要实现这样的效果:
- 定位内容相同的文本块(块的最大行数为X,会从X开始逐步降到1行来检查)
- 如果发现连续的X行重复块,就移除重复部分,替换成提示语
...has an additional X similar entries...,放在第一个出现的块后面。
文本块比较的预处理规则
在比较文本块是否相同时,需要先对每行做以下清洗处理,保证比较的准确性:
- 将2个及以上的空格替换为1个空格
- 移除所有数字及带数字的特定格式:包括日期(如XX/XX/XX)、带逗号/小数的数字(如3,444.22)、序号(如1.),以及以数字开头和结尾的“单词”
- 跳过
strip()后为空的行(比较时忽略空行)
处理完成后会生成一个新的TXT文件,移除了所有重复的文本块。
已实现的部分代码
我已经写好了用于比较的行清洗函数,以及获取文本块、计算块哈希的辅助函数,但在实现整体的检测和去重逻辑时遇到了混乱,不知道怎么高效且正确地完成功能。
行清洗函数
import re def clean_for_comparison(line): # Remove dollar amounts (e.g., $1,234.56) - additional example of cleaning row line = re.sub(r'\$[\d,]+(?:\.\d+)?', '', line) # Remove numbers with commas and decimals (e.g., 1,234.56) line = re.sub(r'[\d,]+(?:\.\d+)?', '', line) # Remove date-like patterns (e.g., 17/01/2020 or 17-01-2020) line = re.sub(r'\d{1,2}[-/.\s]?\d{1,2}[-/.\s]?\d{2,4}', '', line) # Remove words starting and ending with digits (e.g., 123abc456) line = re.sub(r'\b\d+[a-zA-Z]+\d+\b', '', line) # Normalize spaces and strip line = re.sub(r'\s+', ' ', line) line = line.strip() return line
辅助函数(获取块、计算哈希)
import hashlib def get_block_hash(block): cleaned_lines = [clean_for_comparison(line) for line in block] non_blank_cleaned_lines = [line for line in cleaned_lines if line] cleaned_block_content = ''.join(non_blank_cleaned_lines) return hashlib.md5(cleaned_block_content.encode()).hexdigest() def get_block(lines, start_index, max_size): block = [] i = start_index row_count = 0 while i < len(lines) and row_count < max_size: line = lines[i].strip() if line: block.append(lines[i]) row_count += 1 i += 1 if i >= len(lines): break if lines[i - 1].strip() == "" and lines[i].strip() == "": break return block, i
待完善的主逻辑代码
我尝试写了主处理函数,但逻辑有问题,还无法正确运行:
def process_file(file_path, max_consecutive_rows_to_check=4): with open(file_path, 'r') as file: lines = file.readlines() filtered_lines = [] i = 0 removed_count = 0 while i < len(lines): block, j = get_block(lines, i, max_consecutive_rows_to_check) if not block: # Skip processing. i = j # Increment continue # Skip empty block # Sliding window comparison for size in range(min(len(block), max_consecutive_rows_to_check), 0, -1): block_to_compare = block[:size] position_row = initial_block_start while True: compare_start = position_row +1 # use the next row outside the intial block to compare? compare_block, next_k = get_block(lines, compare_start, size) compare_line = compare_block if len(compare_block) != size: break # Done with comapre - no more if get_block_hash(block_to_compare) == get_block_hash(compare_block): print("DUPE FOUND!") found_dupes = True total_dupes += 1 # dupe counter position_row = compare_start #set next dupe position to compare consecutively with i += compare_start #continue #continue if it finds? else: #Did not find anything print("Not a dupe.") break # If duplicates were found: if found_dupes: print(f"Total duplicates found: {total_dupes}") filtered_lines.extend(block) filtered_lines.append(f"....<with an additional {total_dupes} entries found>...\n") removed_count += total_dupes i = position_row+max_consecutive_rows_to_check #+1 + size? # If no duplicates were found: else: print("No duplicates found, adding to filtered lines.") filtered_lines.extend(block) # Advance to next block i =+max_consecutive_rows_to_check # Output to new file output_filename = file_path if output_filename.endswith(".txt"): output_filename = output_filename[:-4] if not output_filename.endswith("-truncated"): output_filename = f"{output_filename}-truncated.txt" with open(output_filename, "w") as output_file: output_file.writelines(filtered_lines) print(f"Processed {len(lines)} lines from {file_path}") print(f"Output saved to {output_filename}") if removed_count > 0: print(f"Removed {removed_count} duplicate blocks.")
我的问题
现在我在整体逻辑流程上遇到了困惑,不知道怎么高效且正确地实现连续重复块的检测和移除。另外,有没有现成的Python库可以实现类似的功能——基于文本块去除文件中的连续重复内容?
备注:内容来源于stack exchange,提问作者jkeys




