如何修复Python图片去重代码中的‘cannot identify image file’错误？

阿华AIGC实验室

2026-5-14

修复OSError: cannot identify image file的问题

这个错误的核心原因很明确：你尝试处理的12.jpg并不是一个有效可识别的图片文件——可能是文件损坏、后缀名错误（比如把文本文件改了jpg后缀）、或者文件权限问题导致Pillow库没法读取它。咱们一步步来修复这个问题：

1. 给图片读取逻辑添加异常处理，跳过无效文件

最直接的方法是在打开图片的地方捕获异常，让程序遇到坏文件时自动跳过，而不是直接崩溃。修改你的getAllImageHashes函数：

from PIL import Image
import os
from os.path import join, isfile, listdir
import datetime

def getAllImageHashes(folder):
    onlyfiles = [join(folder, f) for f in listdir(folder) if isfile(join(folder, f)) and not f.endswith(".ini") and not f.endswith(".db")]
    hashedFiles = []
    fileLength = len(onlyfiles)
    for f in onlyfiles:
        try:
            # 先尝试打开并验证图片有效性
            with Image.open(f) as img:
                # 验证图片是否完整（避免损坏的文件）
                img.verify()
                # 验证后需要重新打开，因为verify会把文件指针移到末尾
                img = Image.open(f)
                hashedFiles.append((f, dhash(img)))
        except (OSError, SyntaxError) as e:
            # 打印错误信息并跳过该文件
            print(f"跳过无效图片文件: {f} - 错误原因: {e}")
            continue
    print("已完成文件夹内所有有效文件的哈希计算: "+ folder)
    return hashedFiles

2. 提前过滤非图片文件（比后缀名判断更准确）

只靠后缀名判断图片并不靠谱，比如有些文件后缀是jpg但实际是PDF或文本。可以用Python内置的imghdr模块来检测文件的实际类型：

import imghdr

def is_real_image(file_path):
    # 检测文件是否为实际的图片类型
    return imghdr.what(file_path) is not None

然后修改onlyfiles的生成逻辑，加上这个检测：

onlyfiles = [join(folder, f) for f in listdir(folder) 
             if isfile(join(folder, f)) 
             and not f.endswith(".ini") 
             and not f.endswith(".db")
             and is_real_image(join(folder, f))]

如果需要支持更多新型图片格式，可以使用第三方库python-magic（需先安装：pip install python-magic-bin（Windows）或pip install python-magic（Linux/macOS）），替换上面的函数：

import magic

def is_real_image(file_path):
    mime_type = magic.from_file(file_path, mime=True)
    return mime_type.startswith('image/')

3. 额外检查：确认文件权限和完整性

如果某些文件确实是图片但无法读取，可能是权限问题（比如Windows下文件被其他程序锁定），可以在处理前添加权限检查：

def is_file_accessible(file_path):
    return os.access(file_path, os.R_OK)  # 检查是否有读权限

然后在onlyfiles里加上这个条件，或者在try块里处理权限异常。

补充：优化你的删除逻辑（可选）

顺便提一句，你代码里的删除条件matchFound == True and matchCount%2==0可能不太合理——第一次匹配时matchFound初始是False，会跳过删除。如果你的需求是每找到一对重复就删除其中一个，可以调整这个逻辑，比如：

# 当找到重复时，标记要删除的文件（比如保留路径排序靠前的那个）
if f1[0] < f2[0]:
    to_delete = f2[0]
else:
    to_delete = f1[0]
try:
    os.remove(to_delete)
    print(f"已删除重复文件: {to_delete}")
except OSError as e:
    print(f"删除文件失败: {to_delete} - 错误: {e}")

这样能更稳定地处理重复文件的删除。

内容的提问来源于stack exchange，提问作者mina