如何用Python将图片中特殊字体文本转为字符串（解决识别精度问题）

阿华AIGC实验室

2026-5-12

优化Tesseract OCR识别精度并去除特殊字符的解决方案

Hey Arjun, let's fix that messy OCR output together! I see you're using Tesseract with Python but getting a lot of garbage characters—let's break down the issues and fix them step by step.

First, let's recap your current setup and the problematic output:

你的原代码

from PIL import Image
import pytesseract
image=Image.open('C://Users/Arjun/Desktop/1512350.jpg')
pytesseract.pytesseract.tesseract_cmd='C://Program Files (x86)/Tesseract- OCR/tesseract'
result=pytesseract.image_to_string(image,config='-psm7 -c tessedit_char_whitlist=ABCDEFGHIJKLMNOPQRSTUVWXYZ01234567890')
print(result)

识别输出（夹杂大量特殊字符）

fl/'S‘TIW ° MILLER‘ 1003055666 ° gum/71; C6521:pmuzznmmimfmmpmy *5mg[e * 2900456023 ° Uj7s564550 ° 130013 ° mm 5\1£®IC/‘(L 0£0wEmm'2zowLI5vg gazmyw 250 0’/lrkksrmgf" ﬂowzzyvg (jﬁff-W" M * 42101 ° wowiany " qw— I’Va:/11/£172 ' J6 ’ 19955.65 * 5685.26 " 4586.65 ’ Safaxizf

1. 修正基础配置错误

You've got two small but critical typos in your config:

tessedit_char_whitlist → tessedit_char_whitelist (missing an 'e' in whitelist)
Your Tesseract path has an extra space: Tesseract- OCR → should be Tesseract-OCR (this might cause path resolution issues on Windows)

Fix those first, and that alone will help filter out some unwanted characters.

2. 图像预处理（最关键的优化步骤）

Tesseract performs best on high-contrast, clean images. Let's add preprocessing steps to clean up your image:

Convert to grayscale
Apply binarization (convert to black/white to eliminate gray noise)
Optional: Resize the image to increase resolution, or apply slight blur to reduce noise

Here's an updated code snippet with PIL-based preprocessing:

from PIL import Image, ImageOps
import pytesseract
import re

# 1. Load and preprocess the image
image = Image.open('C://Users/Arjun/Desktop/1512350.jpg')
# Convert to grayscale
gray_image = ImageOps.grayscale(image)
# Binarize (adjust threshold if needed—127 is default, tweak based on your image)
threshold = 127
binary_image = gray_image.point(lambda x: 0 if x < threshold else 255, '1')

# 2. Fix Tesseract config and path
pytesseract.pytesseract.tesseract_cmd = 'C://Program Files (x86)/Tesseract-OCR/tesseract'
# Use PSM 6 (assume a single uniform block of text) instead of 7 if your image has multiple lines
custom_config = r'-psm 6 -c tessedit_char_whitelist=ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789.'
result = pytesseract.image_to_string(binary_image, config=custom_config)

# 3. Post-process to remove any remaining unwanted characters
cleaned_result = re.sub(r'[^A-Z0-9.]', '', result)
print(cleaned_result)

If you have OpenCV installed, you can use more advanced preprocessing (like Gaussian blur to reduce noise):

import cv2
import numpy as np

image = cv2.imread('C://Users/Arjun/Desktop/1512350.jpg')
# Convert to grayscale
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
# Apply Gaussian blur to reduce noise
blurred = cv2.GaussianBlur(gray, (5,5), 0)
# Binarize with adaptive threshold (better for uneven lighting)
binary = cv2.adaptiveThreshold(blurred, 255, cv2.ADAPTIVE_THRESH_GAUSSIAN_C, cv2.THRESH_BINARY_INV, 11, 2)

# Convert back to PIL image for Tesseract
pil_image = Image.fromarray(binary)
# Rest of the Tesseract code same as above...

3. 调整PSM（页面分割模式）

Your current -psm 7 tells Tesseract to treat the image as a single line of text. If your image has multiple lines or a block of text, switch to:

-psm 6: Assume a single uniform block of text (great for forms, receipts, etc.)
-psm 3: Default mode (automatically segments the page into text blocks)
Tweak the PSM mode based on your image's layout—this can make a huge difference in accuracy.

4. 后处理保险措施

Even with perfect preprocessing, sometimes Tesseract slips through unwanted characters. Use Python's re module to strictly filter your output to only the characters you want:

import re
# Keep only uppercase letters, numbers, and periods (adjust regex based on your needs)
cleaned_result = re.sub(r'[^A-Z0-9.]', '', result)

额外提示

Make sure you have the latest version of Tesseract installed (old versions have worse accuracy)
If your text uses a rare font, you can train a custom Tesseract model, but that's more advanced—start with the steps above first.

内容的提问来源于stack exchange，提问作者Arjun Vc