如何在Python的tesserocr中限制识别字符?仅识别数字的方法
Got it, let's break this down simply—no need to mess with C++ code or config files directly. tesserocr lets you set those same Tesseract parameters right in your Python script using the SetVariable method.
1. Restricting to a Custom Set of Characters
Just like the tessedit_char_whitelist config in C++, you can define exactly which characters Tesseract should look for. Here's how to do it in Python:
from tesserocr import PyTessBaseAPI # Use a context manager to handle the API instance cleanly with PyTessBaseAPI() as api: # Define your allowed characters here (adjust as needed) allowed_chars = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789!@#$%" api.SetVariable("tessedit_char_whitelist", allowed_chars) # Load your image and run recognition api.SetImageFile("your_target_image.png") recognized_text = api.GetUTF8Text() print(recognized_text.strip())
The SetVariable method directly maps to Tesseract's internal configuration—you're essentially passing the same parameter you'd set in the C++ config file, but straight from your Python code.
2. Restricting to Only Digits
For your specific use case of limiting recognition to numbers, just set the whitelist to the digits 0-9:
from tesserocr import PyTessBaseAPI with PyTessBaseAPI() as api: # Restrict recognition to digits only api.SetVariable("tessedit_char_whitelist", "0123456789") api.SetImageFile("image_with_numbers.png") print(api.GetUTF8Text().strip())
Quick Bonus: Blacklisting Characters
If you ever need to exclude specific characters instead of whitelisting, use tessedit_char_blacklist instead. For example, to block all symbols:
api.SetVariable("tessedit_char_blacklist", "!@#$%^&*()")
Hope this works for you—no C++ docs required! 😊
内容的提问来源于stack exchange,提问作者WesR




