如何用Tesseract-OCR将图片表格文本提取至Excel(Python新手求助)
Hey there! Since you're new to Python and Tesseract, let's break this down step by step to get that table data into Excel with the same row/column structure as your original image.
1. First, Install Required Tools & Libraries
You already have Tesseract working (since you can extract text to the console), but we'll need a few more Python libraries to handle image processing and Excel writing:
pip install pytesseract opencv-python pandas openpyxl
opencv-python: For cleaning up the image to make table lines/text clearerpandas: To structure the extracted data into a table formatopenpyxl: To write the structured data to an Excel file
2. Preprocess Your Image (Critical for Accuracy)
Table recognition works way better with clean, high-contrast images. Let's prep your image to highlight text and table lines:
import cv2 import pytesseract import pandas as pd import re # Optional: Set Tesseract path if your system doesn't auto-detect it # pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe' # Windows # pytesseract.pytesseract.tesseract_cmd = '/usr/bin/tesseract' # Linux/Mac # Load your image img = cv2.imread('your_table_image.jpg') # Convert to grayscale gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY) # Apply thresholding to make text stand out against background _, thresh_img = cv2.threshold(gray, 127, 255, cv2.THRESH_BINARY_INV + cv2.THRESH_OTSU) # Clean up small noise (adjust kernel size if needed) cleaned_img = cv2.morphologyEx(thresh_img, cv2.MORPH_CLOSE, cv2.getStructuringElement(cv2.MORPH_RECT, (1,1)))
3. Extract Structured Table Data
We'll cover two approaches—pick the one that fits your table's complexity:
Option 1: Simple Tables (Clear Rows/Columns, No Merged Cells)
If your table has distinct rows and columns with consistent spacing, use Tesseract's structured output mode:
# Configure Tesseract to treat the image as a single uniform block of text custom_config = r'--oem 3 --psm 6' raw_text = pytesseract.image_to_string(cleaned_img, config=custom_config) # Split text into individual rows (skip empty lines) table_rows = [row.strip() for row in raw_text.split('\n') if row.strip()] # Split each row into columns (uses regex to handle multiple spaces as column separators) structured_data = [] for row in table_rows: columns = re.split(r'\s{2,}', row) # Split on 2+ spaces structured_data.append(columns) # Convert to a pandas DataFrame (assumes first row is your header) df = pd.DataFrame(structured_data[1:], columns=structured_data[0])
Option 2: Complex Tables (Merged Cells, Fuzzy Lines)
For trickier tables, we'll detect table lines first, split into individual cells, then extract text from each cell:
# Find table contours to isolate the table area contours, _ = cv2.findContours(cleaned_img, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE) table_contour = max(contours, key=cv2.contourArea) x, y, w, h = cv2.boundingRect(table_contour) table_only_img = cleaned_img[y:y+h, x:x+w] # Detect horizontal and vertical lines to split cells horizontal_lines = cv2.HoughLinesP(table_only_img, 1, cv2.PI/180, threshold=100, minLineLength=w//2, maxLineGap=5) vertical_lines = cv2.HoughLinesP(table_only_img, 1, cv2.PI/180, threshold=100, minLineLength=h//2, maxLineGap=5) # Note: You'll need to sort lines and split the table into individual cells here # For simplicity, let's assume you've mapped out cell coordinates in a list called `cell_coords` structured_data = [] current_row = [] cols_per_row = 3 # Replace with your table's column count for (x1, y1, x2, y2) in cell_coords: # Extract cell image and text cell_img = table_only_img[y1:y2, x1:x2] cell_text = pytesseract.image_to_string(cell_img, config=custom_config).strip() current_row.append(cell_text) # When we hit the end of a row, add to data and reset if len(current_row) == cols_per_row: structured_data.append(current_row) current_row = [] df = pd.DataFrame(structured_data)
4. Save to Excel
Now that we have structured data, writing to Excel is straightforward:
# Write DataFrame to Excel (index=False skips row numbers) df.to_excel('extracted_table.xlsx', index=False, engine='openpyxl') print("Success! Your table is saved to extracted_table.xlsx")
Quick Tips for Better Results
- Adjust Tesseract's PSM Mode: If text is misaligned, try
--psm 4(for column-aligned text) or--psm 11(for sparse text) in thecustom_config - Sharpen Blurry Images: Use
cv2.GaussianBlur()followed bycv2.addWeighted()to sharpen images before processing - Handle Merged Cells: For tables with merged cells, you might need to use specialized libraries like
table-ocror manually map cell positions
内容的提问来源于stack exchange,提问作者Cash Dogg




