You need to enable JavaScript to run this app.
最新活动
大模型
产品
解决方案
定价
生态与合作
支持与服务
开发者
了解我们

如何用Tesseract-OCR将图片表格文本提取至Excel(Python新手求助)

How to Extract Table Image Data into a Structured Excel File

Hey there! Since you're new to Python and Tesseract, let's break this down step by step to get that table data into Excel with the same row/column structure as your original image.

1. First, Install Required Tools & Libraries

You already have Tesseract working (since you can extract text to the console), but we'll need a few more Python libraries to handle image processing and Excel writing:

pip install pytesseract opencv-python pandas openpyxl
  • opencv-python: For cleaning up the image to make table lines/text clearer
  • pandas: To structure the extracted data into a table format
  • openpyxl: To write the structured data to an Excel file

2. Preprocess Your Image (Critical for Accuracy)

Table recognition works way better with clean, high-contrast images. Let's prep your image to highlight text and table lines:

import cv2
import pytesseract
import pandas as pd
import re

# Optional: Set Tesseract path if your system doesn't auto-detect it
# pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe' # Windows
# pytesseract.pytesseract.tesseract_cmd = '/usr/bin/tesseract' # Linux/Mac

# Load your image
img = cv2.imread('your_table_image.jpg')

# Convert to grayscale
gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)

# Apply thresholding to make text stand out against background
_, thresh_img = cv2.threshold(gray, 127, 255, cv2.THRESH_BINARY_INV + cv2.THRESH_OTSU)

# Clean up small noise (adjust kernel size if needed)
cleaned_img = cv2.morphologyEx(thresh_img, cv2.MORPH_CLOSE, cv2.getStructuringElement(cv2.MORPH_RECT, (1,1)))

3. Extract Structured Table Data

We'll cover two approaches—pick the one that fits your table's complexity:

Option 1: Simple Tables (Clear Rows/Columns, No Merged Cells)

If your table has distinct rows and columns with consistent spacing, use Tesseract's structured output mode:

# Configure Tesseract to treat the image as a single uniform block of text
custom_config = r'--oem 3 --psm 6'
raw_text = pytesseract.image_to_string(cleaned_img, config=custom_config)

# Split text into individual rows (skip empty lines)
table_rows = [row.strip() for row in raw_text.split('\n') if row.strip()]

# Split each row into columns (uses regex to handle multiple spaces as column separators)
structured_data = []
for row in table_rows:
    columns = re.split(r'\s{2,}', row)  # Split on 2+ spaces
    structured_data.append(columns)

# Convert to a pandas DataFrame (assumes first row is your header)
df = pd.DataFrame(structured_data[1:], columns=structured_data[0])

Option 2: Complex Tables (Merged Cells, Fuzzy Lines)

For trickier tables, we'll detect table lines first, split into individual cells, then extract text from each cell:

# Find table contours to isolate the table area
contours, _ = cv2.findContours(cleaned_img, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
table_contour = max(contours, key=cv2.contourArea)
x, y, w, h = cv2.boundingRect(table_contour)
table_only_img = cleaned_img[y:y+h, x:x+w]

# Detect horizontal and vertical lines to split cells
horizontal_lines = cv2.HoughLinesP(table_only_img, 1, cv2.PI/180, threshold=100, minLineLength=w//2, maxLineGap=5)
vertical_lines = cv2.HoughLinesP(table_only_img, 1, cv2.PI/180, threshold=100, minLineLength=h//2, maxLineGap=5)

# Note: You'll need to sort lines and split the table into individual cells here
# For simplicity, let's assume you've mapped out cell coordinates in a list called `cell_coords`
structured_data = []
current_row = []
cols_per_row = 3  # Replace with your table's column count

for (x1, y1, x2, y2) in cell_coords:
    # Extract cell image and text
    cell_img = table_only_img[y1:y2, x1:x2]
    cell_text = pytesseract.image_to_string(cell_img, config=custom_config).strip()
    current_row.append(cell_text)
    
    # When we hit the end of a row, add to data and reset
    if len(current_row) == cols_per_row:
        structured_data.append(current_row)
        current_row = []

df = pd.DataFrame(structured_data)

4. Save to Excel

Now that we have structured data, writing to Excel is straightforward:

# Write DataFrame to Excel (index=False skips row numbers)
df.to_excel('extracted_table.xlsx', index=False, engine='openpyxl')
print("Success! Your table is saved to extracted_table.xlsx")

Quick Tips for Better Results

  • Adjust Tesseract's PSM Mode: If text is misaligned, try --psm 4 (for column-aligned text) or --psm 11 (for sparse text) in the custom_config
  • Sharpen Blurry Images: Use cv2.GaussianBlur() followed by cv2.addWeighted() to sharpen images before processing
  • Handle Merged Cells: For tables with merged cells, you might need to use specialized libraries like table-ocr or manually map cell positions

内容的提问来源于stack exchange,提问作者Cash Dogg

火山引擎 最新活动