求基于Base64编码图像获取文本X/Y坐标位置的实现方案

阿华AIGC实验室

2026-5-21

Solution: Get Text Coordinates from Clicked Base64 Image

Alright, let's solve this problem where you need to click on text in a Base64 image and retrieve its X/Y coordinates. The core challenge here is combining image rendering, click coordinate calculation, and OCR (Optical Character Recognition) to identify text regions and match clicks to them.

Core Approach

Render the Base64 image on the page (we'll use an <img> tag for simplicity).
Use Tesseract.js (a lightweight front-end OCR library) to scan the image and extract all text blocks along with their bounding box coordinates.
Listen for click events on the image, convert the click position to coordinates relative to the original image size (accounting for any CSS scaling).
Check which text block's bounding box contains the click position, then return that block's coordinates.

Complete Implementation Code

<!DOCTYPE html>
<html>
<head>
    <title>Base64 Image Text Click Coordinates</title>
    <style>
        #target-image {
            max-width: 800px;
            border: 1px solid #ddd;
            cursor: crosshair;
        }
        #result {
            margin-top: 20px;
            padding: 10px;
            background: #f5f5f5;
            border-radius: 4px;
        }
    </style>
</head>
<body>
    <img id="target-image" alt="Base64 Image" />
    <div id="result">Click on text in the image to get coordinates...</div>

    <!-- Load Tesseract.js from CDN -->
    <script src="https://cdn.jsdelivr.net/npm/tesseract.js@5.0.2/dist/tesseract.min.js"></script>
    <script>
        // Replace this with your actual Base64 image string
        const base64Image = "data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAATAAAAFCAYAAAC8bQeYAAAABmJLR0QA/wD/AP+gvaeTAAAACXBIWXMAAAsTAAALEwEAmpwYAAAAB3RJTUUH5gMVESkqQzkXSwAAABl0RVh0Q29tbWVudABDcmVhdGVkIHdpdGggR0lNUFeBDhcAAAAUSURBVBgZBcHPcsAABAAAQ+zWF8wAAAAASUVORK5CYII=";
        
        const imageElement = document.getElementById('target-image');
        const resultElement = document.getElementById('result');
        let textBlocks = [];

        // Initialize image and OCR
        async function init() {
            imageElement.src = base64Image;
            
            // Wait for image to load
            await new Promise(resolve => imageElement.onload = resolve);
            
            // Run OCR to extract text blocks
            const { data: { blocks } } = await Tesseract.recognize(
                base64Image,
                'eng', // Change to your language code if needed
                { logger: m => console.log(m) } // Optional: log OCR progress
            );
            
            // Filter out non-text blocks and store bounding boxes
            textBlocks = blocks.filter(block => block.blockType === 1).map(block => ({
                text: block.text,
                x: block.bbox.x0,
                y: block.bbox.y0,
                width: block.bbox.x1 - block.bbox.x0,
                height: block.bbox.y1 - block.bbox.y0
            }));
            
            console.log('Text blocks detected:', textBlocks);
        }

        // Handle image click
        imageElement.addEventListener('click', (e) => {
            if (textBlocks.length === 0) {
                resultElement.textContent = "OCR not completed yet. Wait a moment...";
                return;
            }

            // Calculate click position relative to original image size
            const rect = imageElement.getBoundingClientRect();
            const scaleX = imageElement.naturalWidth / rect.width;
            const scaleY = imageElement.naturalHeight / rect.height;
            
            const clickX = (e.clientX - rect.left) * scaleX;
            const clickY = (e.clientY - rect.top) * scaleY;

            // Find which text block contains the click
            const clickedBlock = textBlocks.find(block => 
                clickX >= block.x && 
                clickX <= block.x + block.width && 
                clickY >= block.y && 
                clickY <= block.y + block.height
            );

            if (clickedBlock) {
                resultElement.innerHTML = `
                    Clicked Text: <strong>${clickedBlock.text}</strong><br>
                    Text Block Coordinates (Top-Left): X = ${Math.round(clickedBlock.x)}, Y = ${Math.round(clickedBlock.y)}<br>
                    Click Position: X = ${Math.round(clickX)}, Y = ${Math.round(clickY)}
                `;
            } else {
                resultElement.textContent = "No text found at this click position.";
            }
        });

        // Start initialization
        init();
    </script>
</body>
</html>

Key Details Explained

Base64 Image Handling: We directly set the src of the <img> tag to your Base64 string. No server-side processing needed here.
OCR with Tesseract.js: The library scans the image and returns text blocks with bounding boxes (x0, y0 = top-left corner; x1, y1 = bottom-right corner). We filter these to only keep actual text blocks.
Coordinate Calculation: Since images might be scaled with CSS, we convert the click's clientX/clientY (relative to the viewport) to coordinates relative to the original image size using the naturalWidth/naturalHeight and the element's bounding rect.
Click Matching: We check if the click position falls within any text block's bounding box, then display the relevant coordinates and text.

Notes

Make sure your Base64 image is clear and has high-contrast text for better OCR accuracy.
You can change the language code in Tesseract.recognize() (e.g., 'spa' for Spanish) if needed.
For large images, OCR might take a few seconds—you can add a loading indicator to improve UX.

内容的提问来源于stack exchange，提问作者Nitesh Lad