You need to enable JavaScript to run this app.
最新活动
大模型
产品
解决方案
定价
生态与合作
支持与服务
开发者
了解我们

如何用Java SDK从AWS Textract异步结果提取字段、值与置信度

Hey there! I’ve worked with AWS Textract’s async APIs in Java before, so let’s walk through how to extract those structured key-value pairs with confidence scores you need. Here’s a step-by-step approach with code examples:

Step 1: Retrieve and Merge Async Textract Results

First, since Textract returns async results in chunks, you’ll need to poll the GetDocumentAnalysis API until you’ve collected all blocks (using the NextToken pagination marker).

import com.amazonaws.services.textract.AmazonTextract;
import com.amazonaws.services.textract.AmazonTextractClientBuilder;
import com.amazonaws.services.textract.model.GetDocumentAnalysisRequest;
import com.amazonaws.services.textract.model.GetDocumentAnalysisResult;
import com.amazonaws.services.textract.model.Block;
import java.util.ArrayList;
import java.util.List;

// Initialize Textract client (use proper region/credentials in production)
AmazonTextract textractClient = AmazonTextractClientBuilder.defaultClient();
String asyncJobId = "your-async-job-id-here"; // From StartDocumentAnalysis response

List<Block> allDocumentBlocks = new ArrayList<>();
String nextToken = null;

// Fetch all paginated results
do {
    GetDocumentAnalysisRequest request = new GetDocumentAnalysisRequest()
            .withJobId(asyncJobId)
            .withNextToken(nextToken);
    
    GetDocumentAnalysisResult result = textractClient.getDocumentAnalysis(request);
    allDocumentBlocks.addAll(result.getBlocks());
    nextToken = result.getNextToken();
} while (nextToken != null);

Step 2: Extract Key-Value Pairs with Confidence

Textract structures form data into KEY_VALUE_SET blocks, which contain pre-linked key and value pairs. We’ll iterate through these blocks, extract the text for keys/values, and capture their confidence scores.

First, add a helper function to fetch full text from blocks (since text might be split across child WORD blocks):

private static String extractFullText(Block targetBlock, List<Block> allBlocks) {
    StringBuilder textBuilder = new StringBuilder();
    
    // If it's a direct word block, grab the text
    if ("WORD".equals(targetBlock.getBlockType())) {
        textBuilder.append(targetBlock.getText());
        return textBuilder.toString();
    }
    
    // Recursively fetch text from child blocks
    if (targetBlock.getRelationships() != null) {
        for (var relationship : targetBlock.getRelationships()) {
            if ("CHILD".equals(relationship.getType())) {
                for (String childBlockId : relationship.getIds()) {
                    Block childBlock = allBlocks.stream()
                            .filter(block -> block.getId().equals(childBlockId))
                            .findFirst()
                            .orElse(null);
                    
                    if (childBlock != null) {
                        textBuilder.append(extractFullText(childBlock, allBlocks)).append(" ");
                    }
                }
            }
        }
    }
    return textBuilder.toString().trim();
}

Then, process the blocks to build your desired output:

import com.amazonaws.services.textract.model.KeyValuePair;
import com.fasterxml.jackson.databind.ObjectMapper;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;

List<Map<String, Object>> formattedFields = new ArrayList<>();

for (Block block : allDocumentBlocks) {
    // Only process key-value set blocks
    if ("KEY_VALUE_SET".equals(block.getBlockType())) {
        KeyValuePair kvPair = block.getKeyValuePair();
        if (kvPair == null) continue;
        
        // Extract field name (key)
        Block keyBlock = kvPair.getKey();
        String fieldName = extractFullText(keyBlock, allDocumentBlocks);
        
        // Extract field value (handle cases where value is missing)
        Block valueBlock = kvPair.getValue();
        String fieldValue = valueBlock != null ? extractFullText(valueBlock, allDocumentBlocks) : "";
        
        // Get confidence score (use value's confidence if available, else key's)
        float confidence = valueBlock != null ? valueBlock.getConfidence() : keyBlock.getConfidence();
        String formattedConfidence = String.format("%.2f", confidence);
        
        // Build the field map
        Map<String, Object> fieldMap = new HashMap<>();
        fieldMap.put("Field", fieldName);
        fieldMap.put("Value", fieldValue);
        fieldMap.put("confidence Score", formattedConfidence);
        
        formattedFields.add(fieldMap);
    }
}

// Convert to your desired JSON format (using Jackson)
ObjectMapper objectMapper = new ObjectMapper();
String jsonOutput = objectMapper.writerWithDefaultPrettyPrinter().writeValueAsString(formattedFields);
System.out.println(jsonOutput);

Key Notes to Consider

  • IAM Permissions: Ensure your execution role has textract:StartDocumentAnalysis and textract:GetDocumentAnalysis permissions.
  • Null Handling: The code accounts for missing values (sets empty string) and uses the key’s confidence if the value is absent. Adjust this logic based on your business rules.
  • Duplicate Fields: If your document has repeated field names, you may need to add logic to merge or deduplicate entries.
  • JSON Library: The example uses Jackson for JSON serialization—you can swap it with Gson if that’s your preference.

内容的提问来源于stack exchange,提问作者Mohan vel

火山引擎 最新活动