如何用Java SDK从AWS Textract异步结果提取字段、值与置信度
Hey there! I’ve worked with AWS Textract’s async APIs in Java before, so let’s walk through how to extract those structured key-value pairs with confidence scores you need. Here’s a step-by-step approach with code examples:
Step 1: Retrieve and Merge Async Textract Results
First, since Textract returns async results in chunks, you’ll need to poll the GetDocumentAnalysis API until you’ve collected all blocks (using the NextToken pagination marker).
import com.amazonaws.services.textract.AmazonTextract; import com.amazonaws.services.textract.AmazonTextractClientBuilder; import com.amazonaws.services.textract.model.GetDocumentAnalysisRequest; import com.amazonaws.services.textract.model.GetDocumentAnalysisResult; import com.amazonaws.services.textract.model.Block; import java.util.ArrayList; import java.util.List; // Initialize Textract client (use proper region/credentials in production) AmazonTextract textractClient = AmazonTextractClientBuilder.defaultClient(); String asyncJobId = "your-async-job-id-here"; // From StartDocumentAnalysis response List<Block> allDocumentBlocks = new ArrayList<>(); String nextToken = null; // Fetch all paginated results do { GetDocumentAnalysisRequest request = new GetDocumentAnalysisRequest() .withJobId(asyncJobId) .withNextToken(nextToken); GetDocumentAnalysisResult result = textractClient.getDocumentAnalysis(request); allDocumentBlocks.addAll(result.getBlocks()); nextToken = result.getNextToken(); } while (nextToken != null);
Step 2: Extract Key-Value Pairs with Confidence
Textract structures form data into KEY_VALUE_SET blocks, which contain pre-linked key and value pairs. We’ll iterate through these blocks, extract the text for keys/values, and capture their confidence scores.
First, add a helper function to fetch full text from blocks (since text might be split across child WORD blocks):
private static String extractFullText(Block targetBlock, List<Block> allBlocks) { StringBuilder textBuilder = new StringBuilder(); // If it's a direct word block, grab the text if ("WORD".equals(targetBlock.getBlockType())) { textBuilder.append(targetBlock.getText()); return textBuilder.toString(); } // Recursively fetch text from child blocks if (targetBlock.getRelationships() != null) { for (var relationship : targetBlock.getRelationships()) { if ("CHILD".equals(relationship.getType())) { for (String childBlockId : relationship.getIds()) { Block childBlock = allBlocks.stream() .filter(block -> block.getId().equals(childBlockId)) .findFirst() .orElse(null); if (childBlock != null) { textBuilder.append(extractFullText(childBlock, allBlocks)).append(" "); } } } } } return textBuilder.toString().trim(); }
Then, process the blocks to build your desired output:
import com.amazonaws.services.textract.model.KeyValuePair; import com.fasterxml.jackson.databind.ObjectMapper; import java.util.ArrayList; import java.util.HashMap; import java.util.List; import java.util.Map; List<Map<String, Object>> formattedFields = new ArrayList<>(); for (Block block : allDocumentBlocks) { // Only process key-value set blocks if ("KEY_VALUE_SET".equals(block.getBlockType())) { KeyValuePair kvPair = block.getKeyValuePair(); if (kvPair == null) continue; // Extract field name (key) Block keyBlock = kvPair.getKey(); String fieldName = extractFullText(keyBlock, allDocumentBlocks); // Extract field value (handle cases where value is missing) Block valueBlock = kvPair.getValue(); String fieldValue = valueBlock != null ? extractFullText(valueBlock, allDocumentBlocks) : ""; // Get confidence score (use value's confidence if available, else key's) float confidence = valueBlock != null ? valueBlock.getConfidence() : keyBlock.getConfidence(); String formattedConfidence = String.format("%.2f", confidence); // Build the field map Map<String, Object> fieldMap = new HashMap<>(); fieldMap.put("Field", fieldName); fieldMap.put("Value", fieldValue); fieldMap.put("confidence Score", formattedConfidence); formattedFields.add(fieldMap); } } // Convert to your desired JSON format (using Jackson) ObjectMapper objectMapper = new ObjectMapper(); String jsonOutput = objectMapper.writerWithDefaultPrettyPrinter().writeValueAsString(formattedFields); System.out.println(jsonOutput);
Key Notes to Consider
- IAM Permissions: Ensure your execution role has
textract:StartDocumentAnalysisandtextract:GetDocumentAnalysispermissions. - Null Handling: The code accounts for missing values (sets empty string) and uses the key’s confidence if the value is absent. Adjust this logic based on your business rules.
- Duplicate Fields: If your document has repeated field names, you may need to add logic to merge or deduplicate entries.
- JSON Library: The example uses Jackson for JSON serialization—you can swap it with Gson if that’s your preference.
内容的提问来源于stack exchange,提问作者Mohan vel




