Java Map函数开发需求:统计同时购买商品ID21和27的客户数
Map Function Implementation for Retail Dataset
Got it, let's build that Map function you need for your retail dataset. This will check each input line for both tokens "21" and "27", then emit a fixed key-value pair only when both are present.
Step-by-Step Implementation
1. Core Logic Overview
- Initialize two boolean flags
item_21anditem_27tofalse - Split the input text line into individual tokens using
StringTokenizer - Iterate through each token: set
item_21totrueif the token matches "21", setitem_27totrueif it matches "27" - After processing all tokens, if both flags are
true, emit the keyBoth_21_27with the value1
2. Full Java Code Example
import java.io.IOException; import java.util.StringTokenizer; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.mapreduce.Mapper; public class RetailItemMapper extends Mapper<LongWritable, Text, Text, IntWritable> { // Define reusable constant objects to optimize performance private static final Text OUTPUT_KEY = new Text("Both_21_27"); private static final IntWritable OUTPUT_VALUE = new IntWritable(1); @Override protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { // Initialize flags to track presence of target items boolean item_21 = false; boolean item_27 = false; // Convert input text to string and split into tokens String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); // Check each token for matches while (tokenizer.hasMoreTokens()) { String token = tokenizer.nextToken(); if (token.equals("21")) { item_21 = true; } else if (token.equals("27")) { item_27 = true; } // Optional early exit to save processing once both items are found if (item_21 && item_27) { break; } } // Emit output only if both items are present in the line if (item_21 && item_27) { context.write(OUTPUT_KEY, OUTPUT_VALUE); } } }
3. Key Details Explained
- Reusable Constants: We define
OUTPUT_KEYandOUTPUT_VALUEas class-level constants to avoid creating new objects for every input line—this cuts down on memory overhead and speeds up the MapReduce job. - Exact Token Matching: Using
equals()ensures we only match the exact tokens "21" and "27", so values like "210" or "27a" won't be incorrectly flagged. If your dataset uses a non-whitespace delimiter (like commas for CSV), update theStringTokenizerto use that delimiter (e.g.,new StringTokenizer(line, ",")). - Early Exit Optimization: Breaking out of the token loop once both flags are true reduces unnecessary processing, which is especially helpful for large retail datasets with long lines.
- Efficient Output Types: We use
IntWritablefor the value instead ofTextbecause it's more efficient for numeric data in Hadoop's serialization system.
内容的提问来源于stack exchange,提问作者ElephantG




