编写Hive/MySQL查询:为指定ID列表中每个ID返回单条关联数据
解决Hive/MySQL中按ID列表返回单条匹配记录的问题
我来帮你搞定这个需求!核心目标是让输入的每个ID,都能从books表中拿到恰好一条匹配global_parent_id的记录,而且所有输入ID都要覆盖到。之前只返回一条,大概率是没把输入ID列表作为独立数据集关联处理,或者没做分组取唯一的逻辑。
Hive 解决方案
方法1:使用窗口函数(推荐,Hive 2.3+支持VALUES子句)
先把你的ID列表构造成临时数据集,再关联books表,用ROW_NUMBER()窗口函数给每个global_parent_id的记录编号,最后取每组的第一条。
WITH input_ids AS ( -- 这里替换成你的实际ID列表 SELECT id AS target_id FROM VALUES (1001), (1002), (1003) AS t(id) ), ranked_books AS ( SELECT b.*, -- 按create_time倒序取最新的一条,可替换成你需要的排序字段(比如id、update_time) ROW_NUMBER() OVER (PARTITION BY b.global_parent_id ORDER BY b.create_time DESC) AS rn FROM books b -- 关联输入ID列表,确保只处理你需要的ID JOIN input_ids i ON b.global_parent_id = i.target_id ) -- 只取每组的第一条记录 SELECT * FROM ranked_books WHERE rn = 1;
方法2:兼容低版本Hive(不支持VALUES子句)
如果你的Hive版本较低,用UNION ALL构造ID列表:
WITH input_ids AS ( SELECT 1001 AS target_id UNION ALL SELECT 1002 AS target_id UNION ALL SELECT 1003 AS target_id ), ranked_books AS ( SELECT b.*, ROW_NUMBER() OVER (PARTITION BY b.global_parent_id ORDER BY b.create_time DESC) AS rn FROM books b JOIN input_ids i ON b.global_parent_id = i.target_id ) SELECT * FROM ranked_books WHERE rn = 1;
可选:保留无匹配的输入ID
如果某个输入ID在books里没有对应记录,你想保留该ID并返回空值,把JOIN改成LEFT JOIN即可:
WITH input_ids AS ( SELECT 1001 AS target_id UNION ALL SELECT 1002 AS target_id UNION ALL SELECT 1003 AS target_id ), ranked_books AS ( SELECT i.target_id, b.*, ROW_NUMBER() OVER (PARTITION BY i.target_id ORDER BY b.create_time DESC) AS rn FROM input_ids i LEFT JOIN books b ON b.global_parent_id = i.target_id ) SELECT target_id, id, title, global_parent_id -- 替换成你需要的字段 FROM ranked_books WHERE rn = 1 OR rn IS NULL;
MySQL 解决方案
方法1:窗口函数(MySQL 8.0+推荐)
逻辑和Hive一致,用窗口函数分组取唯一:
WITH input_ids AS ( SELECT 1001 AS target_id UNION ALL SELECT 1002 AS target_id UNION ALL SELECT 1003 AS target_id ), ranked_books AS ( SELECT b.*, ROW_NUMBER() OVER (PARTITION BY b.global_parent_id ORDER BY b.create_time DESC) AS rn FROM books b JOIN input_ids i ON b.global_parent_id = i.target_id ) SELECT * FROM ranked_books WHERE rn = 1;
方法2:兼容MySQL 5.x(无窗口函数)
用关联子查询实现分组取唯一:
SELECT b.* FROM books b JOIN ( SELECT 1001 AS target_id UNION ALL SELECT 1002 AS target_id UNION ALL SELECT 1003 AS target_id ) i ON b.global_parent_id = i.target_id -- 子查询取每个global_parent_id对应的最新记录ID WHERE b.id = ( SELECT id FROM books WHERE global_parent_id = b.global_parent_id ORDER BY create_time DESC LIMIT 1 );
关键注意点
- 排序字段可按需调整:如果不需要最新/最早,只是随便取一条,可以用
ORDER BY RAND()(但性能较差,不建议大数据量使用),或者按主键id排序。 - 必须关联输入ID列表:这样才能保证每个输入ID都被处理,避免只返回一条结果的问题。
内容的提问来源于stack exchange,提问作者user3802348




