You need to enable JavaScript to run this app.
最新活动
大模型
产品
解决方案
定价
生态与合作
支持与服务
开发者
了解我们

编写Hive/MySQL查询:为指定ID列表中每个ID返回单条关联数据

解决Hive/MySQL中按ID列表返回单条匹配记录的问题

我来帮你搞定这个需求!核心目标是让输入的每个ID,都能从books表中拿到恰好一条匹配global_parent_id的记录,而且所有输入ID都要覆盖到。之前只返回一条,大概率是没把输入ID列表作为独立数据集关联处理,或者没做分组取唯一的逻辑。

Hive 解决方案

方法1:使用窗口函数(推荐,Hive 2.3+支持VALUES子句)

先把你的ID列表构造成临时数据集,再关联books表,用ROW_NUMBER()窗口函数给每个global_parent_id的记录编号,最后取每组的第一条。

WITH input_ids AS (
    -- 这里替换成你的实际ID列表
    SELECT id AS target_id
    FROM VALUES (1001), (1002), (1003) AS t(id)
),
ranked_books AS (
    SELECT 
        b.*,
        -- 按create_time倒序取最新的一条,可替换成你需要的排序字段(比如id、update_time)
        ROW_NUMBER() OVER (PARTITION BY b.global_parent_id ORDER BY b.create_time DESC) AS rn
    FROM books b
    -- 关联输入ID列表,确保只处理你需要的ID
    JOIN input_ids i ON b.global_parent_id = i.target_id
)
-- 只取每组的第一条记录
SELECT *
FROM ranked_books
WHERE rn = 1;

方法2:兼容低版本Hive(不支持VALUES子句)

如果你的Hive版本较低,用UNION ALL构造ID列表:

WITH input_ids AS (
    SELECT 1001 AS target_id UNION ALL
    SELECT 1002 AS target_id UNION ALL
    SELECT 1003 AS target_id
),
ranked_books AS (
    SELECT 
        b.*,
        ROW_NUMBER() OVER (PARTITION BY b.global_parent_id ORDER BY b.create_time DESC) AS rn
    FROM books b
    JOIN input_ids i ON b.global_parent_id = i.target_id
)
SELECT *
FROM ranked_books
WHERE rn = 1;

可选:保留无匹配的输入ID

如果某个输入ID在books里没有对应记录,你想保留该ID并返回空值,把JOIN改成LEFT JOIN即可:

WITH input_ids AS (
    SELECT 1001 AS target_id UNION ALL
    SELECT 1002 AS target_id UNION ALL
    SELECT 1003 AS target_id
),
ranked_books AS (
    SELECT 
        i.target_id,
        b.*,
        ROW_NUMBER() OVER (PARTITION BY i.target_id ORDER BY b.create_time DESC) AS rn
    FROM input_ids i
    LEFT JOIN books b ON b.global_parent_id = i.target_id
)
SELECT 
    target_id,
    id, title, global_parent_id -- 替换成你需要的字段
FROM ranked_books
WHERE rn = 1 OR rn IS NULL;

MySQL 解决方案

方法1:窗口函数(MySQL 8.0+推荐)

逻辑和Hive一致,用窗口函数分组取唯一:

WITH input_ids AS (
    SELECT 1001 AS target_id UNION ALL
    SELECT 1002 AS target_id UNION ALL
    SELECT 1003 AS target_id
),
ranked_books AS (
    SELECT 
        b.*,
        ROW_NUMBER() OVER (PARTITION BY b.global_parent_id ORDER BY b.create_time DESC) AS rn
    FROM books b
    JOIN input_ids i ON b.global_parent_id = i.target_id
)
SELECT *
FROM ranked_books
WHERE rn = 1;

方法2:兼容MySQL 5.x(无窗口函数)

用关联子查询实现分组取唯一:

SELECT b.*
FROM books b
JOIN (
    SELECT 1001 AS target_id UNION ALL
    SELECT 1002 AS target_id UNION ALL
    SELECT 1003 AS target_id
) i ON b.global_parent_id = i.target_id
-- 子查询取每个global_parent_id对应的最新记录ID
WHERE b.id = (
    SELECT id 
    FROM books 
    WHERE global_parent_id = b.global_parent_id 
    ORDER BY create_time DESC 
    LIMIT 1
);

关键注意点

  • 排序字段可按需调整:如果不需要最新/最早,只是随便取一条,可以用ORDER BY RAND()(但性能较差,不建议大数据量使用),或者按主键id排序。
  • 必须关联输入ID列表:这样才能保证每个输入ID都被处理,避免只返回一条结果的问题。

内容的提问来源于stack exchange,提问作者user3802348

火山引擎 最新活动