如何在Apache Druid的MSQ去重查询中保留多值列格式?
问题背景与现象
我正在使用Apache Druid,已引入ingestionTimestamp作为第二时间戳列支持去重任务,同时存在tagSetA、tagSetB等多值VARCHAR类型列。
当前通过以下MSQ(Multi-Stage Query)执行去重:
REPLACE INTO "target-datasource" OVERWRITE ALL SELECT __time, LATEST_BY("entityId", MILLIS_TO_TIMESTAMP("ingestionTimestamp")) AS "entityId", LATEST_BY("entityName", MILLIS_TO_TIMESTAMP("ingestionTimestamp")) AS "entityName", LATEST_BY("tagSetA", MILLIS_TO_TIMESTAMP("ingestionTimestamp")) AS "tagSetA", LATEST_BY("tagSetB", MILLIS_TO_TIMESTAMP("ingestionTimestamp")) AS "tagSetB", MAX("ingestionTimestamp") AS ingestionTimestamp FROM "target-datasource" GROUP BY __time, "entityUID" PARTITIONED BY 'P1M';
执行该查询后,tagSetA、tagSetB等标签列不再保持多值格式,导致依赖其多值特性的下游查询失效。
我的理解:MSQ在使用LATEST_BY函数时,可能无法直接保留多值列的格式。
问询:如何在执行此类去重查询的同时保留多值列的格式?Druid中有推荐的方法或Workaround吗?
解决方案与Workaround
方法1:使用LATEST_BY的数组重载版本
Druid的LATEST_BY函数支持直接针对多值列(数组类型)的重载,无需额外转换。将查询中的LATEST_BY调用改为直接传入数值型的ingestionTimestamp,而非转换后的时间戳字符串,函数会正确保留数组结构:
REPLACE INTO "target-datasource" OVERWRITE ALL SELECT __time, LATEST_BY("entityId", "ingestionTimestamp") AS "entityId", LATEST_BY("entityName", "ingestionTimestamp") AS "entityName", LATEST_BY("tagSetA", "ingestionTimestamp") AS "tagSetA", LATEST_BY("tagSetB", "ingestionTimestamp") AS "tagSetB", MAX("ingestionTimestamp") AS ingestionTimestamp FROM "target-datasource" GROUP BY __time, "entityUID" PARTITIONED BY 'P1M';
原因:MILLIS_TO_TIMESTAMP会把数值型时间戳转为字符串,LATEST_BY处理字符串类型时会将多值数组当成普通字符串解析,破坏原有结构。直接传入原始数值型时间戳,函数会识别并保留数组类型。
方法2:手动序列化与反序列化多值列
如果方法1不生效,可以手动将多值列序列化为JSON字符串,经LATEST_BY处理后再反序列化为数组:
REPLACE INTO "target-datasource" OVERWRITE ALL SELECT __time, LATEST_BY("entityId", "ingestionTimestamp") AS "entityId", LATEST_BY("entityName", "ingestionTimestamp") AS "entityName", JSON_PARSE(LATEST_BY(JSON_STRINGIFY("tagSetA"), "ingestionTimestamp")) AS "tagSetA", JSON_PARSE(LATEST_BY(JSON_STRINGIFY("tagSetB"), "ingestionTimestamp")) AS "tagSetB", MAX("ingestionTimestamp") AS ingestionTimestamp FROM "target-datasource" GROUP BY __time, "entityUID" PARTITIONED BY 'P1M';
注意:需确保Druid查询配置启用了JSON函数支持,且多值列内容符合JSON数组格式。
方法3:改用摄取阶段去重
如果MSQ查询的去重方式难以兼容多值列,可考虑在数据摄取阶段完成去重:
- 在摄取规范中配置
dedup策略,以ingestionTimestamp作为排序键,保留最新版本的记录 - 这种方式能天然保留所有列的原始类型(包括多值列),避免查询阶段的类型转换问题
内容的提问来源于stack exchange,提问作者Aldulea Cristian




