如何在Apache Druid的MSQ去重查询中保留多值列格式？

阿华AIGC实验室

2026-6-12

问题背景与现象

我正在使用Apache Druid，已引入ingestionTimestamp作为第二时间戳列支持去重任务，同时存在tagSetA、tagSetB等多值VARCHAR类型列。

当前通过以下MSQ（Multi-Stage Query）执行去重：

REPLACE INTO "target-datasource" 
OVERWRITE ALL

  SELECT 
     __time,
     LATEST_BY("entityId", MILLIS_TO_TIMESTAMP("ingestionTimestamp")) AS "entityId",
     LATEST_BY("entityName", MILLIS_TO_TIMESTAMP("ingestionTimestamp")) AS "entityName",
     LATEST_BY("tagSetA", MILLIS_TO_TIMESTAMP("ingestionTimestamp")) AS "tagSetA",
     LATEST_BY("tagSetB", MILLIS_TO_TIMESTAMP("ingestionTimestamp")) AS "tagSetB",
     MAX("ingestionTimestamp") AS ingestionTimestamp
  FROM "target-datasource"
  GROUP BY 
       __time, 
       "entityUID"
PARTITIONED BY 'P1M';

执行该查询后，tagSetA、tagSetB等标签列不再保持多值格式，导致依赖其多值特性的下游查询失效。

我的理解：MSQ在使用LATEST_BY函数时，可能无法直接保留多值列的格式。

问询：如何在执行此类去重查询的同时保留多值列的格式？Druid中有推荐的方法或Workaround吗？

解决方案与Workaround

方法1：使用`LATEST_BY`的数组重载版本

Druid的LATEST_BY函数支持直接针对多值列（数组类型）的重载，无需额外转换。将查询中的LATEST_BY调用改为直接传入数值型的ingestionTimestamp，而非转换后的时间戳字符串，函数会正确保留数组结构：

REPLACE INTO "target-datasource" 
OVERWRITE ALL

  SELECT 
     __time,
     LATEST_BY("entityId", "ingestionTimestamp") AS "entityId",
     LATEST_BY("entityName", "ingestionTimestamp") AS "entityName",
     LATEST_BY("tagSetA", "ingestionTimestamp") AS "tagSetA",
     LATEST_BY("tagSetB", "ingestionTimestamp") AS "tagSetB",
     MAX("ingestionTimestamp") AS ingestionTimestamp
  FROM "target-datasource"
  GROUP BY 
       __time, 
       "entityUID"
PARTITIONED BY 'P1M';

原因：MILLIS_TO_TIMESTAMP会把数值型时间戳转为字符串，LATEST_BY处理字符串类型时会将多值数组当成普通字符串解析，破坏原有结构。直接传入原始数值型时间戳，函数会识别并保留数组类型。

方法2：手动序列化与反序列化多值列

如果方法1不生效，可以手动将多值列序列化为JSON字符串，经LATEST_BY处理后再反序列化为数组：

REPLACE INTO "target-datasource" 
OVERWRITE ALL

  SELECT 
     __time,
     LATEST_BY("entityId", "ingestionTimestamp") AS "entityId",
     LATEST_BY("entityName", "ingestionTimestamp") AS "entityName",
     JSON_PARSE(LATEST_BY(JSON_STRINGIFY("tagSetA"), "ingestionTimestamp")) AS "tagSetA",
     JSON_PARSE(LATEST_BY(JSON_STRINGIFY("tagSetB"), "ingestionTimestamp")) AS "tagSetB",
     MAX("ingestionTimestamp") AS ingestionTimestamp
  FROM "target-datasource"
  GROUP BY 
       __time, 
       "entityUID"
PARTITIONED BY 'P1M';

注意：需确保Druid查询配置启用了JSON函数支持，且多值列内容符合JSON数组格式。