You need to enable JavaScript to run this app.
最新活动
大模型
产品
解决方案
定价
生态与合作
支持与服务
开发者
了解我们

BigQuery中GROUP_CONCAT/STRING_AGG去重并保留顺序的技术问题

Fixing Deduplication + Order Preservation in BigQuery String Aggregation

Got it, let's work through this problem. The core pain point here is that BigQuery's Standard SQL restricts STRING_AGG(DISTINCT ... ORDER BY ...) to only ordering by the aggregated column—and Legacy SQL's UNIQUE() destroys the original occurrence order. Here's how to get both deduplication and preserved first-sequence order in Standard SQL:

Solution Approach

Instead of trying to force DISTINCT and ORDER BY into a single STRING_AGG, we first filter out duplicate device categories while retaining their first occurrence order, then aggregate the remaining unique values. We'll use window functions to mark the first time each device category appears for a user on a given date.

Updated Standard SQL Query

WITH user_device_events AS (
  SELECT 
    date,
    visitStartTime vstime,
    cd.value userId,
    totals.visits totalsvisits,
    device.deviceCategory devicecategory,
    -- Mark the first occurrence of each device category per user+date
    ROW_NUMBER() OVER (
      PARTITION BY userId, date, devicecategory 
      ORDER BY visitStartTime ASC
    ) AS occurrence_rank
  FROM `12314124123123.ga_sessions_*`, UNNEST(customDimensions) AS cd 
  WHERE cd.index = 1 
    AND cd.value = "hip|7e4fbce9-bbfb-4677-aab0-dcd02851fdb4" -- Remove for production
    AND cd.value IS NOT NULL 
  GROUP BY date, visitStartTime, userId, totals.visits, device.deviceCategory
),
unique_first_devices AS (
  SELECT *
  FROM user_device_events
  WHERE occurrence_rank = 1 -- Keep only the first time each device appears
  ORDER BY date ASC, vstime ASC -- Maintain original visit sequence
)
SELECT 
  date, 
  userId, 
  SUM(totalsvisits) AS totalvisits,
  STRING_AGG(devicecategory ORDER BY vstime ASC) AS deviceAgg
FROM unique_first_devices
GROUP BY date, userId

Breakdown of Changes

  1. Window Function for First Occurrence:
    The ROW_NUMBER() window function partitions data by userId, date, and devicecategory, then orders by visitStartTime. This assigns occurrence_rank = 1 to the first time a user encounters a device category on a specific date, and higher ranks to subsequent duplicates.

  2. Filter Duplicates:
    The unique_first_devices CTE filters out all rows where occurrence_rank > 1, leaving only the first instance of each device category in the user's visit sequence.

  3. Aggregate with Order Preservation:
    Finally, STRING_AGG uses the original vstime (visit start time) to order the unique device categories, giving you the desired sequence like tablet,mobile,desktop.

Note on Legacy SQL

If you still need to support Legacy SQL, the same logic applies—use a subquery to mark first occurrences with a ranking function, filter duplicates, then use GROUP_CONCAT on the filtered results. However, Google recommends migrating to Standard SQL for better functionality and support.

内容的提问来源于stack exchange,提问作者Meisam Ebrahimi

火山引擎 最新活动