BigQuery中GROUP_CONCAT/STRING_AGG去重并保留顺序的技术问题
Got it, let's work through this problem. The core pain point here is that BigQuery's Standard SQL restricts STRING_AGG(DISTINCT ... ORDER BY ...) to only ordering by the aggregated column—and Legacy SQL's UNIQUE() destroys the original occurrence order. Here's how to get both deduplication and preserved first-sequence order in Standard SQL:
Solution Approach
Instead of trying to force DISTINCT and ORDER BY into a single STRING_AGG, we first filter out duplicate device categories while retaining their first occurrence order, then aggregate the remaining unique values. We'll use window functions to mark the first time each device category appears for a user on a given date.
Updated Standard SQL Query
WITH user_device_events AS ( SELECT date, visitStartTime vstime, cd.value userId, totals.visits totalsvisits, device.deviceCategory devicecategory, -- Mark the first occurrence of each device category per user+date ROW_NUMBER() OVER ( PARTITION BY userId, date, devicecategory ORDER BY visitStartTime ASC ) AS occurrence_rank FROM `12314124123123.ga_sessions_*`, UNNEST(customDimensions) AS cd WHERE cd.index = 1 AND cd.value = "hip|7e4fbce9-bbfb-4677-aab0-dcd02851fdb4" -- Remove for production AND cd.value IS NOT NULL GROUP BY date, visitStartTime, userId, totals.visits, device.deviceCategory ), unique_first_devices AS ( SELECT * FROM user_device_events WHERE occurrence_rank = 1 -- Keep only the first time each device appears ORDER BY date ASC, vstime ASC -- Maintain original visit sequence ) SELECT date, userId, SUM(totalsvisits) AS totalvisits, STRING_AGG(devicecategory ORDER BY vstime ASC) AS deviceAgg FROM unique_first_devices GROUP BY date, userId
Breakdown of Changes
Window Function for First Occurrence:
TheROW_NUMBER()window function partitions data byuserId,date, anddevicecategory, then orders byvisitStartTime. This assignsoccurrence_rank = 1to the first time a user encounters a device category on a specific date, and higher ranks to subsequent duplicates.Filter Duplicates:
Theunique_first_devicesCTE filters out all rows whereoccurrence_rank > 1, leaving only the first instance of each device category in the user's visit sequence.Aggregate with Order Preservation:
Finally,STRING_AGGuses the originalvstime(visit start time) to order the unique device categories, giving you the desired sequence liketablet,mobile,desktop.
Note on Legacy SQL
If you still need to support Legacy SQL, the same logic applies—use a subquery to mark first occurrences with a ranking function, filter duplicates, then use GROUP_CONCAT on the filtered results. However, Google recommends migrating to Standard SQL for better functionality and support.
内容的提问来源于stack exchange,提问作者Meisam Ebrahimi




