You need to enable JavaScript to run this app.
最新活动
大模型
产品
解决方案
定价
生态与合作
支持与服务
开发者
了解我们

如何编写Hive SQL实现列数据拆分并生成多条记录(不使用默认函数)

Hive SQL Solution to Split Multiple Columns Without Built-in Split/Explode Functions

Got it, let's work through this problem step by step. You need to split both the colon-separated Phone column and comma-separated Address column, then pair each split value with the original Name to create individual rows—all without relying on Hive's standard split/explode functions. Here's a practical approach:

Step 1: Create a Number Sequence Helper

First, we need a way to target each split position (1st, 2nd, 3rd value) since your sample data has exactly 3 entries per column. We'll use a CTE to generate this sequence:

WITH number_seq AS (
    SELECT 1 AS idx UNION ALL
    SELECT 2 AS idx UNION ALL
    SELECT 3 AS idx
)

If your actual data has more than 3 split values, just add more SELECT n AS idx UNION ALL lines to cover the maximum number of splits you expect.

Step 2: Main Query to Extract Split Values

Next, we'll join our original table with this number sequence, then use basic string functions to extract the nth value from each delimited column:

WITH number_seq AS (
    SELECT 1 AS idx UNION ALL
    SELECT 2 AS idx UNION ALL
    SELECT 3 AS idx
)
SELECT 
    original.Name,
    -- Extract the nth Phone value (colon-separated)
    TRIM(
        SUBSTRING(
            original.Phone,
            -- Start position: 1 for first value, after (idx-1)th colon for others
            CASE idx
                WHEN 1 THEN 1
                ELSE INSTR(original.Phone, ':', 1, idx-1) + 1
            END,
            -- Length of the substring: from start to next colon (or end of string)
            CASE idx
                WHEN 3 THEN LENGTH(original.Phone) - INSTR(original.Phone, ':', 1, 2)
                ELSE INSTR(original.Phone, ':', 1, idx) - CASE idx WHEN 1 THEN 0 ELSE INSTR(original.Phone, ':', 1, idx-1) END - 1
            END
        )
    ) AS Phone,
    -- Extract the nth Address value (comma-separated)
    TRIM(
        SUBSTRING(
            original.Address,
            -- Start position: 1 for first value, after (idx-1)th comma for others
            CASE idx
                WHEN 1 THEN 1
                ELSE INSTR(original.Address, ',', 1, idx-1) + 1
            END,
            -- Length of the substring: from start to next comma (or end of string)
            CASE idx
                WHEN 3 THEN LENGTH(original.Address) - INSTR(original.Address, ',', 1, 2)
                ELSE INSTR(original.Address, ',', 1, idx) - CASE idx WHEN 1 THEN 0 ELSE INSTR(original.Address, ',', 1, idx-1) END - 1
            END
        )
    ) AS Address
FROM 
    your_table_name original
CROSS JOIN 
    number_seq
WHERE 
    -- Filter out rows where the nth value doesn't exist (for variable-length splits)
    (idx = 1 OR INSTR(original.Phone, ':', 1, idx-1) > 0)
    AND (idx = 1 OR INSTR(original.Address, ',', 1, idx-1) > 0)
ORDER BY 
    original.Name, idx;

How This Works

  • Number Sequence: The number_seq CTE gives us an index for each split position, letting us process each value in the delimited columns one by one.
  • String Extraction:
    • INSTR finds the position of the nth delimiter (colon for Phone, comma for Address).
    • SUBSTRING uses those positions to pull out the substring between the previous delimiter (or start of the string) and the next delimiter (or end of the string).
    • TRIM cleans up any accidental whitespace (a safe guard for real-world data).
  • Filtering: The WHERE clause ensures we don't generate rows for split positions that don't exist in the original data (e.g., if a row only has 2 Phone numbers, we won't create a 3rd row for it).

Notes for Adaptation

  • Replace your_table_name with the actual name of your Hive table containing the input data.
  • If your data has a variable number of splits (some rows have 2, some have 4, etc.), expand the number_seq CTE to cover the maximum number of splits you'll encounter. The WHERE clause will automatically exclude any unused indices for rows with fewer splits.

内容的提问来源于stack exchange,提问作者Manindar

火山引擎 最新活动