SAP HANA SQL正则匹配聚合问题：提取全部匹配项及多格式兼容

阿华AIGC实验室

2026-5-11

Hey Jesse, let's break down your problems one by one and give you workable solutions across SAP HANA, SAS, and Python since you mentioned you have access to all three:

Solutions for SAP HANA

HANA’s SUBSTR_REGEXPR only returns the first match by default, so we’ll use sequence generation and aggregation to grab multiple hits per row.

1. Extract up to 6 regex matches per row and aggregate

To capture up to 6 matches and join them into a clean string, we’ll generate a 1-6 sequence (for your max match limit), extract each occurrence, strip unwanted spaces, filter out empty results, then use STRING_AGG to combine everything:

WITH numbered_matches AS (
    SELECT 
        "Primary Key",
        "Text",
        -- Strip spaces from matched values to get clean output like ST43434
        REGEXP_REPLACE(
            SUBSTR_REGEXPR('([PpSs][Tt][Ss]?\w?\s?\d{2,6})' IN "Text" OCCURRENCE n),
            '\s', ''
        ) AS match
    FROM 
        YourTable
    CROSS JOIN 
        SERIES_GENERATE_INTEGER(1, 6) AS s(n) -- Generate 1-6 to fetch up to 6 matches
    WHERE 
        SUBSTR_REGEXPR('([PpSs][Tt][Ss]?\w?\s?\d{2,6})' IN "Text" OCCURRENCE n) IS NOT NULL
)
SELECT 
    "Primary Key",
    "Text",
    STRING_AGG(match, ', ') AS "Location"
FROM numbered_matches
GROUP BY "Primary Key", "Text";

For your sample text msdfmsfmdf PT2222, ST 43434 asdasdas, this will return PT2222, ST43434 exactly as you want.

2. Extract numbers only after the "Locations" prefix

Your original regex didn’t properly anchor to the "Locations" prefix, leading to false matches. We’ll first isolate the segment right after "Locations", then pull all valid 1-2 digit numbers from that segment (ignoring unrelated numbers like .282 or 002):

WITH locations_segment AS (
    SELECT 
        "Primary Key",
        "Text",
        -- Grab everything after "Locations " until we hit a space, dot, or end of text
        SUBSTR_REGEXPR('Locations\s+(.*?)(?=\s|\.|$)' IN "Text") AS loc_segment
    FROM YourTable
),
number_extracts AS (
    SELECT 
        "Primary Key",
        "Text",
        SUBSTR_REGEXPR('(\d{1,2})' IN loc_segment OCCURRENCE n) AS num_match
    FROM locations_segment
    CROSS JOIN SERIES_GENERATE_INTEGER(1,5) AS s(n) -- Match up to 5 numbers
    WHERE SUBSTR_REGEXPR('(\d{1,2})' IN loc_segment OCCURRENCE n) IS NOT NULL
)
SELECT 
    "Primary Key",
    "Text",
    STRING_AGG(num_match, ', ') AS "Location"
FROM number_extracts
GROUP BY "Primary Key", "Text";

For your sample text Locations 1, 2, 35 & 5 lkfaskjdlsaf .282 lkfdsklfjlkdsj 002, this will return 1, 2, 35, 5 (excluding the unwanted numbers).

3. Combine two regex patterns with OR logic

Your initial OR syntax was incorrect—HANA expects regex alternation using | inside the pattern. We’ll combine both patterns, extract all valid matches, then aggregate them:

WITH combined_matches AS (
    SELECT 
        "Primary Key",
        "Text",
        -- Clean PT/ST matches and keep valid numbers from Locations section
        CASE 
            WHEN SUBSTR_REGEXPR('([PpSs][Tt][Ss]?\w?\s?\d{2,6})' IN "Text" OCCURRENCE n) IS NOT NULL
                THEN REGEXP_REPLACE(SUBSTR_REGEXPR('([PpSs][Tt][Ss]?\w?\s?\d{2,6})' IN "Text" OCCURRENCE n), '\s', '')
            ELSE SUBSTR_REGEXPR('(\d{1,2})' IN "Text" OCCURRENCE n)
        END AS combined_match
    FROM YourTable
    CROSS JOIN SERIES_GENERATE_INTEGER(1,6) AS s(n)
    WHERE 
        SUBSTR_REGEXPR('([PpSs][Tt][Ss]?\w?\s?\d{2,6})|(\d{1,2})' IN "Text" OCCURRENCE n) IS NOT NULL
        -- Optional: Ensure numbers only come from the Locations section
        AND (combined_match LIKE '[PpSs][Tt]%' OR EXISTS (
            SELECT 1 FROM YourTable t2 
            WHERE t2."Primary Key" = YourTable."Primary Key" 
            AND t2."Text" LIKE '%Locations%'
        ))
)
SELECT 
    "Primary Key",
    "Text",
    STRING_AGG(DISTINCT combined_match, ', ') AS "Location" -- Avoid duplicate matches
FROM combined_matches
GROUP BY "Primary Key", "Text";

This will capture both PT/ST-style codes and valid numbers from the "Locations" section in one aggregated string.

Solutions for SAS

SAS uses PRX functions for regex handling—we’ll loop through matches with PRXNEXT and aggregate results with CATX.

1. Extract multiple PT/ST-style matches

data want;
    set your_table;
    length Location $200;
    retain Location '';
    rx_id = prxparse('/([PpSs][Tt][Ss]?\w?\s?\d{2,6})/');
    start = 1;
    call prxnext(rx_id, start, length(Text), Text, pos, len);
    do while(pos > 0);
        match = compress(substr(Text, pos, len)); -- Strip spaces from matches
        Location = catx(', ', Location, match);
        call prxnext(rx_id, start, length(Text), Text, pos, len);
        if countw(Location, ', ') >=6 then leave; -- Stop after 6 matches
    end;
    drop rx_id start pos len match;
run;

2. Extract numbers after "Locations"

data want;
    set your_table;
    length Location $200 loc_segment $100;
    retain Location '';
    -- Isolate the segment right after "Locations"
    rx_loc = prxparse('/Locations\s+(.*?)(?=\s|\.|$)/');
    if prxmatch(rx_loc, Text) then do;
        loc_segment = prxposn(rx_loc, 1, Text);
        -- Extract valid numbers from the segment
        rx_num = prxparse('/(\d{1,2})/');
        start = 1;
        call prxnext(rx_num, start, length(loc_segment), loc_segment, pos, len);
        do while(pos > 0);
            match = substr(loc_segment, pos, len);
            Location = catx(', ', Location, match);
            call prxnext(rx_num, start, length(loc_segment), loc_segment, pos, len);
        end;
        drop rx_num start pos len match;
    end;
    drop rx_loc loc_segment;
run;

3. Combine both patterns

data want;
    set your_table;
    length Location $200;
    retain Location '';
    -- Combine both regex patterns into one
    rx_combined = prxparse('/([PpSs][Tt][Ss]?\w?\s?\d{2,6})|(\d{1,2})/');
    start = 1;
    call prxnext(rx_combined, start, length(Text), Text, pos, len);
    do while(pos > 0);
        match = compress(substr(Text, pos, len));
        -- Only keep numbers if they belong to the Locations section
        if not prxmatch('/^\d+$/', match) or index(Text, 'Locations') > 0 then do;
            Location = catx(', ', Location, match);
        end;
        call prxnext(rx_combined, start, length(Text), Text, pos, len);
        if countw(Location, ', ') >=6 then leave;
    end;
    drop rx_combined start pos len match;
run;

Solutions for Python

Python’s re.findall makes extracting all regex matches trivial, and pandas simplifies applying this logic to your dataset. This is likely the fastest and most flexible option for you, especially compared to Tableau’s slow aggregation.

1. Extract up to 6 PT/ST-style matches

import re
import pandas as pd

# Load your data (adjust based on your source)
df = pd.read_csv('your_table.csv')

def extract_locations(text):
    # Find all PT/ST-style matches, strip spaces, take first 6
    matches = [re.sub(r'\s', '', m) for m in re.findall(r'([PpSs][Tt][Ss]?\w?\s?\d{2,6})', text)]
    return ', '.join(matches[:6])

df['Location'] = df['Text'].apply(extract_locations)

2. Extract numbers after "Locations"

def extract_loc_numbers(text):
    # Isolate the segment after "Locations"
    loc_match = re.search(r'Locations\s+(.*?)(?=\s|\.|$)', text)
    if loc_match:
        loc_segment = loc_match.group(1)
        # Extract all valid 1-2 digit numbers from the segment
        numbers = re.findall(r'(\d{1,2})', loc_segment)
        return ', '.join(numbers)
    return ''

df['Location'] = df['Text'].apply(extract_loc_numbers)

3. Combine both patterns

def extract_combined(text):
    # Get cleaned PT/ST matches
    pt_st_matches = [re.sub(r'\s', '', m) for m in re.findall(r'([PpSs][Tt][Ss]?\w?\s?\d{2,6})', text)]
    # Get numbers from the Locations segment
    loc_numbers = []
    loc_match = re.search(r'Locations\s+(.*?)(?=\s|\.|$)', text)
    if loc_match:
        loc_numbers = re.findall(r'(\d{1,2})', loc_match.group(1))
    # Combine results and take up to 6 matches
    combined = pt_st_matches + loc_numbers
    return ', '.join(combined[:6])

df['Location'] = df['Text'].apply(extract_combined)

内容的提问来源于stack exchange，提问作者Jesse