SAP HANA SQL正则匹配聚合问题:提取全部匹配项及多格式兼容
Hey Jesse, let's break down your problems one by one and give you workable solutions across SAP HANA, SAS, and Python since you mentioned you have access to all three:
HANA’s SUBSTR_REGEXPR only returns the first match by default, so we’ll use sequence generation and aggregation to grab multiple hits per row.
1. Extract up to 6 regex matches per row and aggregate
To capture up to 6 matches and join them into a clean string, we’ll generate a 1-6 sequence (for your max match limit), extract each occurrence, strip unwanted spaces, filter out empty results, then use STRING_AGG to combine everything:
WITH numbered_matches AS ( SELECT "Primary Key", "Text", -- Strip spaces from matched values to get clean output like ST43434 REGEXP_REPLACE( SUBSTR_REGEXPR('([PpSs][Tt][Ss]?\w?\s?\d{2,6})' IN "Text" OCCURRENCE n), '\s', '' ) AS match FROM YourTable CROSS JOIN SERIES_GENERATE_INTEGER(1, 6) AS s(n) -- Generate 1-6 to fetch up to 6 matches WHERE SUBSTR_REGEXPR('([PpSs][Tt][Ss]?\w?\s?\d{2,6})' IN "Text" OCCURRENCE n) IS NOT NULL ) SELECT "Primary Key", "Text", STRING_AGG(match, ', ') AS "Location" FROM numbered_matches GROUP BY "Primary Key", "Text";
For your sample text msdfmsfmdf PT2222, ST 43434 asdasdas, this will return PT2222, ST43434 exactly as you want.
2. Extract numbers only after the "Locations" prefix
Your original regex didn’t properly anchor to the "Locations" prefix, leading to false matches. We’ll first isolate the segment right after "Locations", then pull all valid 1-2 digit numbers from that segment (ignoring unrelated numbers like .282 or 002):
WITH locations_segment AS ( SELECT "Primary Key", "Text", -- Grab everything after "Locations " until we hit a space, dot, or end of text SUBSTR_REGEXPR('Locations\s+(.*?)(?=\s|\.|$)' IN "Text") AS loc_segment FROM YourTable ), number_extracts AS ( SELECT "Primary Key", "Text", SUBSTR_REGEXPR('(\d{1,2})' IN loc_segment OCCURRENCE n) AS num_match FROM locations_segment CROSS JOIN SERIES_GENERATE_INTEGER(1,5) AS s(n) -- Match up to 5 numbers WHERE SUBSTR_REGEXPR('(\d{1,2})' IN loc_segment OCCURRENCE n) IS NOT NULL ) SELECT "Primary Key", "Text", STRING_AGG(num_match, ', ') AS "Location" FROM number_extracts GROUP BY "Primary Key", "Text";
For your sample text Locations 1, 2, 35 & 5 lkfaskjdlsaf .282 lkfdsklfjlkdsj 002, this will return 1, 2, 35, 5 (excluding the unwanted numbers).
3. Combine two regex patterns with OR logic
Your initial OR syntax was incorrect—HANA expects regex alternation using | inside the pattern. We’ll combine both patterns, extract all valid matches, then aggregate them:
WITH combined_matches AS ( SELECT "Primary Key", "Text", -- Clean PT/ST matches and keep valid numbers from Locations section CASE WHEN SUBSTR_REGEXPR('([PpSs][Tt][Ss]?\w?\s?\d{2,6})' IN "Text" OCCURRENCE n) IS NOT NULL THEN REGEXP_REPLACE(SUBSTR_REGEXPR('([PpSs][Tt][Ss]?\w?\s?\d{2,6})' IN "Text" OCCURRENCE n), '\s', '') ELSE SUBSTR_REGEXPR('(\d{1,2})' IN "Text" OCCURRENCE n) END AS combined_match FROM YourTable CROSS JOIN SERIES_GENERATE_INTEGER(1,6) AS s(n) WHERE SUBSTR_REGEXPR('([PpSs][Tt][Ss]?\w?\s?\d{2,6})|(\d{1,2})' IN "Text" OCCURRENCE n) IS NOT NULL -- Optional: Ensure numbers only come from the Locations section AND (combined_match LIKE '[PpSs][Tt]%' OR EXISTS ( SELECT 1 FROM YourTable t2 WHERE t2."Primary Key" = YourTable."Primary Key" AND t2."Text" LIKE '%Locations%' )) ) SELECT "Primary Key", "Text", STRING_AGG(DISTINCT combined_match, ', ') AS "Location" -- Avoid duplicate matches FROM combined_matches GROUP BY "Primary Key", "Text";
This will capture both PT/ST-style codes and valid numbers from the "Locations" section in one aggregated string.
SAS uses PRX functions for regex handling—we’ll loop through matches with PRXNEXT and aggregate results with CATX.
1. Extract multiple PT/ST-style matches
data want; set your_table; length Location $200; retain Location ''; rx_id = prxparse('/([PpSs][Tt][Ss]?\w?\s?\d{2,6})/'); start = 1; call prxnext(rx_id, start, length(Text), Text, pos, len); do while(pos > 0); match = compress(substr(Text, pos, len)); -- Strip spaces from matches Location = catx(', ', Location, match); call prxnext(rx_id, start, length(Text), Text, pos, len); if countw(Location, ', ') >=6 then leave; -- Stop after 6 matches end; drop rx_id start pos len match; run;
2. Extract numbers after "Locations"
data want; set your_table; length Location $200 loc_segment $100; retain Location ''; -- Isolate the segment right after "Locations" rx_loc = prxparse('/Locations\s+(.*?)(?=\s|\.|$)/'); if prxmatch(rx_loc, Text) then do; loc_segment = prxposn(rx_loc, 1, Text); -- Extract valid numbers from the segment rx_num = prxparse('/(\d{1,2})/'); start = 1; call prxnext(rx_num, start, length(loc_segment), loc_segment, pos, len); do while(pos > 0); match = substr(loc_segment, pos, len); Location = catx(', ', Location, match); call prxnext(rx_num, start, length(loc_segment), loc_segment, pos, len); end; drop rx_num start pos len match; end; drop rx_loc loc_segment; run;
3. Combine both patterns
data want; set your_table; length Location $200; retain Location ''; -- Combine both regex patterns into one rx_combined = prxparse('/([PpSs][Tt][Ss]?\w?\s?\d{2,6})|(\d{1,2})/'); start = 1; call prxnext(rx_combined, start, length(Text), Text, pos, len); do while(pos > 0); match = compress(substr(Text, pos, len)); -- Only keep numbers if they belong to the Locations section if not prxmatch('/^\d+$/', match) or index(Text, 'Locations') > 0 then do; Location = catx(', ', Location, match); end; call prxnext(rx_combined, start, length(Text), Text, pos, len); if countw(Location, ', ') >=6 then leave; end; drop rx_combined start pos len match; run;
Python’s re.findall makes extracting all regex matches trivial, and pandas simplifies applying this logic to your dataset. This is likely the fastest and most flexible option for you, especially compared to Tableau’s slow aggregation.
1. Extract up to 6 PT/ST-style matches
import re import pandas as pd # Load your data (adjust based on your source) df = pd.read_csv('your_table.csv') def extract_locations(text): # Find all PT/ST-style matches, strip spaces, take first 6 matches = [re.sub(r'\s', '', m) for m in re.findall(r'([PpSs][Tt][Ss]?\w?\s?\d{2,6})', text)] return ', '.join(matches[:6]) df['Location'] = df['Text'].apply(extract_locations)
2. Extract numbers after "Locations"
def extract_loc_numbers(text): # Isolate the segment after "Locations" loc_match = re.search(r'Locations\s+(.*?)(?=\s|\.|$)', text) if loc_match: loc_segment = loc_match.group(1) # Extract all valid 1-2 digit numbers from the segment numbers = re.findall(r'(\d{1,2})', loc_segment) return ', '.join(numbers) return '' df['Location'] = df['Text'].apply(extract_loc_numbers)
3. Combine both patterns
def extract_combined(text): # Get cleaned PT/ST matches pt_st_matches = [re.sub(r'\s', '', m) for m in re.findall(r'([PpSs][Tt][Ss]?\w?\s?\d{2,6})', text)] # Get numbers from the Locations segment loc_numbers = [] loc_match = re.search(r'Locations\s+(.*?)(?=\s|\.|$)', text) if loc_match: loc_numbers = re.findall(r'(\d{1,2})', loc_match.group(1)) # Combine results and take up to 6 matches combined = pt_st_matches + loc_numbers return ', '.join(combined[:6]) df['Location'] = df['Text'].apply(extract_combined)
内容的提问来源于stack exchange,提问作者Jesse




