如何基于含同义词分组的关键词表在通用文本中检索关键词?
Got it, let's walk through how to implement this keyword retrieval with synonym grouping using PostgreSQL's built-in text search tools. First, let's formalize your table structure to make the solution concrete, since the sample data you provided hints at a setup where shared codes link synonyms.
Step 1: Define & Populate Your Table
First, let's create a properly structured table (I'll add an auto-increment ID since multiple keywords share the same code):
CREATE TABLE keyword_synonyms ( id SERIAL PRIMARY KEY, code VARCHAR(10) NOT NULL, keyword VARCHAR(255) NOT NULL, tsv TSVector NOT NULL, -- Ensure no duplicate keyword-code pairs CONSTRAINT unique_code_keyword UNIQUE (code, keyword) );
Now insert your sample data (using to_tsvector to generate the tsvector values consistently):
INSERT INTO keyword_synonyms (code, keyword, tsv) VALUES ('C1000', 'AI', to_tsvector('english', 'AI')), ('C1000', 'Artificial intelligence', to_tsvector('english', 'Artificial intelligence')), ('C1001', 'Algorithms', to_tsvector('english', 'Algorithms')), ('C1002', 'Software Design', to_tsvector('english', 'Software Design')), ('C1003', 'ui design', to_tsvector('english', 'ui design')), ('C1003', 'User interface design', to_tsvector('english', 'User interface design')), ('C1003', 'user interface engineering', to_tsvector('english', 'user interface engineering'));
Step 2: Add an Index for Fast Searches
To make text searches efficient (especially with large datasets), create a GIN index on the tsv column:
CREATE INDEX idx_keyword_synonyms_tsv ON keyword_synonyms USING GIN(tsv);
Step 3: Query to Find Matching Synonym Groups
The goal is to take a general text input, find all keywords from your table that appear in it, then pull in all synonyms for those matching keywords (via their shared code). Here's a reusable query that does this:
-- Replace this with your actual input text WITH input_text AS ( SELECT 'We use AI to develop user interface design for our software projects' AS text ), -- Find all code groups that have at least one matching keyword matching_synonym_groups AS ( SELECT DISTINCT ks.code FROM keyword_synonyms ks CROSS JOIN input_text it -- Match the tsvector against a tsquery generated from the input text WHERE ks.tsv @@ plainto_tsquery('english', it.text) ) -- Fetch all keywords in the matching code groups SELECT ks.code, ks.keyword FROM keyword_synonyms ks JOIN matching_synonym_groups msg ON ks.code = msg.code ORDER BY ks.code, ks.keyword;
How This Works:
- Input Text CTE: Holds your target text (easy to swap out for different inputs).
- Matching Groups CTE: Uses PostgreSQL's
@@operator to check if the keyword's tsvector matches the input text (converted to a tsquery withplainto_tsquery, which handles stemming and stop words automatically). We grab distinct codes to get each synonym group once. - Final Select: Joins back to the main table to get all synonyms for the matching groups.
Customization Tips:
- Exact Phrase Matches: If you want to prioritize exact phrases (e.g., "user interface design" instead of just "design"), use
phraseto_tsqueryinstead ofplainto_tsquery:WHERE ks.tsv @@ phraseto_tsquery('english', it.text) - Any Word Match: If you want to match any word from the input text (not all), use
to_tsquerywith|(OR) instead ofplainto_tsquery:WHERE ks.tsv @@ to_tsquery('english', replace(it.text, ' ', ' | ')) - Case Insensitivity: PostgreSQL's text search is case-insensitive by default, so you don't need to manually lowercase inputs or keywords.
Example Output:
For the input text I used, the query would return:
code | keyword -------|--------------------------- C1000 | AI C1000 | Artificial intelligence C1002 | Software Design C1003 | ui design C1003 | User interface design C1003 | user interface engineering
This gives you every synonym for every keyword that matched your input text, grouped by their code.
内容的提问来源于stack exchange,提问作者Nicola Lepetit




