You need to enable JavaScript to run this app.
优惠活动
大模型
产品
解决方案
定价
更多
文档控制台
免费开始使用

如何基于含同义词分组的关键词表在通用文本中检索关键词?

Retrieve Synonym Keyword Groups from General Text

Got it, let's walk through how to implement this keyword retrieval with synonym grouping using PostgreSQL's built-in text search tools. First, let's formalize your table structure to make the solution concrete, since the sample data you provided hints at a setup where shared codes link synonyms.

Step 1: Define & Populate Your Table

First, let's create a properly structured table (I'll add an auto-increment ID since multiple keywords share the same code):

CREATE TABLE keyword_synonyms (
    id SERIAL PRIMARY KEY,
    code VARCHAR(10) NOT NULL,
    keyword VARCHAR(255) NOT NULL,
    tsv TSVector NOT NULL,
    -- Ensure no duplicate keyword-code pairs
    CONSTRAINT unique_code_keyword UNIQUE (code, keyword)
);

Now insert your sample data (using to_tsvector to generate the tsvector values consistently):

INSERT INTO keyword_synonyms (code, keyword, tsv) VALUES
('C1000', 'AI', to_tsvector('english', 'AI')),
('C1000', 'Artificial intelligence', to_tsvector('english', 'Artificial intelligence')),
('C1001', 'Algorithms', to_tsvector('english', 'Algorithms')),
('C1002', 'Software Design', to_tsvector('english', 'Software Design')),
('C1003', 'ui design', to_tsvector('english', 'ui design')),
('C1003', 'User interface design', to_tsvector('english', 'User interface design')),
('C1003', 'user interface engineering', to_tsvector('english', 'user interface engineering'));

Step 2: Add an Index for Fast Searches

To make text searches efficient (especially with large datasets), create a GIN index on the tsv column:

CREATE INDEX idx_keyword_synonyms_tsv ON keyword_synonyms USING GIN(tsv);

Step 3: Query to Find Matching Synonym Groups

The goal is to take a general text input, find all keywords from your table that appear in it, then pull in all synonyms for those matching keywords (via their shared code). Here's a reusable query that does this:

-- Replace this with your actual input text
WITH input_text AS (
    SELECT 'We use AI to develop user interface design for our software projects' AS text
),
-- Find all code groups that have at least one matching keyword
matching_synonym_groups AS (
    SELECT DISTINCT ks.code
    FROM keyword_synonyms ks
    CROSS JOIN input_text it
    -- Match the tsvector against a tsquery generated from the input text
    WHERE ks.tsv @@ plainto_tsquery('english', it.text)
)
-- Fetch all keywords in the matching code groups
SELECT ks.code, ks.keyword
FROM keyword_synonyms ks
JOIN matching_synonym_groups msg ON ks.code = msg.code
ORDER BY ks.code, ks.keyword;

How This Works:

  • Input Text CTE: Holds your target text (easy to swap out for different inputs).
  • Matching Groups CTE: Uses PostgreSQL's @@ operator to check if the keyword's tsvector matches the input text (converted to a tsquery with plainto_tsquery, which handles stemming and stop words automatically). We grab distinct codes to get each synonym group once.
  • Final Select: Joins back to the main table to get all synonyms for the matching groups.

Customization Tips:

  • Exact Phrase Matches: If you want to prioritize exact phrases (e.g., "user interface design" instead of just "design"), use phraseto_tsquery instead of plainto_tsquery:
    WHERE ks.tsv @@ phraseto_tsquery('english', it.text)
    
  • Any Word Match: If you want to match any word from the input text (not all), use to_tsquery with | (OR) instead of plainto_tsquery:
    WHERE ks.tsv @@ to_tsquery('english', replace(it.text, ' ', ' | '))
    
  • Case Insensitivity: PostgreSQL's text search is case-insensitive by default, so you don't need to manually lowercase inputs or keywords.

Example Output:

For the input text I used, the query would return:

code   | keyword
-------|---------------------------
C1000  | AI
C1000  | Artificial intelligence
C1002  | Software Design
C1003  | ui design
C1003  | User interface design
C1003  | user interface engineering

This gives you every synonym for every keyword that matched your input text, grouped by their code.

内容的提问来源于stack exchange,提问作者Nicola Lepetit

火山引擎 最新活动