文档中心

searchByText

最近更新时间：2024.04.16 13:11:53

首次发布时间：2023.12.08 10:47:34

概述

searchByText 用于非结构化数据检索。非结构化数据检索是指向量数据库支持非结构化原始数据，可以直接通过文本搜索文本。当用户通过文本搜索时，向量数据库通过测量文本之间的距离来确定两段文本的相似程度，返回文本的相似度。该功能适用于重复识别、文本搜索与匹配、问答等场景。

说明

当前仅支持文本类型的非结构化数据。
Collection 数据写入/删除后，Index 数据更新时间最长滞后 20s，不能立即在 Index 检索到。

前提条件

通过 createCollection 接口创建数据集时，定义字段 fields 已添加带 pipelineName 的 text 字段。
通过 upsertData 接口写入数据时，已写入带 pipelineName 的 text 类型的字段名称和字段值。
通过 createIndex 创建索引时，已创建 vectorIndex 向量索引。

请求参数

请求参数是 SearchByTextParam，SearchByTextParam 实例包含的参数如下表所示。

参数	类型	是否必选	默认值	参数说明
text	string	是		检索的输入文本。
filter	map	否		过滤条件，详见 filter 表达式说明。默认为空，不做过滤。过滤条件包含 must、must_not、range、range_out、georange 五类查询算子，包含 and 和 or 两种对查询算子的组合。
limit	int	否	10	检索结果数量，最大5000个。
dense_weight	float	否	0.5	对于混合检索，dense_weight 用于控制稠密向量在检索中的权重。范围为[0.2，1]。仅在检索的索引为混合索引时有效。
outputFields	list<string>	否		过滤字段，指定要返回的标量或向量字段列表。 outputFields 不传时，返回所有的标量字段，不返回向量字段。 outputFields 为空列表时，不返回 fields 字段。 outputFields 格式错误或者过滤字段不是 collection 里的字段时, 接口返回错误。如果索引的距离方式为cosine，向量字段返回的向量是归一化后的向量。
partition	string/int	否	"default"	子索引名称，类型与 partitionBy 的 fieldType 一致，字段值对应 partitionBy 的 fieldValue。 fieldType 为 int64，list<int64> 时，partition 输入类型为 int64。 fieldType 为 string，list<string> 时，partition 输入类型为 string，格式要求 "^[a-zA-Z0-9._]+$"。

filter 表达式

算子	算子说明	示例
must	针对指定字段名生效，语义为必须在 [...] 之中，即 "must in"。	`{ "op": "must", "field": "region", "conds": ["cn", "sg"] }`
must_not	针对指定字段名生效，语义为必须不在 [...] 之中，即 "must not in"。	`{ "op": "must_not", "field": "data_type", "conds": [1,2,3] }`
range	针对指定字段名生效，语义为必须在指定范围内。配置使用`gte`（大于等于）, `gt`（大于）, `lte`（小于等于）, `lt`（小于），用以圈定一维范围。另外，支持用 `center` 和 `radius` 表示二维圆内范围。	`// price 在 [100.0, 500.0) { "op": "range", "field": "price", "gte": 100.0, "lt": 500.0 } //price >= 100.0 { "op": "range", "field": "price", "gte": 100.0 } // 以 center 为中心，半径为50的圆内 { "op": "range", "field": ["pos_x", "pos_y"], "center": [100.0, 123.4], "radius": 50.0 }`
range_out	针对指定字段名生效，语义为必须在指定范围外。配置使用`gte`（大于等于）, `gt`（大于）, `lte`（小于等于）, `lt`（小于），用以圈定一维范围。	`// 筛选价格低于100或高于500的商品 { "op": "range_out", "field": "price", "gt": 500.0, "lt": 100.0 }`
georange	支持地理距离范围筛选。指定经纬度字段，以center为中心，筛选出地表距离在radius范围内的数据。	`// 距离center地表距离 radius 内 { "op": "georange", "field": ["longitude", "latitude"], "center": [100.12312, 22.4324], "radius": 50.0 }`
and	逻辑算子，针对逻辑查询需求，对多个条件取交集。	`{ "op": "and", // 算子名 "conds": [ // 条件列表，支持嵌套逻辑算子和 must/must_not 算子 { "op": "must", "field": "type", "conds": [1] }, { ... // 支持>=1的任意数量的条件进行组合 } ] }`
or	逻辑算子，针对逻辑查询需求，对多个条件取并集。	`{ "op": "or", // 算子名 "conds": [ // 条件列表，支持嵌套逻辑算子和 must/must_not 算子 { "op": "must", "field": "type", "conds": [1] }, { ... // 支持>=1的任意数量的条件进行组合 } ] }`

示例

请求参数

Index index = vikingDBService.getIndex("test_text", "test_index_text");
        HashMap<String, Object> filter = new HashMap<>();
        filter.put("op", "range");
        filter.put("field", "price");
        filter.put("lt", 4);
        Text text = new Text().setText("this.is test").build();
        SearchByTextParam searchByTextParam = new SearchByTextParam()
                                                .setText(text)
                                                .setFilter(filter)
                                                .setDenseWeight(0.5)
                                                .build();
        List<DataObject> datas = index.searchByText(searchByTextParam);

返回值

Java 调用执行上面的任务，返回 List<DataObject> 。DataObject 实例包含的属性如下表所示。

属性	说明
id	主键 id。
fields	请求返回中的 fields 字段，是具体的数据，map 类型。
score	表示找到的向量和输入的向量的匹配程度。
text	文本非结构化检索时返回。

概述

前提条件

请求参数

filter 表达式

示例

请求参数

返回值

机器学习平台

searchByText

filter 表达式

请求参数

返回值