最近更新时间:2024.04.16 13:11:54
首次发布时间:2023.12.11 17:01:20
/doc_chunking 接口用于解析文档内容并根据文档内文本段落的从属关系(如主标题、小标题和正文等)将原文划分为长度较短的文本片。接口的使用需要传入待解析的文档,返回的结果是一个包含多个文本片的数组。每个文本片均包含其在原文中的位置信息和文本内容。将文本切分为较小的文本片后有利于后续检索到更小粒度的信息,以及适应大语言模型输入窗口的限制。
说明
当前默认的文本切分策略:基于文档的语义标签(如摘要、引言等)切分。
URI | https://viking-knowledge-demo.byte-test.com/api/doc_chunking | 统一资源标识符 |
---|---|---|
方法 | POST | 客户端对 doc_chunking 服务请求的操作类型 |
请求头 | Content-Type: application/json | 请求消息类型 |
Authorization: HMAC-SHA256 *** | 鉴权 |
参数 | 子参数 | 类型 | 是否必选 | 默认值 | 说明 |
---|---|---|---|---|---|
doc_infos | doc_type | string | 是 | 文档的类型,当前支持 txt、pdf、markdown、doc、docx、pptx。 | |
url | string | 二选一 | 文档的下载链接。文档的下载链接。对应文档的大小不超过30M、200页。 | ||
data | byte | 以二进制方式上传的文档二进制流。 | |||
chunk_size | int | 否 | 500 | 切分长度 | |
parse_table_for_pdf | bool | 否 | false | 是否对pdf类型文件的表格进行解析。设置为true会增加解析时长,建议按需设置。 | |
parse_picture_for_pdf | bool | 否 | false | 是否对pdf类型文件的图片进行解析。设置为true会增加解析时长,建议按需设置。 | |
pdf_with_ocr | bool | 否 | false | 是否解析扫描件。设置为true代表解析。 |
参数 | 参数说明 |
---|---|
code | 状态码 |
message | 返回信息 |
request_id | 标识每个请求的唯一标识符 |
data | 返回结果。包含以下字段:
|
状态码 | http状态码 | 返回信息 | 状态码说明 |
---|---|---|---|
0 | 200 | success | 请求成功。 |
1000001 | 403 | check sign failed | 请求头中缺乏鉴权信息。 |
1000002 | 403 | account[xx] has no permission | 没有接口权限,请联系管理员添加。 |
1000003 | 400 | account[xx] request parse failed... | 非法参数,参数请求格式错误。 |
1000004 | 200 | account[xx] has arrived .... | qps 限制,请重试。 |
...... | |||
1001000 | 500 | ...... | 服务器内部错误,请联系管理员。 |
curl -X POST https://viking-knowledge-demo.byte-test.com/api/doc_chunking \ -H "Authorization:xxxxxxxxx" \ -d '{"doc_infos":[{"doc_type":"pdf","url":"{url}"}]}'
正常解析的返回结果:
// 1 正常解析的返回结果 { "code": 0, "data": { "DocChunkingResults": [ { "DocChunks": [ "{\"id\": 0, \"type\": \"title\", \"label\": \"\", \"level\": -1, \"parent\": -1, \"children\": [2, 7], \"text\": \"DLSP: A Document Level Structure Parser for Multi-Page Digital Documents\", \"positions\": {\"page_no\": [0], \"bbox\": [[68.171, 94.55177, 543.8337, 112.88397]]}}", "{\"id\": 1, \"type\": \"section-text\", \"label\": \"author\", \"level\": -1, \"parent\": -1, \"children\": [], \"text\": \"Anonymous submission\", \"positions\": {\"page_no\": [0], \"bbox\": [[245.711, 131.17615, 366.29117, 147.953]]}}", "{\"id\": 2, \"type\": \"section-title\", \"label\": \"abstract\", \"level\": 0, \"parent\": 0, \"children\": [3, 4, 5, 6], \"text\": \"Abstract\", \"positions\": {\"page_no\": [0], \"bbox\": [[154.715, 214.48235, 191.78584, 229.96301]]}}", "{\"id\": 3, \"type\": \"section-text\", \"label\": \"abstract\", \"level\": 1, \"parent\": 2, \"children\": [], \"text\": \"Document AI aims to extract structure data from digital documents. However, existing research mainly focuses on extracting information at page level, which doesn’t solve the issues at document level, such as merging paragraphs across pages and organizing paragraphs into chapters and sections. Accurate structural information at the document level is critical for various applications, including search engines, DocQA systems and so on. To address this issue, we propose a novel transition-based parser, DLSP. Moreover, we introduce a new document-level structure parsing dataset, DocTree, which contains manually annotated paragraph structures of multi-page documents, with the longest docu- ment spanning up to 85 pages. We compare the impact of pure text and text-image multi-modal pretraining encoders on the results. Our experiments demonstrate that, in DocTree dataset, our approach outperforms existing methods by an improvement of 12% in accuracy.\", \"positions\": {\"page_no\": [0], \"bbox\": [[63.962997, 233.97977, 282.53696, 407.78415]]}}", "{\"id\": 4, \"type\": \"section-text\", \"label\": \"abstract\", \"level\": 1, \"parent\": 2, \"children\": [], \"text\": \"Introduction\", \"positions\": {\"page_no\": [0], \"bbox\": [[140.81, 421.24728, 205.69087, 438.02414]]}}", "{\"id\": 5, \"type\": \"section-text\", \"label\": \"abstract\", \"level\": 1, \"parent\": 2, \"children\": [], \"text\": \"Document AI is a research field that has emerged in recent years. It focuses on automating the reading, comprehension, and analysis of data in electronic documents, such as PDFs and Word files. These documents can be either scanned or digital-born files and may contain a variety of content, including receipts, forms, resumes, manuals, and textbooks, among others. Extracting information from these documents is a challenging task due to the presence of various layouts and templates. Although many documents are digital-born, their formats were designed for layout purposes, such as PDFs. As a result, the structural information retained within them is often incomplete, making the extraction work challenging.\", \"positions\": {\"page_no\": [0], \"bbox\": [[54.0, 438.44968, 292.50467, 584.95905]]}}", "{\"id\": 6, \"type\": \"section-text\", \"label\": \"abstract\", \"level\": 1, \"parent\": 2, \"children\": [], \"text\": \"Previous Document AI research mainly focused on an- alyzing individual pages, primarily divided into two tasks, entity labeling and entity linking. Entity labeling refers to the process of assigning labels to text tokens or segments in a document, such as titles, tables, figures, and so\", \"positions\": {\"page_no\": [0], \"bbox\": [[54.0, 581.4086, 292.50507, 640.2471]]}}", "{\"id\": 7, \"type\": \"section-text\", \"label\": \"abstract\", \"level\": 0, \"parent\": 0, \"children\": [8, 9, 10, 11, 12, 13, 40, 47], \"text\": \"on. In the early research, entity labeling is primarily achieved through visual features for image classification, DeepDeSRT(Schreiber et al. 2017), PDFTableDection(Hao et al. 2016), VisualDetection(Soto and Yoo 2019) and RVL CDIP(Harley, Ufkes, and Derpanis 2015). Subsequent studies aimed to enhance classification performance by\", \"positions\": {\"page_no\": [0], \"bbox\": [[53.999996, 636.2036, 292.50488, 706.0]]}}", "{\"id\": 8, \"type\": \"image\", \"label\": \"abstract\", \"level\": 1, \"parent\": 7, \"children\": [], \"text\": \" Entity Labeling Entity Linking Document Level Structure Parsing title abstract paragraph section paragraph\", \"positions\": {\"page_no\": [0], \"bbox\": [[319.0, 211.0, 562.0, 321.0]]}}", "{\"id\": 9, \"type\": \"section-text\", \"label\": \"abstract\", \"level\": 1, \"parent\": 7, \"children\": [], \"text\": \"Figure 1: Existing Document AI tasks, entity labeling and entity linking, focus on extracting information at page level. In this paper, we propose a new task, document level structure parsing, that can handle paragraphs across pages and organize paragraphs into sections.\", \"positions\": {\"page_no\": [0], \"bbox\": [[319.5, 323.57852, 558.0047, 382.41693]]}}", "{\"id\": 10, \"type\": \"section-text\", \"label\": \"abstract\", \"level\": 1, \"parent\": 7, \"children\": [], \"text\": \"incorporating both text and visual multi-modal features, such as DocBank(Li et al. 2020), LayoutLM serials(Xu et al. 2020), (Xu et al. 2021b), (Xu et al. 2021a), Pub- LayNet(Zhong, Tang, and Yepes 2019). Apart from clas- sifying textual components, some studies have also pro- posed methods for categorizing reading orders, such as LayoutParser(Shen et al. 2021), LayoutReader(Wang et al. 2021), ERNIE-layout(Peng et al. 2022). Entity linking refers to predicting the relationships between text segments in a document, aiming to extract connections among them. In previous work, the focus was mainly on linking entities within a single page, achieved through pair-wise classifica- tion. Such works include dhSegment(Ares Oliveira, Seguin, and Kaplan 2018), FUNSD(Jaume, Kemal Ekenel, and Thiran 2019a), EPHOIE(Jaume, Kemal Ekenel, and Thiran 2019b), SROIE(Huang et al. 2019), DocStruct(Wang et al. 2020), SPADE(Hwang et al. 2021), StructuralLM(Li et al. 2021a), StrucTexT(Li et al. 2021b).\", \"positions\": {\"page_no\": [0], \"bbox\": [[319.5, 403.32745, 558.00476, 604.63184]]}}", "{\"id\": 11, \"type\": \"section-text\", \"label\": \"abstract\", \"level\": 1, \"parent\": 7, \"children\": [], \"text\": \"Existing studies usually only deal with information within one single page. The issue of document level structure pars- ing is left unsolved. To extract valuable textual information from multi-page documents, it becomes essential to parse the document’s structure, dividing it into distinct sections and addressing the issue of paragraphs spanning multiple pages.\", \"positions\": {\"page_no\": [0], \"bbox\": [[319.5, 601.9574, 558.00476, 682.7128]]}}" ], "Status": "{\"message\": \"chunking_success\", \"code\": 0, \"parser_engine\": \"docai\"}" } ] }, "message": "success", "request_id": "021701174139144fdbddc0300ff0501c3c818b64edcd8f4bb487e"
解析失败的返回结果:
// 1 解析失败的返回结果 { "code": 0, "data": { "DocChunkingResults": [ { "Status": "{\"message\": \"filetype xlsx not supported\", \"code\": 5003}" } ] }, "message": "success", "request_id": "021701852892091fdbddc0300ff0501ce9aacf7979ab0a24b9e22" } // 2 解析失败的返回结果 { "code": 0, "data": { "DocChunkingResults": [ { "Status": "{\"message\": \"No /Root object! - Is this really a PDF?\", \"code\": 7000}" } ] }, "message": "success", "request_id": "021701858018883fdbddc0300ff0501ce9aacf7979ab0a2997666" }