You need to enable JavaScript to run this app.
导航

doc_chunking

最近更新时间2024.04.16 13:11:54

首次发布时间2023.12.11 17:01:20

概述

/doc_chunking 接口用于解析文档内容并根据文档内文本段落的从属关系(如主标题、小标题和正文等)将原文划分为长度较短的文本片。接口的使用需要传入待解析的文档,返回的结果是一个包含多个文本片的数组。每个文本片均包含其在原文中的位置信息和文本内容。将文本切分为较小的文本片后有利于后续检索到更小粒度的信息,以及适应大语言模型输入窗口的限制。

说明

当前默认的文本切分策略:基于文档的语义标签(如摘要、引言等)切分。

请求接口

URI

https://viking-knowledge-demo.byte-test.com/api/doc_chunking

统一资源标识符

方法

POST

客户端对 doc_chunking 服务请求的操作类型

请求头

Content-Type: application/json

请求消息类型

Authorization: HMAC-SHA256 ***

鉴权

请求参数

参数

子参数

类型

是否必选

默认值

说明

doc_infos

doc_type

string

文档的类型,当前支持 txt、pdf、markdown、doc、docx、pptx。

url

string

二选一

文档的下载链接。文档的下载链接。对应文档的大小不超过30M、200页。

data

byte

以二进制方式上传的文档二进制流。
请求整体长度不超过4M。大文件解析需通过url方式。

chunk_size

int

500

切分长度

parse_table_for_pdf

bool

false

是否对pdf类型文件的表格进行解析。设置为true会增加解析时长,建议按需设置。

parse_picture_for_pdf

bool

false

是否对pdf类型文件的图片进行解析。设置为true会增加解析时长,建议按需设置。

pdf_with_ocr

bool

false

是否解析扫描件。设置为true代表解析。

响应消息

参数

参数说明

code

状态码

message

返回信息

request_id

标识每个请求的唯一标识符

data

返回结果。包含以下字段:

  • id:段落索引。
  • type:该 chunk 属于哪一种文档元素。title 表示全文大标题、section-title 表示章节标题、section-text 表示章节内容、image 表示图片、table 表示表格、header 表示页眉、footer 表示页脚、footnote 表示脚注、caption 表示图/表描述、toc 表示目录、others 表示其他。
  • label:段落的语义标签。以论文为例,有 title,author,abstract,introduction,related works 等。
  • Level:段落层级,类似飞书文档的 Hn。
  • parent:当前段落父节点索引。
  • children:当前段落所有子节点索引。
  • text:段落文本内容。
  • positions:chunk 的位置信息,pdf 文档中会有 bbox 和 page_no。

状态码说明

状态码

http状态码

返回信息

状态码说明

0

200

success

请求成功。

1000001

403

check sign failed

请求头中缺乏鉴权信息。

1000002

403

account[xx] has no permission

没有接口权限,请联系管理员添加。

1000003

400

account[xx] request parse failed...

非法参数,参数请求格式错误。

1000004

200

account[xx] has arrived ....

qps 限制,请重试。

......

1001000

500

......

服务器内部错误,请联系管理员。

完整示例

请求消息

curl -X POST https://viking-knowledge-demo.byte-test.com/api/doc_chunking \
        -H "Authorization:xxxxxxxxx" \
        -d '{"doc_infos":[{"doc_type":"pdf","url":"{url}"}]}'

响应消息

正常解析的返回结果:

// 1  正常解析的返回结果
{
    "code": 0,
    "data": {
        "DocChunkingResults": [
            {
                "DocChunks": [
                    "{\"id\": 0, \"type\": \"title\", \"label\": \"\", \"level\": -1, \"parent\": -1, \"children\": [2, 7], \"text\": \"DLSP: A Document Level Structure Parser for Multi-Page Digital Documents\", \"positions\": {\"page_no\": [0], \"bbox\": [[68.171, 94.55177, 543.8337, 112.88397]]}}",
                    "{\"id\": 1, \"type\": \"section-text\", \"label\": \"author\", \"level\": -1, \"parent\": -1, \"children\": [], \"text\": \"Anonymous submission\", \"positions\": {\"page_no\": [0], \"bbox\": [[245.711, 131.17615, 366.29117, 147.953]]}}",
                    "{\"id\": 2, \"type\": \"section-title\", \"label\": \"abstract\", \"level\": 0, \"parent\": 0, \"children\": [3, 4, 5, 6], \"text\": \"Abstract\", \"positions\": {\"page_no\": [0], \"bbox\": [[154.715, 214.48235, 191.78584, 229.96301]]}}",
                    "{\"id\": 3, \"type\": \"section-text\", \"label\": \"abstract\", \"level\": 1, \"parent\": 2, \"children\": [], \"text\": \"Document AI aims to extract structure data from digital documents. However, existing research mainly focuses on extracting information at page level, which doesn’t solve the issues at document level, such as merging paragraphs across pages and organizing paragraphs into chapters and sections. Accurate structural information at the document level is critical for various applications, including search engines, DocQA systems and so on. To address this issue, we propose a novel transition-based parser, DLSP. Moreover, we introduce a new document-level structure parsing dataset, DocTree, which contains manually annotated paragraph structures of multi-page documents, with the longest docu- ment spanning up to 85 pages. We compare the impact of pure text and text-image multi-modal pretraining encoders on the results. Our experiments demonstrate that, in DocTree dataset, our approach outperforms existing methods by an improvement of 12% in accuracy.\", \"positions\": {\"page_no\": [0], \"bbox\": [[63.962997, 233.97977, 282.53696, 407.78415]]}}",
                    "{\"id\": 4, \"type\": \"section-text\", \"label\": \"abstract\", \"level\": 1, \"parent\": 2, \"children\": [], \"text\": \"Introduction\", \"positions\": {\"page_no\": [0], \"bbox\": [[140.81, 421.24728, 205.69087, 438.02414]]}}",
                    "{\"id\": 5, \"type\": \"section-text\", \"label\": \"abstract\", \"level\": 1, \"parent\": 2, \"children\": [], \"text\": \"Document AI is a research field that has emerged in recent years. It focuses on automating the reading, comprehension, and analysis of data in electronic documents, such as PDFs and Word files. These documents can be either scanned or digital-born files and may contain a variety of content, including receipts, forms, resumes, manuals, and textbooks, among others. Extracting information from these documents is a challenging task due to the presence of various layouts and templates. Although many documents are digital-born, their formats were designed for layout purposes, such as PDFs. As a result, the structural information retained within them is often incomplete, making the extraction work challenging.\", \"positions\": {\"page_no\": [0], \"bbox\": [[54.0, 438.44968, 292.50467, 584.95905]]}}",
                    "{\"id\": 6, \"type\": \"section-text\", \"label\": \"abstract\", \"level\": 1, \"parent\": 2, \"children\": [], \"text\": \"Previous Document AI research mainly focused on an- alyzing individual pages, primarily divided into two tasks, entity labeling and entity linking. Entity labeling refers to the process of assigning labels to text tokens or segments in a document, such as titles, tables, figures, and so\", \"positions\": {\"page_no\": [0], \"bbox\": [[54.0, 581.4086, 292.50507, 640.2471]]}}",
                    "{\"id\": 7, \"type\": \"section-text\", \"label\": \"abstract\", \"level\": 0, \"parent\": 0, \"children\": [8, 9, 10, 11, 12, 13, 40, 47], \"text\": \"on. In the early research, entity labeling is primarily achieved through visual features for image classification, DeepDeSRT(Schreiber et al. 2017), PDFTableDection(Hao et al. 2016), VisualDetection(Soto and Yoo 2019) and RVL CDIP(Harley, Ufkes, and Derpanis 2015). Subsequent studies aimed to enhance classification performance by\", \"positions\": {\"page_no\": [0], \"bbox\": [[53.999996, 636.2036, 292.50488, 706.0]]}}",
                    "{\"id\": 8, \"type\": \"image\", \"label\": \"abstract\", \"level\": 1, \"parent\": 7, \"children\": [], \"text\": \" Entity Labeling Entity Linking Document Level Structure Parsing title abstract paragraph section paragraph\", \"positions\": {\"page_no\": [0], \"bbox\": [[319.0, 211.0, 562.0, 321.0]]}}",
                    "{\"id\": 9, \"type\": \"section-text\", \"label\": \"abstract\", \"level\": 1, \"parent\": 7, \"children\": [], \"text\": \"Figure 1: Existing Document AI tasks, entity labeling and entity linking, focus on extracting information at page level. In this paper, we propose a new task, document level structure parsing, that can handle paragraphs across pages and organize paragraphs into sections.\", \"positions\": {\"page_no\": [0], \"bbox\": [[319.5, 323.57852, 558.0047, 382.41693]]}}",
                    "{\"id\": 10, \"type\": \"section-text\", \"label\": \"abstract\", \"level\": 1, \"parent\": 7, \"children\": [], \"text\": \"incorporating both text and visual multi-modal features, such as DocBank(Li et al. 2020), LayoutLM serials(Xu et al. 2020), (Xu et al. 2021b), (Xu et al. 2021a), Pub- LayNet(Zhong, Tang, and Yepes 2019). Apart from clas- sifying textual components, some studies have also pro- posed methods for categorizing reading orders, such as LayoutParser(Shen et al. 2021), LayoutReader(Wang et al. 2021), ERNIE-layout(Peng et al. 2022). Entity linking refers to predicting the relationships between text segments in a document, aiming to extract connections among them. In previous work, the focus was mainly on linking entities within a single page, achieved through pair-wise classifica- tion. Such works include dhSegment(Ares Oliveira, Seguin, and Kaplan 2018), FUNSD(Jaume, Kemal Ekenel, and Thiran 2019a), EPHOIE(Jaume, Kemal Ekenel, and Thiran 2019b), SROIE(Huang et al. 2019), DocStruct(Wang et al. 2020), SPADE(Hwang et al. 2021), StructuralLM(Li et al. 2021a), StrucTexT(Li et al. 2021b).\", \"positions\": {\"page_no\": [0], \"bbox\": [[319.5, 403.32745, 558.00476, 604.63184]]}}",
                    "{\"id\": 11, \"type\": \"section-text\", \"label\": \"abstract\", \"level\": 1, \"parent\": 7, \"children\": [], \"text\": \"Existing studies usually only deal with information within one single page. The issue of document level structure pars- ing is left unsolved. To extract valuable textual information from multi-page documents, it becomes essential to parse the document’s structure, dividing it into distinct sections and addressing the issue of paragraphs spanning multiple pages.\", \"positions\": {\"page_no\": [0], \"bbox\": [[319.5, 601.9574, 558.00476, 682.7128]]}}"
                ],
                "Status": "{\"message\": \"chunking_success\", \"code\": 0, \"parser_engine\": \"docai\"}"
            }
        ]
    },
    "message": "success",
    "request_id": "021701174139144fdbddc0300ff0501c3c818b64edcd8f4bb487e"

解析失败的返回结果:

// 1  解析失败的返回结果
{
    "code": 0,
    "data": {
        "DocChunkingResults": [
            {
                "Status": "{\"message\": \"filetype xlsx not supported\", \"code\": 5003}"
            }
        ]
    },
    "message": "success",
    "request_id": "021701852892091fdbddc0300ff0501ce9aacf7979ab0a24b9e22"
}
// 2   解析失败的返回结果
{
    "code": 0,
    "data": {
        "DocChunkingResults": [
            {
                "Status": "{\"message\": \"No /Root object! - Is this really a PDF?\", \"code\": 7000}"
            }
        ]
    },
    "message": "success",
    "request_id": "021701858018883fdbddc0300ff0501ce9aacf7979ab0a2997666"
}