Sfoglia il codice sorgente

add module doc for docbee2 and chart2table (#3969)

* add module doc for docbee2 and chart2table

* add docs
Zhang Zelun 6 mesi fa
parent
commit
ff6555b7d4

+ 221 - 0
docs/module_usage/tutorials/vlm_modules/chart_parsing.en.md

@@ -0,0 +1,221 @@
+---
+comments: true
+---
+
+# Chart Parsing Model Module Usage Tutorial
+
+## I. Overview
+Multimodal chart parsing is a cutting-edge technology in the OCR field, focusing on automatically converting various types of visual charts (such as bar charts, line charts, pie charts, etc.) into underlying data tables and formatting the output. Traditional methods rely on complex orchestration of models like chart key point detection, which involves many prior assumptions and lacks robustness. The models in this module utilize the latest VLM technology, driven by data, learning robust features from massive real-world data. Its application scenarios cover financial analysis, academic research, business reports, etc. — for instance, quickly extracting growth trend data from financial statements, experimental comparison values from scientific papers, or user distribution statistics from market research, assisting users in shifting from "viewing charts" to "using data."
+
+<img src="https://raw.githubusercontent.com/cuicheng01/PaddleX_doc_images/main/images/modules/chart_parsing/chart_parsing_01.png"/>
+
+## II. Supported Model List
+
+<table>
+<tr>
+<th>Model</th><th>Model Download Link</th>
+<th>Model Storage Size (GB)</th>
+<th>Description</th>
+</tr>
+<tr>
+<td>PP-Chart2Table</td><td><a href="https://paddle-model-ecology.bj.bcebos.com/paddlex/official_inference_model/paddle3.0.0/PP-Chart2Table_infer.tar">Inference Model</a></td>
+<td>1.4</td>
+<td>PP-Chart2Table is a self-developed multimodal model by the PaddlePaddle team, focusing on chart parsing, demonstrating outstanding performance in both Chinese and English chart parsing tasks. The team adopted a carefully designed data generation strategy, constructing a high-quality multimodal dataset of nearly 700,000 entries covering common chart types like pie charts, bar charts, stacked area charts, and various application scenarios. They also designed a two-stage training method, utilizing large model distillation to fully leverage massive unlabeled OOD data. In internal business tests in both Chinese and English scenarios, PP-Chart2Table not only achieved the SOTA level among models of the same parameter scale but also reached accuracy comparable to 7B parameter scale VLM models in critical scenarios.</td>
+</tr>
+</table>
+
+## III. Quick Integration
+> ❗ Before quick integration, please install the PaddleX wheel package. For details, please refer to [PaddleX Local Installation Tutorial](../../../installation/installation.md)
+
+After completing the installation of the whl package, inference of the document-like visual language model module can be completed with just a few lines of code. You can freely switch models under this module, and you can also integrate model inference from the open document-like visual language model module into your project. Before running the following code, please download the [sample image](https://paddle-model-ecology.bj.bcebos.com/paddlex/imgs/demo_image/chart_parsing_02.png) locally.
+
+```python
+from paddlex import create_model
+model = create_model('PP-Chart2Table')
+results = model.predict(
+    input={"image": "chart_parsing_02.png"},
+    batch_size=1
+)
+for res in results:
+    res.print()
+    res.save_to_json(f"./output/res.json")
+```
+
+After running, the result is:
+
+```bash
+{'res': {'image': 'chart_parsing_02.png', 'result': '年份 | 单家五星级旅游饭店年平均营收 (百万元) | 单家五星级旅游饭店年平均利润 (百万元)\n2018 | 104.22 | 9.87\n2019 | 99.11 | 7.47\n2020 | 57.87 | -3.87\n2021 | 68.99 | -2.9\n2022 | 56.29 | -9.48\n2023 | 87.99 | 5.96'}}
+```
+The meanings of the result parameters are as follows:
+- `image`: Indicates the path of the input image to be predicted
+- `result`: The result information predicted by the model
+
+The visualized printed prediction result is as follows:
+
+```bash
+年份 | 单家五星级旅游饭店年平均营收 (百万元) | 单家五星级旅游饭店年平均利润 (百万元)
+2018 | 104.22 | 9.87
+2019 | 99.11 | 7.47
+2020 | 57.87 | -3.87
+2021 | 68.99 | -2.9
+2022 | 56.29 | -9.48
+2023 | 87.99 | 5.96
+```
+
+Related methods, parameters, and descriptions are as follows:
+
+* `create_model` instantiates the document-like visual language model (taking `PP-Chart2Table` as an example here), with specific explanations as follows:
+<table>
+<thead>
+<tr>
+<th>Parameter</th>
+<th>Description</th>
+<th>Type</th>
+<th>Options</th>
+<th>Default</th>
+</tr>
+</thead>
+<tr>
+<td><code>model_name</code></td>
+<td>Model name</td>
+<td><code>str</code></td>
+<td>None</td>
+<td><code>None</code></td>
+</tr>
+<tr>
+<td><code>model_dir</code></td>
+<td>Model storage path</td>
+<td><code>str</code></td>
+<td>None</td>
+<td><code>None</code></td>
+</tr>
+<tr>
+<td><code>device</code></td>
+<td>Model inference device</td>
+<td><code>str</code></td>
+<td>Support specifying specific GPU card number, such as "gpu:0", other hardware specific card numbers, such as "npu:0", CPU as "cpu".</td>
+<td><code>gpu:0</code></td>
+</tr>
+<tr>
+<td><code>use_hpip</code></td>
+<td>Whether to enable high-performance inference plugins. Currently not supported.</td>
+<td><code>bool</code></td>
+<td>None</td>
+<td><code>False</code></td>
+</tr>
+<tr>
+<td><code>hpi_config</code></td>
+<td>High-performance inference configuration. Currently not supported.</td>
+<td><code>dict</code> | <code>None</code></td>
+<td>None</td>
+<td><code>None</code></td>
+</tr>
+</table>
+
+* Among them, `model_name` must be specified. After specifying `model_name`, the default PaddleX built-in model parameters are used. On this basis, if `model_dir` is specified, the user-defined model is used.
+
+* Call the `predict()` method of the document-like visual language model for inference prediction. The `predict()` method parameters include `input`, `batch_size`, with specific explanations as follows:
+
+<table>
+<thead>
+<tr>
+<th>Parameter</th>
+<th>Description</th>
+<th>Type</th>
+<th>Options</th>
+<th>Default</th>
+</tr>
+</thead>
+<tr>
+<td><code>input</code></td>
+<td>Data to be predicted</td>
+<td><code>dict</code></td>
+<td>
+<code>Dict</code>, as multimodal models have different input requirements, it needs to be determined based on the specific model. Specifically:
+<li>The input form for PP-Chart2Table is <code>{'image': image_path}</code></li>
+</td>
+<td>None</td>
+</tr>
+<tr>
+<td><code>batch_size</code></td>
+<td>Batch size</td>
+<td><code>int</code></td>
+<td>Integer</td>
+<td>1</td>
+</tr>
+</table>
+
+* Process the prediction results. The prediction result for each sample is the corresponding Result object, which supports operations like printing and saving as a `json` file:
+
+<table>
+<thead>
+<tr>
+<th>Method</th>
+<th>Description</th>
+<th>Parameter</th>
+<th>Type</th>
+<th>Description</th>
+<th>Default</th>
+</tr>
+</thead>
+<tr>
+<td rowspan = "3"><code>print()</code></td>
+<td rowspan = "3">Print results to terminal</td>
+<td><code>format_json</code></td>
+<td><code>bool</code></td>
+<td>Whether to format the output content using <code>JSON</code> indentation</td>
+<td><code>True</code></td>
+</tr>
+<tr>
+<td><code>indent</code></td>
+<td><code>int</code></td>
+<td>Specify the indentation level to beautify the output <code>JSON</code> data for better readability, only effective when <code>format_json</code> is <code>True</code></td>
+<td>4</td>
+</tr>
+<tr>
+<td><code>ensure_ascii</code></td>
+<td><code>bool</code></td>
+<td>Control whether to escape non-<code>ASCII</code> characters to <code>Unicode</code>. When set to <code>True</code>, all non-<code>ASCII</code> characters will be escaped; <code>False</code> will keep the original characters, only effective when <code>format_json</code> is <code>True</code></td>
+<td><code>False</code></td>
+</tr>
+<tr>
+<td rowspan = "3"><code>save_to_json()</code></td>
+<td rowspan = "3">Save the result as a json formatted file</td>
+<td><code>save_path</code></td>
+<td><code>str</code></td>
+<td>Path to save the file. When it's a directory, the saved file name matches the input file type name</td>
+<td>None</td>
+</tr>
+<tr>
+<td><code>indent</code></td>
+<td><code>int</code></td>
+<td>Specify the indentation level to beautify the output <code>JSON</code> data for better readability, only effective when <code>format_json</code> is <code>True</code></td>
+<td>4</td>
+</tr>
+<tr>
+<td><code>ensure_ascii</code></td>
+<td><code>bool</code></td>
+<td>Control whether to escape non-<code>ASCII</code> characters to <code>Unicode</code>. When set to <code>True</code>, all non-<code>ASCII</code> characters will be escaped; <code>False</code> will keep the original characters, only effective when <code>format_json</code> is <code>True</code></td>
+<td><code>False</code></td>
+</tr>
+</table>
+
+* Additionally, it is also possible to obtain prediction results through attributes, as follows:
+
+<table>
+<thead>
+<tr>
+<th>Attribute</th>
+<th>Description</th>
+</tr>
+</thead>
+<tr>
+<td rowspan = "1"><code>json</code></td>
+<td rowspan = "1">Get the prediction result in <code>json</code> format</td>
+</tr>
+</table>
+
+For more information on using the API for single model inference in PaddleX, you can refer to [PaddleX Single Model Python Script Usage Instructions](../../instructions/model_python_API.md).
+
+## IV. Secondary Development
+The current module temporarily does not support fine-tuning training, only inference integration. Support for fine-tuning training in this module is planned for the future.

+ 225 - 0
docs/module_usage/tutorials/vlm_modules/chart_parsing.md

@@ -0,0 +1,225 @@
+---
+comments: true
+---
+
+# 图表解析模型模块使用教程
+
+## 一、概述
+多模态图表解析是一项OCR领域的前沿技术,专注于将各类可视化图表(如柱状图、折线图、饼图等)自动转化为底层数据表,并进行格式化输出。传统方法依赖于图表关键点检测等模型进行复杂串联编排,先验假设较多,鲁棒性较差,该模块中的模型使用最新的VLM技术,数据驱动,从海量的现实数据中学习鲁棒的特征。其应用场景覆盖金融分析、学术研究、商业报告等场景——例如快速提取财报中的增长趋势数据、科研论文中的实验对比数值,或市场调研中的用户分布统计,助力用户从“看图”转向“用数”。
+
+<img src="https://raw.githubusercontent.com/cuicheng01/PaddleX_doc_images/main/images/modules/chart_parsing/chart_parsing_01.png"/>
+
+## 二、支持模型列表
+
+
+<table>
+<tr>
+<th>模型</th><th>模型下载链接</th>
+<th>模型存储大小(GB)</th>
+<th>介绍</th>
+</tr>
+<tr>
+<td>PP-Chart2Table</td><td><a href="https://paddle-model-ecology.bj.bcebos.com/paddlex/official_inference_model/paddle3.0.0/PP-Chart2Table_infer.tar">推理模型</a></td>
+<td>1.4</td>
+<td>PP-Chart2Table是飞桨团队自研的一款专注于图表解析的多模态模型,在中英文图表解析任务中展现出卓越性能。团队采用精心设计的数据生成策略,构建了近70万条高质量的图表解析多模态数据集,全面覆盖饼图、柱状图、堆叠面积图等常见图表类型及各类应用场景。同时设计了二阶段训练方法,结合大模型蒸馏实现对海量无标注OOD数据的充分利用。在内部业务的中英文场景测试中,PP-Chart2Table不仅达到同参数量级模型中的SOTA水平,更在关键场景中实现了与7B参数量级VLM模型相当的精度。</td>
+</tr>
+</table>
+
+
+## 三、快速集成
+> ❗ 在快速集成前,请先安装 PaddleX 的 wheel 包,详细请参考 [PaddleX本地安装教程](../../../installation/installation.md)
+
+完成whl包的安装后,几行代码即可完成文档类视觉语言模型模块的推理,可以任意切换该模块下的模型,您也可以将开放文档类视觉语言模型模块中的模型推理集成到您的项目中。运行以下代码前,请您下载[示例图片](https://paddle-model-ecology.bj.bcebos.com/paddlex/imgs/demo_image/chart_parsing_02.png)到本地。
+
+```python
+from paddlex import create_model
+model = create_model('PP-Chart2Table')
+results = model.predict(
+    input={"image": "chart_parsing_02.png"},
+    batch_size=1
+)
+for res in results:
+    res.print()
+    res.save_to_json(f"./output/res.json")
+```
+
+运行后,得到的结果为:
+
+```bash
+{'res': {'image': 'chart_parsing_02.png', 'result': '年份 | 单家五星级旅游饭店年平均营收 (百万元) | 单家五星级旅游饭店年平均利润 (百万元)\n2018 | 104.22 | 9.87\n2019 | 99.11 | 7.47\n2020 | 57.87 | -3.87\n2021 | 68.99 | -2.9\n2022 | 56.29 | -9.48\n2023 | 87.99 | 5.96'}}
+```
+运行结果参数含义如下:
+- `image`: 表示输入待预测图像的路径
+- `result`: 模型预测的结果信息
+
+预测结果打印可视化如下:
+
+```bash
+年份 | 单家五星级旅游饭店年平均营收 (百万元) | 单家五星级旅游饭店年平均利润 (百万元)
+2018 | 104.22 | 9.87
+2019 | 99.11 | 7.47
+2020 | 57.87 | -3.87
+2021 | 68.99 | -2.9
+2022 | 56.29 | -9.48
+2023 | 87.99 | 5.96
+```
+
+
+相关方法、参数等说明如下:
+
+* `create_model`实例化文档类视觉语言模型(此处以`PP-Chart2Table`为例),具体说明如下:
+<table>
+<thead>
+<tr>
+<th>参数</th>
+<th>参数说明</th>
+<th>参数类型</th>
+<th>可选项</th>
+<th>默认值</th>
+</tr>
+</thead>
+<tr>
+<td><code>model_name</code></td>
+<td>模型名称</td>
+<td><code>str</code></td>
+<td>无</td>
+<td><code>无</code></td>
+</tr>
+<tr>
+<td><code>model_dir</code></td>
+<td>模型存储路径</td>
+<td><code>str</code></td>
+<td>无</td>
+<td><code>无</code></td>
+</tr>
+<tr>
+<td><code>device</code></td>
+<td>模型推理设备</td>
+<td><code>str</code></td>
+<td>支持指定GPU具体卡号,如“gpu:0”,其他硬件具体卡号,如“npu:0”,CPU如“cpu”。</td>
+<td><code>gpu:0</code></td>
+</tr>
+<tr>
+<td><code>use_hpip</code></td>
+<td>是否启用高性能推理插件。目前暂不支持。</td>
+<td><code>bool</code></td>
+<td>无</td>
+<td><code>False</code></td>
+</tr>
+<tr>
+<td><code>hpi_config</code></td>
+<td>高性能推理配置。目前暂不支持。</td>
+<td><code>dict</code> | <code>None</code></td>
+<td>无</td>
+<td><code>None</code></td>
+</tr>
+</table>
+
+* 其中,`model_name` 必须指定,指定 `model_name` 后,默认使用 PaddleX 内置的模型参数,在此基础上,指定 `model_dir` 时,使用用户自定义的模型。
+
+* 调用文档类视觉语言模型的 `predict()` 方法进行推理预测,`predict()` 方法参数有 `input` 、 `batch_size`,具体说明如下:
+
+<table>
+<thead>
+<tr>
+<th>参数</th>
+<th>参数说明</th>
+<th>参数类型</th>
+<th>可选项</th>
+<th>默认值</th>
+</tr>
+</thead>
+<tr>
+<td><code>input</code></td>
+<td>待预测数据</td>
+<td><code>dict</code></td>
+<td>
+<code>Dict</code>, 由于多模态模型对输入有不同的要求,需要根据具体的模型确定,具体而言:
+<li>PP-Chart2Table的输入形式为<code>{'image': image_path}</code></li>
+</td>
+<td>无</td>
+</tr>
+<tr>
+<td><code>batch_size</code></td>
+<td>批大小</td>
+<td><code>int</code></td>
+<td>整数</td>
+<td>1</td>
+</tr>
+</table>
+
+* 对预测结果进行处理,每个样本的预测结果均为对应的Result对象,且支持打印、保存为`json`文件的操作:
+
+<table>
+<thead>
+<tr>
+<th>方法</th>
+<th>方法说明</th>
+<th>参数</th>
+<th>参数类型</th>
+<th>参数说明</th>
+<th>默认值</th>
+</tr>
+</thead>
+<tr>
+<td rowspan = "3"><code>print()</code></td>
+<td rowspan = "3">打印结果到终端</td>
+<td><code>format_json</code></td>
+<td><code>bool</code></td>
+<td>是否对输出内容进行使用 <code>JSON</code> 缩进格式化</td>
+<td><code>True</code></td>
+</tr>
+<tr>
+<td><code>indent</code></td>
+<td><code>int</code></td>
+<td>指定缩进级别,以美化输出的 <code>JSON</code> 数据,使其更具可读性,仅当 <code>format_json</code> 为 <code>True</code> 时有效</td>
+<td>4</td>
+</tr>
+<tr>
+<td><code>ensure_ascii</code></td>
+<td><code>bool</code></td>
+<td>控制是否将非 <code>ASCII</code> 字符转义为 <code>Unicode</code>。设置为 <code>True</code> 时,所有非 <code>ASCII</code> 字符将被转义;<code>False</code> 则保留原始字符,仅当<code>format_json</code>为<code>True</code>时有效</td>
+<td><code>False</code></td>
+</tr>
+<tr>
+<td rowspan = "3"><code>save_to_json()</code></td>
+<td rowspan = "3">将结果保存为json格式的文件</td>
+<td><code>save_path</code></td>
+<td><code>str</code></td>
+<td>保存的文件路径,当为目录时,保存文件命名与输入文件类型命名一致</td>
+<td>无</td>
+</tr>
+<tr>
+<td><code>indent</code></td>
+<td><code>int</code></td>
+<td>指定缩进级别,以美化输出的 <code>JSON</code> 数据,使其更具可读性,仅当 <code>format_json</code> 为 <code>True</code> 时有效</td>
+<td>4</td>
+</tr>
+<tr>
+<td><code>ensure_ascii</code></td>
+<td><code>bool</code></td>
+<td>控制是否将非 <code>ASCII</code> 字符转义为 <code>Unicode</code>。设置为 <code>True</code> 时,所有非 <code>ASCII</code> 字符将被转义;<code>False</code> 则保留原始字符,仅当<code>format_json</code>为<code>True</code>时有效</td>
+<td><code>False</code></td>
+</tr>
+</table>
+
+* 此外,也支持通过属性获取预测结果,具体如下:
+
+<table>
+<thead>
+<tr>
+<th>属性</th>
+<th>属性说明</th>
+</tr>
+</thead>
+<tr>
+<td rowspan = "1"><code>json</code></td>
+<td rowspan = "1">获取预测的<code>json</code>格式的结果</td>
+</tr>
+</table>
+
+
+关于更多 PaddleX 的单模型推理的 API 的使用方法,可以参考[PaddleX单模型Python脚本使用说明](../../instructions/model_python_API.md)。
+
+## 四、二次开发
+当前模块暂时不支持微调训练,仅支持推理集成。关于该模块的微调训练,计划在未来支持。

+ 8 - 2
docs/module_usage/tutorials/vlm_modules/doc_vlm.en.md

@@ -21,6 +21,11 @@ The document visual-language model is a cutting-edge multimodal processing techn
 <td>PP-DocBee-7B</td><td><a href="https://paddle-model-ecology.bj.bcebos.com/paddlex/official_inference_model/paddle3.0.0/PP-DocBee-7B_infer.tar">Inference Model</a></td>
 <td>15.8</td>
 </tr>
+<tr>
+<td>PP-DocBee2-3B</td><td><a href="https://paddle-model-ecology.bj.bcebos.com/paddlex/official_inference_model/paddle3.0.0/PP-DocBee2-3B_infer.tar">推理模型</a></td>
+<td>7.6</td>
+<td>PP-DocBee2 is a multimodal large model independently developed by the PaddlePaddle team, specifically tailored for document understanding. Building upon PP-DocBee, the team has further optimized the foundational model and introduced a new data optimization scheme to enhance data quality. With just a relatively small dataset of 470,000 samples generated using the team's proprietary data synthesis strategy, PP-DocBee2 demonstrates superior performance in Chinese document understanding tasks. In terms of internal business metrics for Chinese-language scenarios, PP-DocBee2 has achieved an approximately 11.4% improvement over PP-DocBee, outperforming both current popular open-source and closed-source models of a similar scale.</td>
+</tr>
 </table>
 
 ## 3. Quick Integration
@@ -143,7 +148,8 @@ The explanation of related methods and parameters are as follows:
 <td>Data to be predicted</td>
 <td><code>dict</code></td>
 <td>
-<code>Dict</code>, needs to be determined according to the specific model. For the PP-DocBee series, the input is {'image': image_path, 'query': query_text}
+<code>Dict</code>, Since multimodal models have different requirements for input, it needs to be determined based on the specific model. Specifically:
+<li>The input format for the PP-DocBee series is<code>{'image': image_path, 'query': query_text}</code></li>
 </td>
 <td>None</td>
 </tr>
@@ -151,7 +157,7 @@ The explanation of related methods and parameters are as follows:
 <td><code>batch_size</code></td>
 <td>Batch size</td>
 <td><code>int</code></td>
-<td>Integer (currently only supports 1)</td>
+<td>Integer</td>
 <td>1</td>
 </tr>
 </table>

+ 8 - 2
docs/module_usage/tutorials/vlm_modules/doc_vlm.md

@@ -25,6 +25,11 @@ comments: true
 <td>PP-DocBee-7B</td><td><a href="https://paddle-model-ecology.bj.bcebos.com/paddlex/official_inference_model/paddle3.0.0/PP-DocBee-7B_infer.tar">推理模型</a></td>
 <td>15.8</td>
 </tr>
+<tr>
+<td>PP-DocBee2-3B</td><td><a href="https://paddle-model-ecology.bj.bcebos.com/paddlex/official_inference_model/paddle3.0.0/PP-DocBee2-3B_infer.tar">推理模型</a></td>
+<td>7.6</td>
+<td>PP-DocBee2 是飞桨团队自研的一款专注于文档理解的多模态大模型,在PP-DocBee的基础上进一步优化了基础模型,并引入了新的数据优化方案,提高了数据质量,使用自研数据合成策略生成的少量的47万数据便使得PP-DocBee2在中文文档理解任务上表现更佳。在内部业务中文场景类的指标上,PP-DocBee2相较于PP-DocBee提升了约11.4%,同时也高于目前的同规模热门开源和闭源模型。</td>
+</tr>
 </table>
 
 
@@ -147,7 +152,8 @@ for res in results:
 <td>待预测数据</td>
 <td><code>dict</code></td>
 <td>
-<code>Dict</code>, 需要根据具体的模型确定,如PP-DocBee系列的输入为{'image': image_path, 'query': query_text}
+<code>Dict</code>, 由于多模态模型对输入有不同的要求,需要根据具体的模型确定,具体而言:
+<li>PP-DocBee系列的输入形式为<code>{'image': image_path, 'query': query_text}</code></li>
 </td>
 <td>无</td>
 </tr>
@@ -155,7 +161,7 @@ for res in results:
 <td><code>batch_size</code></td>
 <td>批大小</td>
 <td><code>int</code></td>
-<td>整数(目前仅支持为1)</td>
+<td>整数</td>
 <td>1</td>
 </tr>
 </table>