Browse Source

Merge pull request #3424 from myhloli/dev

Dev
Xiaomeng Zhao 2 tháng trước cách đây
mục cha
commit
05a9920ffe

+ 114 - 39
README.md

@@ -43,48 +43,122 @@
 </div>
 
 # Changelog
-- 2025/08/01 2.1.10 Released
-  - Fixed an issue in the `pipeline` backend where block overlap caused the parsing results to deviate from expectations #3232
-- 2025/07/30 2.1.9 Released
-  - `transformers` 4.54.1 version adaptation
-- 2025/07/28 2.1.8 Released
-  - `sglang` 0.4.9.post5 version adaptation
-- 2025/07/27 2.1.7 Released
-  - `transformers` 4.54.0 version adaptation
-- 2025/07/26 2.1.6 Released
-  - Fixed table parsing issues in handwritten documents when using `vlm` backend
-  - Fixed visualization box position drift issue when document is rotated #3175
-- 2025/07/24 2.1.5 Released
-  - `sglang` 0.4.9 version adaptation, synchronously upgrading the dockerfile base image to sglang 0.4.9.post3
-- 2025/07/23 2.1.4 Released
-  - Bug Fixes
-    - Fixed the issue of excessive memory consumption during the `MFR` step in the `pipeline` backend under certain scenarios #2771
-    - Fixed the inaccurate matching between `image`/`table` and `caption`/`footnote` under certain conditions #3129
-- 2025/07/16 2.1.1 Released
-  - Bug fixes
-    - Fixed text block content loss issue that could occur in certain `pipeline` scenarios #3005
-    - Fixed issue where `sglang-client` required unnecessary packages like `torch` #2968
-    - Updated `dockerfile` to fix incomplete text content parsing due to missing fonts in Linux #2915
-  - Usability improvements
-    - Updated `compose.yaml` to facilitate direct startup of `sglang-server`, `mineru-api`, and `mineru-gradio` services
-    - Launched brand new [online documentation site](https://opendatalab.github.io/MinerU/), simplified readme, providing better documentation experience
-- 2025/07/05 Version 2.1.0 Released
-  - This is the first major update of MinerU 2, which includes a large number of new features and improvements, covering significant performance optimizations, user experience enhancements, and bug fixes. The detailed update contents are as follows:
-  - **Performance Optimizations:**
-    - Significantly improved preprocessing speed for documents with specific resolutions (around 2000 pixels on the long side).
-    - Greatly enhanced post-processing speed when the `pipeline` backend handles batch processing of documents with fewer pages (<10 pages).
-    - Layout analysis speed of the `pipeline` backend has been increased by approximately 20%.
-  - **Experience Enhancements:**
-    - Built-in ready-to-use `fastapi service` and `gradio webui`. For detailed usage instructions, please refer to [Documentation](https://opendatalab.github.io/MinerU/usage/quick_usage/#advanced-usage-via-api-webui-sglang-clientserver).
-    - Adapted to `sglang` version `0.4.8`, significantly reducing the GPU memory requirements for the `vlm-sglang` backend. It can now run on graphics cards with as little as `8GB GPU memory` (Turing architecture or newer).
-    - Added transparent parameter passing for all commands related to `sglang`, allowing the `sglang-engine` backend to receive all `sglang` parameters consistently with the `sglang-server`.
-    - Supports feature extensions based on configuration files, including `custom formula delimiters`, `enabling heading classification`, and `customizing local model directories`. For detailed usage instructions, please refer to [Documentation](https://opendatalab.github.io/MinerU/usage/quick_usage/#extending-mineru-functionality-with-configuration-files).
-  - **New Features:**
-    - Updated the `pipeline` backend with the PP-OCRv5 multilingual text recognition model, supporting text recognition in 37 languages such as French, Spanish, Portuguese, Russian, and Korean, with an average accuracy improvement of over 30%. [Details](https://paddlepaddle.github.io/PaddleOCR/latest/en/version3.x/algorithm/PP-OCRv5/PP-OCRv5_multi_languages.html)
-    - Introduced limited support for vertical text layout in the `pipeline` backend.
+
+- 2025/09/05 2.2.0 Released
+  - Major Updates
+    - In this version, we focused on improving table parsing accuracy by introducing a new [wired table recognition model](https://github.com/RapidAI/TableStructureRec) and a brand-new hybrid table structure parsing algorithm, significantly enhancing the table recognition capabilities of the `pipeline` backend.
+    - We also added support for cross-page table merging, which is supported by both `pipeline` and `vlm` backends, further improving the completeness and accuracy of table parsing.
+  - Other Updates
+    - The `pipeline` backend now supports 270-degree rotated table parsing, bringing support for table parsing in 0/90/270-degree orientations
+    - `pipeline` added OCR capability support for Thai and Greek, and updated the English OCR model to the latest version. English recognition accuracy improved by 11%, Thai recognition model accuracy is 82.68%, and Greek recognition model accuracy is 89.28% (by PPOCRv5)
+    - Added `bbox` field (mapped to 0-1000 range) in the output `content_list.json`, making it convenient for users to directly obtain position information for each content block
+
 
 <details>
   <summary>History Log</summary>
+
+  <details>
+    <summary>2025/08/01 2.1.10 Released</summary>
+    <ul>
+      <li>Fixed an issue in the <code>pipeline</code> backend where block overlap caused the parsing results to deviate from expectations #3232</li>
+    </ul>
+  </details>  
+
+  <details>
+    <summary>2025/07/30 2.1.9 Released</summary>
+    <ul>
+      <li><code>transformers</code> 4.54.1 version adaptation</li>
+    </ul>
+  </details>  
+
+  <details>
+    <summary>2025/07/28 2.1.8 Released</summary>
+    <ul>
+      <li><code>sglang</code> 0.4.9.post5 version adaptation</li>
+    </ul>
+  </details>  
+
+  <details>
+    <summary>2025/07/27 2.1.7 Released</summary>
+    <ul>
+      <li><code>transformers</code> 4.54.0 version adaptation</li>
+    </ul>
+  </details>  
+
+  <details>
+    <summary>2025/07/26 2.1.6 Released</summary>
+    <ul>
+      <li>Fixed table parsing issues in handwritten documents when using <code>vlm</code> backend</li>
+      <li>Fixed visualization box position drift issue when document is rotated #3175</li>
+    </ul>
+  </details>  
+
+  <details>
+    <summary>2025/07/24 2.1.5 Released</summary>
+    <ul>
+      <li><code>sglang</code> 0.4.9 version adaptation, synchronously upgrading the dockerfile base image to sglang 0.4.9.post3</li>
+    </ul>
+  </details>  
+
+  <details>
+    <summary>2025/07/23 2.1.4 Released</summary>
+    <ul>
+      <li><strong>Bug Fixes</strong>
+        <ul>
+          <li>Fixed the issue of excessive memory consumption during the <code>MFR</code> step in the <code>pipeline</code> backend under certain scenarios #2771</li>
+          <li>Fixed the inaccurate matching between <code>image</code>/<code>table</code> and <code>caption</code>/<code>footnote</code> under certain conditions #3129</li>
+        </ul>
+      </li>
+    </ul>
+  </details>  
+
+  <details>
+    <summary>2025/07/16 2.1.1 Released</summary>
+    <ul>
+      <li><strong>Bug fixes</strong>
+        <ul>
+          <li>Fixed text block content loss issue that could occur in certain <code>pipeline</code> scenarios #3005</li>
+          <li>Fixed issue where <code>sglang-client</code> required unnecessary packages like <code>torch</code> #2968</li>
+          <li>Updated <code>dockerfile</code> to fix incomplete text content parsing due to missing fonts in Linux #2915</li>
+        </ul>
+      </li>
+      <li><strong>Usability improvements</strong>
+        <ul>
+          <li>Updated <code>compose.yaml</code> to facilitate direct startup of <code>sglang-server</code>, <code>mineru-api</code>, and <code>mineru-gradio</code> services</li>
+          <li>Launched brand new <a href="https://opendatalab.github.io/MinerU/">online documentation site</a>, simplified readme, providing better documentation experience</li>
+        </ul>
+      </li>
+    </ul>
+  </details>  
+
+  <details>
+    <summary>2025/07/05 2.1.0 Released</summary>
+    <ul>
+      <li>This is the first major update of MinerU 2, which includes a large number of new features and improvements, covering significant performance optimizations, user experience enhancements, and bug fixes. The detailed update contents are as follows:</li>
+      <li><strong>Performance Optimizations:</strong>
+        <ul>
+          <li>Significantly improved preprocessing speed for documents with specific resolutions (around 2000 pixels on the long side).</li>
+          <li>Greatly enhanced post-processing speed when the <code>pipeline</code> backend handles batch processing of documents with fewer pages (&lt;10 pages).</li>
+          <li>Layout analysis speed of the <code>pipeline</code> backend has been increased by approximately 20%.</li>
+        </ul>
+      </li>
+      <li><strong>Experience Enhancements:</strong>
+        <ul>
+          <li>Built-in ready-to-use <code>fastapi service</code> and <code>gradio webui</code>. For detailed usage instructions, please refer to <a href="https://opendatalab.github.io/MinerU/usage/quick_usage/#advanced-usage-via-api-webui-sglang-clientserver">Documentation</a>.</li>
+          <li>Adapted to <code>sglang</code> version <code>0.4.8</code>, significantly reducing the GPU memory requirements for the <code>vlm-sglang</code> backend. It can now run on graphics cards with as little as <code>8GB GPU memory</code> (Turing architecture or newer).</li>
+          <li>Added transparent parameter passing for all commands related to <code>sglang</code>, allowing the <code>sglang-engine</code> backend to receive all <code>sglang</code> parameters consistently with the <code>sglang-server</code>.</li>
+          <li>Supports feature extensions based on configuration files, including <code>custom formula delimiters</code>, <code>enabling heading classification</code>, and <code>customizing local model directories</code>. For detailed usage instructions, please refer to <a href="https://opendatalab.github.io/MinerU/usage/quick_usage/#extending-mineru-functionality-with-configuration-files">Documentation</a>.</li>
+        </ul>
+      </li>
+      <li><strong>New Features:</strong>
+        <ul>
+          <li>Updated the <code>pipeline</code> backend with the PP-OCRv5 multilingual text recognition model, supporting text recognition in 37 languages such as French, Spanish, Portuguese, Russian, and Korean, with an average accuracy improvement of over 30%. <a href="https://paddlepaddle.github.io/PaddleOCR/latest/en/version3.x/algorithm/PP-OCRv5/PP-OCRv5_multi_languages.html">Details</a></li>
+          <li>Introduced limited support for vertical text layout in the <code>pipeline</code> backend.</li>
+        </ul>
+      </li>
+    </ul>
+  </details>
+
   <details>
     <summary>2025/06/20 2.0.6 Released</summary>
     <ul>
@@ -596,6 +670,7 @@ Currently, some models in this project are trained based on YOLO. However, since
 - [DocLayout-YOLO](https://github.com/opendatalab/DocLayout-YOLO)
 - [UniMERNet](https://github.com/opendatalab/UniMERNet)
 - [RapidTable](https://github.com/RapidAI/RapidTable)
+- [TableStructureRec](https://github.com/RapidAI/TableStructureRec)
 - [PaddleOCR](https://github.com/PaddlePaddle/PaddleOCR)
 - [PaddleOCR2Pytorch](https://github.com/frotms/PaddleOCR2Pytorch)
 - [layoutreader](https://github.com/ppaanngggg/layoutreader)

+ 114 - 39
README_zh-CN.md

@@ -43,48 +43,122 @@
 </div>
 
 # 更新记录
-- 2025/08/01 2.1.10 发布
-  - 修复`pipeline`后端因block覆盖导致的解析结果与预期不符  #3232
-- 2025/07/30 2.1.9 发布
-  - `transformers` 4.54.1 版本适配
-- 2025/07/28 2.1.8 发布
-  - `sglang` 0.4.9.post5 版本适配
-- 2025/07/27 2.1.7 发布
-  - `transformers` 4.54.0 版本适配
-- 2025/07/26 2.1.6 发布
-  - 修复`vlm`后端解析部分手写文档时的表格异常问题
-  - 修复文档旋转时可视化框位置漂移问题 #3175
-- 2025/07/24 2.1.5 发布
-  - `sglang` 0.4.9 版本适配,同步升级dockerfile基础镜像为sglang 0.4.9.post3
-- 2025/07/23 2.1.4 发布
-  - bug修复
-    - 修复`pipeline`后端中`MFR`步骤在某些情况下显存消耗过大的问题 #2771
-    - 修复某些情况下`image`/`table`与`caption`/`footnote`匹配不准确的问题 #3129
-- 2025/07/16 2.1.1 发布
-  - bug修复 
-    - 修复`pipeline`在某些情况可能发生的文本块内容丢失问题 #3005
-    - 修复`sglang-client`需要安装`torch`等不必要的包的问题 #2968
-    - 更新`dockerfile`以修复linux字体缺失导致的解析文本内容不完整问题 #2915
-  - 易用性更新
-    - 更新`compose.yaml`,便于用户直接启动`sglang-server`、`mineru-api`、`mineru-gradio`服务
-    - 启用全新的[在线文档站点](https://opendatalab.github.io/MinerU/zh/),简化readme,提供更好的文档体验
-- 2025/07/05 2.1.0 发布
-  - 这是 MinerU 2 的第一个大版本更新,包含了大量新功能和改进,包含众多性能优化、体验优化和bug修复,具体更新内容如下: 
-  - 性能优化: 
-    - 大幅提升某些特定分辨率(长边2000像素左右)文档的预处理速度
-    - 大幅提升`pipeline`后端批量处理大量页数较少(<10)文档时的后处理速度
-    - `pipeline`后端的layout分析速度提升约20%
-  - 体验优化:
-    - 内置开箱即用的`fastapi服务`和`gradio webui`,详细使用方法请参考[文档](https://opendatalab.github.io/MinerU/zh/usage/quick_usage/#apiwebuisglang-clientserver)
-    - `sglang`适配`0.4.8`版本,大幅降低`vlm-sglang`后端的显存要求,最低可在`8G显存`(Turing及以后架构)的显卡上运行
-    - 对所有命令增加`sglang`的参数透传,使得`sglang-engine`后端可以与`sglang-server`一致,接收`sglang`的所有参数
-    - 支持基于配置文件的功能扩展,包含`自定义公式标识符`、`开启标题分级功能`、`自定义本地模型目录`,详细使用方法请参考[文档](https://opendatalab.github.io/MinerU/zh/usage/quick_usage/#mineru_1)
-  - 新特性:  
-    - `pipeline`后端更新 PP-OCRv5 多语种文本识别模型,支持法语、西班牙语、葡萄牙语、俄语、韩语等 37 种语言的文字识别,平均精度涨幅超30%。[详情](https://paddlepaddle.github.io/PaddleOCR/latest/version3.x/algorithm/PP-OCRv5/PP-OCRv5_multi_languages.html)
-    - `pipeline`后端增加对竖排文本的有限支持
+
+- 2025/09/05 2.2.0 发布
+  - 主要更新
+    - 在这个版本我们重点提升了表格的解析精度,通过引入新的[有线表识别模型](https://github.com/RapidAI/TableStructureRec)和全新的混合表格结构解析算法,显著提升了`pipeline`后端的表格识别能力。
+    - 另外我们增加了对跨页表格合并的支持,这一功能同时支持`pipeline`和`vlm`后端,进一步提升了表格解析的完整性和准确性。
+  - 其他更新
+    - `pipeline`后端增加270度旋转的表格解析能力,现已支持0/90/270度三个方向的表格解析
+    - `pipeline`增加对泰文、希腊文的ocr能力支持,并更新了英文ocr模型至最新,英文识别精度提升11%,泰文识别模型精度 82.68%,希腊文识别模型精度 89.28%(by PPOCRv5)
+    - 在输出的`content_list.json`中增加了`bbox`字段(映射至0-1000范围内),方便用户直接获取每个内容块的位置信息
+
 
 <details>
   <summary>历史日志</summary>
+
+  <details>
+    <summary>2025/08/01 2.1.10 发布</summary>
+    <ul>
+      <li>修复<code>pipeline</code>后端因block覆盖导致的解析结果与预期不符 #3232</li>
+    </ul>
+  </details>
+
+  <details>
+    <summary>2025/07/30 2.1.9 发布</summary>
+    <ul>
+      <li><code>transformers</code> 4.54.1 版本适配</li>
+    </ul>
+  </details>
+
+  <details>
+    <summary>2025/07/28 2.1.8 发布</summary>
+    <ul>
+      <li><code>sglang</code> 0.4.9.post5 版本适配</li>
+    </ul>
+  </details>
+
+  <details>
+    <summary>2025/07/27 2.1.7 发布</summary>
+    <ul>
+      <li><code>transformers</code> 4.54.0 版本适配</li>
+    </ul>
+  </details>
+
+  <details>
+    <summary>2025/07/26 2.1.6 发布</summary>
+    <ul>
+      <li>修复<code>vlm</code>后端解析部分手写文档时的表格异常问题</li>
+      <li>修复文档旋转时可视化框位置漂移问题 #3175</li>
+    </ul>
+  </details>
+
+  <details>
+    <summary>2025/07/24 2.1.5 发布</summary>
+    <ul>
+      <li><code>sglang</code> 0.4.9 版本适配,同步升级dockerfile基础镜像为sglang 0.4.9.post3</li>
+    </ul>
+  </details>
+
+  <details>
+    <summary>2025/07/23 2.1.4 发布</summary>
+    <ul>
+      <li><strong>bug修复</strong>
+        <ul>
+          <li>修复<code>pipeline</code>后端中<code>MFR</code>步骤在某些情况下显存消耗过大的问题 #2771</li>
+          <li>修复某些情况下<code>image</code>/<code>table</code>与<code>caption</code>/<code>footnote</code>匹配不准确的问题 #3129</li>
+        </ul>
+      </li>
+    </ul>
+  </details>
+
+  <details>
+    <summary>2025/07/16 2.1.1 发布</summary>
+    <ul>
+      <li><strong>bug修复</strong>
+        <ul>
+          <li>修复<code>pipeline</code>在某些情况可能发生的文本块内容丢失问题 #3005</li>
+          <li>修复<code>sglang-client</code>需要安装<code>torch</code>等不必要的包的问题 #2968</li>
+          <li>更新<code>dockerfile</code>以修复linux字体缺失导致的解析文本内容不完整问题 #2915</li>
+        </ul>
+      </li>
+      <li><strong>易用性更新</strong>
+        <ul>
+          <li>更新<code>compose.yaml</code>,便于用户直接启动<code>sglang-server</code>、<code>mineru-api</code>、<code>mineru-gradio</code>服务</li>
+          <li>启用全新的<a href="https://opendatalab.github.io/MinerU/zh/">在线文档站点</a>,简化readme,提供更好的文档体验</li>
+        </ul>
+      </li>
+    </ul>
+  </details>
+
+  <details>
+    <summary>2025/07/05 2.1.0 发布</summary>
+    <p>这是 MinerU 2 的第一个大版本更新,包含了大量新功能和改进,包含众多性能优化、体验优化和bug修复,具体更新内容如下:</p>
+    <ul>
+      <li><strong>性能优化:</strong>
+        <ul>
+          <li>大幅提升某些特定分辨率(长边2000像素左右)文档的预处理速度</li>
+          <li>大幅提升<code>pipeline</code>后端批量处理大量页数较少(&lt;10)文档时的后处理速度</li>
+          <li><code>pipeline</code>后端的layout分析速度提升约20%</li>
+        </ul>
+      </li>
+      <li><strong>体验优化:</strong>
+        <ul>
+          <li>内置开箱即用的<code>fastapi服务</code>和<code>gradio webui</code>,详细使用方法请参考<a href="https://opendatalab.github.io/MinerU/zh/usage/quick_usage/#apiwebuisglang-clientserver">文档</a></li>
+          <li><code>sglang</code>适配<code>0.4.8</code>版本,大幅降低<code>vlm-sglang</code>后端的显存要求,最低可在<code>8G显存</code>(Turing及以后架构)的显卡上运行</li>
+          <li>对所有命令增加<code>sglang</code>的参数透传,使得<code>sglang-engine</code>后端可以与<code>sglang-server</code>一致,接收<code>sglang</code>的所有参数</li>
+          <li>支持基于配置文件的功能扩展,包含<code>自定义公式标识符</code>、<code>开启标题分级功能</code>、<code>自定义本地模型目录</code>,详细使用方法请参考<a href="https://opendatalab.github.io/MinerU/zh/usage/quick_usage/#mineru_1">文档</a></li>
+        </ul>
+      </li>
+      <li><strong>新特性:</strong>
+        <ul>
+          <li><code>pipeline</code>后端更新 PP-OCRv5 多语种文本识别模型,支持法语、西班牙语、葡萄牙语、俄语、韩语等 37 种语言的文字识别,平均精度涨幅超30%。<a href="https://paddlepaddle.github.io/PaddleOCR/latest/version3.x/algorithm/PP-OCRv5/PP-OCRv5_multi_languages.html">详情</a></li>
+          <li><code>pipeline</code>后端增加对竖排文本的有限支持</li>
+        </ul>
+      </li>
+    </ul>
+  </details>
+
   <details>
     <summary>2025/06/20 2.0.6发布</summary>
     <ul>
@@ -584,6 +658,7 @@ mineru -p <input_path> -o <output_path>
 - [DocLayout-YOLO](https://github.com/opendatalab/DocLayout-YOLO)
 - [UniMERNet](https://github.com/opendatalab/UniMERNet)
 - [RapidTable](https://github.com/RapidAI/RapidTable)
+- [TableStructureRec](https://github.com/RapidAI/TableStructureRec)
 - [PaddleOCR](https://github.com/PaddlePaddle/PaddleOCR)
 - [PaddleOCR2Pytorch](https://github.com/frotms/PaddleOCR2Pytorch)
 - [layoutreader](https://github.com/ppaanngggg/layoutreader)

Những thai đổi đã bị hủy bỏ vì nó quá lớn
+ 2 - 17
docs/en/reference/output_files.md


+ 1 - 1
docs/en/usage/cli_tools.md

@@ -13,7 +13,7 @@ Options:
   -m, --method [auto|txt|ocr]     Parsing method: auto (default), txt, ocr (pipeline backend only)
   -b, --backend [pipeline|vlm-transformers|vlm-sglang-engine|vlm-sglang-client]
                                   Parsing backend (default: pipeline)
-  -l, --lang [ch|ch_server|ch_lite|en|korean|japan|chinese_cht|ta|te|ka|latin|arabic|east_slavic|cyrillic|devanagari]
+  -l, --lang [ch|ch_server|ch_lite|en|korean|japan|chinese_cht|ta|te|ka|th|el|latin|arabic|east_slavic|cyrillic|devanagari]
                                   Specify document language (improves OCR accuracy, pipeline backend only)
   -u, --url TEXT                  Service address when using sglang-client
   -s, --start INTEGER             Starting page number for parsing (0-based)

Những thai đổi đã bị hủy bỏ vì nó quá lớn
+ 2 - 17
docs/zh/reference/output_files.md


+ 1 - 1
docs/zh/usage/cli_tools.md

@@ -13,7 +13,7 @@ Options:
   -m, --method [auto|txt|ocr]     解析方法:auto(默认)、txt、ocr(仅用于 pipeline 后端)
   -b, --backend [pipeline|vlm-transformers|vlm-sglang-engine|vlm-sglang-client]
                                   解析后端(默认为 pipeline)
-  -l, --lang [ch|ch_server|ch_lite|en|korean|japan|chinese_cht|ta|te|ka|latin|arabic|east_slavic|cyrillic|devanagari]
+  -l, --lang [ch|ch_server|ch_lite|en|korean|japan|chinese_cht|ta|te|ka|th|el|latin|arabic|east_slavic|cyrillic|devanagari]
                                   指定文档语言(可提升 OCR 准确率,仅用于 pipeline 后端)
   -u, --url TEXT                  当使用 sglang-client 时,需指定服务地址
   -s, --start INTEGER             开始解析的页码(从 0 开始)

+ 14 - 2
mineru/backend/pipeline/pipeline_middle_json_mkcontent.py

@@ -188,7 +188,7 @@ def merge_para_with_text(para_block):
     return para_text
 
 
-def make_blocks_to_content_list(para_block, img_buket_path, page_idx):
+def make_blocks_to_content_list(para_block, img_buket_path, page_idx, page_size):
     para_type = para_block['type']
     para_content = {}
     if para_type in [BlockType.TEXT, BlockType.LIST, BlockType.INDEX]:
@@ -245,6 +245,17 @@ def make_blocks_to_content_list(para_block, img_buket_path, page_idx):
             if block['type'] == BlockType.TABLE_FOOTNOTE:
                 para_content[BlockType.TABLE_FOOTNOTE].append(merge_para_with_text(block))
 
+    page_weight, page_height = page_size
+    para_bbox = para_block.get('bbox')
+    if para_bbox:
+        x0, y0, x1, y1 = para_bbox
+        para_content['bbox'] = [
+            int(x0 * 1000 / page_weight),
+            int(y0 * 1000 / page_height),
+            int(x1 * 1000 / page_weight),
+            int(y1 * 1000 / page_height),
+        ]
+
     para_content['page_idx'] = page_idx
 
     return para_content
@@ -258,6 +269,7 @@ def union_make(pdf_info_dict: list,
     for page_info in pdf_info_dict:
         paras_of_layout = page_info.get('para_blocks')
         page_idx = page_info.get('page_idx')
+        page_size = page_info.get('page_size')
         if not paras_of_layout:
             continue
         if make_mode in [MakeMode.MM_MD, MakeMode.NLP_MD]:
@@ -265,7 +277,7 @@ def union_make(pdf_info_dict: list,
             output_content.extend(page_markdown)
         elif make_mode == MakeMode.CONTENT_LIST:
             for para_block in paras_of_layout:
-                para_content = make_blocks_to_content_list(para_block, img_buket_path, page_idx)
+                para_content = make_blocks_to_content_list(para_block, img_buket_path, page_idx, page_size)
                 if para_content:
                     output_content.append(para_content)
 

+ 6 - 2
mineru/backend/vlm/hf_predictor.py

@@ -4,7 +4,7 @@ from typing import Iterable, List, Optional, Union
 import torch
 from PIL import Image
 from tqdm import tqdm
-from transformers import AutoTokenizer, BitsAndBytesConfig
+from transformers import AutoTokenizer, BitsAndBytesConfig, __version__
 
 from ...model.vlm_hf_model import Mineru2QwenForCausalLM
 from ...model.vlm_hf_model.image_processing_mineru2 import process_images
@@ -66,7 +66,11 @@ class HuggingfacePredictor(BasePredictor):
                 bnb_4bit_quant_type="nf4",
             )
         else:
-            kwargs["torch_dtype"] = torch_dtype
+            from packaging import version
+            if version.parse(__version__) >= version.parse("4.56.0"):
+                kwargs["dtype"] = torch_dtype
+            else:
+                kwargs["torch_dtype"] = torch_dtype
 
         if use_flash_attn:
             kwargs["attn_implementation"] = "flash_attention_2"

+ 5 - 2
mineru/backend/vlm/token_to_middle_json.py

@@ -1,8 +1,9 @@
+import os
 import time
 from loguru import logger
 import numpy as np
 import cv2
-from mineru.utils.config_reader import get_llm_aided_config
+from mineru.utils.config_reader import get_llm_aided_config, get_table_enable
 from mineru.utils.cut_image import cut_image_and_table
 from mineru.utils.enum_class import ContentType
 from mineru.utils.hash_utils import str_md5
@@ -94,7 +95,9 @@ def result_to_middle_json(token_list, images_list, pdf_doc, image_writer):
         middle_json["pdf_info"].append(page_info)
 
     """表格跨页合并"""
-    merge_table(middle_json["pdf_info"])
+    table_enable = get_table_enable(os.getenv('MINERU_VLM_TABLE_ENABLE', 'True').lower() == 'true')
+    if table_enable:
+        merge_table(middle_json["pdf_info"])
 
     """llm优化标题分级"""
     if heading_level_import_success:

+ 14 - 2
mineru/backend/vlm/vlm_middle_json_mkcontent.py

@@ -125,7 +125,7 @@ def mk_blocks_to_markdown(para_blocks, make_mode, formula_enable, table_enable,
 
 
 
-def make_blocks_to_content_list(para_block, img_buket_path, page_idx):
+def make_blocks_to_content_list(para_block, img_buket_path, page_idx, page_size):
     para_type = para_block['type']
     para_content = {}
     if para_type in [BlockType.TEXT, BlockType.LIST, BlockType.INDEX]:
@@ -179,6 +179,17 @@ def make_blocks_to_content_list(para_block, img_buket_path, page_idx):
             if block['type'] == BlockType.TABLE_FOOTNOTE:
                 para_content[BlockType.TABLE_FOOTNOTE].append(merge_para_with_text(block))
 
+    page_weight, page_height = page_size
+    para_bbox = para_block.get('bbox')
+    if para_bbox:
+        x0, y0, x1, y1 = para_bbox
+        para_content['bbox'] = [
+            int(x0 * 1000 / page_weight),
+            int(y0 * 1000 / page_height),
+            int(x1 * 1000 / page_weight),
+            int(y1 * 1000 / page_height),
+        ]
+
     para_content['page_idx'] = page_idx
 
     return para_content
@@ -195,6 +206,7 @@ def union_make(pdf_info_dict: list,
     for page_info in pdf_info_dict:
         paras_of_layout = page_info.get('para_blocks')
         page_idx = page_info.get('page_idx')
+        page_size = page_info.get('page_size')
         if not paras_of_layout:
             continue
         if make_mode in [MakeMode.MM_MD, MakeMode.NLP_MD]:
@@ -202,7 +214,7 @@ def union_make(pdf_info_dict: list,
             output_content.extend(page_markdown)
         elif make_mode == MakeMode.CONTENT_LIST:
             for para_block in paras_of_layout:
-                para_content = make_blocks_to_content_list(para_block, img_buket_path, page_idx)
+                para_content = make_blocks_to_content_list(para_block, img_buket_path, page_idx, page_size)
                 output_content.append(para_content)
 
     if make_mode in [MakeMode.MM_MD, MakeMode.NLP_MD]:

+ 1 - 1
mineru/cli/client.py

@@ -62,7 +62,7 @@ from .common import do_parse, read_fn, pdf_suffixes, image_suffixes
     '-l',
     '--lang',
     'lang',
-    type=click.Choice(['ch', 'ch_server', 'ch_lite', 'en', 'korean', 'japan', 'chinese_cht', 'ta', 'te', 'ka',
+    type=click.Choice(['ch', 'ch_server', 'ch_lite', 'en', 'korean', 'japan', 'chinese_cht', 'ta', 'te', 'ka', 'th', 'el',
                        'latin', 'arabic', 'east_slavic', 'cyrillic', 'devanagari']),
     help="""
     Input the languages in the pdf (if known) to improve OCR accuracy.  Optional.

Một số tệp đã không được hiển thị bởi vì quá nhiều tập tin thay đổi trong này khác