Browse Source

Merge pull request #2891 from myhloli/dev

Dev
Xiaomeng Zhao 4 tháng trước cách đây
mục cha
commit
37eff6859a

+ 156 - 76
README.md

@@ -46,39 +46,81 @@
 </div>
 
 # Changelog
-- 2025/06/20 2.0.6 Released
-  - Fixed occasional parsing interruptions caused by invalid block content in `vlm` mode
-  - Fixed parsing interruptions caused by incomplete table structures in `vlm` mode
-- 2025/06/17 2.0.5 Released 
-  - Fixed the issue where models were still required to be downloaded in the `sglang-client` mode  
-  - Fixed the issue where the `sglang-client` mode unnecessarily depended on packages like `torch` during runtime.
-  - Fixed the issue where only the first instance would take effect when attempting to launch multiple `sglang-client` instances via multiple URLs within the same process
-- 2025/06/15 2.0.3 released
-  - Fixed a configuration file key-value update error that occurred when downloading model type was set to `all`
-  - Fixed the issue where the formula and table feature toggle switches were not working in `command line mode`, causing the features to remain enabled.
-  - Fixed compatibility issues with sglang version 0.4.7 in the `sglang-engine` mode.
-  - Updated Dockerfile and installation documentation for deploying the full version of MinerU in sglang environment
-- 2025/06/13 2.0.0 Released
-  - MinerU 2.0 represents a comprehensive reconstruction and upgrade from architecture to functionality, delivering a more streamlined design, enhanced performance, and more flexible user experience.
-    - **New Architecture**: MinerU 2.0 has been deeply restructured in code organization and interaction methods, significantly improving system usability, maintainability, and extensibility.
-      - **Removal of Third-party Dependency Limitations**: Completely eliminated the dependency on `pymupdf`, moving the project toward a more open and compliant open-source direction.
-      - **Ready-to-use, Easy Configuration**: No need to manually edit JSON configuration files; most parameters can now be set directly via command line or API.
-      - **Automatic Model Management**: Added automatic model download and update mechanisms, allowing users to complete model deployment without manual intervention.
-      - **Offline Deployment Friendly**: Provides built-in model download commands, supporting deployment requirements in completely offline environments.
-      - **Streamlined Code Structure**: Removed thousands of lines of redundant code, simplified class inheritance logic, significantly improving code readability and development efficiency.
-      - **Unified Intermediate Format Output**: Adopted standardized `middle_json` format, compatible with most secondary development scenarios based on this format, ensuring seamless ecosystem business migration.
-    - **New Model**: MinerU 2.0 integrates our latest small-parameter, high-performance multimodal document parsing model, achieving end-to-end high-speed, high-precision document understanding.
-      - **Small Model, Big Capabilities**: With parameters under 1B, yet surpassing traditional 72B-level vision-language models (VLMs) in parsing accuracy.
-      - **Multiple Functions in One**: A single model covers multilingual recognition, handwriting recognition, layout analysis, table parsing, formula recognition, reading order sorting, and other core tasks.
-      - **Ultimate Inference Speed**: Achieves peak throughput exceeding 10,000 tokens/s through `sglang` acceleration on a single NVIDIA 4090 card, easily handling large-scale document processing requirements.
-      - **Online Experience**: You can experience our brand-new VLM model on [MinerU.net](https://mineru.net/OpenSourceTools/Extractor), [Hugging Face](https://huggingface.co/spaces/opendatalab/MinerU), and [ModelScope](https://www.modelscope.cn/studios/OpenDataLab/MinerU).
-    - **Incompatible Changes Notice**: To improve overall architectural rationality and long-term maintainability, this version contains some incompatible changes:
-      - Python package name changed from `magic-pdf` to `mineru`, and the command-line tool changed from `magic-pdf` to `mineru`. Please update your scripts and command calls accordingly.
-      - For modular system design and ecosystem consistency considerations, MinerU 2.0 no longer includes the LibreOffice document conversion module. If you need to process Office documents, we recommend converting them to PDF format through an independently deployed LibreOffice service before proceeding with subsequent parsing operations.
+
+- 2025/07/05 Version 2.1.0 Released
+  - This is the first major update of Miner2, which includes a large number of new features and improvements, covering significant performance optimizations, user experience enhancements, and bug fixes. The detailed update contents are as follows:
+  - **Performance Optimizations:**
+    - Significantly improved preprocessing speed for documents with specific resolutions (around 2000 pixels on the long side).
+    - Greatly enhanced post-processing speed when the `pipeline` backend handles batch processing of documents with fewer pages (<10 pages).
+    - Layout analysis speed of the `pipeline` backend has been increased by approximately 20%.
+  - **Experience Enhancements:**
+    - Built-in ready-to-use `fastapi service` and `gradio webui`. For detailed usage instructions, please refer to [Documentation](#3-api-calls-or-visual-invocation).
+    - Adapted to `sglang` version `0.4.8`, significantly reducing the GPU memory requirements for the `vlm-sglang` backend. It can now run on graphics cards with as little as `8GB GPU memory` (Turing architecture or newer).
+    - Added transparent parameter passing for all commands related to `sglang`, allowing the `sglang-engine` backend to receive all `sglang` parameters consistently with the `sglang-server`.
+    - Supports feature extensions based on configuration files, including `custom formula delimiters`, `enabling heading classification`, and `customizing local model directories`. For detailed usage instructions, please refer to [Documentation](#4-extending-mineru-functionality-through-configuration-files).
+  - **New Features:**
+    - Updated the `pipeline` backend with the PP-OCRv5 multilingual text recognition model, supporting text recognition in 37 languages such as French, Spanish, Portuguese, Russian, and Korean, with an average accuracy improvement of over 30%. [Details](https://paddlepaddle.github.io/PaddleOCR/latest/en/version3.x/algorithm/PP-OCRv5/PP-OCRv5_multi_languages.html)
+    - Introduced limited support for vertical text layout in the `pipeline` backend.
 
 <details>
   <summary>History Log</summary>
   <details>
+    <summary>2025/06/20 2.0.6 Released</summary>
+    <ul>
+      <li>Fixed occasional parsing interruptions caused by invalid block content in <code>vlm</code> mode</li>
+      <li>Fixed parsing interruptions caused by incomplete table structures in <code>vlm</code> mode</li>
+    </ul>
+  </details>
+  
+  <details>
+    <summary>2025/06/17 2.0.5 Released</summary>
+    <ul>
+      <li>Fixed the issue where models were still required to be downloaded in the <code>sglang-client</code> mode</li>
+      <li>Fixed the issue where the <code>sglang-client</code> mode unnecessarily depended on packages like <code>torch</code> during runtime.</li>
+      <li>Fixed the issue where only the first instance would take effect when attempting to launch multiple <code>sglang-client</code> instances via multiple URLs within the same process</li>
+    </ul>
+  </details>
+  
+  <details>
+    <summary>2025/06/15 2.0.3 released</summary>
+    <ul>
+      <li>Fixed a configuration file key-value update error that occurred when downloading model type was set to <code>all</code></li>
+      <li>Fixed the issue where the formula and table feature toggle switches were not working in <code>command line mode</code>, causing the features to remain enabled.</li>
+      <li>Fixed compatibility issues with sglang version 0.4.7 in the <code>sglang-engine</code> mode.</li>
+      <li>Updated Dockerfile and installation documentation for deploying the full version of MinerU in sglang environment</li>
+    </ul>
+  </details>
+  
+  <details>
+    <summary>2025/06/13 2.0.0 Released</summary>
+    <ul>
+      <li><strong>New Architecture</strong>: MinerU 2.0 has been deeply restructured in code organization and interaction methods, significantly improving system usability, maintainability, and extensibility.
+        <ul>
+          <li><strong>Removal of Third-party Dependency Limitations</strong>: Completely eliminated the dependency on <code>pymupdf</code>, moving the project toward a more open and compliant open-source direction.</li>
+          <li><strong>Ready-to-use, Easy Configuration</strong>: No need to manually edit JSON configuration files; most parameters can now be set directly via command line or API.</li>
+          <li><strong>Automatic Model Management</strong>: Added automatic model download and update mechanisms, allowing users to complete model deployment without manual intervention.</li>
+          <li><strong>Offline Deployment Friendly</strong>: Provides built-in model download commands, supporting deployment requirements in completely offline environments.</li>
+          <li><strong>Streamlined Code Structure</strong>: Removed thousands of lines of redundant code, simplified class inheritance logic, significantly improving code readability and development efficiency.</li>
+          <li><strong>Unified Intermediate Format Output</strong>: Adopted standardized <code>middle_json</code> format, compatible with most secondary development scenarios based on this format, ensuring seamless ecosystem business migration.</li>
+        </ul>
+      </li>
+      <li><strong>New Model</strong>: MinerU 2.0 integrates our latest small-parameter, high-performance multimodal document parsing model, achieving end-to-end high-speed, high-precision document understanding.
+        <ul>
+          <li><strong>Small Model, Big Capabilities</strong>: With parameters under 1B, yet surpassing traditional 72B-level vision-language models (VLMs) in parsing accuracy.</li>
+          <li><strong>Multiple Functions in One</strong>: A single model covers multilingual recognition, handwriting recognition, layout analysis, table parsing, formula recognition, reading order sorting, and other core tasks.</li>
+          <li><strong>Ultimate Inference Speed</strong>: Achieves peak throughput exceeding 10,000 tokens/s through <code>sglang</code> acceleration on a single NVIDIA 4090 card, easily handling large-scale document processing requirements.</li>
+          <li><strong>Online Experience</strong>: You can experience our brand-new VLM model on <a href="https://mineru.net/OpenSourceTools/Extractor">MinerU.net</a>, <a href="https://huggingface.co/spaces/opendatalab/MinerU">Hugging Face</a>, and <a href="https://www.modelscope.cn/studios/OpenDataLab/MinerU">ModelScope</a>.</li>
+        </ul>
+      </li>
+      <li><strong>Incompatible Changes Notice</strong>: To improve overall architectural rationality and long-term maintainability, this version contains some incompatible changes:
+        <ul>
+          <li>Python package name changed from <code>magic-pdf</code> to <code>mineru</code>, and the command-line tool changed from <code>magic-pdf</code> to <code>mineru</code>. Please update your scripts and command calls accordingly.</li>
+          <li>For modular system design and ecosystem consistency considerations, MinerU 2.0 no longer includes the LibreOffice document conversion module. If you need to process Office documents, we recommend converting them to PDF format through an independently deployed LibreOffice service before proceeding with subsequent parsing operations.</li>
+        </ul>
+      </li>
+    </ul>
+  </details>
+  <details>
   <summary>2025/05/24 Release 1.3.12</summary>
   <ul>
       <li>Added support for PPOCRv5 models, updated <code>ch_server</code> model to <code>PP-OCRv5_rec_server</code>, and <code>ch_lite</code> model to <code>PP-OCRv5_rec_mobile</code> (model update required)
@@ -382,8 +424,6 @@
     <li><a href="#acknowledgments">Acknowledgments</a></li>
     <li><a href="#citation">Citation</a></li>
     <li><a href="#star-history">Star History</a></li>
-    <li><a href="#magic-doc">Magic-doc</a></li>
-    <li><a href="#magic-html">Magic-html</a></li>
     <li><a href="#links">Links</a></li>
   </ol>
 </details>
@@ -453,7 +493,7 @@ There are three different ways to experience MinerU:
     <tr>
         <td>GPU Requirements</td>
         <td>Turing architecture or later, 6GB+ VRAM or Apple Silicon</td>
-        <td colspan="2">Ampere architecture or later, 8GB+ VRAM</td>
+        <td colspan="2">Turing architecture or later, 8GB+ VRAM</td>
     </tr>
     <tr>
         <td>Memory Requirements</td>
@@ -499,7 +539,7 @@ uv pip install -e .[core]
 > Linux and macOS systems automatically support CUDA/MPS acceleration after installation. For Windows users who want to use CUDA acceleration, 
 > please visit the [PyTorch official website](https://pytorch.org/get-started/locally/) to install PyTorch with the appropriate CUDA version.
 
-#### 1.3 Install Full Version (supports sglang acceleration) (requires device with Ampere or newer architecture and at least 24GB GPU memory)
+#### 1.3 Install Full Version (supports sglang acceleration) (requires device with Turing or newer architecture and at least 8GB GPU memory)
 
 If you need to use **sglang to accelerate VLM model inference**, you can choose any of the following methods to install the full version:
 
@@ -511,6 +551,10 @@ If you need to use **sglang to accelerate VLM model inference**, you can choose
   ```bash
   uv pip install -e .[all]
   ```
+
+> [!TIP]  
+> If any exceptions occur during the installation of `sglang`, please refer to the [official sglang documentation](https://docs.sglang.ai/start/install.html) for troubleshooting and solutions, or directly use Docker-based installation.
+
 - Build image using Dockerfile:
   ```bash
   wget https://gcore.jsdelivr.net/gh/opendatalab/MinerU@master/docker/global/Dockerfile
@@ -532,8 +576,8 @@ If you need to use **sglang to accelerate VLM model inference**, you can choose
   ```
   
 > [!TIP]
-> The Dockerfile uses `lmsysorg/sglang:v0.4.8.post1-cu126` as the default base image. If necessary, you can modify it to another platform version.
-
+> The Dockerfile uses `lmsysorg/sglang:v0.4.8.post1-cu126` as the default base image, which supports the Turing/Ampere/Ada Lovelace/Hopper platforms.  
+> If you are using the newer Blackwell platform, please change the base image to `lmsysorg/sglang:v0.4.8.post1-cu128-b200`.
 
 #### 1.4 Install client  (for connecting to sglang-server on edge devices that require only CPU and network connectivity)
 
@@ -556,7 +600,7 @@ The simplest command line invocation is:
 mineru -p <input_path> -o <output_path>
 ```
 
-- `<input_path>`: Local PDF file or directory (supports pdf/png/jpg/jpeg)
+- `<input_path>`: Local PDF/Image file or directory (supports pdf/png/jpg/jpeg/webp/gif)
 - `<output_path>`: Output directory
 
 ##### View Help Information
@@ -579,14 +623,15 @@ Options:
   -m, --method [auto|txt|ocr]     Parsing method: auto (default), txt, ocr (pipeline backend only)
   -b, --backend [pipeline|vlm-transformers|vlm-sglang-engine|vlm-sglang-client]
                                   Parsing backend (default: pipeline)
-  -l, --lang [ch|ch_server|... ]  Specify document language (improves OCR accuracy, pipeline backend only)
+  -l, --lang [ch|ch_server|ch_lite|en|korean|japan|chinese_cht|ta|te|ka|latin|arabic|east_slavic|cyrillic|devanagari]
+                                  Specify document language (improves OCR accuracy, pipeline backend only)
   -u, --url TEXT                  Service address when using sglang-client
   -s, --start INTEGER             Starting page number (0-based)
   -e, --end INTEGER               Ending page number (0-based)
-  -f, --formula BOOLEAN           Enable formula parsing (default: on, pipeline backend only)
-  -t, --table BOOLEAN             Enable table parsing (default: on, pipeline backend only)
+  -f, --formula BOOLEAN           Enable formula parsing (default: on)
+  -t, --table BOOLEAN             Enable table parsing (default: on)
   -d, --device TEXT               Inference device (e.g., cpu/cuda/cuda:0/npu/mps, pipeline backend only)
-  --vram INTEGER                  Maximum GPU VRAM usage per process (pipeline backend only)
+  --vram INTEGER                  Maximum GPU VRAM usage per process (GB)(pipeline backend only)
   --source [huggingface|modelscope|local]
                                   Model source, default: huggingface
   --help                          Show help information
@@ -658,15 +703,6 @@ mineru -p <input_path> -o <output_path> -b vlm-sglang-engine
 mineru-sglang-server --port 30000
 ```
 
-> [!TIP]
-> sglang-server has some commonly used parameters for configuration:
-> - If you have two GPUs with `12GB` or `16GB` VRAM, you can use the Tensor Parallel (TP) mode: `--tp 2`
-> - If you have two GPUs with `11GB` VRAM, in addition to Tensor Parallel mode, you need to reduce the KV cache size: `--tp 2 --mem-fraction-static 0.7`
-> - If you have more than two GPUs with `24GB` VRAM or above, you can use sglang's multi-GPU parallel mode to increase throughput: `--dp 2`
-> - You can also enable `torch.compile` to accelerate inference speed by approximately 15%: `--enable-torch-compile`
-> - If you want to learn more about the usage of `sglang` parameters, please refer to the [official sglang documentation](https://docs.sglang.ai/backend/server_arguments.html#common-launch-commands)
-
-
 2. Use Client in another terminal:
 
 ```bash
@@ -678,26 +714,73 @@ mineru -p <input_path> -o <output_path> -b vlm-sglang-client -u http://127.0.0.1
 
 ---
 
-### 3. API Usage
-
-You can also call MinerU through Python code, see example code at:
-👉 [Python Usage Example](demo/demo.py)
+### 3. API Calls or Visual Invocation
+
+1. Directly invoke using Python API: [Python Invocation Example](demo/demo.py)
+2. Invoke using FastAPI:
+   ```bash
+   mineru-api --host 127.0.0.1 --port 8000
+   ```
+   Visit http://127.0.0.1:8000/docs in your browser to view the API documentation.
+
+3. Use Gradio WebUI or Gradio API:
+   ```bash
+   # Using pipeline/vlm-transformers/vlm-sglang-client backend
+   mineru-gradio --server-name 127.0.0.1 --server-port 7860
+   # Or using vlm-sglang-engine/pipeline backend
+   mineru-gradio --server-name 127.0.0.1 --server-port 7860 --enable-sglang-engine
+   ```
+   Access http://127.0.0.1:7860 in your browser to use the Gradio WebUI, or visit http://127.0.0.1:7860/?view=api to use the Gradio API.
+
+
+> [!TIP]  
+> Below are some suggestions and notes for using the sglang acceleration mode:  
+> - The sglang acceleration mode currently supports operation on Turing architecture GPUs with a minimum of 8GB VRAM, but you may encounter VRAM shortages on GPUs with less than 24GB VRAM. You can optimize VRAM usage with the following parameters:  
+>   - If running on a single GPU and encountering VRAM shortage, reduce the KV cache size by setting `--mem-fraction-static 0.5`. If VRAM issues persist, try lowering it further to `0.4` or below.  
+>   - If you have more than one GPU, you can expand available VRAM using tensor parallelism (TP) mode: `--tp 2`  
+> - If you are already successfully using sglang to accelerate VLM inference but wish to further improve inference speed, consider the following parameters:  
+>   - If using multiple GPUs, increase throughput using sglang's multi-GPU parallel mode: `--dp 2`  
+>   - You can also enable `torch.compile` to accelerate inference speed by about 15%: `--enable-torch-compile`  
+> - For more information on using sglang parameters, please refer to the [sglang official documentation](https://docs.sglang.ai/backend/server_arguments.html#common-launch-commands)  
+> - All sglang-supported parameters can be passed to MinerU via command-line arguments, including those used with the following commands: `mineru`, `mineru-sglang-server`, `mineru-gradio`, `mineru-api`
+
+> [!TIP]  
+> - In any case, you can specify visible GPU devices at the start of a command line by adding the `CUDA_VISIBLE_DEVICES` environment variable. For example:  
+>   ```bash
+>   CUDA_VISIBLE_DEVICES=1 mineru -p <input_path> -o <output_path>
+>   ```
+> - This method works for all command-line calls, including `mineru`, `mineru-sglang-server`, `mineru-gradio`, and `mineru-api`, and applies to both `pipeline` and `vlm` backends.  
+> - Below are some common `CUDA_VISIBLE_DEVICES` settings:  
+>   ```bash
+>   CUDA_VISIBLE_DEVICES=1 Only device 1 will be seen
+>   CUDA_VISIBLE_DEVICES=0,1 Devices 0 and 1 will be visible
+>   CUDA_VISIBLE_DEVICES="0,1" Same as above, quotation marks are optional
+>   CUDA_VISIBLE_DEVICES=0,2,3 Devices 0, 2, 3 will be visible; device 1 is masked
+>   CUDA_VISIBLE_DEVICES="" No GPU will be visible
+>   ```
+> - Below are some possible use cases:  
+>   - If you have multiple GPUs and need to specify GPU 0 and GPU 1 to launch 'sglang-server' in multi-GPU mode, you can use the following command:  
+>   ```bash
+>   CUDA_VISIBLE_DEVICES=0,1 mineru-sglang-server --port 30000 --dp 2
+>   ```
+>   - If you have multiple GPUs and need to launch two `fastapi` services on GPU 0 and GPU 1 respectively, listening on different ports, you can use the following commands:  
+>   ```bash
+>   # In terminal 1
+>   CUDA_VISIBLE_DEVICES=0 mineru-api --host 127.0.0.1 --port 8000
+>   # In terminal 2
+>   CUDA_VISIBLE_DEVICES=1 mineru-api --host 127.0.0.1 --port 8001
+>   ```
 
 ---
 
-### 4. Deploy Derivative Projects
-
-Community developers have created various extensions based on MinerU, including:
-
-- Graphical interface based on Gradio
-- Web API based on FastAPI
-- Client/server architecture with multi-GPU load balancing
-- MCP Server based on the official API
-
-These projects typically offer better user experience and additional features.
+### 4. Extending MinerU Functionality Through Configuration Files
 
-For detailed deployment instructions, please refer to:
-👉 [Derivative Projects Documentation](projects/README.md)
+- MinerU is designed to work out-of-the-box, but also supports extending functionality through configuration files. You can create a `mineru.json` file in your home directory and add custom configurations.
+- The `mineru.json` file will be automatically generated when you use the built-in model download command `mineru-models-download`. Alternatively, you can create it by copying the [configuration template file](./mineru.template.json) to your home directory and renaming it to `mineru.json`.
+- Below are some available configuration options:
+  - `latex-delimiter-config`: Used to configure LaTeX formula delimiters, defaults to the `$` symbol, and can be modified to other symbols or strings as needed.
+  - `llm-aided-config`: Used to configure related parameters for LLM-assisted heading level detection, compatible with all LLM models supporting the `OpenAI protocol`. It defaults to Alibaba Cloud Qwen's `qwen2.5-32b-instruct` model. You need to configure an API key yourself and set `enable` to `true` to activate this feature.
+  - `models-dir`: Used to specify local model storage directories. Please specify separate model directories for the `pipeline` and `vlm` backends. After specifying these directories, you can use local models by setting the environment variable `export MINERU_MODEL_SOURCE=local`.
 
 ---
 
@@ -714,7 +797,7 @@ For detailed deployment instructions, please refer to:
 # Known Issues
 
 - Reading order is determined by the model based on the spatial distribution of readable content, and may be out of order in some areas under extremely complex layouts.
-- Vertical text is not supported.
+- Limited support for vertical text.
 - Tables of contents and lists are recognized through rules, and some uncommon list formats may not be recognized.
 - Code blocks are not yet supported in the layout model.
 - Comic books, art albums, primary school textbooks, and exercises cannot be parsed well.
@@ -724,9 +807,9 @@ For detailed deployment instructions, please refer to:
 
 # FAQ
 
-[FAQ in Chinese](docs/FAQ_zh_cn.md)
-
-[FAQ in English](docs/FAQ_en_us.md)
+- If you encounter any issues during usage, you can first check the [FAQ](docs/FAQ_en_us.md) for solutions.  
+- If your issue remains unresolved, you may also use [DeepWiki](https://deepwiki.com/opendatalab/MinerU) to interact with an AI assistant, which can address most common problems.  
+- If you still cannot resolve the issue, you are welcome to join our community via [Discord](https://discord.gg/Tdedn9GTXq) or [WeChat](http://mineru.space/s/V85Yl) to discuss with other users and developers.
 
 # All Thanks To Our Contributors
 
@@ -787,16 +870,13 @@ Currently, some models in this project are trained based on YOLO. However, since
  </picture>
 </a>
 
-# Magic-doc
-
-[Magic-Doc](https://github.com/InternLM/magic-doc) Fast speed ppt/pptx/doc/docx/pdf extraction tool
-
-# Magic-html
-
-[Magic-HTML](https://github.com/opendatalab/magic-html) Mixed web page extraction tool
 
 # Links
 
 - [LabelU (A Lightweight Multi-modal Data Annotation Tool)](https://github.com/opendatalab/labelU)
 - [LabelLLM (An Open-source LLM Dialogue Annotation Platform)](https://github.com/opendatalab/LabelLLM)
 - [PDF-Extract-Kit (A Comprehensive Toolkit for High-Quality PDF Content Extraction)](https://github.com/opendatalab/PDF-Extract-Kit)
+- [Vis3 (OSS browser based on s3)](https://github.com/opendatalab/Vis3)
+- [OmniDocBench (A Comprehensive Benchmark for Document Parsing and Evaluation)](https://github.com/opendatalab/OmniDocBench)
+- [Magic-HTML (Mixed web page extraction tool)](https://github.com/opendatalab/magic-html)
+- [Magic-Doc (Fast speed ppt/pptx/doc/docx/pdf extraction tool)](https://github.com/InternLM/magic-doc) 

+ 153 - 75
README_zh-CN.md

@@ -46,40 +46,80 @@
 </div>
 
 # 更新记录
-- 2025/06/20 2.0.6发布
-  - 修复`vlm`模式下,某些偶发的无效块内容导致解析中断问题
-  - 修复`vlm`模式下,某些不完整的表结构导致的解析中断问题
-- 2025/06/17 2.0.5发布
-  - 修复了`sglang-client`模式下依然需要下载模型的问题
-  - 修复了`sglang-client`模式需要依赖`torch`等实际运行不需要的包的问题
-  - 修复了同一进程内尝试通过多个url启动多个`sglang-client`实例时,只有第一个生效的问题
-- 2025/06/15 2.0.3发布
-  - 修复了当下载模型类型设置为`all`时,配置文件出现键值更新错误的问题
-  - 修复了命令行模式下公式和表格功能开关不生效导致功能无法关闭的问题
-  - 修复了`sglang-engine`模式下,0.4.7版本sglang的兼容性问题
-  - 更新了sglang环境下部署完整版MinerU的Dockerfile和相关安装文档
-- 2025/06/13 2.0.0发布
-  - MinerU 2.0 是一次从架构到功能的全面重构与升级,带来了更简洁的设计、更强的性能以及更灵活的使用体验。
-    - **全新架构**:MinerU 2.0 在代码结构和交互方式上进行了深度重构,显著提升了系统的易用性、可维护性与扩展能力。
-      - **去除第三方依赖限制**:彻底移除对 `pymupdf` 的依赖,推动项目向更开放、合规的开源方向迈进。
-      - **开箱即用,配置便捷**:无需手动编辑 JSON 配置文件,绝大多数参数已支持命令行或 API 直接设置。
-      - **模型自动管理**:新增模型自动下载与更新机制,用户无需手动干预即可完成模型部署。
-      - **离线部署友好**:提供内置模型下载命令,支持完全断网环境下的部署需求。
-      - **代码结构精简**:移除数千行冗余代码,简化类继承逻辑,显著提升代码可读性与开发效率。
-      - **统一中间格式输出**:采用标准化的 `middle_json` 格式,兼容多数基于该格式的二次开发场景,确保生态业务无缝迁移。
-    - **全新模型**:MinerU 2.0 集成了我们最新研发的小参数量、高性能多模态文档解析模型,实现端到端的高速、高精度文档理解。
-      - **小模型,大能力**:模型参数不足 1B,却在解析精度上超越传统 72B 级别的视觉语言模型(VLM)。
-      - **多功能合一**:单模型覆盖多语言识别、手写识别、版面分析、表格解析、公式识别、阅读顺序排序等核心任务。
-      - **极致推理速度**:在单卡 NVIDIA 4090 上通过 `sglang` 加速,达到峰值吞吐量超过 10,000 token/s,轻松应对大规模文档处理需求。
-      - **在线体验**:您可以在[MinerU.net](https://mineru.net/OpenSourceTools/Extractor)、[Hugging Face](https://huggingface.co/spaces/opendatalab/MinerU), 以及[ModelScope](https://www.modelscope.cn/studios/OpenDataLab/MinerU)体验我们的全新VLM模型
-    - **不兼容变更说明**:为提升整体架构合理性与长期可维护性,本版本包含部分不兼容的变更:
-      - Python 包名从 `magic-pdf` 更改为 `mineru`,命令行工具也由 `magic-pdf` 改为 `mineru`,请同步更新脚本与调用命令。
-      - 出于对系统模块化设计与生态一致性的考虑,MinerU 2.0 已不再内置 LibreOffice 文档转换模块。如需处理 Office 文档,建议通过独立部署的 LibreOffice 服务先行转换为 PDF 格式,再进行后续解析操作。
-
+- 2025/07/05 2.1.0发布
+  - 这是 Miner2 的第一个大版本更新,包含了大量新功能和改进,包含众多性能优化、体验优化和bug修复,具体更新内容如下: 
+  - 性能优化: 
+    - 大幅提升某些特定分辨率(长边2000像素左右)文档的预处理速度
+    - 大幅提升`pipeline`后端批量处理大量页数较少(<10)文档时的后处理速度
+    - `pipline`后端的layout分析速度提升约20%
+  - 体验优化:
+    - 内置开箱即用的`fastapi服务`和`gradio webui`,详细使用方法请参考[文档](#3-api-调用-或-可视化调用)
+    - `sglang`适配`0.4.8`版本,大幅降低`vlm-sglang`后端的显存要求,最低可在`8G显存`(Turing及以后架构)的显卡上运行
+    - 对所有命令增加`sglang`的参数透传,使得`sglang-engine`后端可以`sglang-server`一致,接收`sglang`的所有参数
+    - 支持基于配置文件的功能扩展,包含`自定义公式标识符`、`开启标题分级功能`、`自定义本地模型目录`,详细使用方法请参考[文档](#4-基于配置文件扩展-mineru-功能)
+  - 新特性:  
+    - `pipeline`后端更新 PP-OCRv5 多语种文本识别模型,支持法语、西班牙语、葡萄牙语、俄语、韩语等 37 种语言的文字识别,平均精度涨幅超30%。[详情](https://paddlepaddle.github.io/PaddleOCR/latest/version3.x/algorithm/PP-OCRv5/PP-OCRv5_multi_languages.html)
+    - `pipeline`后端增加对竖排文本的有限支持
 
 <details>
   <summary>历史日志</summary>
   <details>
+    <summary>2025/06/20 2.0.6发布</summary>
+    <ul>
+      <li>修复<code>vlm</code>模式下,某些偶发的无效块内容导致解析中断问题</li>
+      <li>修复<code>vlm</code>模式下,某些不完整的表结构导致的解析中断问题</li>
+    </ul>
+  </details>
+  
+  <details>
+    <summary>2025/06/17 2.0.5发布</summary>
+    <ul>
+      <li>修复了<code>sglang-client</code>模式下依然需要下载模型的问题</li>
+      <li>修复了<code>sglang-client</code>模式需要依赖<code>torch</code>等实际运行不需要的包的问题</li>
+      <li>修复了同一进程内尝试通过多个url启动多个<code>sglang-client</code>实例时,只有第一个生效的问题</li>
+    </ul>
+  </details>
+  
+  <details>
+    <summary>2025/06/15 2.0.3发布</summary>
+    <ul>
+      <li>修复了当下载模型类型设置为<code>all</code>时,配置文件出现键值更新错误的问题</li>
+      <li>修复了命令行模式下公式和表格功能开关不生效导致功能无法关闭的问题</li>
+      <li>修复了<code>sglang-engine</code>模式下,0.4.7版本sglang的兼容性问题</li>
+      <li>更新了sglang环境下部署完整版MinerU的Dockerfile和相关安装文档</li>
+    </ul>
+  </details>
+  
+  <details>
+    <summary>2025/06/13 2.0.0发布</summary>
+    <ul>
+      <li><strong>全新架构</strong>:MinerU 2.0 在代码结构和交互方式上进行了深度重构,显著提升了系统的易用性、可维护性与扩展能力。
+        <ul>
+          <li><strong>去除第三方依赖限制</strong>:彻底移除对 <code>pymupdf</code> 的依赖,推动项目向更开放、合规的开源方向迈进。</li>
+          <li><strong>开箱即用,配置便捷</strong>:无需手动编辑 JSON 配置文件,绝大多数参数已支持命令行或 API 直接设置。</li>
+          <li><strong>模型自动管理</strong>:新增模型自动下载与更新机制,用户无需手动干预即可完成模型部署。</li>
+          <li><strong>离线部署友好</strong>:提供内置模型下载命令,支持完全断网环境下的部署需求。</li>
+          <li><strong>代码结构精简</strong>:移除数千行冗余代码,简化类继承逻辑,显著提升代码可读性与开发效率。</li>
+          <li><strong>统一中间格式输出</strong>:采用标准化的 <code>middle_json</code> 格式,兼容多数基于该格式的二次开发场景,确保生态业务无缝迁移。</li>
+        </ul>
+      </li>
+      <li><strong>全新模型</strong>:MinerU 2.0 集成了我们最新研发的小参数量、高性能多模态文档解析模型,实现端到端的高速、高精度文档理解。
+        <ul>
+          <li><strong>小模型,大能力</strong>:模型参数不足 1B,却在解析精度上超越传统 72B 级别的视觉语言模型(VLM)。</li>
+          <li><strong>多功能合一</strong>:单模型覆盖多语言识别、手写识别、版面分析、表格解析、公式识别、阅读顺序排序等核心任务。</li>
+          <li><strong>极致推理速度</strong>:在单卡 NVIDIA 4090 上通过 <code>sglang</code> 加速,达到峰值吞吐量超过 10,000 token/s,轻松应对大规模文档处理需求。</li>
+          <li><strong>在线体验</strong>:您可以在<a href="https://mineru.net/OpenSourceTools/Extractor">MinerU.net</a>、<a href="https://huggingface.co/spaces/opendatalab/MinerU">Hugging Face</a>, 以及<a href="https://www.modelscope.cn/studios/OpenDataLab/MinerU">ModelScope</a>体验我们的全新VLM模型</li>
+        </ul>
+      </li>
+      <li><strong>不兼容变更说明</strong>:为提升整体架构合理性与长期可维护性,本版本包含部分不兼容的变更:
+        <ul>
+          <li>Python 包名从 <code>magic-pdf</code> 更改为 <code>mineru</code>,命令行工具也由 <code>magic-pdf</code> 改为 <code>mineru</code>,请同步更新脚本与调用命令。</li>
+          <li>出于对系统模块化设计与生态一致性的考虑,MinerU 2.0 已不再内置 LibreOffice 文档转换模块。如需处理 Office 文档,建议通过独立部署的 LibreOffice 服务先行转换为 PDF 格式,再进行后续解析操作。</li>
+        </ul>
+      </li>
+    </ul>
+  </details>
+  <details>
   <summary>2025/05/24 1.3.12 发布</summary>
   <ul>
       <li>增加ppocrv5模型的支持,将<code>ch_server</code>模型更新为<code>PP-OCRv5_rec_server</code>,<code>ch_lite</code>模型更新为<code>PP-OCRv5_rec_mobile</code>(需更新模型)
@@ -372,8 +412,6 @@
     <li><a href="#acknowledgments">Acknowledgements</a></li>
     <li><a href="#citation">Citation</a></li>
     <li><a href="#star-history">Star History</a></li>
-    <li><a href="#magic-doc">magic-doc快速提取PPT/DOC/PDF</a></li>
-    <li><a href="#magic-html">magic-html提取混合网页内容</a></li>
     <li><a href="#links">Links</a></li>
   </ol>
 </details>
@@ -444,7 +482,7 @@ https://github.com/user-attachments/assets/4bea02c9-6d54-4cd6-97ed-dff14340982c
     <tr>
         <td>GPU要求</td>
         <td>Turing及以后架构,6G显存以上或Apple Silicon</td>
-        <td colspan="2">Ampere及以后架构,8G显存以上</td>
+        <td colspan="2">Turing及以后架构,8G显存以上</td>
     </tr>
     <tr>
         <td>内存要求</td>
@@ -490,7 +528,7 @@ uv pip install -e .[core] -i https://mirrors.aliyun.com/pypi/simple
 > Linux和macOS系统安装后自动支持cuda/mps加速,Windows用户如需使用cuda加速,
 > 请前往 [Pytorch官网](https://pytorch.org/get-started/locally/) 选择合适的cuda版本安装pytorch。
 
-#### 1.3 安装完整版(支持 sglang 加速)(需确保设备有Ampere及以后架构,24G显存及以上显卡)
+#### 1.3 安装完整版(支持 sglang 加速)(需确保设备有Turing及以后架构,8G显存及以上显卡)
 
 如需使用 **sglang 加速 VLM 模型推理**,请选择合适的方式安装完整版本:
 
@@ -502,6 +540,10 @@ uv pip install -e .[core] -i https://mirrors.aliyun.com/pypi/simple
   ```bash
   uv pip install -e .[all] -i https://mirrors.aliyun.com/pypi/simple
   ```
+  
+> [!TIP]
+> sglang安装过程中如发生异常,请参考[sglang官方文档](https://docs.sglang.ai/start/install.html)尝试解决或直接使用docker方式安装。
+
 - 使用 Dockerfile 构建镜像:
   ```bash
   wget https://gcore.jsdelivr.net/gh/opendatalab/MinerU@master/docker/china/Dockerfile
@@ -523,7 +565,8 @@ uv pip install -e .[core] -i https://mirrors.aliyun.com/pypi/simple
   ```
   
 > [!TIP]
-> Dockerfile默认使用`lmsysorg/sglang:v0.4.8.post1-cu126`作为基础镜像,如有需要,您可以自行修改为其他平台版本。
+> Dockerfile默认使用`lmsysorg/sglang:v0.4.8.post1-cu126`作为基础镜像,支持Turing/Ampere/Ada Lovelace/Hopper平台,
+> 如您使用较新的Blackwell平台,请将基础镜像修改为`lmsysorg/sglang:v0.4.8.post1-cu128-b200`。
   
 #### 1.4 安装client(用于在仅需 CPU 和网络连接的边缘设备上连接 sglang-server)
 
@@ -546,7 +589,7 @@ mineru -p <input_path> -o <output_path> -b vlm-sglang-client -u http://<host_ip>
 mineru -p <input_path> -o <output_path>
 ```
 
-- `<input_path>`:本地 PDF 文件或目录(支持 pdf/png/jpg/jpeg)
+- `<input_path>`:本地 PDF/图片 文件或目录(支持 pdf/png/jpg/jpeg/webp/gif
 - `<output_path>`:输出目录
 
 ##### 查看帮助信息
@@ -569,14 +612,15 @@ Options:
   -m, --method [auto|txt|ocr]     解析方法:auto(默认)、txt、ocr(仅用于 pipeline 后端)
   -b, --backend [pipeline|vlm-transformers|vlm-sglang-engine|vlm-sglang-client]
                                   解析后端(默认为 pipeline)
-  -l, --lang [ch|ch_server|... ]  指定文档语言(可提升 OCR 准确率,仅用于 pipeline 后端)
+  -l, --lang [ch|ch_server|ch_lite|en|korean|japan|chinese_cht|ta|te|ka|latin|arabic|east_slavic|cyrillic|devanagari]
+                                  指定文档语言(可提升 OCR 准确率,仅用于 pipeline 后端)
   -u, --url TEXT                  当使用 sglang-client 时,需指定服务地址
   -s, --start INTEGER             开始解析的页码(从 0 开始)
   -e, --end INTEGER               结束解析的页码(从 0 开始)
-  -f, --formula BOOLEAN           是否启用公式解析(默认开启,仅 pipeline 后端
-  -t, --table BOOLEAN             是否启用表格解析(默认开启,仅 pipeline 后端
+  -f, --formula BOOLEAN           是否启用公式解析(默认开启)
+  -t, --table BOOLEAN             是否启用表格解析(默认开启)
   -d, --device TEXT               推理设备(如 cpu/cuda/cuda:0/npu/mps,仅 pipeline 后端)
-  --vram INTEGER                  单进程最大 GPU 显存占用(仅 pipeline 后端)
+  --vram INTEGER                  单进程最大 GPU 显存占用(GB)(仅 pipeline 后端)
   --source [huggingface|modelscope|local]
                                   模型来源,默认 huggingface
   --help                          显示帮助信息
@@ -648,14 +692,6 @@ mineru -p <input_path> -o <output_path> -b vlm-sglang-engine
 mineru-sglang-server --port 30000
 ```
 
-> [!TIP]
-> sglang-server 有一些常用参数可以配置:
-> - 如您有两张显存为`12G`或`16G`的显卡,可以通过张量并行(TP)模式使用:`--tp 2`
-> - 如您有两张`11G`显卡,除了张量并行外,还需要调低KV缓存大小,可以使用:`--tp 2 --mem-fraction-static 0.7`
-> - 如果您有超过多张`24G`以上显卡,可以使用sglang的多卡并行模式来增加吞吐量:`--dp 2`
-> - 同时您可以启用`torch.compile`来将推理速度加速约15%:`--enable-torch-compile`
-> - 如果您想了解更多有关`sglang`的参数使用方法,请参考 [sglang官方文档](https://docs.sglang.ai/backend/server_arguments.html#common-launch-commands)
-
 2. 在另一个终端中使用 Client 调用:
 
 ```bash
@@ -667,29 +703,75 @@ mineru -p <input_path> -o <output_path> -b vlm-sglang-client -u http://127.0.0.1
 
 ---
 
-### 3. API 调用方式
+### 3. API 调用 或 可视化调用
 
-您也可以通过 Python 代码调用 MinerU,示例代码请参考:
-👉 [Python 调用示例](demo/demo.py)
+1. 使用python api直接调用:[Python 调用示例](demo/demo.py)
+2. 使用fast api方式调用:
+    ```bash
+    mineru-api --host 127.0.0.1 --port 8000
+    ```
+    在浏览器中访问 http://127.0.0.1:8000/docs 查看API文档。
 
----
+3. 使用gradio webui 或 gradio api调用
+    ```bash
+    # 使用 pipeline/vlm-transformers/vlm-sglang-client 后端
+    mineru-gradio --server-name 127.0.0.1 --server-port 7860
+    # 或使用 vlm-sglang-engine/pipeline 后端
+    mineru-gradio --server-name 127.0.0.1 --server-port 7860 --enable-sglang-engine
+    ```
+    在浏览器中访问 http://127.0.0.1:7860 使用 Gradio WebUI 或访问 http://127.0.0.1:7860/?view=api 使用 Gradio API。
 
-### 4. 部署衍生项目
 
-社区开发者基于 MinerU 进行了多种二次开发,包括:
+> [!TIP]
+> 以下是一些使用sglang加速模式的建议和注意事项:
+> - sglang加速模式目前支持在最低8G显存的Turing架构显卡上运行,但在显存<24G的显卡上可能会遇到显存不足的问题, 可以通过使用以下参数来优化显存使用:
+>   - 如果您使用单张显卡遇到显存不足的情况时,可能需要调低KV缓存大小,`--mem-fraction-static 0.5`,如仍出现显存不足问题,可尝试进一步降低到`0.4`或更低。
+>   - 如您有两张以上显卡,可尝试通过张量并行(TP)模式简单扩充可用显存:`--tp 2`
+> - 如果您已经可用正常使用sglang对vlm模型进行加速推理,但仍然希望进一步提升推理速度,可以尝试以下参数:
+>   - 如果您有超过多张显卡,可以使用sglang的多卡并行模式来增加吞吐量:`--dp 2`
+>   - 同时您可以启用`torch.compile`来将推理速度加速约15%:`--enable-torch-compile`
+> - 如果您想了解更多有关`sglang`的参数使用方法,请参考 [sglang官方文档](https://docs.sglang.ai/backend/server_arguments.html#common-launch-commands)
+> - 所有sglang官方支持的参数都可用通过命令行参数传递给 MinerU,包括以下命令:`mineru`、`mineru-sglang-server`、`mineru-gradio`、`mineru-api`
 
-- 基于 Gradio 的图形界面
-- 基于 FastAPI 的 Web API
-- 多卡负载均衡的客户端/服务端架构
-- 基于官网API的MCP Server
+> [!TIP]
+> - 任何情况下,您都可以通过在命令行的开头添加`CUDA_VISIBLE_DEVICES` 环境变量来指定可见的 GPU 设备。例如:
+>   ```bash
+>   CUDA_VISIBLE_DEVICES=1 mineru -p <input_path> -o <output_path>
+>   ```
+> - 这种指定方式对所有的命令行调用都有效,包括 `mineru`、`mineru-sglang-server`、`mineru-gradio` 和 `mineru-api`,且对`pipeline`、`vlm`后端均适用。
+> - 以下是一些常见的 `CUDA_VISIBLE_DEVICES` 设置示例:
+>   ```bash
+>   CUDA_VISIBLE_DEVICES=1 Only device 1 will be seen
+>   CUDA_VISIBLE_DEVICES=0,1 Devices 0 and 1 will be visible
+>   CUDA_VISIBLE_DEVICES=“0,1” Same as above, quotation marks are optional
+>   CUDA_VISIBLE_DEVICES=0,2,3 Devices 0, 2, 3 will be visible; device 1 is masked
+>   CUDA_VISIBLE_DEVICES="" No GPU will be visible
+>   ```
+> - 以下是一些可能的使用场景:
+>   - 如果您有多张显卡,需要指定卡0和卡1,并使用多卡并行来启动'sglang-server',可以使用以下命令:
+>   ```bash
+>   CUDA_VISIBLE_DEVICES=0,1 mineru-sglang-server --port 30000 --dp 2
+>   ```
+>   - 如果您有多张显卡,需要在卡0和卡1上启动两个`fastapi`服务,并分别监听不同的端口,可以使用以下命令:
+>   ```bash
+>   # 在终端1中
+>   CUDA_VISIBLE_DEVICES=0 mineru-api --host 127.0.0.1 --port 8000
+>   # 在终端2中
+>   CUDA_VISIBLE_DEVICES=1 mineru-api --host 127.0.0.1 --port 8001
+>   ```
 
-这些项目通常提供更好的用户体验和更多功能。
+---
 
-详细部署方式请参阅:
-👉 [衍生项目说明](projects/README_zh-CN.md)
+### 4. 基于配置文件扩展 MinerU 功能
 
---- 
+- MinerU 现已实现开箱即用,但也支持通过配置文件扩展功能。您可以在用户目录下创建 `mineru.json` 文件,添加自定义配置。
+- `mineru.json` 文件会在您使用内置模型下载命令 `mineru-models-download` 时自动生成,也可以通过将[配置模板文件](./mineru.template.json)复制到用户目录下并重命名为 `mineru.json` 来创建。
+- 以下是一些可用的配置选项:
+  - `latex-delimiter-config`:用于配置 LaTeX 公式的分隔符,默认为`$`符号,可根据需要修改为其他符号或字符串。
+  - `llm-aided-config`:用于配置 LLM 辅助标题分级的相关参数,兼容所有支持`openai协议`的 LLM 模型,默认使用`阿里云百练`的`qwen2.5-32b-instruct`模型,您需要自行配置 API 密钥并将`enable`设置为`true`来启用此功能。
+  - `models-dir`:用于指定本地模型存储目录,请为`pipeline`和`vlm`后端分别指定模型目录,指定目录后您可通过配置环境变量`export MINERU_MODEL_SOURCE=local`来使用本地模型。
 
+---
 
 # TODO
 
@@ -704,7 +786,7 @@ mineru -p <input_path> -o <output_path> -b vlm-sglang-client -u http://127.0.0.1
 # Known Issues
 
 - 阅读顺序基于模型对可阅读内容在空间中的分布进行排序,在极端复杂的排版下可能会部分区域乱序
-- 不支持竖排文字
+- 对竖排文字的支持较为有限
 - 目录和列表通过规则进行识别,少部分不常见的列表形式可能无法识别
 - 代码块在layout模型里还没有支持
 - 漫画书、艺术图册、小学教材、习题尚不能很好解析
@@ -713,11 +795,10 @@ mineru -p <input_path> -o <output_path> -b vlm-sglang-client -u http://127.0.0.1
 - 部分公式可能会无法在markdown中渲染
 
 # FAQ
-
-[常见问题](docs/FAQ_zh_cn.md)
-
-
-[FAQ](docs/FAQ_en_us.md)
+ 
+- 如果您在使用过程中遇到问题,可以先查看[常见问题](docs/FAQ_zh_cn.md)是否有解答。  
+- 如果未能解决您的问题,您也可以使用[DeepWiki](https://deepwiki.com/opendatalab/MinerU)与AI助手交流,这可以解决大部分常见问题。  
+- 如果您仍然无法解决问题,您可通过[Discord](https://discord.gg/Tdedn9GTXq)或[WeChat](http://mineru.space/s/V85Yl)加入社区,与其他用户和开发者交流。
 
 # All Thanks To Our Contributors
 
@@ -778,16 +859,13 @@ mineru -p <input_path> -o <output_path> -b vlm-sglang-client -u http://127.0.0.1
  </picture>
 </a>
 
-# Magic-doc
-
-[Magic-Doc](https://github.com/InternLM/magic-doc) Fast speed ppt/pptx/doc/docx/pdf extraction tool
-
-# Magic-html
-
-[Magic-HTML](https://github.com/opendatalab/magic-html) Mixed web page extraction tool
 
 # Links
 
 - [LabelU (A Lightweight Multi-modal Data Annotation Tool)](https://github.com/opendatalab/labelU)
 - [LabelLLM (An Open-source LLM Dialogue Annotation Platform)](https://github.com/opendatalab/LabelLLM)
 - [PDF-Extract-Kit (A Comprehensive Toolkit for High-Quality PDF Content Extraction)](https://github.com/opendatalab/PDF-Extract-Kit)
+- [Vis3 (OSS browser based on s3)](https://github.com/opendatalab/Vis3)
+- [OmniDocBench (A Comprehensive Benchmark for Document Parsing and Evaluation)](https://github.com/opendatalab/OmniDocBench)
+- [Magic-HTML (Mixed web page extraction tool)](https://github.com/opendatalab/magic-html)
+- [Magic-Doc (Fast speed ppt/pptx/doc/docx/pdf extraction tool)](https://github.com/InternLM/magic-doc) 

+ 2 - 2
mineru/cli/fast_api.py

@@ -11,7 +11,7 @@ from typing import List, Optional
 from loguru import logger
 from base64 import b64encode
 
-from mineru.cli.common import aio_do_parse, read_fn
+from mineru.cli.common import aio_do_parse, read_fn, pdf_suffixes, image_suffixes
 from mineru.utils.cli_parser import arg_parse
 from mineru.version import __version__
 
@@ -69,7 +69,7 @@ async def parse_pdf(
             file_path = Path(file.filename)
 
             # 如果是图像文件或PDF,使用read_fn处理
-            if file_path.suffix.lower() in [".pdf", ".png", ".jpeg", ".jpg"]:
+            if file_path.suffix.lower() in pdf_suffixes + image_suffixes:
                 # 创建临时文件以便使用read_fn
                 temp_path = Path(unique_dir) / file_path.name
                 with open(temp_path, "wb") as f:

+ 0 - 1
projects/README.md

@@ -3,5 +3,4 @@
 ## Project List
 
 - Projects not yet compatible with version 2.0:
-  - [multi_gpu](./multi_gpu/README.md): Multi-GPU parallel processing based on LitServe
   - [mcp](./mcp/README.md): MCP server based on the official API

+ 0 - 1
projects/README_zh-CN.md

@@ -3,5 +3,4 @@
 ## 项目列表
 
 - 未兼容2.0版本的项目列表
-  - [multi_gpu](./multi_gpu/README.md): 基于 LitServe 的多 GPU 并行处理
   - [mcp](./mcp/README.md): 基于官方api的mcp server

+ 0 - 44
projects/multi_gpu/README.md

@@ -1,44 +0,0 @@
-## 项目简介
-本项目提供基于 LitServe 的多 GPU 并行处理方案。LitServe 是一个简便且灵活的 AI 模型服务引擎,基于 FastAPI 构建。它为 FastAPI 增强了批处理、流式传输和 GPU 自动扩展等功能,无需为每个模型单独重建 FastAPI 服务器。
-
-## 环境配置
-请使用以下命令配置所需的环境:
-```bash
-pip install -U magic-pdf[full] litserve python-multipart filetype
-```
-
-## 快速使用
-### 1. 启动服务端
-以下示例展示了如何启动服务端,支持自定义设置:
-```python
-server = ls.LitServer(
-    MinerUAPI(output_dir='/tmp'),  # 可自定义输出文件夹
-    accelerator='cuda',  # 启用 GPU 加速
-    devices='auto',  # "auto" 使用所有 GPU
-    workers_per_device=1,  # 每个 GPU 启动一个服务实例
-    timeout=False  # 设置为 False 以禁用超时
-)
-server.run(port=8000)  # 设定服务端口为 8000
-```
-
-启动服务端命令:
-```bash
-python server.py
-```
-
-### 2. 启动客户端
-以下代码展示了客户端的使用方式,可根据需求修改配置:
-```python
-files = ['demo/small_ocr.pdf']  # 替换为文件路径,支持 pdf、jpg/jpeg、png、doc、docx、ppt、pptx 文件
-n_jobs = np.clip(len(files), 1, 8)  # 设置并发线程数,此处最大为 8,可根据自身修改
-results = Parallel(n_jobs, prefer='threads', verbose=10)(
-    delayed(do_parse)(p) for p in files
-)
-print(results)
-```
-
-启动客户端命令:
-```bash
-python client.py
-```
-好了,你的文件会自动在多个 GPU 上并行处理!🍻🍻🍻

+ 0 - 39
projects/multi_gpu/client.py

@@ -1,39 +0,0 @@
-import base64
-import requests
-import numpy as np
-from loguru import logger
-from joblib import Parallel, delayed
-
-
-def to_b64(file_path):
-    try:
-        with open(file_path, 'rb') as f:
-            return base64.b64encode(f.read()).decode('utf-8')
-    except Exception as e:
-        raise Exception(f'File: {file_path} - Info: {e}')
-
-
-def do_parse(file_path, url='http://127.0.0.1:8000/predict', **kwargs):
-    try:
-        response = requests.post(url, json={
-            'file': to_b64(file_path),
-            'kwargs': kwargs
-        })
-
-        if response.status_code == 200:
-            output = response.json()
-            output['file_path'] = file_path
-            return output
-        else:
-            raise Exception(response.text)
-    except Exception as e:
-        logger.error(f'File: {file_path} - Info: {e}')
-
-
-if __name__ == '__main__':
-    files = ['demo/small_ocr.pdf']
-    n_jobs = np.clip(len(files), 1, 8)
-    results = Parallel(n_jobs, prefer='threads', verbose=10)(
-        delayed(do_parse)(p) for p in files
-    )
-    print(results)

+ 0 - 98
projects/multi_gpu/server.py

@@ -1,98 +0,0 @@
-import os
-import uuid
-import shutil
-import tempfile
-import gc
-import fitz
-import torch
-import base64
-import filetype
-import litserve as ls
-from pathlib import Path
-from fastapi import HTTPException
-
-
-class MinerUAPI(ls.LitAPI):
-    def __init__(self, output_dir='/tmp'):
-        self.output_dir = Path(output_dir)
-
-    def setup(self, device):
-        if device.startswith('cuda'):
-            os.environ['CUDA_VISIBLE_DEVICES'] = device.split(':')[-1]
-            if torch.cuda.device_count() > 1:
-                raise RuntimeError("Remove any CUDA actions before setting 'CUDA_VISIBLE_DEVICES'.")
-
-        from magic_pdf.tools.cli import do_parse, convert_file_to_pdf
-        from magic_pdf.model.doc_analyze_by_custom_model import ModelSingleton
-
-        self.do_parse = do_parse
-        self.convert_file_to_pdf = convert_file_to_pdf
-
-        model_manager = ModelSingleton()
-        model_manager.get_model(True, False)
-        model_manager.get_model(False, False)
-        print(f'Model initialization complete on {device}!')
-
-    def decode_request(self, request):
-        file = request['file']
-        file = self.cvt2pdf(file)
-        opts = request.get('kwargs', {})
-        opts.setdefault('debug_able', False)
-        opts.setdefault('parse_method', 'auto')
-        return file, opts
-
-    def predict(self, inputs):
-        try:
-            pdf_name = str(uuid.uuid4())
-            output_dir = self.output_dir.joinpath(pdf_name)
-            self.do_parse(self.output_dir, pdf_name, inputs[0], [], **inputs[1])
-            return output_dir
-        except Exception as e:
-            shutil.rmtree(output_dir, ignore_errors=True)
-            raise HTTPException(status_code=500, detail=str(e))
-        finally:
-            self.clean_memory()
-
-    def encode_response(self, response):
-        return {'output_dir': response}
-
-    def clean_memory(self):
-        if torch.cuda.is_available():
-            torch.cuda.empty_cache()
-            torch.cuda.ipc_collect()
-        gc.collect()
-
-    def cvt2pdf(self, file_base64):
-        try:
-            temp_dir = Path(tempfile.mkdtemp())
-            temp_file = temp_dir.joinpath('tmpfile')
-            file_bytes = base64.b64decode(file_base64)
-            file_ext = filetype.guess_extension(file_bytes)
-
-            if file_ext in ['pdf', 'jpg', 'png', 'doc', 'docx', 'ppt', 'pptx']:
-                if file_ext == 'pdf':
-                    return file_bytes
-                elif file_ext in ['jpg', 'png']:
-                    with fitz.open(stream=file_bytes, filetype=file_ext) as f:
-                        return f.convert_to_pdf()
-                else:
-                    temp_file.write_bytes(file_bytes)
-                    self.convert_file_to_pdf(temp_file, temp_dir)
-                    return temp_file.with_suffix('.pdf').read_bytes()
-            else:
-                raise Exception('Unsupported file format')
-        except Exception as e:
-            raise HTTPException(status_code=500, detail=str(e))
-        finally:
-            shutil.rmtree(temp_dir, ignore_errors=True)
-
-
-if __name__ == '__main__':
-    server = ls.LitServer(
-        MinerUAPI(output_dir='/tmp'),
-        accelerator='cuda',
-        devices='auto',
-        workers_per_device=1,
-        timeout=False
-    )
-    server.run(port=8000)