4 ماه پیش · 944109620b
--- a/README.md
+++ b/README.md
@@ -439,7 +439,7 @@ There are three different ways to experience MinerU:
 
				         <td>Parsing Backend</td>
			
 
				         <td>pipeline</td>
			
 
				         <td>vlm-transformers</td>
			
 
				-        <td>vlm-sgslang</td>
			
 
				+        <td>vlm-sglang</td>
			
 
				     </tr>
			
 
				     <tr>
			
 
				         <td>Operating System</td>
			
@@ -502,7 +502,7 @@ cd MinerU
 
				 uv pip install -e .[core]
			
 
				 ```
			
 
				 
			
 
				-> [!TIP]  
			
 
				+> [!NOTE]  
			
 
				 > Linux and macOS systems automatically support CUDA/MPS acceleration after installation. For Windows users who want to use CUDA acceleration, 
			
 
				 > please visit the [PyTorch official website](https://pytorch.org/get-started/locally/) to install PyTorch with the appropriate CUDA version.
			
 
				 
			
@@ -651,13 +651,13 @@ mineru -p <input_path> -o <output_path>
 
				 
			
 
				 #### 2.3 Using sglang to Accelerate VLM Model Inference
			
 
				 
			
 
				-##### Start sglang-engine Mode
			
 
				+##### Through the sglang-engine Mode
			
 
				 
			
 
				 ```bash
			
 
				 mineru -p <input_path> -o <output_path> -b vlm-sglang-engine
			
 
				 ```
			
 
				 
			
 
				-##### Start sglang-server/client Mode
			
 
				+##### Through the sglang-server/client Mode
			
 
				 
			
 
				 1. Start Server:
			
 
				 
			
@@ -666,10 +666,13 @@ mineru-sglang-server --port 30000
 
				 ```
			
 
				 
			
 
				 > [!TIP]
			
 
				-> sglang acceleration requires a GPU with Ampere architecture or newer, and at least 24GB VRAM. If you have two 12GB or 16GB GPUs, you can use Tensor Parallelism (TP) mode:  
			
 
				-> `mineru-sglang-server --port 30000 --tp 2`  
			
 
				-> 
			
 
				-> If you still encounter out-of-memory errors with two GPUs, or if you need to improve throughput or inference speed using multi-GPU parallelism, please refer to the [sglang official documentation](https://docs.sglang.ai/backend/server_arguments.html#common-launch-commands).
			
 
				+> sglang-server has some commonly used parameters for configuration:
			
 
				+> - If you have two GPUs with `12GB` or `16GB` VRAM, you can use the Tensor Parallel (TP) mode: `--tp 2`
			
 
				+> - If you have two GPUs with `11GB` VRAM, in addition to Tensor Parallel mode, you need to reduce the KV cache size: `--tp 2 --mem-fraction-static 0.7`
			
 
				+> - If you have more than two GPUs with `24GB` VRAM or above, you can use sglang's multi-GPU parallel mode to increase throughput: `--dp 2`
			
 
				+> - You can also enable `torch.compile` to accelerate inference speed by approximately 15%: `--enable-torch-compile`
			
 
				+> - If you want to learn more about the usage of `sglang` parameters, please refer to the [official sglang documentation](https://docs.sglang.ai/backend/server_arguments.html#common-launch-commands)
			
 
				+
			
 
				 
			
 
				 2. Use Client in another terminal:
			
 
				 
			
--- a/README_zh-CN.md
+++ b/README_zh-CN.md
@@ -429,7 +429,7 @@ https://github.com/user-attachments/assets/4bea02c9-6d54-4cd6-97ed-dff14340982c
 
				         <td>解析后端</td>
			
 
				         <td>pipeline</td>
			
 
				         <td>vlm-transformers</td>
			
 
				-        <td>vlm-sgslang</td>
			
 
				+        <td>vlm-sglang</td>
			
 
				     </tr>
			
 
				     <tr>
			
 
				         <td>操作系统</td>
			
@@ -492,7 +492,7 @@ cd MinerU
 
				 uv pip install -e .[core] -i https://mirrors.aliyun.com/pypi/simple
			
 
				 ```
			
 
				 
			
 
				-> [!TIP]
			
 
				+> [!NOTE]
			
 
				 > Linux和macOS系统安装后自动支持cuda/mps加速，Windows用户如需使用cuda加速，
			
 
				 > 请前往 [Pytorch官网](https://pytorch.org/get-started/locally/) 选择合适的cuda版本安装pytorch。
			
 
				 
			
@@ -640,13 +640,13 @@ mineru -p <input_path> -o <output_path>
 
				 
			
 
				 #### 2.3 使用 sglang 加速 VLM 模型推理
			
 
				 
			
 
				-##### 启动 sglang-engine 模式
			
 
				+##### 通过 sglang-engine 模式
			
 
				 
			
 
				 ```bash
			
 
				 mineru -p <input_path> -o <output_path> -b vlm-sglang-engine
			
 
				 ```
			
 
				 
			
 
				-##### 启动 sglang-server/client 模式
			
 
				+##### 通过 sglang-server/client 模式
			
 
				 
			
 
				 1. 启动 Server：
			
 
				 
			
@@ -655,10 +655,12 @@ mineru-sglang-server --port 30000
 
				 ```
			
 
				 
			
 
				 > [!TIP]
			
 
				-> sglang加速需设备有Ampere及以后架构，24G显存及以上显卡，如您有两张12G或16G显卡，可以通过张量并行（TP）模式使用:
			
 
				-> `mineru-sglang-server --port 30000 --tp 2`
			
 
				-> 
			
 
				-> 如使用两张卡仍出现显存不足错误或需要使用多卡并行增加吞吐量或推理速度，请参考 [sglang官方文档](https://docs.sglang.ai/backend/server_arguments.html#common-launch-commands)
			
 
				+> sglang-server 有一些常用参数可以配置：
			
 
				+> - 如您有两张显存为`12G`或`16G`的显卡，可以通过张量并行（TP）模式使用：`--tp 2`
			
 
				+> - 如您有两张`11G`显卡，除了张量并行外，还需要调低KV缓存大小，可以使用：`--tp 2 --mem-fraction-static 0.7`
			
 
				+> - 如果您有超过多张`24G`以上显卡，可以使用sglang的多卡并行模式来增加吞吐量：`--dp 2`
			
 
				+> - 同时您可以启用`torch.compile`来将推理速度加速约15%：`--enable-torch-compile`
			
 
				+> - 如果您想了解更多有关`sglang`的参数使用方法，请参考 [sglang官方文档](https://docs.sglang.ai/backend/server_arguments.html#common-launch-commands)
			
 
				 
			
 
				 2. 在另一个终端中使用 Client 调用：
			
 
				 
			
--- a/docker/compose.yaml
+++ b/docker/compose.yaml
@@ -1,3 +1,5 @@
 
				+# Documentation:
			
 
				+# https://docs.sglang.ai/backend/server_arguments.html#common-launch-commands
			
 
				 services:
			
 
				   mineru-sglang:
			
 
				     image: mineru-sglang:latest
			
@@ -11,6 +13,10 @@ services:
 
				     command:
			
 
				       --host 0.0.0.0
			
 
				       --port 30000
			
 
				+      # --enable-torch-compile  # You can also enable torch.compile to accelerate inference speed by approximately 15%
			
 
				+      # --dp 2  # If you have more than two GPUs with 24GB VRAM or above, you can use sglang's multi-GPU parallel mode to increase throughput  
			
 
				+      # --tp 2  # If you have two GPUs with 12GB or 16GB VRAM, you can use the Tensor Parallel (TP) mode
			
 
				+      # --mem-fraction-static 0.7  # If you have two GPUs with 11GB VRAM, in addition to Tensor Parallel mode, you need to reduce the KV cache size
			
 
				     ulimits:
			
 
				       memlock: -1
			
 
				       stack: 67108864
			
@@ -23,4 +29,4 @@ services:
 
				           devices:
			
 
				             - driver: nvidia
			
 
				               device_ids: ["0"]
			
 
				-              capabilities: [gpu]
			
 
				+              capabilities: [gpu]
			
--- a/mineru/backend/pipeline/batch_analyze.py
+++ b/mineru/backend/pipeline/batch_analyze.py
@@ -9,7 +9,7 @@ from ...utils.config_reader import get_formula_enable, get_table_enable
 
				 from ...utils.model_utils import crop_img, get_res_list_from_layout_res
			
 
				 from ...utils.ocr_utils import get_adjusted_mfdetrec_res, get_ocr_result_list, OcrConfidence
			
 
				 
			
 
				-YOLO_LAYOUT_BASE_BATCH_SIZE = 1
			
 
				+YOLO_LAYOUT_BASE_BATCH_SIZE = 8
			
 
				 MFD_BASE_BATCH_SIZE = 1
			
 
				 MFR_BASE_BATCH_SIZE = 16
			
 
				 
			
--- a/mineru/model/layout/doclayout_yolo.py
+++ b/mineru/model/layout/doclayout_yolo.py
@@ -1,64 +1,73 @@
 
				+from typing import List, Dict, Union
			
 
				 from doclayout_yolo import YOLOv10
			
 
				 from tqdm import tqdm
			
 
				+import numpy as np
			
 
				+from PIL import Image
			
 
				 
			
 
				 
			
 
				-class DocLayoutYOLOModel(object):
			
 
				-    def __init__(self, weight, device):
			
 
				-        self.model = YOLOv10(weight)
			
 
				+class DocLayoutYOLOModel:
			
 
				+    def __init__(
			
 
				+        self,
			
 
				+        weight: str,
			
 
				+        device: str = "cuda",
			
 
				+        imgsz: int = 1280,
			
 
				+        conf: float = 0.1,
			
 
				+        iou: float = 0.45,
			
 
				+    ):
			
 
				+        self.model = YOLOv10(weight).to(device)
			
 
				         self.device = device
			
 
				+        self.imgsz = imgsz
			
 
				+        self.conf = conf
			
 
				+        self.iou = iou
			
 
				 
			
 
				-    def predict(self, image):
			
 
				+    def _parse_prediction(self, prediction) -> List[Dict]:
			
 
				         layout_res = []
			
 
				-        doclayout_yolo_res = self.model.predict(
			
 
				-            image,
			
 
				-            imgsz=1280,
			
 
				-            conf=0.10,
			
 
				-            iou=0.45,
			
 
				-            verbose=False, device=self.device
			
 
				-        )[0]
			
 
				-        for xyxy, conf, cla in zip(
			
 
				-            doclayout_yolo_res.boxes.xyxy.cpu(),
			
 
				-            doclayout_yolo_res.boxes.conf.cpu(),
			
 
				-            doclayout_yolo_res.boxes.cls.cpu(),
			
 
				+
			
 
				+        # 容错处理
			
 
				+        if not hasattr(prediction, "boxes") or prediction.boxes is None:
			
 
				+            return layout_res
			
 
				+
			
 
				+        for xyxy, conf, cls in zip(
			
 
				+            prediction.boxes.xyxy.cpu(),
			
 
				+            prediction.boxes.conf.cpu(),
			
 
				+            prediction.boxes.cls.cpu(),
			
 
				         ):
			
 
				-            xmin, ymin, xmax, ymax = [int(p.item()) for p in xyxy]
			
 
				-            new_item = {
			
 
				-                "category_id": int(cla.item()),
			
 
				+            coords = list(map(int, xyxy.tolist()))
			
 
				+            xmin, ymin, xmax, ymax = coords
			
 
				+            layout_res.append({
			
 
				+                "category_id": int(cls.item()),
			
 
				                 "poly": [xmin, ymin, xmax, ymin, xmax, ymax, xmin, ymax],
			
 
				                 "score": round(float(conf.item()), 3),
			
 
				-            }
			
 
				-            layout_res.append(new_item)
			
 
				+            })
			
 
				         return layout_res
			
 
				 
			
 
				-    def batch_predict(self, images: list, batch_size: int) -> list:
			
 
				-        images_layout_res = []
			
 
				-        # for index in range(0, len(images), batch_size):
			
 
				-        for index in tqdm(range(0, len(images), batch_size), desc="Layout Predict"):
			
 
				-            doclayout_yolo_res = [
			
 
				-                image_res.cpu()
			
 
				-                for image_res in self.model.predict(
			
 
				-                    images[index : index + batch_size],
			
 
				-                    imgsz=1280,
			
 
				-                    conf=0.10,
			
 
				-                    iou=0.45,
			
 
				+    def predict(self, image: Union[np.ndarray, Image.Image]) -> List[Dict]:
			
 
				+        prediction = self.model.predict(
			
 
				+            image,
			
 
				+            imgsz=self.imgsz,
			
 
				+            conf=self.conf,
			
 
				+            iou=self.iou,
			
 
				+            verbose=False
			
 
				+        )[0]
			
 
				+        return self._parse_prediction(prediction)
			
 
				+
			
 
				+    def batch_predict(
			
 
				+        self,
			
 
				+        images: List[Union[np.ndarray, Image.Image]],
			
 
				+        batch_size: int = 4
			
 
				+    ) -> List[List[Dict]]:
			
 
				+        results = []
			
 
				+        with tqdm(total=len(images), desc="Layout Predict") as pbar:
			
 
				+            for idx in range(0, len(images), batch_size):
			
 
				+                batch = images[idx: idx + batch_size]
			
 
				+                predictions = self.model.predict(
			
 
				+                    batch,
			
 
				+                    imgsz=self.imgsz,
			
 
				+                    conf=self.conf,
			
 
				+                    iou=self.iou,
			
 
				                     verbose=False,
			
 
				-                    device=self.device,
			
 
				                 )
			
 
				-            ]
			
 
				-            for image_res in doclayout_yolo_res:
			
 
				-                layout_res = []
			
 
				-                for xyxy, conf, cla in zip(
			
 
				-                    image_res.boxes.xyxy,
			
 
				-                    image_res.boxes.conf,
			
 
				-                    image_res.boxes.cls,
			
 
				-                ):
			
 
				-                    xmin, ymin, xmax, ymax = [int(p.item()) for p in xyxy]
			
 
				-                    new_item = {
			
 
				-                        "category_id": int(cla.item()),
			
 
				-                        "poly": [xmin, ymin, xmax, ymin, xmax, ymax, xmin, ymax],
			
 
				-                        "score": round(float(conf.item()), 3),
			
 
				-                    }
			
 
				-                    layout_res.append(new_item)
			
 
				-                images_layout_res.append(layout_res)
			
 
				-
			
 
				-        return images_layout_res
			
 
				+                for pred in predictions:
			
 
				+                    results.append(self._parse_prediction(pred))
			
 
				+                pbar.update(len(batch))
			
 
				+        return results
			
--- a/mineru/model/mfd/yolo_v8.py
+++ b/mineru/model/mfd/yolo_v8.py
@@ -1,33 +1,53 @@
 
				+from typing import List, Union
			
 
				 from tqdm import tqdm
			
 
				 from ultralytics import YOLO
			
 
				+import numpy as np
			
 
				+from PIL import Image
			
 
				 
			
 
				 
			
 
				-class YOLOv8MFDModel(object):
			
 
				-    def __init__(self, weight, device="cpu"):
			
 
				-        self.mfd_model = YOLO(weight)
			
 
				+class YOLOv8MFDModel:
			
 
				+    def __init__(
			
 
				+        self,
			
 
				+        weight: str,
			
 
				+        device: str = "cpu",
			
 
				+        imgsz: int = 1888,
			
 
				+        conf: float = 0.25,
			
 
				+        iou: float = 0.45,
			
 
				+    ):
			
 
				+        self.model = YOLO(weight).to(device)
			
 
				         self.device = device
			
 
				+        self.imgsz = imgsz
			
 
				+        self.conf = conf
			
 
				+        self.iou = iou
			
 
				 
			
 
				-    def predict(self, image):
			
 
				-        mfd_res = self.mfd_model.predict(
			
 
				-            image, imgsz=1888, conf=0.25, iou=0.45, verbose=False, device=self.device
			
 
				-        )[0]
			
 
				-        return mfd_res
			
 
				+    def _run_predict(
			
 
				+        self,
			
 
				+        inputs: Union[np.ndarray, Image.Image, List],
			
 
				+        is_batch: bool = False
			
 
				+    ) -> List:
			
 
				+        preds = self.model.predict(
			
 
				+            inputs,
			
 
				+            imgsz=self.imgsz,
			
 
				+            conf=self.conf,
			
 
				+            iou=self.iou,
			
 
				+            verbose=False,
			
 
				+            device=self.device
			
 
				+        )
			
 
				+        return [pred.cpu() for pred in preds] if is_batch else preds[0].cpu()
			
 
				 
			
 
				-    def batch_predict(self, images: list, batch_size: int) -> list:
			
 
				-        images_mfd_res = []
			
 
				-        # for index in range(0, len(images), batch_size):
			
 
				-        for index in tqdm(range(0, len(images), batch_size), desc="MFD Predict"):
			
 
				-            mfd_res = [
			
 
				-                image_res.cpu()
			
 
				-                for image_res in self.mfd_model.predict(
			
 
				-                    images[index : index + batch_size],
			
 
				-                    imgsz=1888,
			
 
				-                    conf=0.25,
			
 
				-                    iou=0.45,
			
 
				-                    verbose=False,
			
 
				-                    device=self.device,
			
 
				-                )
			
 
				-            ]
			
 
				-            for image_res in mfd_res:
			
 
				-                images_mfd_res.append(image_res)
			
 
				-        return images_mfd_res
			
 
				+    def predict(self, image: Union[np.ndarray, Image.Image]):
			
 
				+        return self._run_predict(image)
			
 
				+
			
 
				+    def batch_predict(
			
 
				+        self,
			
 
				+        images: List[Union[np.ndarray, Image.Image]],
			
 
				+        batch_size: int = 4
			
 
				+    ) -> List:
			
 
				+        results = []
			
 
				+        with tqdm(total=len(images), desc="MFD Predict") as pbar:
			
 
				+            for idx in range(0, len(images), batch_size):
			
 
				+                batch = images[idx: idx + batch_size]
			
 
				+                batch_preds = self._run_predict(batch, is_batch=True)
			
 
				+                results.extend(batch_preds)
			
 
				+                pbar.update(len(batch))
			
 
				+        return results
			
--- a/mineru/utils/pdf_reader.py
+++ b/mineru/utils/pdf_reader.py
@@ -15,7 +15,7 @@ def page_to_image(
 
				     scale = dpi / 72
			
 
				 
			
 
				     long_side_length = max(*page.get_size())
			
 
				-    if long_side_length > max_width_or_height:
			
 
				+    if (long_side_length*scale) > max_width_or_height:
			
 
				         scale = max_width_or_height / long_side_length
			
 
				 
			
 
				     bitmap: PdfBitmap = page.render(scale=scale)  # type: ignore