Quellcode durchsuchen

Merge remote-tracking branch 'origin/dev' into dev

myhloli vor 10 Monaten
Ursprung
Commit
869cf0a609
30 geänderte Dateien mit 247 neuen und 558 gelöschten Zeilen
  1. 4 2
      README.md
  2. 4 2
      README_zh-CN.md
  3. 2 1
      docs/README_Ascend_NPU_Acceleration_zh_CN.md
  4. 1 0
      magic_pdf/data/dataset.py
  5. 1 0
      magic_pdf/model/model_list.py
  6. 22 13
      magic_pdf/model/sub_modules/language_detection/utils.py
  7. 10 5
      magic_pdf/model/sub_modules/language_detection/yolov11/YOLOv11.py
  8. 20 1
      magic_pdf/model/sub_modules/model_init.py
  9. 5 5
      magic_pdf/model/sub_modules/ocr/paddleocr/ppocr_273_mod.py
  10. 18 10
      magic_pdf/model/sub_modules/table/rapidtable/rapid_table.py
  11. 7 0
      magic_pdf/pdf_parse_union_core_v2.py
  12. 1 2
      magic_pdf/resources/model_config/model_configs.yaml
  13. BIN
      magic_pdf/resources/yolov11-langdetect/yolo_v11_ft.pt
  14. 1 4
      next_docs/en/user_guide/quick_start.rst
  15. 0 58
      next_docs/en/user_guide/quick_start/convert_docx.rst
  16. 0 5
      next_docs/en/user_guide/quick_start/convert_image.rst
  17. 11 10
      next_docs/en/user_guide/quick_start/convert_ms_office.rst
  18. 5 4
      next_docs/en/user_guide/quick_start/convert_pdf.rst
  19. 0 58
      next_docs/en/user_guide/quick_start/convert_ppt.rst
  20. 0 61
      next_docs/en/user_guide/quick_start/convert_pptx.rst
  21. 2 2
      projects/gradio_app/app.py
  22. BIN
      projects/gradio_app/examples/complex_layout.pdf
  23. 1 1
      requirements.txt
  24. 1 1
      setup.py
  25. 2 1
      tests/retry_env.sh
  26. 0 246
      tests/test_cli/pdf_dev/annotations/cleaned/cleaned_research_report_1f978cd81fb7260c8f7644039ec2c054.md
  27. BIN
      tests/test_cli/pdf_dev/doc/test_mineru.docx
  28. BIN
      tests/test_cli/pdf_dev/images/docstructbench.jpg
  29. BIN
      tests/test_cli/pdf_dev/ppt/small.pptx
  30. 129 66
      tests/test_cli/test_cli_sdk.py

+ 4 - 2
README.md

@@ -42,13 +42,15 @@
 </div>
 
 # Changelog
-- 2025/01/06 1.0.0 released. This is our first official release, where we have introduced a completely new API interface and enhanced compatibility through extensive refactoring:
+- 2025/01/06 1.0.0 released. This is our first official release, where we have introduced a completely new API interface and enhanced compatibility through extensive refactoring, as well as a brand new automatic language identification feature:
   - New API Interface
     - For the data-side API, we have introduced the Dataset class, designed to provide a robust and flexible data processing framework. This framework currently supports a variety of document formats, including images (.jpg and .png), PDFs, Word documents (.doc and .docx), and PowerPoint presentations (.ppt and .pptx). It ensures effective support for data processing tasks ranging from simple to complex.
     - For the user-side API, we have meticulously designed the MinerU processing workflow as a series of composable Stages. Each Stage represents a specific processing step, allowing users to define new Stages according to their needs and creatively combine these stages to customize their data processing workflows.
   - Enhanced Compatibility
     - By optimizing the dependency environment and configuration items, we ensure stable and efficient operation on ARM architecture Linux systems.
-    - We have deeply integrated with Huawei Ascend NPU acceleration, providing autonomous and controllable high-performance computing capabilities. This supports the localization and development of AI application platforms in China.
+    - We have deeply integrated with Huawei Ascend NPU acceleration, providing autonomous and controllable high-performance computing capabilities. This supports the localization and development of AI application platforms in China. [Ascend NPU Acceleration](docs/README_Ascend_NPU_Acceleration_zh_CN.md)
+  - Automatic Language Identification
+    - By introducing a new language recognition model, setting the `lang` configuration to `auto` during document parsing will automatically select the appropriate OCR language model, improving the accuracy of scanned document parsing.
 - 2024/11/22 0.10.0 released. Introducing hybrid OCR text extraction capabilities,
   - Significantly improved parsing performance in complex text distribution scenarios such as dense formulas, irregular span regions, and text represented by images.
   - Combines the dual advantages of accurate content extraction and faster speed in text mode, and more precise span/line region recognition in OCR mode.

+ 4 - 2
README_zh-CN.md

@@ -42,13 +42,15 @@
 </div>
 
 # 更新记录
-- 2025/01/06 1.0.0 发布,这是我们的第一个正式版本,在这个版本中,我们通过大量重构带来了全新的API接口和更广泛的兼容性:
+- 2025/01/06 1.0.0 发布,这是我们的第一个正式版本,在这个版本中,我们通过大量重构带来了全新的API接口和更广泛的兼容性,以及全新的自动语言识别功能
   - 全新API接口 
     - 对于数据侧API,我们引入了Dataset类,旨在提供一个强大而灵活的数据处理框架。该框架当前支持包括图像(.jpg及.png)、PDF、Word(.doc及.docx)、以及PowerPoint(.ppt及.pptx)在内的多种文档格式,确保了从简单到复杂的数据处理任务都能得到有效的支持。
     - 针对用户侧API,我们将MinerU的处理流程精心设计为一系列可组合的Stage阶段。每个Stage代表了一个特定的处理步骤,用户可以根据自身需求自由地定义新的Stage,并通过创造性地组合这些阶段来定制专属的数据处理流程。
   - 更广泛的兼容性适配
     - 通过优化依赖环境和配置项,确保在ARM架构的Linux系统上能够稳定高效运行。
-    - 深度适配华为昇腾NPU加速,积极响应信创要求,提供自主可控的高性能计算能力,助力人工智能应用平台的国产化应用与发展。
+    - 深度适配华为昇腾NPU加速,积极响应信创要求,提供自主可控的高性能计算能力,助力人工智能应用平台的国产化应用与发展。[NPU加速教程](docs/README_Ascend_NPU_Acceleration_zh_CN.md)
+  - 自动语言识别
+    - 通过引入全新的语言识别模型, 在文档解析中将`lang`配置为`auto`,即可自动选择合适的OCR语言模型,提升扫描类文档解析的准确性。
 - 2024/11/22 0.10.0发布,通过引入混合OCR文本提取能力,
   - 在公式密集、span区域不规范、部分文本使用图像表现等复杂文本分布场景下获得解析效果的显著提升
   - 同时具备文本模式内容提取准确、速度更快与OCR模式span/line区域识别更准的双重优势

+ 2 - 1
docs/README_Ascend_NPU_Acceleration_zh_CN.md

@@ -51,6 +51,7 @@ magic-pdf --help
 
 ## 已知问题
 
-- paddleocr使用内嵌onnx模型,仅支持中英文ocr,不支持其他语言ocr
+- paddleocr使用内嵌onnx模型,仅在默认语言配置下能以较快速度对中英文进行识别
+- 自定义lang参数时,paddleocr速度会存在明显下降情况
 - layout模型使用layoutlmv3时会发生间歇性崩溃,建议使用默认配置的doclayout_yolo模型
 - 表格解析仅适配了rapid_table模型,其他模型可能会无法使用

+ 1 - 0
magic_pdf/data/dataset.py

@@ -153,6 +153,7 @@ class PymuDocDataset(Dataset):
             logger.info(f"lang: {lang}, detect_lang: {self._lang}")
         else:
             self._lang = lang
+            logger.info(f"lang: {lang}")
     def __len__(self) -> int:
         """The page number of the pdf."""
         return len(self._records)

+ 1 - 0
magic_pdf/model/model_list.py

@@ -9,3 +9,4 @@ class AtomicModel:
     MFR = "mfr"
     OCR = "ocr"
     Table = "table"
+    LangDetect = "langdetect"

+ 22 - 13
magic_pdf/model/sub_modules/language_detection/utils.py

@@ -12,7 +12,6 @@ from magic_pdf.data.utils import load_images_from_pdf
 from magic_pdf.libs.config_reader import get_local_models_dir, get_device
 from magic_pdf.libs.pdf_check import extract_pages
 from magic_pdf.model.model_list import AtomicModel
-from magic_pdf.model.sub_modules.language_detection.yolov11.YOLOv11 import YOLOv11LangDetModel
 from magic_pdf.model.sub_modules.model_init import AtomModelSingleton
 
 
@@ -25,11 +24,11 @@ def get_model_config():
     config_path = os.path.join(model_config_dir, 'model_configs.yaml')
     with open(config_path, 'r', encoding='utf-8') as f:
         configs = yaml.load(f, Loader=yaml.FullLoader)
-    return local_models_dir, device, configs
+    return root_dir, local_models_dir, device, configs
 
 
 def get_text_images(simple_images):
-    local_models_dir, device, configs = get_model_config()
+    _, local_models_dir, device, configs = get_model_config()
     atom_model_manager = AtomModelSingleton()
     temp_layout_model = atom_model_manager.get_atom_model(
         atom_model_name=AtomicModel.Layout,
@@ -59,15 +58,25 @@ def get_text_images(simple_images):
 def auto_detect_lang(pdf_bytes: bytes):
     sample_docs = extract_pages(pdf_bytes)
     sample_pdf_bytes = sample_docs.tobytes()
-    simple_images = load_images_from_pdf(sample_pdf_bytes, dpi=96)
+    simple_images = load_images_from_pdf(sample_pdf_bytes, dpi=200)
     text_images = get_text_images(simple_images)
-    local_models_dir, device, configs = get_model_config()
-    # 用yolo11做语言分类
-    langdetect_model_weights = str(
-        os.path.join(
-            local_models_dir, configs['weights'][MODEL_NAME.YOLO_V11_LangDetect]
-        )
-    )
-    langdetect_model = YOLOv11LangDetModel(langdetect_model_weights, device)
+    langdetect_model = model_init(MODEL_NAME.YOLO_V11_LangDetect)
     lang = langdetect_model.do_detect(text_images)
-    return lang
+    return lang
+
+
+def model_init(model_name: str):
+    atom_model_manager = AtomModelSingleton()
+
+    if model_name == MODEL_NAME.YOLO_V11_LangDetect:
+        root_dir, _, device, _ = get_model_config()
+        model = atom_model_manager.get_atom_model(
+            atom_model_name=AtomicModel.LangDetect,
+            langdetect_model_name=MODEL_NAME.YOLO_V11_LangDetect,
+            langdetect_model_weight=str(os.path.join(root_dir, 'resources', 'yolov11-langdetect', 'yolo_v11_ft.pt')),
+            device=device,
+        )
+    else:
+        raise ValueError(f"model_name {model_name} not found")
+    return model
+

+ 10 - 5
magic_pdf/model/sub_modules/language_detection/yolov11/YOLOv11.py

@@ -2,6 +2,7 @@
 from collections import Counter
 from uuid import uuid4
 
+import torch
 from PIL import Image
 from loguru import logger
 from ultralytics import YOLO
@@ -83,10 +84,14 @@ def resize_images_to_224(image):
 
 
 class YOLOv11LangDetModel(object):
-    def __init__(self, weight, device):
-        self.model = YOLO(weight)
-        self.device = device
+    def __init__(self, langdetect_model_weight, device):
 
+        self.model = YOLO(langdetect_model_weight)
+
+        if str(device).startswith("npu"):
+            self.device = torch.device(device)
+        else:
+            self.device = device
     def do_detect(self, images: list):
         all_images = []
         for image in images:
@@ -99,7 +104,7 @@ class YOLOv11LangDetModel(object):
                 all_images.append(resize_images_to_224(temp_image))
 
         images_lang_res = self.batch_predict(all_images, batch_size=8)
-        logger.info(f"images_lang_res: {images_lang_res}")
+        # logger.info(f"images_lang_res: {images_lang_res}")
         if len(images_lang_res) > 0:
             count_dict = Counter(images_lang_res)
             language = max(count_dict, key=count_dict.get)
@@ -107,7 +112,6 @@ class YOLOv11LangDetModel(object):
             language = None
         return language
 
-
     def predict(self, image):
         results = self.model.predict(image, verbose=False, device=self.device)
         predicted_class_id = int(results[0].probs.top1)
@@ -117,6 +121,7 @@ class YOLOv11LangDetModel(object):
 
     def batch_predict(self, images: list, batch_size: int) -> list:
         images_lang_res = []
+
         for index in range(0, len(images), batch_size):
             lang_res = [
                 image_res.cpu()

+ 20 - 1
magic_pdf/model/sub_modules/model_init.py

@@ -2,8 +2,8 @@ import torch
 from loguru import logger
 
 from magic_pdf.config.constants import MODEL_NAME
-from magic_pdf.libs.config_reader import get_device
 from magic_pdf.model.model_list import AtomicModel
+from magic_pdf.model.sub_modules.language_detection.yolov11.YOLOv11 import YOLOv11LangDetModel
 from magic_pdf.model.sub_modules.layout.doclayout_yolo.DocLayoutYOLO import \
     DocLayoutYOLOModel
 from magic_pdf.model.sub_modules.layout.layoutlmv3.model_init import \
@@ -63,6 +63,13 @@ def doclayout_yolo_model_init(weight, device='cpu'):
     return model
 
 
+def langdetect_model_init(langdetect_model_weight, device='cpu'):
+    if str(device).startswith("npu"):
+        device = torch.device(device)
+    model = YOLOv11LangDetModel(langdetect_model_weight, device)
+    return model
+
+
 def ocr_model_init(show_log: bool = False,
                    det_db_box_thresh=0.3,
                    lang=None,
@@ -130,6 +137,9 @@ def atom_model_init(model_name: str, **kwargs):
                 kwargs.get('doclayout_yolo_weights'),
                 kwargs.get('device')
             )
+        else:
+            logger.error('layout model name not allow')
+            exit(1)
     elif model_name == AtomicModel.MFD:
         atom_model = mfd_model_init(
             kwargs.get('mfd_weights'),
@@ -155,6 +165,15 @@ def atom_model_init(model_name: str, **kwargs):
             kwargs.get('device'),
             kwargs.get('ocr_engine')
         )
+    elif model_name == AtomicModel.LangDetect:
+        if kwargs.get('langdetect_model_name') == MODEL_NAME.YOLO_V11_LangDetect:
+            atom_model = langdetect_model_init(
+                kwargs.get('langdetect_model_weight'),
+                kwargs.get('device')
+            )
+        else:
+            logger.error('langdetect model name not allow')
+            exit(1)
     else:
         logger.error('model name not allow')
         exit(1)

+ 5 - 5
magic_pdf/model/sub_modules/ocr/paddleocr/ppocr_273_mod.py

@@ -21,7 +21,7 @@ class ModifiedPaddleOCR(PaddleOCR):
     def __init__(self, *args, **kwargs):
 
         super().__init__(*args, **kwargs)
-
+        self.lang = kwargs.get('lang', 'ch')
         # 在cpu架构为arm且不支持cuda时调用onnx、
         if not torch.cuda.is_available() and platform.machine() in ['arm64', 'aarch64']:
             self.use_onnx = True
@@ -94,7 +94,7 @@ class ModifiedPaddleOCR(PaddleOCR):
             ocr_res = []
             for img in imgs:
                 img = preprocess_image(img)
-                if self.use_onnx:
+                if self.lang in ['ch'] and self.use_onnx:
                     dt_boxes, elapse = self.additional_ocr.text_detector(img)
                 else:
                     dt_boxes, elapse = self.text_detector(img)
@@ -124,7 +124,7 @@ class ModifiedPaddleOCR(PaddleOCR):
                     img, cls_res_tmp, elapse = self.text_classifier(img)
                     if not rec:
                         cls_res.append(cls_res_tmp)
-                if self.use_onnx:
+                if self.lang in ['ch'] and self.use_onnx:
                     rec_res, elapse = self.additional_ocr.text_recognizer(img)
                 else:
                     rec_res, elapse = self.text_recognizer(img)
@@ -142,7 +142,7 @@ class ModifiedPaddleOCR(PaddleOCR):
 
         start = time.time()
         ori_im = img.copy()
-        if self.use_onnx:
+        if self.lang in ['ch'] and self.use_onnx:
             dt_boxes, elapse = self.additional_ocr.text_detector(img)
         else:
             dt_boxes, elapse = self.text_detector(img)
@@ -183,7 +183,7 @@ class ModifiedPaddleOCR(PaddleOCR):
             time_dict['cls'] = elapse
             logger.debug("cls num  : {}, elapsed : {}".format(
                 len(img_crop_list), elapse))
-        if self.use_onnx:
+        if self.lang in ['ch'] and self.use_onnx:
             rec_res, elapse = self.additional_ocr.text_recognizer(img_crop_list)
         else:
             rec_res, elapse = self.text_recognizer(img_crop_list)

+ 18 - 10
magic_pdf/model/sub_modules/table/rapidtable/rapid_table.py

@@ -8,17 +8,25 @@ from rapid_table import RapidTable
 class RapidTableModel(object):
     def __init__(self, ocr_engine):
         self.table_model = RapidTable()
-        if ocr_engine is None:
-            self.ocr_model_name = "RapidOCR"
-            if torch.cuda.is_available():
-                from rapidocr_paddle import RapidOCR
-                self.ocr_engine = RapidOCR(det_use_cuda=True, cls_use_cuda=True, rec_use_cuda=True)
-            else:
-                from rapidocr_onnxruntime import RapidOCR
-                self.ocr_engine = RapidOCR()
+        # if ocr_engine is None:
+        #     self.ocr_model_name = "RapidOCR"
+        #     if torch.cuda.is_available():
+        #         from rapidocr_paddle import RapidOCR
+        #         self.ocr_engine = RapidOCR(det_use_cuda=True, cls_use_cuda=True, rec_use_cuda=True)
+        #     else:
+        #         from rapidocr_onnxruntime import RapidOCR
+        #         self.ocr_engine = RapidOCR()
+        # else:
+        #     self.ocr_model_name = "PaddleOCR"
+        #     self.ocr_engine = ocr_engine
+
+        self.ocr_model_name = "RapidOCR"
+        if torch.cuda.is_available():
+            from rapidocr_paddle import RapidOCR
+            self.ocr_engine = RapidOCR(det_use_cuda=True, cls_use_cuda=True, rec_use_cuda=True)
         else:
-            self.ocr_model_name = "PaddleOCR"
-            self.ocr_engine = ocr_engine
+            from rapidocr_onnxruntime import RapidOCR
+            self.ocr_engine = RapidOCR()
 
     def predict(self, image):
 

+ 7 - 0
magic_pdf/pdf_parse_union_core_v2.py

@@ -373,6 +373,8 @@ def cal_block_index(fix_blocks, sorted_bboxes):
         # 使用xycut排序
         block_bboxes = []
         for block in fix_blocks:
+            # 如果block['bbox']任意值小于0,将其置为0
+            block['bbox'] = [max(0, x) for x in block['bbox']]
             block_bboxes.append(block['bbox'])
 
             # 删除图表body block中的虚拟line信息, 并用real_lines信息回填
@@ -766,6 +768,11 @@ def parse_page_core(
     """重排block"""
     sorted_blocks = sorted(fix_blocks, key=lambda b: b['index'])
 
+    """block内重排(img和table的block内多个caption或footnote的排序)"""
+    for block in sorted_blocks:
+        if block['type'] in [BlockType.Image, BlockType.Table]:
+            block['blocks'] = sorted(block['blocks'], key=lambda b: b['index'])
+
     """获取QA需要外置的list"""
     images, tables, interline_equations = get_qa_need_list_v2(sorted_blocks)
 

+ 1 - 2
magic_pdf/resources/model_config/model_configs.yaml

@@ -5,5 +5,4 @@ weights:
   unimernet_small: MFR/unimernet_small
   struct_eqtable: TabRec/StructEqTable
   tablemaster: TabRec/TableMaster
-  rapid_table: TabRec/RapidTable
-  yolo_v11n_langdetect: LangDetect/YOLO/yolo_v11_cls_ft.pt
+  rapid_table: TabRec/RapidTable

BIN
magic_pdf/resources/yolov11-langdetect/yolo_v11_ft.pt


+ 1 - 4
next_docs/en/user_guide/quick_start.rst

@@ -9,7 +9,4 @@ Want to learn about the usage methods under different scenarios ? This page give
 
     quick_start/convert_pdf 
     quick_start/convert_image
-    quick_start/convert_ppt
-    quick_start/convert_pptx
-    quick_start/convert_doc
-    quick_start/convert_docx
+    quick_start/convert_ms_office

+ 0 - 58
next_docs/en/user_guide/quick_start/convert_docx.rst

@@ -1,58 +0,0 @@
-
-Convert DocX
-=============
-
-.. admonition:: Warning
-    :class: tip
-
-    When processing MS-Office files, we first use third-party software to convert the MS-Office files to PDF.
-
-    For certain MS-Office files, the quality of the converted PDF files may not be very high, which can affect the quality of the final output.
-
-
-Command Line
-^^^^^^^^^^^^^
-
-.. code:: python
-
-    # make sure the file have correct suffix
-    magic-pdf -p a.docx -o output -m auto
-
-
-API
-^^^^^
-
-.. code:: python
-
-    import os
-
-    from magic_pdf.data.data_reader_writer import FileBasedDataWriter, FileBasedDataReader
-    from magic_pdf.model.doc_analyze_by_custom_model import doc_analyze
-    from magic_pdf.data.read_api import read_local_office
-
-    # prepare env
-    local_image_dir, local_md_dir = "output/images", "output"
-    image_dir = str(os.path.basename(local_image_dir))
-
-    os.makedirs(local_image_dir, exist_ok=True)
-
-    image_writer, md_writer = FileBasedDataWriter(local_image_dir), FileBasedDataWriter(
-        local_md_dir
-    )
-
-    # proc
-    ## Create Dataset Instance
-    input_file = "some_docx.docx"     # replace with real ms-office file
-
-    input_file_name = input_file.split(".")[0]
-    ds = read_local_office(input_file)[0]
-
-    # ocr mode
-    ds.apply(doc_analyze, ocr=True).pipe_ocr_mode(image_writer).dump_md(
-        md_writer, f"{input_file_name}.md", image_dir
-    )
-
-    # txt mode
-    ds.apply(doc_analyze, ocr=True).pipe_txt_mode(image_writer).dump_md(
-        md_writer, f"{input_file_name}.md", image_dir
-    )

+ 0 - 5
next_docs/en/user_guide/quick_start/convert_image.rst

@@ -45,8 +45,3 @@ API
     ds.apply(doc_analyze, ocr=True).pipe_ocr_mode(image_writer).dump_md(
         md_writer, f"{input_file_name}.md", image_dir
     )
-
-    # txt mode
-    ds.apply(doc_analyze, ocr=True).pipe_txt_mode(image_writer).dump_md(
-        md_writer, f"{input_file_name}.md", image_dir
-    )

+ 11 - 10
next_docs/en/user_guide/quick_start/convert_doc.rst → next_docs/en/user_guide/quick_start/convert_ms_office.rst

@@ -17,7 +17,7 @@ Command Line
 
 .. code:: python
 
-    # make sure the file have correct suffix
+    # replace with real ms-office file, we support MS-DOC, MS-DOCX, MS-PPT, MS-PPTX now
     magic-pdf -p a.doc -o output -m auto
 
 
@@ -30,6 +30,8 @@ API
     from magic_pdf.data.data_reader_writer import FileBasedDataWriter, FileBasedDataReader
     from magic_pdf.model.doc_analyze_by_custom_model import doc_analyze
     from magic_pdf.data.read_api import read_local_office
+    from magic_pdf.config.enums import SupportedPdfParseMethod
+
 
     # prepare env
     local_image_dir, local_md_dir = "output/images", "output"
@@ -43,17 +45,16 @@ API
 
     # proc
     ## Create Dataset Instance
-    input_file = "some_doc.doc"     # replace with real ms-office file
+    input_file = "some_doc.doc"     # replace with real ms-office file, we support MS-DOC, MS-DOCX, MS-PPT, MS-PPTX now
 
     input_file_name = input_file.split(".")[0]
     ds = read_local_office(input_file)[0]
 
-    # ocr mode
-    ds.apply(doc_analyze, ocr=True).pipe_ocr_mode(image_writer).dump_md(
-        md_writer, f"{input_file_name}.md", image_dir
-    )
 
-    # txt mode
-    ds.apply(doc_analyze, ocr=True).pipe_txt_mode(image_writer).dump_md(
-        md_writer, f"{input_file_name}.md", image_dir
-    )
+    ## inference
+    if ds.classify() == SupportedPdfParseMethod.OCR:
+        ds.apply(doc_analyze, ocr=True).pipe_ocr_mode(image_writer).dump_md(
+        md_writer, f"{input_file_name}.md", image_dir)
+    else:
+        ds.apply(doc_analyze, ocr=False).pipe_txt_mode(image_writer).dump_md(
+        md_writer, f"{input_file_name}.md", image_dir)

+ 5 - 4
next_docs/en/user_guide/quick_start/convert_pdf.rst

@@ -44,12 +44,13 @@ API
     ## Create Dataset Instance
     ds = PymuDocDataset(pdf_bytes)
 
-    # ocr mode
-    ds.apply(doc_analyze, ocr=True).pipe_ocr_mode(image_writer).dump_md(
+    ## inference
+    if ds.classify() == SupportedPdfParseMethod.OCR:
+        ds.apply(doc_analyze, ocr=True).pipe_ocr_mode(image_writer).dump_md(
         md_writer, f"{name_without_suff}.md", image_dir
     )
 
-    # txt mode
-    ds.apply(doc_analyze, ocr=True).pipe_txt_mode(image_writer).dump_md(
+    else:
+        ds.apply(doc_analyze, ocr=False).pipe_txt_mode(image_writer).dump_md(
         md_writer, f"{name_without_suff}.md", image_dir
     )

+ 0 - 58
next_docs/en/user_guide/quick_start/convert_ppt.rst

@@ -1,58 +0,0 @@
-
-
-Convert PPT
-============
-
-.. admonition:: Warning
-    :class: tip
-
-    When processing MS-Office files, we first use third-party software to convert the MS-Office files to PDF.
-
-    For certain MS-Office files, the quality of the converted PDF files may not be very high, which can affect the quality of the final output.
-
-Command Line
-^^^^^^^^^^^^^
-
-.. code:: python
-
-    # make sure the file have correct suffix
-    magic-pdf -p a.ppt -o output -m auto
-
-
-API
-^^^^^
-
-.. code:: python
-
-    import os
-
-    from magic_pdf.data.data_reader_writer import FileBasedDataWriter, FileBasedDataReader
-    from magic_pdf.model.doc_analyze_by_custom_model import doc_analyze
-    from magic_pdf.data.read_api import read_local_office
-
-    # prepare env
-    local_image_dir, local_md_dir = "output/images", "output"
-    image_dir = str(os.path.basename(local_image_dir))
-
-    os.makedirs(local_image_dir, exist_ok=True)
-
-    image_writer, md_writer = FileBasedDataWriter(local_image_dir), FileBasedDataWriter(
-        local_md_dir
-    )
-
-    # proc
-    ## Create Dataset Instance
-    input_file = "some_ppt.ppt"     # replace with real ms-office file
-
-    input_file_name = input_file.split(".")[0]
-    ds = read_local_office(input_file)[0]
-
-    # ocr mode
-    ds.apply(doc_analyze, ocr=True).pipe_ocr_mode(image_writer).dump_md(
-        md_writer, f"{input_file_name}.md", image_dir
-    )
-
-    # txt mode
-    ds.apply(doc_analyze, ocr=True).pipe_txt_mode(image_writer).dump_md(
-        md_writer, f"{input_file_name}.md", image_dir
-    )

+ 0 - 61
next_docs/en/user_guide/quick_start/convert_pptx.rst

@@ -1,61 +0,0 @@
-
-
-Convert PPTX
-=================
-
-.. admonition:: Warning
-    :class: tip
-
-    When processing MS-Office files, we first use third-party software to convert the MS-Office files to PDF.
-
-    For certain MS-Office files, the quality of the converted PDF files may not be very high, which can affect the quality of the final output.
-
-
-Command Line
-^^^^^^^^^^^^^
-
-.. code:: python
-
-    # make sure the file have correct suffix
-    magic-pdf -p a.pptx -o output -m auto
-
-
-
-
-API
-^^^^^^
-
-.. code:: python
-
-    import os
-
-    from magic_pdf.data.data_reader_writer import FileBasedDataWriter, FileBasedDataReader
-    from magic_pdf.model.doc_analyze_by_custom_model import doc_analyze
-    from magic_pdf.data.read_api import read_local_office
-
-    # prepare env
-    local_image_dir, local_md_dir = "output/images", "output"
-    image_dir = str(os.path.basename(local_image_dir))
-
-    os.makedirs(local_image_dir, exist_ok=True)
-
-    image_writer, md_writer = FileBasedDataWriter(local_image_dir), FileBasedDataWriter(
-        local_md_dir
-    )
-
-    # proc
-    ## Create Dataset Instance
-    input_file = "some_pptx.pptx"     # replace with real ms-office file
-
-    input_file_name = input_file.split(".")[0]
-    ds = read_local_office(input_file)[0]
-
-    # ocr mode
-    ds.apply(doc_analyze, ocr=True).pipe_ocr_mode(image_writer).dump_md(
-        md_writer, f"{input_file_name}.md", image_dir
-    )
-
-    # txt mode
-    ds.apply(doc_analyze, ocr=True).pipe_txt_mode(image_writer).dump_md(
-        md_writer, f"{input_file_name}.md", image_dir
-    )

+ 2 - 2
projects/gradio_app/app.py

@@ -193,7 +193,7 @@ if __name__ == '__main__':
                 max_pages = gr.Slider(1, 20, 10, step=1, label='Max convert pages')
                 with gr.Row():
                     layout_mode = gr.Dropdown(['layoutlmv3', 'doclayout_yolo'], label='Layout model', value='doclayout_yolo')
-                    language = gr.Dropdown(all_lang, label='Language', value='')
+                    language = gr.Dropdown(all_lang, label='Language', value='auto')
                 with gr.Row():
                     formula_enable = gr.Checkbox(label='Enable formula recognition', value=True)
                     is_ocr = gr.Checkbox(label='Force enable OCR', value=False)
@@ -221,6 +221,6 @@ if __name__ == '__main__':
         file.change(fn=to_pdf, inputs=file, outputs=pdf_show)
         change_bu.click(fn=to_markdown, inputs=[file, max_pages, is_ocr, layout_mode, formula_enable, table_enable, language],
                         outputs=[md, md_text, output_file, pdf_show])
-        clear_bu.add([file, md, pdf_show, md_text, output_file, is_ocr, language])
+        clear_bu.add([file, md, pdf_show, md_text, output_file, is_ocr])
 
     demo.launch(server_name='0.0.0.0')

BIN
projects/gradio_app/examples/complex_layout.pdf


+ 1 - 1
requirements.txt

@@ -1,7 +1,7 @@
 boto3>=1.28.43
 Brotli>=1.1.0
 click>=8.1.7
-fast-langdetect==0.2.0
+fast-langdetect>=0.2.3
 loguru>=0.6.0
 numpy>=1.21.6,<2.0.0
 pydantic>=2.7.2

+ 1 - 1
setup.py

@@ -51,7 +51,7 @@ if __name__ == '__main__':
                      "doclayout_yolo==0.0.2",  # doclayout_yolo
                      "rapidocr-paddle",  # rapidocr-paddle
                      "rapidocr_onnxruntime",
-                     "rapid_table",  # rapid_table
+                     "rapid_table==0.3.0",  # rapid_table
                      "PyYAML",  # yaml
                      "openai",  # openai SDK
                      "detectron2"

+ 2 - 1
tests/retry_env.sh

@@ -6,7 +6,8 @@ retry_count=0
 while true; do
     # prepare env
     #python -m pip install -r requirements-qa.txt
-    python -m pip install -U magic-pdf[full] --extra-index-url https://wheels.myhloli.com -i https://mirrors.aliyun.com/pypi/simple
+    #python -m pip install -U magic-pdf[full] --extra-index-url https://wheels.myhloli.com -i https://mirrors.aliyun.com/pypi/simple
+    pip install -e .
     python -m pip install paddlepaddle-gpu==3.0.0b1 -i https://www.paddlepaddle.org.cn/packages/stable/cu118/
     pip install modelscope
     wget https://gitee.com/myhloli/MinerU/raw/master/scripts/download_models.py -O download_models.py

+ 0 - 246
tests/test_cli/pdf_dev/annotations/cleaned/cleaned_research_report_1f978cd81fb7260c8f7644039ec2c054.md

@@ -1,246 +0,0 @@
-## 增持(维持)
-
-所属行业:机械设备
-
-当前价格(元): 82.42
-
-## 证券分析师
-
-倪正洋
-
-资格编号:S0120521020003
-
-邮箱: nizy@tebon.com.cn
-
-## 研究助理
-
-杨云道
-
-邮箱: yangyx@tebon.com.cn
-
-
-
-| 沪深 300 对比 | $1 \mathrm{M}$ | $2 \mathrm{M}$ | $3 \mathrm{M}$ |
-| :--- | ---: | ---: | ---: |
-| 绝对涨幅(\%) | 7.18 | 32.88 | 80.86 |
-| 相对涨幅(\%) | 8.10 | 25.93 | 78.39 |
-
-资料来源: 德邦研究所, 聚源数据
-
-## 相关研究
-
-1.《高测股份 (688556): 光伏金刚线及硅片切割代工业务推动公司 22Q1 业绩大超预期》, 2022.4.29
-
-2.《光伏设备: 光伏高效电池扩产提速,关键设备商各领风骚》, 2022.4.10 3. 《高测股份 (688556.SH): 再签建湖 10GW 硅片切割代工产能,强化代工业务成长逻辑》, 2022.4.7
-
-3.《高测股份 (688556.SH): 签订晶澳曲靖 2.2 亿元切割设备合同,看好 22 年代工业绩释放+HJT 切割工艺进步》, 2022.3.9
-
-4.《高测股份 (688556.SH): 21 年业绩预告超市场预期,关注切片代工利润释放》, 2022.1.24
-
-# 高测股份 $(688556.5 H):$ 扩产 4000 万公里金刚线,强化光伏碰片切割三元布局
-
-## 投资要点
-
-- 事件:公司拟与蓝关县人民政府签署的《壶关年产 12000 万千米金刚线项目投资协议书》,项目一期计划建设年产 4,000万千米金刚线产能,预计一期总投资额约 6.66 亿元; 后续年产 8,000 万千米金刚线项目尚未具体约定,存在较大不确定性。
-- 顺应下游需求扩张, 金刚线产能快速扩产, 保证公司内供+外销。光伏金刚线需求 22 年提升源于两方面:1)2022 年光伏产业链景气度高涨,1-5 月光伏装机同比 $+24.4 \%$, 带动产业链各环节开工率提升, 硅片前期扩产产能逐步落地, 金刚线需求释放;2)由于多晶硅料价格持续维持高位,细线化、薄片化趋势加速,其中细线化要求金刚线线径由 40 线、 38 线向 36 线、 35 线进步, 带动单 GW 切割线耗不断提升。目前 36 线单 GW 切割线耗约 50 万公里, 较 38 线提升约 $30 \%$ 。公司于 2021 年对金刚线进行 “ 1 机 12 线” 技改,技改完成后,公司 22 年 1 季度产能 712 万公里, 年化产能超 2500 万公里。公司目前切片代工产能约 47GW, 对应远期金刚线产能超 2300 万公里。本次扩产再一次扩充公司金刚线产能, 强化金刚线产能内供+外销布局。
-- 依托萦关低成本电价提升金刚线盈利能力, 顺应硅料节约持续推动细线化布局。公司在山西长治金刚线生产厂区采购电力的平均单价较青岛金刚线生产厂区采购电力的平均单价低, 2020 年度公司陆续将青岛的金刚线生产线搬迁到山西长治並关厂区,随着山西长治金刚线生产厂区金刚线产量增加,公司采购电力的平均单价呈下降趋势。目前公司电力采购单价从 2019 年 0.8 元/kwh 降低到 2022 年 Q1 的 0.39 元/kwh,並关后续拓展有望进一步降低公司金刚线电价成本。金刚线线径越细,锯㖓越小,切割时产生的锯㖓硅料损失越少,同样一根硅棒可切割加工出的硅片数量越多,制造硅片所需的硅材料越少。相同切割工艺下,金刚线越细,固结在钢线基体上的金刚石微粉颗粒越小,切割加工时对硅片的表面损伤越小,硅片表面质量越好,砝片 TTV 等质量指标表现也就越好。金刚线母线直径已由 2016 年的 80um 降至 2022 年上半年的 36、38、40um,此外高线速、柔性化和智能化等均是金刚线及切片技术进步方向, 公司在薄片、细线化、高线速、柔性智能化方面均有领先布局, 推动切割工艺持续进步。
-- 切割工艺的持续进步领先, 是保障公司利润释放的核心壁垒。公司光伏硅片切割三元布局包括硅片切割及机加工设备、砝片切割耗材 (金刚线) 以及切割代工业务。公司 2021 年依托前期设备+耗材布局切割代工业务, 目前已公布 47GW 产能 (乐山5GW 示范基地、乐山 20GW 大硅片及配套项目、建湖一期 10GW 项目,建湖二期 $12 \mathrm{GW}$ 项目), 客户包括通威、京运通、美科及建湖周边电池企业。22 年底公司有望实现超 20GW 切割代工产能, 且当前终端客户主要为下游电池企业。客户选择切割代工模式的核心在于凭借高测的专业化服务实现快速上产, 同时可获得较自建硅片切割产能或购买硅片更多的超额利润。超额利润的核心在于高测股份的切割代工技术领先, 可实现更多的硅片切割红利, 并与客户共享。未来随着金刚线扩产和切割技术进步, 公司光伏硅片切割代工利润弹性有望持续释放。
-- 盈利预测与投资建议:预计公司 2022-2024 年归母净利润 4.7、7.2、9.3 亿元,对应 PE 30、20、15 倍,维持 “增持” 评级。
-- 风险提示:硅片扩产不及预期,公司代工业务利润波动风险,市场竞争加剧。
-
-<table><thead><tr><th>股票数据</th><th></th></tr></thead><tr><td>总股本(百万股):</td><td>227.92</td></tr><tr><td>流通 A 股(百万股):</td><td>167.01</td></tr><tr><td>52 周内股价区间(元):</td><td>21.60-97.40</td></tr><tr><td>总市值(百万元):</td><td>18,785.44</td></tr><tr><td>总资产(百万元):</td><td>3,508.81</td></tr><tr><td>每股净资产(元):</td><td>5.50</td></tr><tr><td>咨料来源,公司公告</td><td></td></tr></table>
-
-<table><thead><tr><th>主要财务数据及预测</th><th></th><th></th><th></th><th></th><th></th></tr></thead><tr><td></td><td>2020</td><td>2021</td><td>2022E</td><td>2023E</td><td>2024E</td></tr><tr><td>营业收入(百万元)</td><td>746</td><td>1,567</td><td>3,684</td><td>5,056</td><td>5,752</td></tr><tr><td>(+/-)YOY(%)</td><td>4.5\%</td><td>110.0\%</td><td>135.1\%</td><td>37.2\%</td><td>13.8\%</td></tr><tr><td>净利润(百万元)</td><td>59</td><td>173</td><td>471</td><td>717</td><td>933</td></tr><tr><td>(+/-)YOY(%)</td><td>83.8\%</td><td>193.4\%</td><td>172.8\%</td><td>52.2\%</td><td>30.1\%</td></tr><tr><td>全面摊薄 EPS(元)</td><td>0.43</td><td>1.07</td><td>2.91</td><td>4.43</td><td>5.77</td></tr><tr><td>毛利率(\%)</td><td>35.3\%</td><td>33.7\%</td><td>35.0\%</td><td>36.0\%</td><td>38.0\%</td></tr><tr><td>净资产收益率(\%)</td><td>6.0\%</td><td>15.0\%</td><td>27.9\%</td><td>28.8\%</td><td>26.5\%</td></tr></table>
-
-资料来源: 公司年报 (2020-2021),德邦研究所
-
-备注: 净利润为归属母公司所有者的净利润
-
-## 财务报表分析和预测
-
-| 主要财务指标 | 2021 | $2022 E$ | $2023 E$ | $2024 E$ |
-| :--- | ---: | ---: | ---: | ---: |
-| 每股指标(元) |  |  |  |  |
-| 每股收益 | 1.07 | 2.91 | 4.43 | 5.77 |
-| 每股净资产 | 7.13 | 10.43 | 15.39 | 21.76 |
-| 每股经营现金流 | 0.47 | 1.27 | 4.07 | 5.02 |
-| 每股股利 | 0.11 | 0.11 | 0.11 | 0.11 |
-| 价值评估(倍) |  |  |  |  |
-| P/E | 82.90 | 30.47 | 20.02 | 15.38 |
-| P/B | 12.44 | 8.50 | 5.76 | 4.08 |
-| P/S | 8.52 | 3.62 | 2.64 | 2.32 |
-| EV/EBITDA | 49.85 | 24.12 | 15.68 | 11.46 |
-| 股息率\% | $0.1 \%$ | $0.1 \%$ | $0.1 \%$ | $0.1 \%$ |
-| 盈利能力指标(\%) |  |  |  |  |
-| 毛利率 | $33.7 \%$ | $35.0 \%$ | $36.0 \%$ | $38.0 \%$ |
-| 净利润率 | $11.0 \%$ | $12.8 \%$ | $14.2 \%$ | $16.2 \%$ |
-| 净资产收益率 | $15.0 \%$ | $27.9 \%$ | $28.8 \%$ | $26.5 \%$ |
-| 资产回报率 | $5.3 \%$ | $7.9 \%$ | $8.5 \%$ | $9.2 \%$ |
-| 投资回报率 | $15.3 \%$ | $25.9 \%$ | $24.6 \%$ | $23.7 \%$ |
-| 盈利增长(\%) |  |  |  |  |
-| 营业收入增长率 | $110.0 \%$ | $135.1 \%$ | $37.2 \%$ | $13.8 \%$ |
-| EBIT 增长率 | $233.7 \%$ | $150.7 \%$ | $52.3 \%$ | $31.9 \%$ |
-| 净利润增长率 | $193.4 \%$ | $172.8 \%$ | $52.2 \%$ | $30.1 \%$ |
-| 偿倩能力指标 |  |  |  |  |
-| 资产负债率 | $64.3 \%$ | $71.5 \%$ | $70.6 \%$ | $65.3 \%$ |
-| 流动比率 | 1.2 | 1.2 | 1.3 | 1.4 |
-| 速动比率 | 0.9 | 0.9 | 1.0 | 1.1 |
-| 现金比率 | 0.2 | 0.1 | 0.2 | 0.3 |
-| 经营效率指标 |  |  |  |  |
-| 应收怅款周转天数 | 161.7 | 165.1 | 164.9 | 164.4 |
-| 存货周转天数 | 196.1 | 170.0 | 180.0 | 190.0 |
-| 总资产周转率 | 0.5 | 0.6 | 0.6 | 0.6 |
-| 固定资产周转率 | 4.2 | 8.6 | 10.3 | 11.1 |
-
-| 现金流量表(百万元) | 2021 | $2022 E$ | 2023E | 2024E |
-| :--- | ---: | ---: | ---: | ---: |
-| 净利润 | 173 | 471 | 717 | 933 |
-| 少数股东损益 | 0 | 0 | 0 | 0 |
-| 非现金支出 | 107 | 114 | 133 | 147 |
-| 非经营收益 | 17 | 1 | 4 | 14 |
-| 营运资金变动 | -220 | -382 | -195 | -283 |
-| 经营活动现金流 | 76 | 205 | 658 | 812 |
-| 资产 | -83 | -184 | -203 | -169 |
-| 投资 | 229 | 0 | 0 | 0 |
-| 其他 | 6 | 9 | 13 | 14 |
-| 投资活动现金流 | 151 | -175 | -190 | -155 |
-| 债权募资 | -80 | 39 | 321 | 64 |
-| 股权募资 | 0 | 0 | 0 | 0 |
-| 其他活 | -21 | -3 | -14 | -25 |
-| 融资活动现金流 | -101 | 36 | 307 | 39 |
-| 现金净流量 | 127 | 66 | 775 | 696 |
-
-备注: 表中计算估值指标的收盘价日期为 7 月 19 日
-
-资料来源: 公司年报 (2020-2021), 德邦研究所
-
-| 利润表(百万元) | 2021 | 2022E | 2023E | 2024E |
-| :---: | :---: | :---: | :---: | :---: |
-| 营业总收入 | 1,567 | 3,684 | 5,056 | 5,752 |
-| 营业成本 | 1,038 | 2,394 | 3,236 | 3,567 |
-| 毛利率\% | $33.7 \%$ | $35.0 \%$ | $36.0 \%$ | $38.0 \%$ |
-| 营业税金及附加 | 6 | 18 | 25 | 29 |
-| 营业税金率\% | $0.4 \%$ | $0.5 \%$ | $0.5 \%$ | $0.5 \%$ |
-| 营业费用 | 63 | 147 | 193 | 209 |
-| 营业费用率\% | $4.0 \%$ | $4.0 \%$ | $3.8 \%$ | $3.6 \%$ |
-| 管理费用 | 131 | 313 | 409 | 444 |
-| 管理费用率\% | $8.4 \%$ | $8.5 \%$ | $8.1 \%$ | $7.7 \%$ |
-| 研发费用 | 117 | 276 | 379 | 431 |
-| 研发费用率\% | $7.5 \%$ | $7.5 \%$ | $7.5 \%$ | $7.5 \%$ |
-| EBIT | 213 | 534 | 814 | 1,074 |
-| 财务费用 | 7 | 1 | 11 | 19 |
-| 财务费用率\% | $0.4 \%$ | $0.0 \%$ | $0.2 \%$ | $0.3 \%$ |
-| 资产减值损失 | -33 | -63 | -86 | -98 |
-| 投资收益 | 5 | 9 | 13 | 14 |
-| 营业利润 | 212 | 531 | 800 | 1,040 |
-| 营业外收支 | -25 | -8 | -3 | -3 |
-| 利润总额 | 187 | 523 | 797 | 1,037 |
-| EBITDA | 282 | 582 | 865 | 1,129 |
-| 所得税 | 14 | 52 | 80 | 104 |
-| 有效所得税率\% | $7.7 \%$ | $10.0 \%$ | $10.0 \%$ | $10.0 \%$ |
-| 少数股东损益 | 0 | 0 | 0 | $\mathbf{0}-1-2$ |
-| 归属母公司所有者净利润 | 173 | 471 | 717 | 933 |
-
-| 资产负债表(百万元) | 2021 | 2022E | 2023E | $2024 E$ |
-| :---: | :---: | :---: | :---: | :---: |
-| 货币资金 | 427 | 494 | 1,269 | 1,965 |
-| 应收账款及应收票据 | 1,173 | 2,806 | 3,798 | 4,344 |
-| 存货 | 558 | 1,115 | 1,596 | 1,857 |
-| 其它流动资产 | 266 | 578 | 736 | 778 |
-| 流动资产合计 | 2,424 | 4,992 | 7,400 | 8,943 |
-| 长期股权投资 | 0 | 0 | 0 | 0 |
-| 固定资产 | 370 | 429 | 491 | 516 |
-| 在建工程 | 169 | 183 | 205 | 226 |
-| 无形资产 | 42 | 56 | 69 | 80 |
-| 非流动资产合计 | 811 | 940 | 1,087 | 1,198 |
-| 资产总计 | 3,235 | 5,932 | 8,487 | 10,141 |
-| 短期借款 | 28 | 68 | 388 | 452 |
-| 应付票据及应付账款 | 1,401 | 3,197 | 4,302 | 4,760 |
-| 预收账款 | 0 | 0 | 0 | 0 |
-| 其它流动负债 | 560 | 887 | 1,214 | 1,314 |
-| 流动负债合计 | 1,989 | 4,152 | 5,904 | 6,527 |
-| 长期借款 | 0 | 0 | 0 | 0 |
-| 其它长期负债 | 92 | 92 | 92 | 92 |
-| 非流动负债合计 | 92 | 92 | 92 | 92 |
-| 负债总计 | 2,081 | 4,243 | 5,996 | 6,619 |
-| 实收资本 | 162 | 162 | 162 | 162 |
-| 普通股股东权益 | 1,154 | 1,688 | 2,491 | 3,522 |
-| 少数股东权益 | 0 | 0 | 0 | 0 |
-| 负债和所有者权益合计 | 3,235 | 5,932 | 8,487 | 10,141 |
-
-## 信息披露
-
-## 分析师与研究助理简介
-
-倪正洋,2021 年加入德邦证券,任研究所大制造组组长、机械行业首席分析师,拥有 5 年机械研究经验,1 年高端装备产业经验,南京大学材料学学士、上海交通大学材料学硕士。2020 年获得 iFinD 机械行业最具人气分析师, 所在团队曾获机械行业 2019 年新财富第三名,2017 年新财富第二名,2017 年金牛奖第二名,2016 年新财富第四名。
-
-## 分析师声明
-
-本人具有中国证券业协会授予的证券投资咨询执业资格,以勤勉的职业态度,独立、客观地出具本报告。本报告所采用的数据和信息均来自市场公开信息, 本人不保证该等信息的准确性或完整性。分析逻辑基于作者的职业理解,清晰准确地反映了作者的研究观点,结论不受任何第三方的授意或影响,特此声明。
-
-## 投资评级说明
-
-1.投资评级的比较和评级标准:
-
-以报告发布后的 6 个月内的市场表现为比较标准,报告发布日后 6 个月内的公司股价(或行业指数)的张跌幅相对同期市场基准指数的涨跌幅;
-
-2.市场基准指数的比较标准:
-
-A 股市场以上证综指或深证成指为基准;香港市场以恒生指数为基准;美国市场以标普 500 或纳斯达克综合指数为基准。
-
-<table>
-    <tr>
-        <td rowspan="11">1. 投资评级的比较和评级标准: 以报告发布后的 6 个月内的市场表 现为比较标准,报告发布日后 6 个 月内的公司股价(或行业指数)的 涨跌幅相对同期市场基准指数的涨 跌幅:<br> 2. 市场基准指数的比较标准: A股市场以上证综指或深证成指为基 准; 香港市场以恒生指数为基准; 美 国市场以标普500或纳斯达克综合指 数为基准。</td>
-    </tr>
-    <tr>
-        <td>类型</td>
-        <td>评级</td>
-        <td>说明</td>
-    </tr>
-        <td rowspan="5">股票评级</td>
-    </tr>
-    <tr>
-        <td>买入</td>
-        <td>相对强于市场表现 20%以上;</td>
-    </tr>
-    <tr>
-        <td>增持</td>
-        <td>相对强于市场表现 5% 20%;</td>
-    </tr>
-    <tr>
-        <td>中性</td>
-        <td>相对市场表现在-5% +5%之间波动;</td>
-    </tr>
-    <tr>
-        <td>减持</td>
-        <td>相对弱于市场表现 5%以下。</td>
-    </tr>
-    <tr>
-        <td rowspan="4">行业投资评级</td>
-    </tr>
-    <tr>
-        <td>优于大市</td>
-        <td>预期行业整体回报高于基准指数整体水平10%以上;</td>
-    </tr>
-    <tr>
-        <td>中性</td>
-        <td>预期行业整体回报介于基准指数整体水平-10%与 10%之间;</td>
-    </tr>
-    <tr>
-        <td>弱于大市</td>
-        <td>预期行业整体回报低于基准指数整体水平 10%以下。</td>
-    </tr>
-    <tr>
-</table>
-
-## 法律声明
-
-本报告仅供德邦证券股份有限公司(以下简称 “本公司”)的客户使用。本公司不会因接收人收到本报告而视其为客户。在任何情况下,本报告中的信息或所表述的意见并不构成对任何人的投资建议。在任何情况下,本公司不对任何人因使用本报告中的任何内容所引致的任何损失负任何责任。
-
-本报告所载的资料、意见及推测仅反映本公司于发布本报告当日的判断,本报告所指的证券或投资标的的价格、价值及投资收入可能会波动。在不同时期,本公司可发出与本报告所载资料、意见及推测不一致的报告。
-
-市场有风险,投资需谨慎。本报告所载的信息、材料及结论只提供特定客户作参考,不构成投资建议,也没有考虑到个别客户特殊的投资目标、财务状况或需要。客户应考虑本报告中的任何意见或建议是否符合其特定状况。在法律许可的情况下,德邦证券及其所属关联机构可能会持有报告中提到的公司所发行的证券并进行交易,还可能为这些公司提供投资银行服务或其他服务。
-
-本报告仅向特定客户传送,未经德邦证券研究所书面授权,本研究报告的任何部分均不得以任何方式制作任何形式的拷贝、复印件或复制品,或再次分发给任何其他人,或以任何侵犯本公司版权的其他方式使用。所有本报告中使用的商标、服务标记及标记均为本公司的商标、服务标记及标记。如欲引用或转载本文内容, 务必联络德邦证券研究所并获得许可, 并需注明出处为德邦证券研究所,且不得对本文进行有悖原意的引用和删改。
-
-根据中国证监会核发的经营证券业务许可,德邦证券股份有限公司的经营范围包括证券投资咨询业务。

BIN
tests/test_cli/pdf_dev/doc/test_mineru.docx


BIN
tests/test_cli/pdf_dev/images/docstructbench.jpg


BIN
tests/test_cli/pdf_dev/ppt/small.pptx


+ 129 - 66
tests/test_cli/test_cli_sdk.py

@@ -6,15 +6,14 @@ from conf import conf
 from lib import common
 import time
 import magic_pdf.model as model_config
-import os
-from magic_pdf.data.data_reader_writer import FileBasedDataWriter
+from magic_pdf.data.read_api import read_local_images
+from magic_pdf.data.read_api import read_local_office
 from magic_pdf.data.data_reader_writer import S3DataReader, S3DataWriter
 from magic_pdf.config.make_content_config import DropMode, MakeMode
 from magic_pdf.data.data_reader_writer import FileBasedDataWriter, FileBasedDataReader
 from magic_pdf.data.dataset import PymuDocDataset
 from magic_pdf.model.doc_analyze_by_custom_model import doc_analyze
 from magic_pdf.config.enums import SupportedPdfParseMethod
-model_config.__use_inside_model__ = True
 pdf_res_path = conf.conf['pdf_res_path']
 code_path = conf.conf['code_path']
 pdf_dev_path = conf.conf['pdf_dev_path']
@@ -33,7 +32,7 @@ class TestCli:
         yield
 
     @pytest.mark.P0
-    def test_pdf_auto_sdk(self):
+    def test_pdf_local_sdk(self):
         """pdf sdk auto test."""
         demo_names = list()
         pdf_path = os.path.join(pdf_dev_path, 'pdf')
@@ -44,6 +43,7 @@ class TestCli:
             pdf_path = os.path.join(pdf_dev_path, 'pdf', f'{demo_name}.pdf')
             local_image_dir = os.path.join(pdf_dev_path, 'pdf', 'images')
             image_dir = str(os.path.basename(local_image_dir))
+            name_without_suff = os.path.basename(pdf_path).split(".pdf")[0]
             dir_path = os.path.join(pdf_dev_path, 'mineru')
             image_writer, md_writer = FileBasedDataWriter(local_image_dir), FileBasedDataWriter(dir_path)
             reader1 = FileBasedDataReader("")
@@ -59,13 +59,132 @@ class TestCli:
                 ## pipeline
                 pipe_result = infer_result.pipe_txt_mode(image_writer)
             common.delete_file(dir_path)
-            infer_result.draw_model(os.path.join(dir_path, f"{demo_name}_model.pdf"))
-            pipe_result.draw_layout(os.path.join(dir_path, f"{demo_name}_layout.pdf"))
-            pipe_result.draw_span(os.path.join(dir_path, f"{demo_name}_spans.pdf"))
-            pipe_result.dump_md(md_writer, f"{demo_name}.md", image_dir)
-            pipe_result.dump_content_list(md_writer, f"{demo_name}_content_list.json", image_dir)
+            ### draw model result on each page
+            infer_result.draw_model(os.path.join(dir_path, f"{name_without_suff}_model.pdf"))
+
+            ### get model inference result
+            model_inference_result = infer_result.get_infer_res()
+
+            ### draw layout result on each page
+            pipe_result.draw_layout(os.path.join(dir_path, f"{name_without_suff}_layout.pdf"))
+
+            ### draw spans result on each page
+            pipe_result.draw_span(os.path.join(dir_path, f"{name_without_suff}_spans.pdf"))
+
+            ### dump markdown
+            md_content = pipe_result.get_markdown(image_dir)
+            pipe_result.dump_md(md_writer, f"{name_without_suff}.md", image_dir)
+            ### get content list content
+            content_list_content = pipe_result.get_content_list(image_dir)
+            pipe_result.dump_content_list(md_writer, f"{name_without_suff}_content_list.json", image_dir)
+            
+            ### get middle json
+            middle_json_content = pipe_result.get_middle_json()
+            ### dump middle json
+            pipe_result.dump_middle_json(md_writer, f'{name_without_suff}_middle.json')
+            common.sdk_count_folders_and_check_contents(dir_path)
+
+    @pytest.mark.P0
+    def test_pdf_s3_sdk(self):
+        """pdf s3 sdk test."""
+        demo_names = list()
+        pdf_path = os.path.join(pdf_dev_path, 'pdf')
+        for pdf_file in os.listdir(pdf_path):
+            if pdf_file.endswith('.pdf'):
+                demo_names.append(pdf_file.split('.')[0])
+        for demo_name in demo_names:
+            pdf_path = os.path.join(pdf_dev_path, 'pdf', f'{demo_name}.pdf')
+            local_image_dir = os.path.join(pdf_dev_path, 'pdf', 'images')
+            image_dir = str(os.path.basename(local_image_dir))
+            name_without_suff = os.path.basename(pdf_path).split(".pdf")[0]
+            dir_path = os.path.join(pdf_dev_path, 'mineru')
+            pass
+
+    @pytest.mark.P0
+    def test_pdf_local_ppt(self):
+        """pdf sdk auto test."""
+        demo_names = list()
+        pdf_path = os.path.join(pdf_dev_path, 'ppt')
+        for pdf_file in os.listdir(pdf_path):
+            if pdf_file.endswith('.pptx'):
+                demo_names.append(pdf_file.split('.')[0])
+        for demo_name in demo_names:
+            pdf_path = os.path.join(pdf_dev_path, 'ppt', f'{demo_name}.pptx')
+            local_image_dir = os.path.join(pdf_dev_path, 'mineru', 'images')
+            image_dir = str(os.path.basename(local_image_dir))
+            name_without_suff = os.path.basename(pdf_path).split(".pptx")[0]
+            dir_path = os.path.join(pdf_dev_path, 'mineru')
+            image_writer, md_writer = FileBasedDataWriter(local_image_dir), FileBasedDataWriter(dir_path)
+            ds = read_local_office(pdf_path)[0]
+            common.delete_file(dir_path)
+            
+            ds.apply(doc_analyze, ocr=True).pipe_txt_mode(image_writer).dump_md(md_writer, f"{name_without_suff}.md", image_dir)          
+            common.sdk_count_folders_and_check_contents(dir_path)
+
+
+
+    @pytest.mark.P0
+    def test_pdf_local_image(self):
+        """pdf sdk auto test."""
+        demo_names = list()
+        pdf_path = os.path.join(pdf_dev_path, 'images')
+        for pdf_file in os.listdir(pdf_path):
+            if pdf_file.endswith('.jpg'):
+                demo_names.append(pdf_file.split('.')[0])
+        for demo_name in demo_names:
+            pdf_path = os.path.join(pdf_dev_path, 'images', f'{demo_name}.jpg')
+            local_image_dir = os.path.join(pdf_dev_path, 'mineru', 'images')
+            image_dir = str(os.path.basename(local_image_dir))
+            name_without_suff = os.path.basename(pdf_path).split(".jpg")[0]
+            dir_path = os.path.join(pdf_dev_path, 'mineru')
+            common.delete_file(dir_path)
+            image_writer, md_writer = FileBasedDataWriter(local_image_dir), FileBasedDataWriter(dir_path)
+            ds = read_local_images(pdf_path)[0]
+            ds.apply(doc_analyze, ocr=True).pipe_ocr_mode(image_writer).dump_md(
+            md_writer, f"{name_without_suff}.md", image_dir)
             common.sdk_count_folders_and_check_contents(dir_path)
 
+
+    @pytest.mark.P0
+    def test_local_image_dir(self):
+        """local image dir."""
+        demo_names = list()
+        pdf_path = os.path.join(pdf_dev_path, 'images')
+        dir_path = os.path.join(pdf_dev_path, 'mineru')
+        local_image_dir = os.path.join(pdf_dev_path, 'mineru', 'images')
+        image_dir = str(os.path.basename(local_image_dir))
+        image_writer, md_writer = FileBasedDataWriter(local_image_dir), FileBasedDataWriter(dir_path)
+        common.delete_file(dir_path)
+        dss = read_local_images(pdf_path, suffixes=['.png', '.jpg'])
+        count = 0
+        for ds in dss:
+            ds.apply(doc_analyze, ocr=True).pipe_ocr_mode(image_writer).dump_md(md_writer, f"{count}.md", image_dir)
+            count += 1
+        common.sdk_count_folders_and_check_contents(dir_path)
+
+    def test_local_doc_parse(self):
+        """
+        doc 解析
+        """
+        demo_names = list()
+        pdf_path = os.path.join(pdf_dev_path, 'doc')
+        for pdf_file in os.listdir(pdf_path):
+            if pdf_file.endswith('.docx'):
+                demo_names.append(pdf_file.split('.')[0])
+        for demo_name in demo_names:
+            pdf_path = os.path.join(pdf_dev_path, 'doc', f'{demo_name}.docx')
+            local_image_dir = os.path.join(pdf_dev_path, 'mineru', 'images')
+            image_dir = str(os.path.basename(local_image_dir))
+            name_without_suff = os.path.basename(pdf_path).split(".docx")[0]
+            dir_path = os.path.join(pdf_dev_path, 'mineru')
+            image_writer, md_writer = FileBasedDataWriter(local_image_dir), FileBasedDataWriter(dir_path)
+            ds = read_local_office(pdf_path)[0]
+            common.delete_file(dir_path)
+            
+            ds.apply(doc_analyze, ocr=True).pipe_txt_mode(image_writer).dump_md(md_writer, f"{name_without_suff}.md", image_dir)          
+            common.sdk_count_folders_and_check_contents(dir_path)
+
+
     @pytest.mark.P0
     def test_pdf_cli_auto(self):
         """magic_pdf cli test auto."""
@@ -203,63 +322,7 @@ class TestCli:
         cmd = 'magic-pdf-dev --pdf %s --json %s --method %s' % (pdf_path, json_path, 'auto')
         logging.info(cmd)
         os.system(cmd)
-   
-
-    @pytest.mark.P1
-    def test_s3_sdk_auto(self):
-        """
-        test s3 sdk auto.
-        """
-        time.sleep(2)
-        pdf_ak = os.getenv('pdf_ak')
-        print (pdf_ak)
-        pdf_sk = os.environ.get('pdf_sk', "")
-        pdf_bucket = os.environ.get('bucket', "")
-        pdf_endpoint = os.environ.get('pdf_endpoint', "")
-        s3_pdf_path = conf.conf["s3_pdf_path"]
-        image_dir = "s3://" + pdf_bucket + "/mineru/test/output"
-        prefix = "mineru/test/output"
-        reader = S3DataReader(prefix, pdf_bucket, pdf_ak, pdf_sk, pdf_endpoint)
-        writer = S3DataWriter(prefix, pdf_bucket, pdf_ak, pdf_sk, pdf_endpoint)
-        # = S3DataWriter(prefix, pdf_bucket, pdf_ak, pdf_sk, pdf_endpoint)
-        image_writer = S3DataWriter(prefix, pdf_bucket, pdf_ak, pdf_sk, pdf_endpoint)
-        local_dir = "output"
-        name_without_suff = os.path.basename(s3_pdf_path).split(".")[0]
-
-        # read bytes
-        pdf_bytes = reader.read(s3_pdf_path)  # read the pdf content
-
-        # proc
-        ## Create Dataset Instance
-        ds = PymuDocDataset(pdf_bytes)
-
-        ## inference
-        if ds.classify() == SupportedPdfParseMethod.OCR:
-            infer_result = ds.apply(doc_analyze, ocr=True)
-
-            ## pipeline
-            pipe_result = infer_result.pipe_ocr_mode(image_writer)
-        else:
-            infer_result = ds.apply(doc_analyze, ocr=False)
-
-            ## pipeline
-            pipe_result = infer_result.pipe_txt_mode(image_writer)
-
-        ### draw model result on each page
-        infer_result.draw_model(os.path.join(local_dir, f'{name_without_suff}_model.pdf'))  # dump to local
-
-        ### draw layout result on each page
-        pipe_result.draw_layout(os.path.join(local_dir, f'{name_without_suff}_layout.pdf'))  # dump to local
-
-        ### draw spans result on each page
-        pipe_result.draw_span(os.path.join(local_dir, f'{name_without_suff}_spans.pdf'))   # dump to local
-
-        ### dump markdown
-        pipe_result.dump_md(writer, f'{name_without_suff}.md', "unittest/tmp/images")    # dump to remote s3
-
-        ### dump content list
-        pipe_result.dump_content_list(writer, f"{name_without_suff}_content_list.json", image_dir)
-
+    
 
     @pytest.mark.P1
     def test_local_magic_pdf_open_st_table(self):