Răsfoiți Sursa

Merge branch 'opendatalab:dev' into dev

linfeng 1 an în urmă
părinte
comite
9d689790eb
59 a modificat fișierele cu 2267 adăugiri și 299 ștergeri
  1. 1 0
      .github/ISSUE_TEMPLATE/bug_report.yml
  2. 4 3
      .github/workflows/cli.yml
  3. 1 0
      LICENSE.md
  4. 0 0
      README.md
  5. 0 0
      README_zh-CN.md
  6. 8 0
      docs/FAQ_en_us.md
  7. 8 0
      docs/FAQ_zh_cn.md
  8. 3 0
      docs/download_models_hf.py
  9. 1 51
      docs/how_to_download_models_en.md
  10. 2 50
      docs/how_to_download_models_zh_cn.md
  11. 51 31
      magic_pdf/dict2md/ocr_mkcontent.py
  12. 1 0
      magic_pdf/libs/MakeContentConfig.py
  13. BIN
      magic_pdf/libs/__pycache__/__init__.cpython-312.pyc
  14. BIN
      magic_pdf/libs/__pycache__/version.cpython-312.pyc
  15. 19 0
      magic_pdf/libs/boxbase.py
  16. 1 1
      magic_pdf/libs/version.py
  17. 10 8
      magic_pdf/model/doc_analyze_by_custom_model.py
  18. 96 45
      magic_pdf/model/magic_model.py
  19. 16 7
      magic_pdf/model/pdf_extract_kit.py
  20. 5 2
      magic_pdf/model/pp_structure_v2.py
  21. 8 3
      magic_pdf/pipe/AbsPipe.py
  22. 6 4
      magic_pdf/pipe/OCRPipe.py
  23. 6 4
      magic_pdf/pipe/TXTPipe.py
  24. 11 7
      magic_pdf/pipe/UNIPipe.py
  25. 7 7
      magic_pdf/resources/model_config/UniMERNet/demo.yaml
  26. 1 1
      magic_pdf/resources/model_config/model_configs.yaml
  27. 14 1
      magic_pdf/tools/cli.py
  28. 5 4
      magic_pdf/tools/common.py
  29. 16 5
      magic_pdf/user_api.py
  30. 2 0
      projects/README.md
  31. 2 0
      projects/README_zh-CN.md
  32. 24 0
      projects/gradio_app/README.md
  33. 24 0
      projects/gradio_app/README_zh-CN.md
  34. 23 18
      projects/gradio_app/app.py
  35. BIN
      projects/gradio_app/examples/academic_paper_formula.pdf
  36. BIN
      projects/gradio_app/examples/academic_paper_img_formula.pdf
  37. BIN
      projects/gradio_app/examples/garbled_formula.pdf
  38. BIN
      projects/gradio_app/examples/garbled_formula2.pdf
  39. BIN
      projects/gradio_app/examples/garbled_img_formula.pdf
  40. BIN
      projects/gradio_app/examples/scanned.pdf
  41. 109 0
      projects/gradio_app/header.html
  42. 3 0
      projects/gradio_app/requirements.txt
  43. 80 39
      projects/llama_index_rag/README_zh-CN.md
  44. BIN
      projects/llama_index_rag/rag_data_api.png
  45. 1 1
      requirements-docker.txt
  46. 1 1
      setup.py
  47. 24 0
      tests/clean_coverage.py
  48. 1 1
      tests/get_coverage.py
  49. 2 1
      tests/retry_env.sh
  50. 4 3
      tests/test_cli/conf/conf.py
  51. 0 0
      tests/test_cli/pdf_dev/line1.jsonl
  52. 1472 0
      tests/test_cli/pdf_dev/test_model.json
  53. 90 1
      tests/test_cli/test_cli_sdk.py
  54. 0 0
      tests/test_cli/test_magic-pdf-dev_cli.py
  55. 36 0
      tests/test_cli/test_performence.py
  56. 54 0
      tests/test_cli/test_table.py
  57. BIN
      tests/unittest/test_table/assets/table.jpg
  58. 14 0
      tests/unittest/test_table/test_tablemaster.py
  59. 0 0
      tests/unittest/test_unit.py

+ 1 - 0
.github/ISSUE_TEMPLATE/bug_report.yml

@@ -80,6 +80,7 @@ body:
         -
         - "0.6.x"
         - "0.7.x"
+        - "0.8.x"
     validations:
       required: true
 

+ 4 - 3
.github/workflows/cli.yml

@@ -37,12 +37,13 @@ jobs:
       run: |
         echo $GITHUB_WORKSPACE && sh tests/retry_env.sh
     - name: unit test
-      run: |        
-        cd $GITHUB_WORKSPACE && export PYTHONPATH=. && coverage run -m  pytest  tests/test_unit.py --cov=magic_pdf/ --cov-report term-missing --cov-report html
+      run: | 
+        cd $GITHUB_WORKSPACE && python tests/clean_coverage.py
+        cd $GITHUB_WORKSPACE && export PYTHONPATH=. && coverage run -m  pytest  tests/unittest --cov=magic_pdf/ --cov-report term-missing --cov-report html
         cd $GITHUB_WORKSPACE && python tests/get_coverage.py
     - name: cli test
       run: |
-        cd $GITHUB_WORKSPACE &&  pytest -s -v tests/test_cli/test_cli_sdk.py
+        source ~/.bashrc && cd $GITHUB_WORKSPACE &&  pytest -s -v tests/test_cli/test_cli.py
 
   notify_to_feishu:
     if: ${{ always() && !cancelled() && contains(needs.*.result, 'failure') && (github.ref_name == 'master') }}

+ 1 - 0
LICENSE.md

@@ -659,3 +659,4 @@ specific requirements.
 if any, to sign a "copyright disclaimer" for the program, if necessary.
 For more information on this, and how to apply and follow the GNU AGPL, see
 <https://www.gnu.org/licenses/>.
+

Fișier diff suprimat deoarece este prea mare
+ 0 - 0
README.md


Fișier diff suprimat deoarece este prea mare
+ 0 - 0
README_zh-CN.md


+ 8 - 0
docs/FAQ_en_us.md

@@ -44,3 +44,11 @@ pip uninstall fairscale
 pip install fairscale
 ```
 Reference: https://github.com/opendatalab/MinerU/issues/411
+
+### 6. On some newer devices like the H100, the text parsed during OCR using CUDA acceleration is garbled.
+
+The compatibility of cuda11 with new graphics cards is poor, and the CUDA version used by Paddle needs to be upgraded.
+```bash
+pip install paddlepaddle-gpu==3.0.0b1 -i https://www.paddlepaddle.org.cn/packages/stable/cu123/
+```
+Reference: https://github.com/opendatalab/MinerU/issues/558

+ 8 - 0
docs/FAQ_zh_cn.md

@@ -41,3 +41,11 @@ pip uninstall fairscale
 pip install fairscale
 ```
 参考:https://github.com/opendatalab/MinerU/issues/411
+
+### 6.在部分较新的设备如H100上,使用CUDA加速OCR时解析出的文字乱码。
+
+cuda11对新显卡的兼容性不好,需要升级paddle使用的cuda版本
+```bash
+pip install paddlepaddle-gpu==3.0.0b1 -i https://www.paddlepaddle.org.cn/packages/stable/cu123/
+```
+参考:https://github.com/opendatalab/MinerU/issues/558

+ 3 - 0
docs/download_models_hf.py

@@ -0,0 +1,3 @@
+from huggingface_hub import snapshot_download
+model_dir = snapshot_download('opendatalab/PDF-Extract-Kit')
+print(f"model dir is: {model_dir}/models")

+ 1 - 51
docs/how_to_download_models_en.md

@@ -6,58 +6,8 @@ wget https://github.com/opendatalab/MinerU/raw/master/docs/download_models_hf.py
 python download_models_hf.py
 ```
 After the Python script finishes executing, it will output the directory where the models are downloaded.
-### 2. Additional steps
 
-#### 1. Check whether the model directory is downloaded completely.
-
-The structure of the model folder is as follows, including configuration files and weight files of different components:
-```
-../
-├── Layout
-│   ├── config.json
-│   └── model_final.pth
-├── MFD
-│   └── weights.pt
-├── MFR
-│   └── UniMERNet
-│       ├── config.json
-│       ├── preprocessor_config.json
-│       ├── pytorch_model.bin
-│       ├── README.md
-│       ├── tokenizer_config.json
-│       └── tokenizer.json
-│── TabRec
-│   └─StructEqTable
-│       ├── config.json
-│       ├── generation_config.json
-│       ├── model.safetensors
-│       ├── preprocessor_config.json
-│       ├── special_tokens_map.json
-│       ├── spiece.model
-│       ├── tokenizer.json
-│       └── tokenizer_config.json 
-│   └─ TableMaster 
-│       └─ ch_PP-OCRv3_det_infer
-│           ├── inference.pdiparams
-│           ├── inference.pdiparams.info
-│           └── inference.pdmodel
-│       └─ ch_PP-OCRv3_rec_infer
-│           ├── inference.pdiparams
-│           ├── inference.pdiparams.info
-│           └── inference.pdmodel
-│       └─ table_structure_tablemaster_infer
-│           ├── inference.pdiparams
-│           ├── inference.pdiparams.info
-│           └── inference.pdmodel
-│       ├── ppocr_keys_v1.txt
-│       └── table_master_structure_dict.txt
-└── README.md
-```
-#### 2. Check whether the model file is fully downloaded.
-
-Please check whether the size of the model file in the directory is consistent with the description on the web page. If possible, it is best to check whether the model is downloaded completely through sha256.
-
-#### 3. 
+### 2. To modify the model path address in the configuration file
 
 Additionally, in `~/magic-pdf.json`, update the model directory path to the absolute path of the `models` directory output by the previous Python script. Otherwise, you will encounter an error indicating that the model cannot be loaded.
 

+ 2 - 50
docs/how_to_download_models_zh_cn.md

@@ -21,55 +21,7 @@ wget https://gitee.com/myhloli/MinerU/raw/master/docs/download_models.py
 python download_models.py
 ```
 python脚本执行完毕后,会输出模型下载目录
-## 【❗️必须要做❗️】的额外步骤(模型下载完成后请务必完成以下操作)
 
-### 1.检查模型目录是否下载完整
-模型文件夹的结构如下,包含了不同组件的配置文件和权重文件:
-```
-./
-├── Layout  # 布局检测模型
-│   ├── config.json
-│   └── model_final.pth
-├── MFD  # 公式检测
-│   └── weights.pt
-├── MFR  # 公式识别模型
-│   └── UniMERNet
-│       ├── config.json
-│       ├── preprocessor_config.json
-│       ├── pytorch_model.bin
-│       ├── README.md
-│       ├── tokenizer_config.json
-│       └── tokenizer.json
-│── TabRec # 表格识别模型
-│   └─StructEqTable
-│       ├── config.json
-│       ├── generation_config.json
-│       ├── model.safetensors
-│       ├── preprocessor_config.json
-│       ├── special_tokens_map.json
-│       ├── spiece.model
-│       ├── tokenizer.json
-│       └── tokenizer_config.json 
-│   └─ TableMaster 
-│       └─ ch_PP-OCRv3_det_infer
-│           ├── inference.pdiparams
-│           ├── inference.pdiparams.info
-│           └── inference.pdmodel
-│       └─ ch_PP-OCRv3_rec_infer
-│           ├── inference.pdiparams
-│           ├── inference.pdiparams.info
-│           └── inference.pdmodel
-│       └─ table_structure_tablemaster_infer
-│           ├── inference.pdiparams
-│           ├── inference.pdiparams.info
-│           └── inference.pdmodel
-│       ├── ppocr_keys_v1.txt
-│       └── table_master_structure_dict.txt
-└── README.md
-```
-
-### 2.检查模型文件是否下载完整
-请检查目录下的模型文件大小与网页上描述是否一致,如果可以的话,最好通过sha256校验模型是否下载完整
 
-### 3.修改magic-pdf.json中的模型路径
-此外在 `~/magic-pdf.json`里修改模型的目录指向之前python脚本输出的models目录的绝对路径,否则会报模型无法加载的错误。
+## 下载完成后的操作:修改magic-pdf.json中的模型路径
+在`~/magic-pdf.json`里修改模型的目录指向上一步脚本输出的models目录的绝对路径,否则会报模型无法加载的错误。

+ 51 - 31
magic_pdf/dict2md/ocr_mkcontent.py

@@ -116,17 +116,20 @@ def ocr_mk_markdown_with_para_core(paras_of_layout, mode, img_buket_path=''):
 
 def ocr_mk_markdown_with_para_core_v2(paras_of_layout,
                                       mode,
-                                      img_buket_path=''):
+                                      img_buket_path='',
+                                      parse_type="auto",
+                                      lang=None
+                                      ):
     page_markdown = []
     for para_block in paras_of_layout:
         para_text = ''
         para_type = para_block['type']
         if para_type == BlockType.Text:
-            para_text = merge_para_with_text(para_block)
+            para_text = merge_para_with_text(para_block, parse_type=parse_type, lang=lang)
         elif para_type == BlockType.Title:
-            para_text = f'# {merge_para_with_text(para_block)}'
+            para_text = f'# {merge_para_with_text(para_block, parse_type=parse_type, lang=lang)}'
         elif para_type == BlockType.InterlineEquation:
-            para_text = merge_para_with_text(para_block)
+            para_text = merge_para_with_text(para_block, parse_type=parse_type, lang=lang)
         elif para_type == BlockType.Image:
             if mode == 'nlp':
                 continue
@@ -139,17 +142,17 @@ def ocr_mk_markdown_with_para_core_v2(paras_of_layout,
                                     para_text += f"\n![]({join_path(img_buket_path, span['image_path'])})  \n"
                 for block in para_block['blocks']:  # 2nd.拼image_caption
                     if block['type'] == BlockType.ImageCaption:
-                        para_text += merge_para_with_text(block)
+                        para_text += merge_para_with_text(block, parse_type=parse_type, lang=lang)
                 for block in para_block['blocks']:  # 2nd.拼image_caption
                     if block['type'] == BlockType.ImageFootnote:
-                        para_text += merge_para_with_text(block)
+                        para_text += merge_para_with_text(block, parse_type=parse_type, lang=lang)
         elif para_type == BlockType.Table:
             if mode == 'nlp':
                 continue
             elif mode == 'mm':
                 for block in para_block['blocks']:  # 1st.拼table_caption
                     if block['type'] == BlockType.TableCaption:
-                        para_text += merge_para_with_text(block)
+                        para_text += merge_para_with_text(block, parse_type=parse_type, lang=lang)
                 for block in para_block['blocks']:  # 2nd.拼table_body
                     if block['type'] == BlockType.TableBody:
                         for line in block['lines']:
@@ -164,7 +167,7 @@ def ocr_mk_markdown_with_para_core_v2(paras_of_layout,
                                         para_text += f"\n![]({join_path(img_buket_path, span['image_path'])})  \n"
                 for block in para_block['blocks']:  # 3rd.拼table_footnote
                     if block['type'] == BlockType.TableFootnote:
-                        para_text += merge_para_with_text(block)
+                        para_text += merge_para_with_text(block, parse_type=parse_type, lang=lang)
 
         if para_text.strip() == '':
             continue
@@ -174,7 +177,7 @@ def ocr_mk_markdown_with_para_core_v2(paras_of_layout,
     return page_markdown
 
 
-def merge_para_with_text(para_block):
+def merge_para_with_text(para_block, parse_type="auto", lang=None):
 
     def detect_language(text):
         en_pattern = r'[a-zA-Z]+'
@@ -205,11 +208,15 @@ def merge_para_with_text(para_block):
                 content = span['content']
                 # language = detect_lang(content)
                 language = detect_language(content)
-                if language == 'en':  # 只对英文长词进行分词处理,中文分词会丢失文本
-                    content = ocr_escape_special_markdown_char(
-                        split_long_words(content))
-                else:
+                # 判断是否小语种
+                if lang is not None and lang != 'en':
                     content = ocr_escape_special_markdown_char(content)
+                else:  # 非小语种逻辑
+                    if language == 'en' and parse_type == 'ocr':  # 只对英文长词进行分词处理,中文分词会丢失文本
+                        content = ocr_escape_special_markdown_char(
+                            split_long_words(content))
+                    else:
+                        content = ocr_escape_special_markdown_char(content)
             elif span_type == ContentType.InlineEquation:
                 content = f" ${span['content']}$ "
             elif span_type == ContentType.InterlineEquation:
@@ -265,41 +272,39 @@ def para_to_standard_format(para, img_buket_path):
     return para_content
 
 
-def para_to_standard_format_v2(para_block, img_buket_path, page_idx):
+def para_to_standard_format_v2(para_block, img_buket_path, page_idx, parse_type="auto", lang=None, drop_reason=None):
     para_type = para_block['type']
+    para_content = {}
     if para_type == BlockType.Text:
         para_content = {
             'type': 'text',
-            'text': merge_para_with_text(para_block),
-            'page_idx': page_idx,
+            'text': merge_para_with_text(para_block, parse_type=parse_type, lang=lang),
         }
     elif para_type == BlockType.Title:
         para_content = {
             'type': 'text',
-            'text': merge_para_with_text(para_block),
+            'text': merge_para_with_text(para_block, parse_type=parse_type, lang=lang),
             'text_level': 1,
-            'page_idx': page_idx,
         }
     elif para_type == BlockType.InterlineEquation:
         para_content = {
             'type': 'equation',
-            'text': merge_para_with_text(para_block),
+            'text': merge_para_with_text(para_block, parse_type=parse_type, lang=lang),
             'text_format': 'latex',
-            'page_idx': page_idx,
         }
     elif para_type == BlockType.Image:
-        para_content = {'type': 'image', 'page_idx': page_idx}
+        para_content = {'type': 'image'}
         for block in para_block['blocks']:
             if block['type'] == BlockType.ImageBody:
                 para_content['img_path'] = join_path(
                     img_buket_path,
                     block['lines'][0]['spans'][0]['image_path'])
             if block['type'] == BlockType.ImageCaption:
-                para_content['img_caption'] = merge_para_with_text(block)
+                para_content['img_caption'] = merge_para_with_text(block, parse_type=parse_type, lang=lang)
             if block['type'] == BlockType.ImageFootnote:
-                para_content['img_footnote'] = merge_para_with_text(block)
+                para_content['img_footnote'] = merge_para_with_text(block, parse_type=parse_type, lang=lang)
     elif para_type == BlockType.Table:
-        para_content = {'type': 'table', 'page_idx': page_idx}
+        para_content = {'type': 'table'}
         for block in para_block['blocks']:
             if block['type'] == BlockType.TableBody:
                 if block["lines"][0]["spans"][0].get('latex', ''):
@@ -308,9 +313,14 @@ def para_to_standard_format_v2(para_block, img_buket_path, page_idx):
                     para_content['table_body'] = f"\n\n{block['lines'][0]['spans'][0]['html']}\n\n"
                 para_content['img_path'] = join_path(img_buket_path, block["lines"][0]["spans"][0]['image_path'])
             if block['type'] == BlockType.TableCaption:
-                para_content['table_caption'] = merge_para_with_text(block)
+                para_content['table_caption'] = merge_para_with_text(block, parse_type=parse_type, lang=lang)
             if block['type'] == BlockType.TableFootnote:
-                para_content['table_footnote'] = merge_para_with_text(block)
+                para_content['table_footnote'] = merge_para_with_text(block, parse_type=parse_type, lang=lang)
+
+    para_content['page_idx'] = page_idx
+
+    if drop_reason is not None:
+        para_content['drop_reason'] = drop_reason
 
     return para_content
 
@@ -394,13 +404,19 @@ def ocr_mk_mm_standard_format(pdf_info_dict: list):
 def union_make(pdf_info_dict: list,
                make_mode: str,
                drop_mode: str,
-               img_buket_path: str = ''):
+               img_buket_path: str = '',
+               parse_type: str = "auto",
+               lang=None):
     output_content = []
     for page_info in pdf_info_dict:
+        drop_reason_flag = False
+        drop_reason = None
         if page_info.get('need_drop', False):
             drop_reason = page_info.get('drop_reason')
             if drop_mode == DropMode.NONE:
                 pass
+            elif drop_mode == DropMode.NONE_WITH_REASON:
+                drop_reason_flag = True
             elif drop_mode == DropMode.WHOLE_PDF:
                 raise Exception((f'drop_mode is {DropMode.WHOLE_PDF} ,'
                                  f'drop_reason is {drop_reason}'))
@@ -417,16 +433,20 @@ def union_make(pdf_info_dict: list,
             continue
         if make_mode == MakeMode.MM_MD:
             page_markdown = ocr_mk_markdown_with_para_core_v2(
-                paras_of_layout, 'mm', img_buket_path)
+                paras_of_layout, 'mm', img_buket_path, parse_type=parse_type, lang=lang)
             output_content.extend(page_markdown)
         elif make_mode == MakeMode.NLP_MD:
             page_markdown = ocr_mk_markdown_with_para_core_v2(
-                paras_of_layout, 'nlp')
+                paras_of_layout, 'nlp', parse_type=parse_type, lang=lang)
             output_content.extend(page_markdown)
         elif make_mode == MakeMode.STANDARD_FORMAT:
             for para_block in paras_of_layout:
-                para_content = para_to_standard_format_v2(
-                    para_block, img_buket_path, page_idx)
+                if drop_reason_flag:
+                    para_content = para_to_standard_format_v2(
+                        para_block, img_buket_path, page_idx, parse_type=parse_type, lang=lang, drop_reason=drop_reason)
+                else:
+                    para_content = para_to_standard_format_v2(
+                        para_block, img_buket_path, page_idx, parse_type=parse_type, lang=lang)
                 output_content.append(para_content)
     if make_mode in [MakeMode.MM_MD, MakeMode.NLP_MD]:
         return '\n\n'.join(output_content)

+ 1 - 0
magic_pdf/libs/MakeContentConfig.py

@@ -8,3 +8,4 @@ class DropMode:
     WHOLE_PDF = "whole_pdf"
     SINGLE_PAGE = "single_page"
     NONE = "none"
+    NONE_WITH_REASON = "none_with_reason"

BIN
magic_pdf/libs/__pycache__/__init__.cpython-312.pyc


BIN
magic_pdf/libs/__pycache__/version.cpython-312.pyc


+ 19 - 0
magic_pdf/libs/boxbase.py

@@ -426,3 +426,22 @@ def bbox_distance(bbox1, bbox2):
     elif top:
         return y2 - y1b
     return 0.0
+
+
+def box_area(bbox):
+    return (bbox[2] - bbox[0]) * (bbox[3] - bbox[1])
+
+
+def get_overlap_area(bbox1, bbox2):
+    """计算box1和box2的重叠面积占bbox1的比例."""
+    # Determine the coordinates of the intersection rectangle
+    x_left = max(bbox1[0], bbox2[0])
+    y_top = max(bbox1[1], bbox2[1])
+    x_right = min(bbox1[2], bbox2[2])
+    y_bottom = min(bbox1[3], bbox2[3])
+
+    if x_right < x_left or y_bottom < y_top:
+        return 0.0
+
+    # The area of overlap area
+    return (x_right - x_left) * (y_bottom - y_top)

+ 1 - 1
magic_pdf/libs/version.py

@@ -1 +1 @@
-__version__ = "0.7.1"
+__version__ = "0.8.0"

+ 10 - 8
magic_pdf/model/doc_analyze_by_custom_model.py

@@ -57,14 +57,14 @@ class ModelSingleton:
             cls._instance = super().__new__(cls)
         return cls._instance
 
-    def get_model(self, ocr: bool, show_log: bool):
-        key = (ocr, show_log)
+    def get_model(self, ocr: bool, show_log: bool, lang=None):
+        key = (ocr, show_log, lang)
         if key not in self._models:
-            self._models[key] = custom_model_init(ocr=ocr, show_log=show_log)
+            self._models[key] = custom_model_init(ocr=ocr, show_log=show_log, lang=lang)
         return self._models[key]
 
 
-def custom_model_init(ocr: bool = False, show_log: bool = False):
+def custom_model_init(ocr: bool = False, show_log: bool = False, lang=None):
     model = None
 
     if model_config.__model_mode__ == "lite":
@@ -78,7 +78,7 @@ def custom_model_init(ocr: bool = False, show_log: bool = False):
         model_init_start = time.time()
         if model == MODEL.Paddle:
             from magic_pdf.model.pp_structure_v2 import CustomPaddleModel
-            custom_model = CustomPaddleModel(ocr=ocr, show_log=show_log)
+            custom_model = CustomPaddleModel(ocr=ocr, show_log=show_log, lang=lang)
         elif model == MODEL.PEK:
             from magic_pdf.model.pdf_extract_kit import CustomPEKModel
             # 从配置文件读取model-dir和device
@@ -89,7 +89,9 @@ def custom_model_init(ocr: bool = False, show_log: bool = False):
                            "show_log": show_log,
                            "models_dir": local_models_dir,
                            "device": device,
-                           "table_config": table_config}
+                           "table_config": table_config,
+                           "lang": lang,
+                           }
             custom_model = CustomPEKModel(**model_input)
         else:
             logger.error("Not allow model_name!")
@@ -104,10 +106,10 @@ def custom_model_init(ocr: bool = False, show_log: bool = False):
 
 
 def doc_analyze(pdf_bytes: bytes, ocr: bool = False, show_log: bool = False,
-                start_page_id=0, end_page_id=None):
+                start_page_id=0, end_page_id=None, lang=None):
 
     model_manager = ModelSingleton()
-    custom_model = model_manager.get_model(ocr, show_log)
+    custom_model = model_manager.get_model(ocr, show_log, lang)
 
     images = load_images_from_pdf(pdf_bytes)
 

+ 96 - 45
magic_pdf/model/magic_model.py

@@ -1,8 +1,9 @@
 import json
 
 from magic_pdf.libs.boxbase import (_is_in, _is_part_overlap, bbox_distance,
-                                    bbox_relative_pos, calculate_iou,
-                                    calculate_overlap_area_in_bbox1_area_ratio)
+                                    bbox_relative_pos, box_area, calculate_iou,
+                                    calculate_overlap_area_in_bbox1_area_ratio,
+                                    get_overlap_area)
 from magic_pdf.libs.commons import fitz, join_path
 from magic_pdf.libs.coordinate_transform import get_scale_ratio
 from magic_pdf.libs.local_math import float_gt
@@ -12,6 +13,7 @@ from magic_pdf.rw.AbsReaderWriter import AbsReaderWriter
 from magic_pdf.rw.DiskReaderWriter import DiskReaderWriter
 
 CAPATION_OVERLAP_AREA_RATIO = 0.6
+MERGE_BOX_OVERLAP_AREA_RATIO = 1.1
 
 
 class MagicModel:
@@ -124,49 +126,51 @@ class MagicModel:
                     tables.append(obj)
                 if len(footnotes) * len(figures) == 0:
                     continue
-                dis_figure_footnote = {}
-                dis_table_footnote = {}
-
-                for i in range(len(footnotes)):
-                    for j in range(len(figures)):
-                        pos_flag_count = sum(
-                            list(
-                                map(
-                                    lambda x: 1 if x else 0,
-                                    bbox_relative_pos(
-                                        footnotes[i]['bbox'], figures[j]['bbox']
-                                    ),
-                                )
+            dis_figure_footnote = {}
+            dis_table_footnote = {}
+
+            for i in range(len(footnotes)):
+                for j in range(len(figures)):
+                    pos_flag_count = sum(
+                        list(
+                            map(
+                                lambda x: 1 if x else 0,
+                                bbox_relative_pos(
+                                    footnotes[i]['bbox'], figures[j]['bbox']
+                                ),
                             )
                         )
-                        if pos_flag_count > 1:
-                            continue
-                        dis_figure_footnote[i] = min(
-                            bbox_distance(figures[j]['bbox'], footnotes[i]['bbox']),
-                            dis_figure_footnote.get(i, float('inf')),
-                        )
-                for i in range(len(footnotes)):
-                    for j in range(len(tables)):
-                        pos_flag_count = sum(
-                            list(
-                                map(
-                                    lambda x: 1 if x else 0,
-                                    bbox_relative_pos(
-                                        footnotes[i]['bbox'], tables[j]['bbox']
-                                    ),
-                                )
+                    )
+                    if pos_flag_count > 1:
+                        continue
+                    dis_figure_footnote[i] = min(
+                        bbox_distance(figures[j]['bbox'], footnotes[i]['bbox']),
+                        dis_figure_footnote.get(i, float('inf')),
+                    )
+            for i in range(len(footnotes)):
+                for j in range(len(tables)):
+                    pos_flag_count = sum(
+                        list(
+                            map(
+                                lambda x: 1 if x else 0,
+                                bbox_relative_pos(
+                                    footnotes[i]['bbox'], tables[j]['bbox']
+                                ),
                             )
                         )
-                        if pos_flag_count > 1:
-                            continue
+                    )
+                    if pos_flag_count > 1:
+                        continue
 
-                        dis_table_footnote[i] = min(
-                            bbox_distance(tables[j]['bbox'], footnotes[i]['bbox']),
-                            dis_table_footnote.get(i, float('inf')),
-                        )
-                for i in range(len(footnotes)):
-                    if dis_table_footnote.get(i, float('inf')) > dis_figure_footnote[i]:
-                        footnotes[i]['category_id'] = CategoryId.ImageFootnote
+                    dis_table_footnote[i] = min(
+                        bbox_distance(tables[j]['bbox'], footnotes[i]['bbox']),
+                        dis_table_footnote.get(i, float('inf')),
+                    )
+            for i in range(len(footnotes)):
+                if i not in dis_figure_footnote:
+                    continue
+                if dis_table_footnote.get(i, float('inf')) > dis_figure_footnote[i]:
+                    footnotes[i]['category_id'] = CategoryId.ImageFootnote
 
     def __reduct_overlap(self, bboxes):
         N = len(bboxes)
@@ -191,6 +195,44 @@ class MagicModel:
         筛选出所有和 merged bbox 有 overlap 且 overlap 面积大于 object 的面积的 subjects。
         再求出筛选出的 subjects 和 object 的最短距离
         """
+        def search_overlap_between_boxes(
+            subject_idx, object_idx
+        ):
+            idxes = [subject_idx, object_idx]
+            x0s = [all_bboxes[idx]['bbox'][0] for idx in idxes]
+            y0s = [all_bboxes[idx]['bbox'][1] for idx in idxes]
+            x1s = [all_bboxes[idx]['bbox'][2] for idx in idxes]
+            y1s = [all_bboxes[idx]['bbox'][3] for idx in idxes]
+
+            merged_bbox = [
+                min(x0s),
+                min(y0s),
+                max(x1s),
+                max(y1s),
+            ]
+            ratio = 0
+
+            other_objects = list(
+                map(
+                    lambda x: {'bbox': x['bbox'], 'score': x['score']},
+                    filter(
+                        lambda x: x['category_id']
+                        not in (object_category_id, subject_category_id),
+                        self.__model_list[page_no]['layout_dets'],
+                    ),
+                )
+            )
+            for other_object in other_objects:
+                ratio = max(
+                    ratio,
+                    get_overlap_area(
+                        merged_bbox, other_object['bbox']
+                    ) * 1.0 / box_area(all_bboxes[object_idx]['bbox'])
+                )
+                if ratio >= MERGE_BOX_OVERLAP_AREA_RATIO:
+                    break
+
+            return ratio
 
         def may_find_other_nearest_bbox(subject_idx, object_idx):
             ret = float('inf')
@@ -299,6 +341,15 @@ class MagicModel:
                 ):
                     continue
 
+                subject_idx, object_idx = i, j
+                if all_bboxes[j]['category_id'] == subject_category_id:
+                    subject_idx, object_idx = j, i
+
+                if search_overlap_between_boxes(subject_idx, object_idx) >= MERGE_BOX_OVERLAP_AREA_RATIO:
+                    dis[i][j] = float('inf')
+                    dis[j][i] = dis[i][j]
+                    continue
+
                 dis[i][j] = bbox_distance(all_bboxes[i]['bbox'], all_bboxes[j]['bbox'])
                 dis[j][i] = dis[i][j]
 
@@ -627,13 +678,13 @@ class MagicModel:
                     span['type'] = ContentType.Image
                 elif category_id == 5:
                     # 获取table模型结果
-                    latex = layout_det.get("latex", None)
-                    html = layout_det.get("html", None)
+                    latex = layout_det.get('latex', None)
+                    html = layout_det.get('html', None)
                     if latex:
-                        span["latex"] = latex
+                        span['latex'] = latex
                     elif html:
-                        span["html"] = html
-                    span["type"] = ContentType.Table
+                        span['html'] = html
+                    span['type'] = ContentType.Table
                 elif category_id == 13:
                     span['content'] = layout_det['latex']
                     span['type'] = ContentType.InlineEquation

+ 16 - 7
magic_pdf/model/pdf_extract_kit.py

@@ -58,7 +58,7 @@ def mfd_model_init(weight):
 def mfr_model_init(weight_dir, cfg_path, _device_='cpu'):
     args = argparse.Namespace(cfg_path=cfg_path, options=None)
     cfg = Config(args)
-    cfg.config.model.pretrained = os.path.join(weight_dir, "pytorch_model.bin")
+    cfg.config.model.pretrained = os.path.join(weight_dir, "pytorch_model.pth")
     cfg.config.model.model_config.model_name = weight_dir
     cfg.config.model.tokenizer_config.path = weight_dir
     task = tasks.setup_task(cfg)
@@ -74,8 +74,11 @@ def layout_model_init(weight, config_file, device):
     return model
 
 
-def ocr_model_init(show_log: bool = False, det_db_box_thresh=0.3):
-    model = ModifiedPaddleOCR(show_log=show_log, det_db_box_thresh=det_db_box_thresh)
+def ocr_model_init(show_log: bool = False, det_db_box_thresh=0.3, lang=None):
+    if lang is not None:
+        model = ModifiedPaddleOCR(show_log=show_log, det_db_box_thresh=det_db_box_thresh, lang=lang)
+    else:
+        model = ModifiedPaddleOCR(show_log=show_log, det_db_box_thresh=det_db_box_thresh)
     return model
 
 
@@ -134,7 +137,8 @@ def atom_model_init(model_name: str, **kwargs):
     elif model_name == AtomicModel.OCR:
         atom_model = ocr_model_init(
             kwargs.get("ocr_show_log"),
-            kwargs.get("det_db_box_thresh")
+            kwargs.get("det_db_box_thresh"),
+            kwargs.get("lang")
         )
     elif model_name == AtomicModel.Table:
         atom_model = table_model_init(
@@ -177,9 +181,10 @@ class CustomPEKModel:
         self.table_max_time = self.table_config.get("max_time", TABLE_MAX_TIME_VALUE)
         self.table_model_type = self.table_config.get("model", TABLE_MASTER)
         self.apply_ocr = ocr
+        self.lang = kwargs.get("lang", None)
         logger.info(
-            "DocAnalysis init, this may take some times. apply_layout: {}, apply_formula: {}, apply_ocr: {}, apply_table: {}".format(
-                self.apply_layout, self.apply_formula, self.apply_ocr, self.apply_table
+            "DocAnalysis init, this may take some times. apply_layout: {}, apply_formula: {}, apply_ocr: {}, apply_table: {}, lang: {}".format(
+                self.apply_layout, self.apply_formula, self.apply_ocr, self.apply_table, self.lang
             )
         )
         assert self.apply_layout, "DocAnalysis must contain layout model."
@@ -225,11 +230,13 @@ class CustomPEKModel:
         )
         # 初始化ocr
         if self.apply_ocr:
+
             # self.ocr_model = ModifiedPaddleOCR(show_log=show_log, det_db_box_thresh=0.3)
             self.ocr_model = atom_model_manager.get_atom_model(
                 atom_model_name=AtomicModel.OCR,
                 ocr_show_log=show_log,
-                det_db_box_thresh=0.3
+                det_db_box_thresh=0.3,
+                lang=self.lang
             )
         # init table model
         if self.apply_table:
@@ -243,6 +250,7 @@ class CustomPEKModel:
                 table_max_time=self.table_max_time,
                 device=self.device
             )
+
         logger.info('DocAnalysis init done!')
 
     def __call__(self, image):
@@ -383,6 +391,7 @@ class CustomPEKModel:
                         latex_code = self.table_model.image2latex(new_image)[0]
                 else:
                     html_code = self.table_model.img2html(new_image)
+
                 run_time = time.time() - single_table_start_time
                 logger.info(f"------------table recognition processing ends within {run_time}s-----")
                 if run_time > self.table_max_time:

+ 5 - 2
magic_pdf/model/pp_structure_v2.py

@@ -18,8 +18,11 @@ def region_to_bbox(region):
 
 
 class CustomPaddleModel:
-    def __init__(self, ocr: bool = False, show_log: bool = False):
-        self.model = PPStructure(table=False, ocr=ocr, show_log=show_log)
+    def __init__(self, ocr: bool = False, show_log: bool = False, lang=None):
+        if lang is not None:
+            self.model = PPStructure(table=False, ocr=ocr, show_log=show_log, lang=lang)
+        else:
+            self.model = PPStructure(table=False, ocr=ocr, show_log=show_log)
 
     def __call__(self, img):
         try:

+ 8 - 3
magic_pdf/pipe/AbsPipe.py

@@ -17,7 +17,7 @@ class AbsPipe(ABC):
     PIP_TXT = "txt"
 
     def __init__(self, pdf_bytes: bytes, model_list: list, image_writer: AbsReaderWriter, is_debug: bool = False,
-                 start_page_id=0, end_page_id=None):
+                 start_page_id=0, end_page_id=None, lang=None):
         self.pdf_bytes = pdf_bytes
         self.model_list = model_list
         self.image_writer = image_writer
@@ -25,6 +25,7 @@ class AbsPipe(ABC):
         self.is_debug = is_debug
         self.start_page_id = start_page_id
         self.end_page_id = end_page_id
+        self.lang = lang
     
     def get_compress_pdf_mid_data(self):
         return JsonCompressor.compress_json(self.pdf_mid_data)
@@ -94,7 +95,9 @@ class AbsPipe(ABC):
         """
         pdf_mid_data = JsonCompressor.decompress_json(compressed_pdf_mid_data)
         pdf_info_list = pdf_mid_data["pdf_info"]
-        content_list = union_make(pdf_info_list, MakeMode.STANDARD_FORMAT, drop_mode, img_buket_path)
+        parse_type = pdf_mid_data["_parse_type"]
+        lang = pdf_mid_data.get("_lang", None)
+        content_list = union_make(pdf_info_list, MakeMode.STANDARD_FORMAT, drop_mode, img_buket_path, parse_type, lang)
         return content_list
 
     @staticmethod
@@ -104,7 +107,9 @@ class AbsPipe(ABC):
         """
         pdf_mid_data = JsonCompressor.decompress_json(compressed_pdf_mid_data)
         pdf_info_list = pdf_mid_data["pdf_info"]
-        md_content = union_make(pdf_info_list, md_make_mode, drop_mode, img_buket_path)
+        parse_type = pdf_mid_data["_parse_type"]
+        lang = pdf_mid_data.get("_lang", None)
+        md_content = union_make(pdf_info_list, md_make_mode, drop_mode, img_buket_path, parse_type, lang)
         return md_content
 
 

+ 6 - 4
magic_pdf/pipe/OCRPipe.py

@@ -10,19 +10,21 @@ from magic_pdf.user_api import parse_ocr_pdf
 class OCRPipe(AbsPipe):
 
     def __init__(self, pdf_bytes: bytes, model_list: list, image_writer: AbsReaderWriter, is_debug: bool = False,
-                 start_page_id=0, end_page_id=None):
-        super().__init__(pdf_bytes, model_list, image_writer, is_debug, start_page_id, end_page_id)
+                 start_page_id=0, end_page_id=None, lang=None):
+        super().__init__(pdf_bytes, model_list, image_writer, is_debug, start_page_id, end_page_id, lang)
 
     def pipe_classify(self):
         pass
 
     def pipe_analyze(self):
         self.model_list = doc_analyze(self.pdf_bytes, ocr=True,
-                                      start_page_id=self.start_page_id, end_page_id=self.end_page_id)
+                                      start_page_id=self.start_page_id, end_page_id=self.end_page_id,
+                                      lang=self.lang)
 
     def pipe_parse(self):
         self.pdf_mid_data = parse_ocr_pdf(self.pdf_bytes, self.model_list, self.image_writer, is_debug=self.is_debug,
-                                          start_page_id=self.start_page_id, end_page_id=self.end_page_id)
+                                          start_page_id=self.start_page_id, end_page_id=self.end_page_id,
+                                          lang=self.lang)
 
     def pipe_mk_uni_format(self, img_parent_path: str, drop_mode=DropMode.WHOLE_PDF):
         result = super().pipe_mk_uni_format(img_parent_path, drop_mode)

+ 6 - 4
magic_pdf/pipe/TXTPipe.py

@@ -11,19 +11,21 @@ from magic_pdf.user_api import parse_txt_pdf
 class TXTPipe(AbsPipe):
 
     def __init__(self, pdf_bytes: bytes, model_list: list, image_writer: AbsReaderWriter, is_debug: bool = False,
-                 start_page_id=0, end_page_id=None):
-        super().__init__(pdf_bytes, model_list, image_writer, is_debug, start_page_id, end_page_id)
+                 start_page_id=0, end_page_id=None, lang=None):
+        super().__init__(pdf_bytes, model_list, image_writer, is_debug, start_page_id, end_page_id, lang)
 
     def pipe_classify(self):
         pass
 
     def pipe_analyze(self):
         self.model_list = doc_analyze(self.pdf_bytes, ocr=False,
-                                      start_page_id=self.start_page_id, end_page_id=self.end_page_id)
+                                      start_page_id=self.start_page_id, end_page_id=self.end_page_id,
+                                      lang=self.lang)
 
     def pipe_parse(self):
         self.pdf_mid_data = parse_txt_pdf(self.pdf_bytes, self.model_list, self.image_writer, is_debug=self.is_debug,
-                                          start_page_id=self.start_page_id, end_page_id=self.end_page_id)
+                                          start_page_id=self.start_page_id, end_page_id=self.end_page_id,
+                                          lang=self.lang)
 
     def pipe_mk_uni_format(self, img_parent_path: str, drop_mode=DropMode.WHOLE_PDF):
         result = super().pipe_mk_uni_format(img_parent_path, drop_mode)

+ 11 - 7
magic_pdf/pipe/UNIPipe.py

@@ -14,9 +14,9 @@ from magic_pdf.user_api import parse_union_pdf, parse_ocr_pdf
 class UNIPipe(AbsPipe):
 
     def __init__(self, pdf_bytes: bytes, jso_useful_key: dict, image_writer: AbsReaderWriter, is_debug: bool = False,
-                 start_page_id=0, end_page_id=None):
+                 start_page_id=0, end_page_id=None, lang=None):
         self.pdf_type = jso_useful_key["_pdf_type"]
-        super().__init__(pdf_bytes, jso_useful_key["model_list"], image_writer, is_debug, start_page_id, end_page_id)
+        super().__init__(pdf_bytes, jso_useful_key["model_list"], image_writer, is_debug, start_page_id, end_page_id, lang)
         if len(self.model_list) == 0:
             self.input_model_is_empty = True
         else:
@@ -28,22 +28,26 @@ class UNIPipe(AbsPipe):
     def pipe_analyze(self):
         if self.pdf_type == self.PIP_TXT:
             self.model_list = doc_analyze(self.pdf_bytes, ocr=False,
-                                          start_page_id=self.start_page_id, end_page_id=self.end_page_id)
+                                          start_page_id=self.start_page_id, end_page_id=self.end_page_id,
+                                          lang=self.lang)
         elif self.pdf_type == self.PIP_OCR:
             self.model_list = doc_analyze(self.pdf_bytes, ocr=True,
-                                          start_page_id=self.start_page_id, end_page_id=self.end_page_id)
+                                          start_page_id=self.start_page_id, end_page_id=self.end_page_id,
+                                          lang=self.lang)
 
     def pipe_parse(self):
         if self.pdf_type == self.PIP_TXT:
             self.pdf_mid_data = parse_union_pdf(self.pdf_bytes, self.model_list, self.image_writer,
                                                 is_debug=self.is_debug, input_model_is_empty=self.input_model_is_empty,
-                                                start_page_id=self.start_page_id, end_page_id=self.end_page_id)
+                                                start_page_id=self.start_page_id, end_page_id=self.end_page_id,
+                                                lang=self.lang)
         elif self.pdf_type == self.PIP_OCR:
             self.pdf_mid_data = parse_ocr_pdf(self.pdf_bytes, self.model_list, self.image_writer,
                                               is_debug=self.is_debug,
-                                              start_page_id=self.start_page_id, end_page_id=self.end_page_id)
+                                              start_page_id=self.start_page_id, end_page_id=self.end_page_id,
+                                              lang=self.lang)
 
-    def pipe_mk_uni_format(self, img_parent_path: str, drop_mode=DropMode.WHOLE_PDF):
+    def pipe_mk_uni_format(self, img_parent_path: str, drop_mode=DropMode.NONE_WITH_REASON):
         result = super().pipe_mk_uni_format(img_parent_path, drop_mode)
         logger.info("uni_pipe mk content list finished")
         return result

+ 7 - 7
magic_pdf/resources/model_config/UniMERNet/demo.yaml

@@ -2,13 +2,13 @@ model:
   arch: unimernet
   model_type: unimernet
   model_config:
-    model_name: ./models
-    max_seq_len: 1024
-    length_aware: False
+    model_name: ./models/unimernet_base
+    max_seq_len: 1536
+
   load_pretrained: True
-  pretrained: ./models/pytorch_model.bin
+  pretrained: './models/unimernet_base/pytorch_model.pth'
   tokenizer_config:
-    path: ./models
+    path: ./models/unimernet_base
 
 datasets:
   formula_rec_eval:
@@ -18,7 +18,7 @@ datasets:
         image_size:
           - 192
           - 672
-   
+
 run:
   runner: runner_iter
   task: unimernet_train
@@ -43,4 +43,4 @@ run:
   distributed_type: ddp  # or fsdp when train llm
 
   generate_cfg:
-    temperature: 0.0
+    temperature: 0.0

+ 1 - 1
magic_pdf/resources/model_config/model_configs.yaml

@@ -10,6 +10,6 @@ config:
 weights:
   layout: Layout/model_final.pth
   mfd: MFD/weights.pt
-  mfr: MFR/UniMERNet
+  mfr: MFR/unimernet_base
   struct_eqtable: TabRec/StructEqTable
   TableMaster: TabRec/TableMaster

+ 14 - 1
magic_pdf/tools/cli.py

@@ -45,6 +45,18 @@ without method specified, auto will be used by default.""",
     default='auto',
 )
 @click.option(
+    '-l',
+    '--lang',
+    'lang',
+    type=str,
+    help="""
+    Input the languages in the pdf (if known) to improve OCR accuracy.  Optional.
+    You should input "Abbreviation" with language form url:
+    https://paddlepaddle.github.io/PaddleOCR/en/ppocr/blog/multi_languages.html#5-support-languages-and-abbreviations
+    """,
+    default=None,
+)
+@click.option(
     '-d',
     '--debug',
     'debug_able',
@@ -68,7 +80,7 @@ without method specified, auto will be used by default.""",
     help='The ending page for PDF parsing, beginning from 0.',
     default=None,
 )
-def cli(path, output_dir, method, debug_able, start_page_id, end_page_id):
+def cli(path, output_dir, method, lang, debug_able, start_page_id, end_page_id):
     model_config.__use_inside_model__ = True
     model_config.__model_mode__ = 'full'
     os.makedirs(output_dir, exist_ok=True)
@@ -90,6 +102,7 @@ def cli(path, output_dir, method, debug_able, start_page_id, end_page_id):
                 debug_able,
                 start_page_id=start_page_id,
                 end_page_id=end_page_id,
+                lang=lang
             )
 
         except Exception as e:

+ 5 - 4
magic_pdf/tools/common.py

@@ -44,9 +44,10 @@ def do_parse(
     f_draw_model_bbox=False,
     start_page_id=0,
     end_page_id=None,
+    lang=None,
 ):
     if debug_able:
-        logger.warning("debug mode is on")
+        logger.warning('debug mode is on')
         f_dump_content_list = True
         f_draw_model_bbox = True
 
@@ -61,13 +62,13 @@ def do_parse(
     if parse_method == 'auto':
         jso_useful_key = {'_pdf_type': '', 'model_list': model_list}
         pipe = UNIPipe(pdf_bytes, jso_useful_key, image_writer, is_debug=True,
-                       start_page_id=start_page_id, end_page_id=end_page_id)
+                       start_page_id=start_page_id, end_page_id=end_page_id, lang=lang)
     elif parse_method == 'txt':
         pipe = TXTPipe(pdf_bytes, model_list, image_writer, is_debug=True,
-                       start_page_id=start_page_id, end_page_id=end_page_id)
+                       start_page_id=start_page_id, end_page_id=end_page_id, lang=lang)
     elif parse_method == 'ocr':
         pipe = OCRPipe(pdf_bytes, model_list, image_writer, is_debug=True,
-                       start_page_id=start_page_id, end_page_id=end_page_id)
+                       start_page_id=start_page_id, end_page_id=end_page_id, lang=lang)
     else:
         logger.error('unknown parse method')
         exit(1)

+ 16 - 5
magic_pdf/user_api.py

@@ -26,7 +26,7 @@ PARSE_TYPE_OCR = "ocr"
 
 
 def parse_txt_pdf(pdf_bytes: bytes, pdf_models: list, imageWriter: AbsReaderWriter, is_debug=False,
-                  start_page_id=0, end_page_id=None,
+                  start_page_id=0, end_page_id=None, lang=None,
                   *args, **kwargs):
     """
     解析文本类pdf
@@ -44,11 +44,14 @@ def parse_txt_pdf(pdf_bytes: bytes, pdf_models: list, imageWriter: AbsReaderWrit
 
     pdf_info_dict["_version_name"] = __version__
 
+    if lang is not None:
+        pdf_info_dict["_lang"] = lang
+
     return pdf_info_dict
 
 
 def parse_ocr_pdf(pdf_bytes: bytes, pdf_models: list, imageWriter: AbsReaderWriter, is_debug=False,
-                  start_page_id=0, end_page_id=None,
+                  start_page_id=0, end_page_id=None, lang=None,
                   *args, **kwargs):
     """
     解析ocr类pdf
@@ -66,12 +69,15 @@ def parse_ocr_pdf(pdf_bytes: bytes, pdf_models: list, imageWriter: AbsReaderWrit
 
     pdf_info_dict["_version_name"] = __version__
 
+    if lang is not None:
+        pdf_info_dict["_lang"] = lang
+
     return pdf_info_dict
 
 
 def parse_union_pdf(pdf_bytes: bytes, pdf_models: list, imageWriter: AbsReaderWriter, is_debug=False,
                     input_model_is_empty: bool = False,
-                    start_page_id=0, end_page_id=None,
+                    start_page_id=0, end_page_id=None, lang=None,
                     *args, **kwargs):
     """
     ocr和文本混合的pdf,全部解析出来
@@ -95,9 +101,11 @@ def parse_union_pdf(pdf_bytes: bytes, pdf_models: list, imageWriter: AbsReaderWr
     if pdf_info_dict is None or pdf_info_dict.get("_need_drop", False):
         logger.warning(f"parse_pdf_by_txt drop or error, switch to parse_pdf_by_ocr")
         if input_model_is_empty:
-            pdf_models = doc_analyze(pdf_bytes, ocr=True,
+            pdf_models = doc_analyze(pdf_bytes,
+                                     ocr=True,
                                      start_page_id=start_page_id,
-                                     end_page_id=end_page_id)
+                                     end_page_id=end_page_id,
+                                     lang=lang)
         pdf_info_dict = parse_pdf(parse_pdf_by_ocr)
         if pdf_info_dict is None:
             raise Exception("Both parse_pdf_by_txt and parse_pdf_by_ocr failed.")
@@ -108,4 +116,7 @@ def parse_union_pdf(pdf_bytes: bytes, pdf_models: list, imageWriter: AbsReaderWr
 
     pdf_info_dict["_version_name"] = __version__
 
+    if lang is not None:
+        pdf_info_dict["_lang"] = lang
+
     return pdf_info_dict

+ 2 - 0
projects/README.md

@@ -3,4 +3,6 @@
 ## Project List
 
 - [llama_index_rag](./llama_index_rag/README.md): Build a lightweight RAG system based on llama_index
+- [gradio_app](./gradio_app/README.md): Build a web app based on gradio
+
 

+ 2 - 0
projects/README_zh-CN.md

@@ -3,3 +3,5 @@
 ## 项目列表
 
 - [llama_index_rag](./llama_index_rag/README_zh-CN.md): 基于 llama_index 构建轻量级 RAG 系统
+- [gradio_app](./gradio_app/README_zh-CN.md): 基于 Gradio 的 Web 应用
+

+ 24 - 0
projects/gradio_app/README.md

@@ -0,0 +1,24 @@
+## Installation
+
+MinerU(>=0.8.0)
+ > If you already have a functioning MinerU environment, you can skip this step.
+ > 
+[Deploy in CPU environment](https://github.com/opendatalab/MinerU?tab=readme-ov-file#quick-cpu-demo)
+
+[Deploy in GPU environment](https://github.com/opendatalab/MinerU?tab=readme-ov-file#using-gpu)
+
+Third-party Software
+
+```bash
+pip install gradio gradio-pdf
+```
+
+## Start Gradio App
+
+```bash
+python app.py
+```
+
+## Use Gradio App
+
+Access http://127.0.0.1:7860 in your web browser

+ 24 - 0
projects/gradio_app/README_zh-CN.md

@@ -0,0 +1,24 @@
+## 安装
+
+MinerU(>=0.8.0)
+ >如已有正常运行的MinerU环境则可以跳过此步骤
+> 
+[在CPU环境部署](https://github.com/opendatalab/MinerU/blob/master/README_zh-CN.md#%E4%BD%BF%E7%94%A8cpu%E5%BF%AB%E9%80%9F%E4%BD%93%E9%AA%8C)
+
+[在GPU环境部署](https://github.com/opendatalab/MinerU/blob/master/README_zh-CN.md#%E4%BD%BF%E7%94%A8gpu)
+
+第三方软件
+
+```bash
+pip install gradio gradio-pdf
+```
+
+## 启动gradio应用
+
+```bash
+python app.py
+```
+
+## 使用gradio应用
+
+在浏览器中访问 http://127.0.0.1:7860

+ 23 - 18
app.py → projects/gradio_app/app.py

@@ -14,8 +14,6 @@ from magic_pdf.rw.AbsReaderWriter import AbsReaderWriter
 from magic_pdf.rw.DiskReaderWriter import DiskReaderWriter
 from magic_pdf.tools.common import do_parse, prepare_env
 
-os.system("pip install gradio")
-os.system("pip install gradio-pdf")
 import gradio as gr
 from gradio_pdf import PDF
 
@@ -25,13 +23,16 @@ def read_fn(path):
     return disk_rw.read(os.path.basename(path), AbsReaderWriter.MODE_BIN)
 
 
-def parse_pdf(doc_path, output_dir, end_page_id):
+def parse_pdf(doc_path, output_dir, end_page_id, is_ocr):
     os.makedirs(output_dir, exist_ok=True)
 
     try:
         file_name = f"{str(Path(doc_path).stem)}_{time.time()}"
         pdf_data = read_fn(doc_path)
-        parse_method = "auto"
+        if is_ocr:
+            parse_method = "ocr"
+        else:
+            parse_method = "auto"
         local_image_dir, local_md_dir = prepare_env(output_dir, file_name, parse_method)
         do_parse(
             output_dir,
@@ -92,9 +93,9 @@ def replace_image_with_base64(markdown_text, image_dir_path):
     return re.sub(pattern, replace, markdown_text)
 
 
-def to_markdown(file_path, end_pages):
+def to_markdown(file_path, end_pages, is_ocr):
     # 获取识别的md文件以及压缩包文件路径
-    local_md_dir, file_name = parse_pdf(file_path, './output', end_pages - 1)
+    local_md_dir, file_name = parse_pdf(file_path, './output', end_pages - 1, is_ocr)
     archive_zip_path = os.path.join("./output", compute_sha256(local_md_dir) + ".zip")
     zip_archive_success = compress_directory_to_zip(local_md_dir, archive_zip_path)
     if zip_archive_success == 0:
@@ -111,14 +112,6 @@ def to_markdown(file_path, end_pages):
     return md_content, txt_content, archive_zip_path, new_pdf_path
 
 
-# def show_pdf(file_path):
-#     with open(file_path, "rb") as f:
-#         base64_pdf = base64.b64encode(f.read()).decode('utf-8')
-#     pdf_display = f'<embed src="data:application/pdf;base64,{base64_pdf}" ' \
-#                   f'width="100%" height="1000" type="application/pdf">'
-#     return pdf_display
-
-
 latex_delimiters = [{"left": "$$", "right": "$$", "display": True},
                     {"left": '$', "right": '$', "display": False}]
 
@@ -141,16 +134,29 @@ model_init = init_model()
 logger.info(f"model_init: {model_init}")
 
 
+with open("header.html", "r") as file:
+    header = file.read()
+
+
 if __name__ == "__main__":
     with gr.Blocks() as demo:
+        gr.HTML(header)
         with gr.Row():
             with gr.Column(variant='panel', scale=5):
                 pdf_show = gr.Markdown()
                 max_pages = gr.Slider(1, 10, 5, step=1, label="Max convert pages")
                 with gr.Row() as bu_flow:
+                    is_ocr = gr.Checkbox(label="Force enable OCR")
                     change_bu = gr.Button("Convert")
                     clear_bu = gr.ClearButton([pdf_show], value="Clear")
                 pdf_show = PDF(label="Please upload pdf", interactive=True, height=800)
+                with gr.Accordion("Examples:"):
+                    example_root = os.path.join(os.path.dirname(__file__), "examples")
+                    gr.Examples(
+                        examples=[os.path.join(example_root, _) for _ in os.listdir(example_root) if
+                                  _.endswith("pdf")],
+                        inputs=pdf_show,
+                    )
 
             with gr.Column(variant='panel', scale=5):
                 output_file = gr.File(label="convert result", interactive=False)
@@ -160,8 +166,7 @@ if __name__ == "__main__":
                                          latex_delimiters=latex_delimiters, line_breaks=True)
                     with gr.Tab("Markdown text"):
                         md_text = gr.TextArea(lines=45, show_copy_button=True)
-        change_bu.click(fn=to_markdown, inputs=[pdf_show, max_pages], outputs=[md, md_text, output_file, pdf_show])
-        clear_bu.add([md, pdf_show, md_text, output_file])
-
-    demo.launch()
+        change_bu.click(fn=to_markdown, inputs=[pdf_show, max_pages, is_ocr], outputs=[md, md_text, output_file, pdf_show])
+        clear_bu.add([md, pdf_show, md_text, output_file, is_ocr])
 
+    demo.launch()

BIN
projects/gradio_app/examples/academic_paper_formula.pdf


BIN
projects/gradio_app/examples/academic_paper_img_formula.pdf


BIN
projects/gradio_app/examples/garbled_formula.pdf


BIN
projects/gradio_app/examples/garbled_formula2.pdf


BIN
projects/gradio_app/examples/garbled_img_formula.pdf


BIN
projects/gradio_app/examples/scanned.pdf


+ 109 - 0
projects/gradio_app/header.html

@@ -0,0 +1,109 @@
+<html><head>
+  <link rel="stylesheet" href="https://use.fontawesome.com/releases/v5.15.4/css/all.css">
+<style>
+  .link-block {
+    border: 1px solid transparent;
+    border-radius: 24px;
+    background-color: rgba(54, 54, 54, 1);
+    cursor: pointer !important;
+  }
+  .link-block:hover {
+    background-color: rgba(54, 54, 54, 0.75) !important;
+    cursor: pointer !important;
+  }
+  .external-link {
+    display: inline-flex;
+    align-items: center;
+    height: 36px;
+    line-height: 36px;
+    padding: 0 16px;
+    cursor: pointer !important;
+  }
+  .external-link,
+  .external-link:hover {
+    cursor: pointer !important;
+  }
+  a {
+    text-decoration: none;
+  }
+</style></head>
+
+<body>
+  <div style="
+      display: flex;
+      flex-direction: column;
+      justify-content: center;
+      align-items: center;
+      text-align: center;
+      background: linear-gradient(45deg, #007bff 0%, #0056b3 100%);
+      padding: 24px;
+      gap: 24px;
+      border-radius: 8px;
+    ">
+    <div style="
+        display: flex;
+        flex-direction: column;
+        align-items: center;
+        gap: 16px;
+      ">
+      <div style="display: flex; flex-direction: column; gap: 8px">
+        <h1 style="
+            font-size: 48px;
+            color: #fafafa;
+            margin: 0;
+            font-family: 'Trebuchet MS', 'Lucida Sans Unicode',
+              'Lucida Grande', 'Lucida Sans', Arial, sans-serif;
+          ">
+          MinerU: PDF Extraction Demo
+        </h1>
+      </div>
+    </div>
+
+    <p style="
+        margin: 0;
+        line-height: 1.6rem;
+        font-size: 16px;
+        color: #fafafa;
+        opacity: 0.8;
+      ">
+      A one-stop, open-source, high-quality data extraction tool, supports
+      PDF/webpage/e-book extraction.<br>
+    </p>
+    <style>
+      .link-block {
+        display: inline-block;
+      }
+      .link-block + .link-block {
+        margin-left: 20px;
+      }
+    </style>
+
+    <div class="column has-text-centered">
+      <div class="publication-links">
+        <!-- Code Link. -->
+        <span class="link-block">
+          <a href="https://github.com/opendatalab/MinerU" class="external-link button is-normal is-rounded is-dark" style="text-decoration: none; cursor: pointer">
+            <span class="icon" style="margin-right: 4px">
+              <i class="fab fa-github" style="color: white; margin-right: 4px"></i>
+            </span>
+            <span style="color: white">Code</span>
+          </a>
+        </span>
+
+        <!-- Homepage Link. -->
+        <span class="link-block">
+          <a href="https://opendatalab.com/" class="external-link button is-normal is-rounded is-dark" style="text-decoration: none; cursor: pointer">
+            <span class="icon" style="margin-right: 8px">
+              <i class="fas fa-globe" style="color: white"></i>
+            </span>
+            <span style="color: white">Homepage</span>
+          </a>
+        </span>
+      </div>
+    </div>
+
+    <!-- New Demo Links -->
+  </div>
+
+
+</body></html>

+ 3 - 0
projects/gradio_app/requirements.txt

@@ -0,0 +1,3 @@
+magic-pdf[full]>=0.8.0
+gradio
+gradio-pdf

+ 80 - 39
projects/llama_index_rag/README_zh-CN.md

@@ -1,20 +1,64 @@
-## 安装
+<details open="open">
+  <summary><h2 style="display: inline-block">目录</h2></summary>
+    <li><a href="#介绍">介绍</a></li>
+    <li><a href="#安装">安装</a></li>
+    <li><a href="#示例">示例</a></li>
+    <li><a href="#开发">开发</a></li>
+  </ol>
+</details>
 
-MinerU
+## 介绍
 
-```bash
-git clone https://github.com/opendatalab/MinerU.git
-cd MinerU
+`MinerU` 提供数据 `API接口` 以支持用户导入数据到 `RAG` 系统。本项目将基于`通义千问`展示如何构建一个轻量级的 `RAG` 系统。
+
+<p align="center">
+  <img src="rag_data_api.png" width="300px" style="vertical-align:middle;">
+</p>
+
+## 安装
 
-conda create -n MinerU python=3.10
-conda activate MinerU
-pip install .[full] --extra-index-url https://wheels.myhloli.com
+环境要求
+
+```text
+NVIDIA A100 80GB,
+Centos 7 3.10.0-957.el7.x86_64
+
+Client: Docker Engine - Community
+ Version:           24.0.5
+ API version:       1.43
+ Go version:        go1.20.6
+ Git commit:        ced0996
+ Built:             Fri Jul 21 20:39:02 2023
+ OS/Arch:           linux/amd64
+ Context:           default
+
+Server: Docker Engine - Community
+ Engine:
+  Version:          24.0.5
+  API version:      1.43 (minimum version 1.12)
+  Go version:       go1.20.6
+  Git commit:       a61e2b4
+  Built:            Fri Jul 21 20:38:05 2023
+  OS/Arch:          linux/amd64
+  Experimental:     false
+ containerd:
+  Version:          1.6.25
+  GitCommit:        d8f198a4ed8892c764191ef7b3b06d8a2eeb5c7f
+ runc:
+  Version:          1.1.10
+  GitCommit:        v1.1.10-0-g18a0cb0
+ docker-init:
+  Version:          0.19.0
+  GitCommit:        de40ad0
 ```
 
+请参考[文档](../../README_zh-CN.md) 安装 MinerU
+
 第三方软件
 
 ```bash
 # install
+pip install modelscope==1.14.0
 pip install llama-index-vector-stores-elasticsearch==0.2.0
 pip install llama-index-embeddings-dashscope==0.2.0
 pip install llama-index-core==0.10.68
@@ -26,39 +70,12 @@ pip install accelerate==0.33.0
 pip uninstall transformer-engine
 ```
 
-## 环境配置
-
-```
-export DASHSCOPE_API_KEY={some_key}
-export ES_USER={some_es_user}
-export ES_PASSWORD={some_es_password}
-export ES_URL=http://{es_url}:9200
-```
-
-DASHSCOPE_API_KEY 开通参考[文档](https://help.aliyun.com/zh/dashscope/opening-service)
-
-## 使用
-
-### 导入数据
-
-```bash
-python data_ingestion.py -p some.pdf  # load data from pdf
-
-    or
-
-python data_ingestion.py -p /opt/data/some_pdf_directory/ # load data from multiples pdf which under the directory of {some_pdf_directory}
-```
-
-### 查询
-
-```bash
-python query.py --question '{the_question_you_want_to_ask}'
-```
 
 ## 示例
 
 ````bash
-# 启动 es 服务
+cd  projects/llama_index_rag
+
 docker compose up -d
 
 or
@@ -67,17 +84,41 @@ docker-compose up -d
 
 
 # 配置环境变量
+
 export ES_USER=elastic
 export ES_PASSWORD=llama_index
 export ES_URL=http://127.0.0.1:9200
 export DASHSCOPE_API_KEY={some_key}
 
 
+DASHSCOPE_API_KEY 开通参考[文档](https://help.aliyun.com/zh/dashscope/opening-service)
+
+# 未导入数据,查询问题。返回通义千问默认答案
+python query.py -q 'how about the rights of men'
+
+## outputs
+question: how about the rights of men
+answer: The topic of men's rights often refers to discussions around legal, social, and political issues that affect men specifically or differently from women. Movements related to men's rights advocate for addressing areas where men face discrimination or unique challenges, such as:
+
+    Child Custody: Ensuring that men have equal opportunities for custody of their children following divorce or separation.
+    Domestic Violence: Recognizing that men can also be victims of domestic abuse and ensuring they have access to support services.
+    Mental Health and Suicide Rates: Addressing the higher rates of suicide among men and providing mental health resources.
+    Military Conscription: In some countries, only men are required to register for military service, which is seen as a gender-based obligation.
+    Workplace Safety: Historically, more men than women have been employed in high-risk occupations, leading to higher workplace injury and death rates.
+    Parental Leave: Advocating for paternity leave policies that allow men to take time off work for family care.
+    Men's rights activism often intersects with broader discussions on gender equality and aims to promote fairness and equity across genders. It's important to note that while advocating for these issues, it should be done in a way that does not detract from or oppose the goals of gender equality and the rights of other groups. The focus should be on creating a fair society where everyone has equal opportunities and protections under the law.
+
+
 # 导入数据
-python data_ingestion.py example/data/declaration_of_the_rights_of_man_1789.pdf
+python data_ingestion.py -p example/data/
+
+or
+
+python data_ingestion.py -p example/data/declaration_of_the_rights_of_man_1789.pdf
+
 
+# 导入数据后,查询问题。通义千问模型会根据 RAG 系统的检索结果,结合上下文,给出答案。
 
-# 查询问题
 python query.py -q 'how about the rights of men'
 
 ## outputs

BIN
projects/llama_index_rag/rag_data_api.png


+ 1 - 1
requirements-docker.txt

@@ -8,7 +8,7 @@ fast-langdetect==0.2.0
 wordninja>=2.0.0
 scikit-learn>=1.0.2
 pdfminer.six==20231228
-unimernet==0.1.6
+unimernet==0.2.1
 matplotlib
 ultralytics
 paddleocr==2.7.3

+ 1 - 1
setup.py

@@ -36,7 +36,7 @@ if __name__ == '__main__':
                      "paddlepaddle==3.0.0b1;platform_system=='Linux'",
                      "paddlepaddle==2.6.1;platform_system=='Windows' or platform_system=='Darwin'",
                      ],
-            "full": ["unimernet==0.1.6",  # 0.1.6版本大幅裁剪依赖包范围,推荐使用此版本
+            "full": ["unimernet==0.2.1",  # unimernet升级0.2.1
                      "matplotlib<=3.9.0;platform_system=='Windows'",  # 3.9.1及之后不提供windows的预编译包,避免一些没有编译环境的windows设备安装失败
                      "matplotlib;platform_system=='Linux' or platform_system=='Darwin'",  # linux 和 macos 不应限制matplotlib的最高版本,以避免无法更新导致的一些bug
                      "ultralytics",  # yolov8,公式检测

+ 24 - 0
tests/clean_coverage.py

@@ -0,0 +1,24 @@
+"""
+clean coverage
+"""
+import os
+import shutil
+
+def delete_file(path):
+    """delete file."""
+    if not os.path.exists(path):
+        if os.path.isfile(path):
+            try:
+                os.remove(path)
+                print(f"File '{path}' deleted.")
+            except TypeError as e:
+                print(f"Error deleting file '{path}': {e}")
+    elif os.path.isdir(path):
+        try:
+            shutil.rmtree(path)
+            print(f"Directory '{path}' and its contents deleted.")
+        except TypeError as e:
+            print(f"Error deleting directory '{path}': {e}")
+
+if __name__ == "__main__":
+    delete_file("htmlcov")

+ 1 - 1
tests/get_coverage.py

@@ -2,7 +2,7 @@
 get cov
 """
 from bs4 import BeautifulSoup
-
+import shutil
 def get_covrage():
     """get covrage"""
     # 发送请求获取网页内容

+ 2 - 1
tests/retry_env.sh

@@ -8,7 +8,8 @@ while true; do
     # prepare env
     source activate MinerU
     pip install -r requirements-qa.txt
-    pip install magic-pdf[full]==0.7.0b1 --extra-index-url https://wheels.myhloli.com -i https://pypi.tuna.tsinghua.edu.cn/simple
+    pip uninstall magic-pdf
+    pip install -U magic-pdf[full] --extra-index-url https://wheels.myhloli.com
     pip install paddlepaddle-gpu==3.0.0b1 -i https://www.paddlepaddle.org.cn/packages/stable/cu118/
     exit_code=$?
     if [ $exit_code -eq 0 ]; then

+ 4 - 3
tests/test_cli/conf/conf.py

@@ -2,6 +2,7 @@ import os
 conf = {
 "code_path": os.environ.get('GITHUB_WORKSPACE'),
 "pdf_dev_path" : os.environ.get('GITHUB_WORKSPACE') + "/tests/test_cli/pdf_dev",
-"pdf_res_path": "/tmp/magic-pdf"
-}
-
+"pdf_res_path": "/tmp/magic-pdf",
+"jsonl_path": "s3://llm-qatest-pnorm/mineru/test/line1.jsonl",
+"s3_pdf_path": "s3://llm-qatest-pnorm/mineru/test/test.pdf"
+}

Fișier diff suprimat deoarece este prea mare
+ 0 - 0
tests/test_cli/pdf_dev/line1.jsonl


+ 1472 - 0
tests/test_cli/pdf_dev/test_model.json

@@ -0,0 +1,1472 @@
+[
+    {
+        "layout_dets": [
+            {
+                "category_id": 1,
+                "poly": [
+                    578.2055053710938,
+                    672.8831787109375,
+                    1579.973388671875,
+                    672.8831787109375,
+                    1579.973388671875,
+                    1034.681640625,
+                    578.2055053710938,
+                    1034.681640625
+                ],
+                "score": 0.9999963045120239
+            },
+            {
+                "category_id": 1,
+                "poly": [
+                    583.6041259765625,
+                    1067.1112060546875,
+                    1579.822265625,
+                    1067.1112060546875,
+                    1579.822265625,
+                    1537.1324462890625,
+                    583.6041259765625,
+                    1537.1324462890625
+                ],
+                "score": 0.9999961853027344
+            },
+            {
+                "category_id": 1,
+                "poly": [
+                    585.4341430664062,
+                    1568.220703125,
+                    1578.5487060546875,
+                    1568.220703125,
+                    1578.5487060546875,
+                    1931.516845703125,
+                    585.4341430664062,
+                    1931.516845703125
+                ],
+                "score": 0.9999949336051941
+            },
+            {
+                "category_id": 1,
+                "poly": [
+                    578.491455078125,
+                    532.0020141601562,
+                    1577.96337890625,
+                    532.0020141601562,
+                    1577.96337890625,
+                    641.0128784179688,
+                    578.491455078125,
+                    641.0128784179688
+                ],
+                "score": 0.999992847442627
+            },
+            {
+                "category_id": 1,
+                "poly": [
+                    66.43791961669922,
+                    1776.6951904296875,
+                    530.4810180664062,
+                    1776.6951904296875,
+                    530.4810180664062,
+                    1883.127685546875,
+                    66.43791961669922,
+                    1883.127685546875
+                ],
+                "score": 0.9999925494194031
+            },
+            {
+                "category_id": 3,
+                "poly": [
+                    70.23656463623047,
+                    818.9393920898438,
+                    517.8253784179688,
+                    818.9393920898438,
+                    517.8253784179688,
+                    1076.5823974609375,
+                    70.23656463623047,
+                    1076.5823974609375
+                ],
+                "score": 0.9999912977218628
+            },
+            {
+                "category_id": 1,
+                "poly": [
+                    64.99957275390625,
+                    651.9596557617188,
+                    436.5134582519531,
+                    651.9596557617188,
+                    436.5134582519531,
+                    723.5758056640625,
+                    64.99957275390625,
+                    723.5758056640625
+                ],
+                "score": 0.9999804496765137
+            },
+            {
+                "category_id": 0,
+                "poly": [
+                    556.2775268554688,
+                    270.2123107910156,
+                    1577.8211669921875,
+                    270.2123107910156,
+                    1577.8211669921875,
+                    408.9685974121094,
+                    556.2775268554688,
+                    408.9685974121094
+                ],
+                "score": 0.9999696016311646
+            },
+            {
+                "category_id": 1,
+                "poly": [
+                    67.8562240600586,
+                    1342.2239990234375,
+                    530.5654296875,
+                    1342.2239990234375,
+                    530.5654296875,
+                    1447.843017578125,
+                    67.8562240600586,
+                    1447.843017578125
+                ],
+                "score": 0.9999648928642273
+            },
+            {
+                "category_id": 1,
+                "poly": [
+                    65.74958801269531,
+                    1631.3671875,
+                    530.32861328125,
+                    1631.3671875,
+                    530.32861328125,
+                    1772.413818359375,
+                    65.74958801269531,
+                    1772.413818359375
+                ],
+                "score": 0.9999628067016602
+            },
+            {
+                "category_id": 1,
+                "poly": [
+                    588.5570068359375,
+                    2068.54931640625,
+                    1525.3253173828125,
+                    2068.54931640625,
+                    1525.3253173828125,
+                    2103.89013671875,
+                    588.5570068359375,
+                    2103.89013671875
+                ],
+                "score": 0.9999607801437378
+            },
+            {
+                "category_id": 1,
+                "poly": [
+                    586.5548706054688,
+                    1963.105712890625,
+                    1556.578125,
+                    1963.105712890625,
+                    1556.578125,
+                    2034.8116455078125,
+                    586.5548706054688,
+                    2034.8116455078125
+                ],
+                "score": 0.9999469518661499
+            },
+            {
+                "category_id": 5,
+                "poly": [
+                    59.96487045288086,
+                    1110.6282958984375,
+                    529.9209594726562,
+                    1110.6282958984375,
+                    529.9209594726562,
+                    1225.2921142578125,
+                    59.96487045288086,
+                    1225.2921142578125
+                ],
+                "score": 0.999945878982544
+            },
+            {
+                "category_id": 2,
+                "poly": [
+                    70.25292205810547,
+                    103.42201232910156,
+                    420.4892578125,
+                    103.42201232910156,
+                    420.4892578125,
+                    223.39370727539062,
+                    70.25292205810547,
+                    223.39370727539062
+                ],
+                "score": 0.9999405145645142
+            },
+            {
+                "category_id": 2,
+                "poly": [
+                    1081.0203857421875,
+                    2244.87890625,
+                    1554.669189453125,
+                    2244.87890625,
+                    1554.669189453125,
+                    2275.28662109375,
+                    1081.0203857421875,
+                    2275.28662109375
+                ],
+                "score": 0.9999217987060547
+            },
+            {
+                "category_id": 1,
+                "poly": [
+                    68.85404968261719,
+                    345.9093017578125,
+                    307.9080810546875,
+                    345.9093017578125,
+                    307.9080810546875,
+                    409.0098876953125,
+                    68.85404968261719,
+                    409.0098876953125
+                ],
+                "score": 0.9999183416366577
+            },
+            {
+                "category_id": 0,
+                "poly": [
+                    65.58759307861328,
+                    1295.9366455078125,
+                    180.4149932861328,
+                    1295.9366455078125,
+                    180.4149932861328,
+                    1328.867919921875,
+                    65.58759307861328,
+                    1328.867919921875
+                ],
+                "score": 0.9998926520347595
+            },
+            {
+                "category_id": 2,
+                "poly": [
+                    1245.0789794921875,
+                    108.83513641357422,
+                    1576.3131103515625,
+                    108.83513641357422,
+                    1576.3131103515625,
+                    219.29042053222656,
+                    1245.0789794921875,
+                    219.29042053222656
+                ],
+                "score": 0.9995975494384766
+            },
+            {
+                "category_id": 1,
+                "poly": [
+                    65.75041961669922,
+                    483.5210266113281,
+                    428.6028137207031,
+                    483.5210266113281,
+                    428.6028137207031,
+                    586.8894653320312,
+                    65.75041961669922,
+                    586.8894653320312
+                ],
+                "score": 0.9993270635604858
+            },
+            {
+                "category_id": 0,
+                "poly": [
+                    65.02926635742188,
+                    445.02288818359375,
+                    208.3317108154297,
+                    445.02288818359375,
+                    208.3317108154297,
+                    476.65252685546875,
+                    65.02926635742188,
+                    476.65252685546875
+                ],
+                "score": 0.9992279410362244
+            },
+            {
+                "category_id": 0,
+                "poly": [
+                    556.96630859375,
+                    453.08447265625,
+                    673.0485229492188,
+                    453.08447265625,
+                    673.0485229492188,
+                    490.60455322265625,
+                    556.96630859375,
+                    490.60455322265625
+                ],
+                "score": 0.9949817657470703
+            },
+            {
+                "category_id": 1,
+                "poly": [
+                    66.26518249511719,
+                    1524.234130859375,
+                    530.2540283203125,
+                    1524.234130859375,
+                    530.2540283203125,
+                    1627.5291748046875,
+                    66.26518249511719,
+                    1627.5291748046875
+                ],
+                "score": 0.9919581413269043
+            },
+            {
+                "category_id": 7,
+                "poly": [
+                    62.5564079284668,
+                    1227.41943359375,
+                    380.10693359375,
+                    1227.41943359375,
+                    380.10693359375,
+                    1252.8614501953125,
+                    62.5564079284668,
+                    1252.8614501953125
+                ],
+                "score": 0.9918426275253296
+            },
+            {
+                "category_id": 1,
+                "poly": [
+                    66.80464935302734,
+                    1451.4775390625,
+                    527.3795166015625,
+                    1451.4775390625,
+                    527.3795166015625,
+                    1519.5836181640625,
+                    66.80464935302734,
+                    1519.5836181640625
+                ],
+                "score": 0.9883899688720703
+            },
+            {
+                "category_id": 0,
+                "poly": [
+                    65.36080932617188,
+                    605.3754272460938,
+                    181.24375915527344,
+                    605.3754272460938,
+                    181.24375915527344,
+                    637.0076904296875,
+                    65.36080932617188,
+                    637.0076904296875
+                ],
+                "score": 0.9870840311050415
+            },
+            {
+                "category_id": 0,
+                "poly": [
+                    178.82904052734375,
+                    264.6627197265625,
+                    396.52825927734375,
+                    264.6627197265625,
+                    396.52825927734375,
+                    315.41900634765625,
+                    178.82904052734375,
+                    315.41900634765625
+                ],
+                "score": 0.9779323935508728
+            },
+            {
+                "category_id": 4,
+                "poly": [
+                    66.15127563476562,
+                    767.24658203125,
+                    181.25694274902344,
+                    767.24658203125,
+                    181.25694274902344,
+                    799.7832641601562,
+                    66.15127563476562,
+                    799.7832641601562
+                ],
+                "score": 0.8932801485061646
+            },
+            {
+                "category_id": 13,
+                "poly": [
+                    590,
+                    747,
+                    688,
+                    747,
+                    688,
+                    778,
+                    590,
+                    778
+                ],
+                "score": 0.91,
+                "latex": "+24.4\\%"
+            },
+            {
+                "category_id": 13,
+                "poly": [
+                    1433,
+                    855,
+                    1492,
+                    855,
+                    1492,
+                    886,
+                    1433,
+                    886
+                ],
+                "score": 0.86,
+                "latex": "30\\%"
+            },
+            {
+                "category_id": 13,
+                "poly": [
+                    238,
+                    689,
+                    264,
+                    689,
+                    264,
+                    717,
+                    238,
+                    717
+                ],
+                "score": 0.34,
+                "latex": "@"
+            },
+            {
+                "category_id": 13,
+                "poly": [
+                    702,
+                    1002,
+                    722,
+                    1002,
+                    722,
+                    1026,
+                    702,
+                    1026
+                ],
+                "score": 0.33,
+                "latex": "^+"
+            },
+            {
+                "category_id": 13,
+                "poly": [
+                    177,
+                    1154,
+                    223,
+                    1154,
+                    223,
+                    1185,
+                    177,
+                    1185
+                ],
+                "score": 0.28,
+                "latex": "(\\%)"
+            }
+        ],
+        "page_info": {
+            "page_no": 0,
+            "height": 2339,
+            "width": 1654
+        }
+    },
+    {
+        "layout_dets": [
+            {
+                "category_id": 2,
+                "poly": [
+                    88.00849151611328,
+                    31.891826629638672,
+                    300.7432861328125,
+                    31.891826629638672,
+                    300.7432861328125,
+                    113.5999755859375,
+                    88.00849151611328,
+                    113.5999755859375
+                ],
+                "score": 0.9999986886978149
+            },
+            {
+                "category_id": 2,
+                "poly": [
+                    771.0192260742188,
+                    2213.479248046875,
+                    827.4273681640625,
+                    2213.479248046875,
+                    827.4273681640625,
+                    2239.40185546875,
+                    771.0192260742188,
+                    2239.40185546875
+                ],
+                "score": 0.9999963641166687
+            },
+            {
+                "category_id": 7,
+                "poly": [
+                    544.2962646484375,
+                    488.5493469238281,
+                    988.3958129882812,
+                    488.5493469238281,
+                    988.3958129882812,
+                    541.0634155273438,
+                    544.2962646484375,
+                    541.0634155273438
+                ],
+                "score": 0.9999918341636658
+            },
+            {
+                "category_id": 2,
+                "poly": [
+                    1082.88232421875,
+                    82.37471771240234,
+                    1519.4150390625,
+                    82.37471771240234,
+                    1519.4150390625,
+                    114.9271011352539,
+                    1082.88232421875,
+                    114.9271011352539
+                ],
+                "score": 0.9999632835388184
+            },
+            {
+                "category_id": 2,
+                "poly": [
+                    1009.1597900390625,
+                    2210.9462890625,
+                    1535.9239501953125,
+                    2210.9462890625,
+                    1535.9239501953125,
+                    2241.830322265625,
+                    1009.1597900390625,
+                    2241.830322265625
+                ],
+                "score": 0.9999324679374695
+            },
+            {
+                "category_id": 5,
+                "poly": [
+                    537.349365234375,
+                    156.8784637451172,
+                    1584.9866943359375,
+                    156.8784637451172,
+                    1584.9866943359375,
+                    485.3042907714844,
+                    537.349365234375,
+                    485.3042907714844
+                ],
+                "score": 0.9985955953598022
+            },
+            {
+                "category_id": 7,
+                "poly": [
+                    62.69784927368164,
+                    443.4034118652344,
+                    249.9097137451172,
+                    443.4034118652344,
+                    249.9097137451172,
+                    467.4612731933594,
+                    62.69784927368164,
+                    467.4612731933594
+                ],
+                "score": 0.9873980283737183
+            },
+            {
+                "category_id": 5,
+                "poly": [
+                    61.374210357666016,
+                    138.51153564453125,
+                    528.30517578125,
+                    138.51153564453125,
+                    528.30517578125,
+                    443.5376281738281,
+                    61.374210357666016,
+                    443.5376281738281
+                ],
+                "score": 0.9232220649719238
+            },
+            {
+                "category_id": 6,
+                "poly": [
+                    548.1119384765625,
+                    148.7312774658203,
+                    797.3070678710938,
+                    148.7312774658203,
+                    797.3070678710938,
+                    180.74609375,
+                    548.1119384765625,
+                    180.74609375
+                ],
+                "score": 0.6074804663658142
+            },
+            {
+                "category_id": 13,
+                "poly": [
+                    864,
+                    455,
+                    922,
+                    455,
+                    922,
+                    482,
+                    864,
+                    482
+                ],
+                "score": 0.74,
+                "latex": "6.0\\%"
+            },
+            {
+                "category_id": 13,
+                "poly": [
+                    850,
+                    418,
+                    922,
+                    418,
+                    922,
+                    445,
+                    850,
+                    445
+                ],
+                "score": 0.64,
+                "latex": "35.3\\%"
+            },
+            {
+                "category_id": 13,
+                "poly": [
+                    1501,
+                    270,
+                    1571,
+                    270,
+                    1571,
+                    298,
+                    1501,
+                    298
+                ],
+                "score": 0.54,
+                "latex": "13.8\\%"
+            },
+            {
+                "category_id": 13,
+                "poly": [
+                    1013,
+                    454,
+                    1083,
+                    454,
+                    1083,
+                    482,
+                    1013,
+                    482
+                ],
+                "score": 0.52,
+                "latex": "15.0\\%"
+            },
+            {
+                "category_id": 13,
+                "poly": [
+                    1012,
+                    417,
+                    1083,
+                    417,
+                    1083,
+                    444,
+                    1012,
+                    444
+                ],
+                "score": 0.52,
+                "latex": "33.7\\%"
+            },
+            {
+                "category_id": 13,
+                "poly": [
+                    689,
+                    456,
+                    725,
+                    456,
+                    725,
+                    482,
+                    689,
+                    482
+                ],
+                "score": 0.48,
+                "latex": "(\\%)"
+            },
+            {
+                "category_id": 13,
+                "poly": [
+                    850,
+                    344,
+                    922,
+                    344,
+                    922,
+                    372,
+                    850,
+                    372
+                ],
+                "score": 0.4,
+                "latex": "83.8\\%"
+            },
+            {
+                "category_id": 13,
+                "poly": [
+                    863,
+                    270,
+                    922,
+                    270,
+                    922,
+                    298,
+                    863,
+                    298
+                ],
+                "score": 0.4,
+                "latex": "4.5\\%"
+            },
+            {
+                "category_id": 13,
+                "poly": [
+                    1334,
+                    270,
+                    1406,
+                    270,
+                    1406,
+                    298,
+                    1334,
+                    298
+                ],
+                "score": 0.35,
+                "latex": "37.2\\%"
+            },
+            {
+                "category_id": 13,
+                "poly": [
+                    618,
+                    419,
+                    656,
+                    419,
+                    656,
+                    446,
+                    618,
+                    446
+                ],
+                "score": 0.35,
+                "latex": "(\\%)"
+            }
+        ],
+        "page_info": {
+            "page_no": 1,
+            "height": 2339,
+            "width": 1654
+        }
+    },
+    {
+        "layout_dets": [
+            {
+                "category_id": 2,
+                "poly": [
+                    87.9037094116211,
+                    31.59800148010254,
+                    300.9930419921875,
+                    31.59800148010254,
+                    300.9930419921875,
+                    113.4053955078125,
+                    87.9037094116211,
+                    113.4053955078125
+                ],
+                "score": 0.9999939799308777
+            },
+            {
+                "category_id": 2,
+                "poly": [
+                    1008.992919921875,
+                    2209.248779296875,
+                    1534.9334716796875,
+                    2209.248779296875,
+                    1534.9334716796875,
+                    2242.77294921875,
+                    1008.992919921875,
+                    2242.77294921875
+                ],
+                "score": 0.9999377131462097
+            },
+            {
+                "category_id": 2,
+                "poly": [
+                    770.6600341796875,
+                    2212.857666015625,
+                    827.4126586914062,
+                    2212.857666015625,
+                    827.4126586914062,
+                    2239.77197265625,
+                    770.6600341796875,
+                    2239.77197265625
+                ],
+                "score": 0.9998395442962646
+            },
+            {
+                "category_id": 2,
+                "poly": [
+                    1082.096923828125,
+                    82.25012969970703,
+                    1518.9267578125,
+                    82.25012969970703,
+                    1518.9267578125,
+                    114.52576446533203,
+                    1082.096923828125,
+                    114.52576446533203
+                ],
+                "score": 0.9996457099914551
+            },
+            {
+                "category_id": 7,
+                "poly": [
+                    95.39900970458984,
+                    1846.6380615234375,
+                    564.4166870117188,
+                    1846.6380615234375,
+                    564.4166870117188,
+                    1899.209716796875,
+                    95.39900970458984,
+                    1899.209716796875
+                ],
+                "score": 0.9908766746520996
+            },
+            {
+                "category_id": 6,
+                "poly": [
+                    95.4662094116211,
+                    173.42832946777344,
+                    470.21905517578125,
+                    173.42832946777344,
+                    470.21905517578125,
+                    217.74632263183594,
+                    95.4662094116211,
+                    217.74632263183594
+                ],
+                "score": 0.9437939524650574
+            },
+            {
+                "category_id": 5,
+                "poly": [
+                    854.1142578125,
+                    1043.93603515625,
+                    1592.0174560546875,
+                    1043.93603515625,
+                    1592.0174560546875,
+                    1846.16552734375,
+                    854.1142578125,
+                    1846.16552734375
+                ],
+                "score": 0.8844046592712402
+            },
+            {
+                "category_id": 5,
+                "poly": [
+                    92.02946472167969,
+                    1331.8909912109375,
+                    814.2915649414062,
+                    1331.8909912109375,
+                    814.2915649414062,
+                    1842.6195068359375,
+                    92.02946472167969,
+                    1842.6195068359375
+                ],
+                "score": 0.8743430972099304
+            },
+            {
+                "category_id": 5,
+                "poly": [
+                    851.83984375,
+                    224.99559020996094,
+                    1592.4068603515625,
+                    224.99559020996094,
+                    1592.4068603515625,
+                    1018.7105712890625,
+                    851.83984375,
+                    1018.7105712890625
+                ],
+                "score": 0.8650150299072266
+            },
+            {
+                "category_id": 5,
+                "poly": [
+                    91.79800415039062,
+                    224.10838317871094,
+                    816.58154296875,
+                    224.10838317871094,
+                    816.58154296875,
+                    1248.422607421875,
+                    91.79800415039062,
+                    1248.422607421875
+                ],
+                "score": 0.8604844808578491
+            },
+            {
+                "category_id": 5,
+                "poly": [
+                    85.19661712646484,
+                    220.71524047851562,
+                    1602.3074951171875,
+                    220.71524047851562,
+                    1602.3074951171875,
+                    1844.488525390625,
+                    85.19661712646484,
+                    1844.488525390625
+                ],
+                "score": 0.6638449430465698
+            },
+            {
+                "category_id": 13,
+                "poly": [
+                    737,
+                    704,
+                    804,
+                    704,
+                    804,
+                    730,
+                    737,
+                    730
+                ],
+                "score": 0.56,
+                "latex": "\\pmb{26.5\\%}"
+            },
+            {
+                "category_id": 13,
+                "poly": [
+                    738,
+                    673,
+                    804,
+                    673,
+                    804,
+                    699,
+                    738,
+                    699
+                ],
+                "score": 0.48,
+                "latex": "\\pmb{16.2\\%}"
+            },
+            {
+                "category_id": 13,
+                "poly": [
+                    736,
+                    767,
+                    805,
+                    767,
+                    805,
+                    795,
+                    736,
+                    795
+                ],
+                "score": 0.48,
+                "latex": "\\mathbf{\\lambda_{23.7\\%}}"
+            },
+            {
+                "category_id": 13,
+                "poly": [
+                    231,
+                    611,
+                    268,
+                    611,
+                    268,
+                    638,
+                    231,
+                    638
+                ],
+                "score": 0.47,
+                "latex": "(\\%)"
+            },
+            {
+                "category_id": 13,
+                "poly": [
+                    749,
+                    736,
+                    804,
+                    736,
+                    804,
+                    763,
+                    749,
+                    763
+                ],
+                "score": 0.41,
+                "latex": "\\pmb{9.2\\%}"
+            },
+            {
+                "category_id": 13,
+                "poly": [
+                    737,
+                    641,
+                    804,
+                    641,
+                    804,
+                    668,
+                    737,
+                    668
+                ],
+                "score": 0.41,
+                "latex": "{\\bf38.0\\%}"
+            },
+            {
+                "category_id": 13,
+                "poly": [
+                    748,
+                    577,
+                    805,
+                    577,
+                    805,
+                    606,
+                    748,
+                    606
+                ],
+                "score": 0.35,
+                "latex": "0.1\\%"
+            },
+            {
+                "category_id": 13,
+                "poly": [
+                    187,
+                    800,
+                    222,
+                    800,
+                    222,
+                    827,
+                    187,
+                    827
+                ],
+                "score": 0.32,
+                "latex": "(\\%)"
+            },
+            {
+                "category_id": 13,
+                "poly": [
+                    738,
+                    830,
+                    805,
+                    830,
+                    805,
+                    857,
+                    738,
+                    857
+                ],
+                "score": 0.28,
+                "latex": "\\mathbf{13.8\\%}"
+            },
+            {
+                "category_id": 13,
+                "poly": [
+                    737,
+                    862,
+                    805,
+                    862,
+                    805,
+                    889,
+                    737,
+                    889
+                ],
+                "score": 0.27,
+                "latex": "\\mathbf{31.9\\%}"
+            },
+            {
+                "category_id": 13,
+                "poly": [
+                    736,
+                    955,
+                    804,
+                    955,
+                    804,
+                    983,
+                    736,
+                    983
+                ],
+                "score": 0.26,
+                "latex": "\\pmb{65.3\\%}"
+            }
+        ],
+        "page_info": {
+            "page_no": 2,
+            "height": 2339,
+            "width": 1654
+        }
+    },
+    {
+        "layout_dets": [
+            {
+                "category_id": 2,
+                "poly": [
+                    86.3010025024414,
+                    32.05937194824219,
+                    303.65325927734375,
+                    32.05937194824219,
+                    303.65325927734375,
+                    114.77494049072266,
+                    86.3010025024414,
+                    114.77494049072266
+                ],
+                "score": 0.9999954700469971
+            },
+            {
+                "category_id": 1,
+                "poly": [
+                    108.4952392578125,
+                    590.2026977539062,
+                    1536.75146484375,
+                    590.2026977539062,
+                    1536.75146484375,
+                    688.4915771484375,
+                    108.4952392578125,
+                    688.4915771484375
+                ],
+                "score": 0.9999932646751404
+            },
+            {
+                "category_id": 0,
+                "poly": [
+                    95.94864654541016,
+                    1205.4134521484375,
+                    252.92477416992188,
+                    1205.4134521484375,
+                    252.92477416992188,
+                    1246.0015869140625,
+                    95.94864654541016,
+                    1246.0015869140625
+                ],
+                "score": 0.999992847442627
+            },
+            {
+                "category_id": 1,
+                "poly": [
+                    106.48407745361328,
+                    338.27471923828125,
+                    1568.86328125,
+                    338.27471923828125,
+                    1568.86328125,
+                    437.84783935546875,
+                    106.48407745361328,
+                    437.84783935546875
+                ],
+                "score": 0.9999897480010986
+            },
+            {
+                "category_id": 2,
+                "poly": [
+                    767.6918334960938,
+                    2212.269287109375,
+                    830.787353515625,
+                    2212.269287109375,
+                    830.787353515625,
+                    2239.28515625,
+                    767.6918334960938,
+                    2239.28515625
+                ],
+                "score": 0.9999850988388062
+            },
+            {
+                "category_id": 0,
+                "poly": [
+                    96.18482208251953,
+                    508.36334228515625,
+                    291.4427490234375,
+                    508.36334228515625,
+                    291.4427490234375,
+                    549.4661865234375,
+                    96.18482208251953,
+                    549.4661865234375
+                ],
+                "score": 0.9999837875366211
+            },
+            {
+                "category_id": 2,
+                "poly": [
+                    1082.2672119140625,
+                    81.18732452392578,
+                    1520.2149658203125,
+                    81.18732452392578,
+                    1520.2149658203125,
+                    116.55751037597656,
+                    1082.2672119140625,
+                    116.55751037597656
+                ],
+                "score": 0.9999496340751648
+            },
+            {
+                "category_id": 0,
+                "poly": [
+                    96.45167541503906,
+                    157.92835998535156,
+                    319.21392822265625,
+                    157.92835998535156,
+                    319.21392822265625,
+                    213.8436279296875,
+                    96.45167541503906,
+                    213.8436279296875
+                ],
+                "score": 0.9999274015426636
+            },
+            {
+                "category_id": 0,
+                "poly": [
+                    96.99238586425781,
+                    257.6522216796875,
+                    483.6472473144531,
+                    257.6522216796875,
+                    483.6472473144531,
+                    301.53717041015625,
+                    96.99238586425781,
+                    301.53717041015625
+                ],
+                "score": 0.9999104738235474
+            },
+            {
+                "category_id": 2,
+                "poly": [
+                    1008.8760986328125,
+                    2208.609375,
+                    1536.0474853515625,
+                    2208.609375,
+                    1536.0474853515625,
+                    2243.414306640625,
+                    1008.8760986328125,
+                    2243.414306640625
+                ],
+                "score": 0.9998928308486938
+            },
+            {
+                "category_id": 1,
+                "poly": [
+                    108.46533203125,
+                    1288.0927734375,
+                    1546.7518310546875,
+                    1288.0927734375,
+                    1546.7518310546875,
+                    1383.8438720703125,
+                    108.46533203125,
+                    1383.8438720703125
+                ],
+                "score": 0.9997898936271667
+            },
+            {
+                "category_id": 1,
+                "poly": [
+                    107.81462860107422,
+                    1678.24609375,
+                    1227.880615234375,
+                    1678.24609375,
+                    1227.880615234375,
+                    1711.37255859375,
+                    107.81462860107422,
+                    1711.37255859375
+                ],
+                "score": 0.99957275390625
+            },
+            {
+                "category_id": 5,
+                "poly": [
+                    109.75360107421875,
+                    810.0169677734375,
+                    1579.9549560546875,
+                    810.0169677734375,
+                    1579.9549560546875,
+                    1171.6383056640625,
+                    109.75360107421875,
+                    1171.6383056640625
+                ],
+                "score": 0.9994542598724365
+            },
+            {
+                "category_id": 1,
+                "poly": [
+                    106.46218872070312,
+                    1548.299072265625,
+                    1540.3388671875,
+                    1548.299072265625,
+                    1540.3388671875,
+                    1676.67919921875,
+                    106.46218872070312,
+                    1676.67919921875
+                ],
+                "score": 0.9886452555656433
+            },
+            {
+                "category_id": 1,
+                "poly": [
+                    107.52558898925781,
+                    1386.4000244140625,
+                    1540.886962890625,
+                    1386.4000244140625,
+                    1540.886962890625,
+                    1447.8128662109375,
+                    107.52558898925781,
+                    1447.8128662109375
+                ],
+                "score": 0.9709398150444031
+            },
+            {
+                "category_id": 1,
+                "poly": [
+                    107.66414642333984,
+                    1451.8369140625,
+                    1537.99169921875,
+                    1451.8369140625,
+                    1537.99169921875,
+                    1546.690185546875,
+                    107.66414642333984,
+                    1546.690185546875
+                ],
+                "score": 0.9590120315551758
+            },
+            {
+                "category_id": 6,
+                "poly": [
+                    95.90371704101562,
+                    728.2855224609375,
+                    328.1967468261719,
+                    728.2855224609375,
+                    328.1967468261719,
+                    768.121826171875,
+                    95.90371704101562,
+                    768.121826171875
+                ],
+                "score": 0.6999977827072144
+            },
+            {
+                "category_id": 1,
+                "poly": [
+                    106.67481994628906,
+                    1371.857421875,
+                    1544.84814453125,
+                    1371.857421875,
+                    1544.84814453125,
+                    1678.67236328125,
+                    106.67481994628906,
+                    1678.67236328125
+                ],
+                "score": 0.5645973086357117
+            },
+            {
+                "category_id": 0,
+                "poly": [
+                    95.94171142578125,
+                    728.264404296875,
+                    328.1947937011719,
+                    728.264404296875,
+                    328.1947937011719,
+                    768.1663818359375,
+                    95.94171142578125,
+                    768.1663818359375
+                ],
+                "score": 0.30702608823776245
+            },
+            {
+                "category_id": 13,
+                "poly": [
+                    1247,
+                    887,
+                    1353,
+                    887,
+                    1353,
+                    914,
+                    1247,
+                    914
+                ],
+                "score": 0.91,
+                "latex": "5\\%{\\sim}20\\%"
+            },
+            {
+                "category_id": 13,
+                "poly": [
+                    1181,
+                    923,
+                    1290,
+                    923,
+                    1290,
+                    950,
+                    1181,
+                    950
+                ],
+                "score": 0.9,
+                "latex": "-5\\%{+}5\\%"
+            },
+            {
+                "category_id": 13,
+                "poly": [
+                    1416,
+                    1047,
+                    1469,
+                    1047,
+                    1469,
+                    1077,
+                    1416,
+                    1077
+                ],
+                "score": 0.87,
+                "latex": "10\\%"
+            },
+            {
+                "category_id": 13,
+                "poly": [
+                    1254,
+                    963,
+                    1296,
+                    963,
+                    1296,
+                    991,
+                    1254,
+                    991
+                ],
+                "score": 0.86,
+                "latex": "5\\%"
+            },
+            {
+                "category_id": 13,
+                "poly": [
+                    1373,
+                    1003,
+                    1428,
+                    1003,
+                    1428,
+                    1032,
+                    1373,
+                    1032
+                ],
+                "score": 0.86,
+                "latex": "10\\%"
+            },
+            {
+                "category_id": 13,
+                "poly": [
+                    1332,
+                    1047,
+                    1388,
+                    1047,
+                    1388,
+                    1076,
+                    1332,
+                    1076
+                ],
+                "score": 0.86,
+                "latex": "\\cdot10\\%"
+            },
+            {
+                "category_id": 13,
+                "poly": [
+                    1373,
+                    1112,
+                    1428,
+                    1112,
+                    1428,
+                    1141,
+                    1373,
+                    1141
+                ],
+                "score": 0.85,
+                "latex": "10\\%"
+            },
+            {
+                "category_id": 13,
+                "poly": [
+                    1248,
+                    854,
+                    1302,
+                    854,
+                    1302,
+                    880,
+                    1248,
+                    880
+                ],
+                "score": 0.85,
+                "latex": "z0\\%"
+            }
+        ],
+        "page_info": {
+            "page_no": 3,
+            "height": 2339,
+            "width": 1654
+        }
+    }
+]

+ 90 - 1
tests/test_cli/test_cli_sdk.py

@@ -9,7 +9,7 @@ from lib import common
 import magic_pdf.model as model_config
 from magic_pdf.pipe.UNIPipe import UNIPipe
 from magic_pdf.rw.DiskReaderWriter import DiskReaderWriter
-
+from magic_pdf.rw.S3ReaderWriter import S3ReaderWriter
 model_config.__use_inside_model__ = True
 pdf_res_path = conf.conf['pdf_res_path']
 code_path = conf.conf['code_path']
@@ -178,6 +178,95 @@ class TestCli:
             common.cli_count_folders_and_check_contents(
                 os.path.join(res_path, demo_name, 'ocr'))
 
+    @pytest.mark.P1
+    def test_pdf_dev_cli_local_jsonl_txt(self):
+        """magic_pdf_dev cli local txt."""
+        jsonl_path = os.path.join(pdf_dev_path, 'line1.jsonl')
+        cmd = 'magic-pdf-dev --jsonl %s --method %s' % (jsonl_path, "txt")
+        logging.info(cmd)
+        os.system(cmd)
+
+
+    @pytest.mark.P1
+    def test_pdf_dev_cli_local_jsonl_ocr(self):
+        """magic_pdf_dev cli local ocr."""
+        jsonl_path = os.path.join(pdf_dev_path, 'line1.jsonl')
+        cmd = 'magic-pdf-dev --jsonl %s --method %s' % (jsonl_path, 'ocr')
+        logging.info(cmd)
+        os.system(cmd)
+
+    @pytest.mark.P1
+    def test_pdf_dev_cli_local_jsonl_auto(self):
+        """magic_pdf_dev cli local auto."""
+        jsonl_path = os.path.join(pdf_dev_path, 'line1.jsonl')
+        cmd = 'magic-pdf-dev --jsonl %s --method %s' % (jsonl_path, 'auto')
+        logging.info(cmd)
+        os.system(cmd)
+
+    @pytest.mark.P1
+    def test_pdf_dev_cli_s3_jsonl_txt(self):
+        """magic_pdf_dev cli s3 txt."""
+        jsonl_path = os.path.join(pdf_dev_path, 'line1.jsonl')
+        cmd = 'magic-pdf-dev --jsonl %s --method %s' % (jsonl_path, "txt")
+        logging.info(cmd)
+        os.system(cmd)
+
+
+    @pytest.mark.P1
+    def test_pdf_dev_cli_s3_jsonl_ocr(self):
+        """magic_pdf_dev cli s3 ocr."""
+        jsonl_path = os.path.join(pdf_dev_path, 'line1.jsonl')
+        cmd = 'magic-pdf-dev --jsonl %s --method %s' % (jsonl_path, 'ocr')
+        logging.info(cmd)
+        os.system(cmd)
+
+    @pytest.mark.P1
+    def test_pdf_dev_cli_s3_jsonl_auto(self):
+        """magic_pdf_dev cli s3 auto."""
+        jsonl_path = os.path.join(pdf_dev_path, 'line1.jsonl')
+        cmd = 'magic-pdf-dev --jsonl %s --method %s' % (jsonl_path, 'auto')
+        logging.info(cmd)
+        os.system(cmd)
+
+
+    @pytest.mark.P1
+    def test_pdf_dev_cli_pdf_json_auto(self):
+        """magic_pdf_dev cli pdf+json auto."""
+        json_path = os.path.join(pdf_dev_path, 'test_model.json')
+        pdf_path = os.path.join(pdf_dev_path, 'pdf', 'research_report_1f978cd81fb7260c8f7644039ec2c054.pdf')
+        cmd = 'magic-pdf-dev --pdf %s --json %s --method %s' % (pdf_path, json_path, 'auto')
+        logging.info(cmd)
+        os.system(cmd)
+    
+    @pytest.mark.P1
+    def test_pdf_dev_cli_pdf_json_ocr(self):
+        """magic_pdf_dev cli pdf+json ocr."""
+        json_path = os.path.join(pdf_dev_path, 'test_model.json')
+        pdf_path = os.path.join(pdf_dev_path, 'pdf', 'research_report_1f978cd81fb7260c8f7644039ec2c054.pdf')
+        cmd = 'magic-pdf-dev --pdf %s --json %s --method %s' % (pdf_path, json_path, 'auto')
+        logging.info(cmd)
+        os.system(cmd)
+
+
+    @pytest.mark.P1
+    def test_s3_sdk_suto(self):
+        pdf_ak = os.environ.get('pdf_ak', "")
+        pdf_sk = os.environ.get('pdf_sk', "")
+        pdf_bucket = os.environ.get('bucket', "")
+        pdf_endpoint = os.environ.get('pdf_endpoint', "")
+        s3_pdf_path = conf.conf["s3_pdf_path"]
+        image_dir = "s3://" + pdf_bucket + "/mineru/test/test.md"
+        s3pdf_cli = S3ReaderWriter(pdf_ak, pdf_sk, pdf_endpoint)
+        s3image_cli = S3ReaderWriter(pdf_ak, pdf_sk, pdf_endpoint, parent_path=image_dir)
+        pdf_bytes = s3pdf_cli.read(s3_pdf_path, mode=s3pdf_cli.MODE_BIN)
+        jso_useful_key = {"_pdf_type": "", "model_list": []}
+        pipe = UNIPipe(pdf_bytes, jso_useful_key, s3image_cli)
+        pipe.pipe_classify()
+        pipe.pipe_analyze()
+        pipe.pipe_parse()
+        md_content = pipe.pipe_mk_markdown(image_dir, drop_mode="none")
+        assert len(md_content) > 0
+
 
 if __name__ == '__main__':
     pytest.main()

+ 0 - 0
tests/test_cli/test_magic-pdf-dev_cli.py


+ 36 - 0
tests/test_cli/test_performence.py

@@ -0,0 +1,36 @@
+"""
+test performance
+"""
+import os
+import shutil
+import json
+from lib import calculate_score
+import pytest
+from conf import conf
+
+code_path = os.environ.get('GITHUB_WORKSPACE')
+pdf_dev_path = conf.conf["pdf_dev_path"]
+pdf_res_path = conf.conf["pdf_res_path"]
+
+class TestTable():
+    """
+    test table
+    """
+    def test_perf_close_table(self):
+        """
+        test perf when close table
+        """
+
+
+
+
+def get_score():
+    """
+    get score
+    """
+    score = calculate_score.Scoring(os.path.join(pdf_dev_path, "result.json"))
+    score.calculate_similarity_total("mineru", pdf_dev_path)
+    res = score.summary_scores()
+    return res
+
+

+ 54 - 0
tests/test_cli/test_table.py

@@ -0,0 +1,54 @@
+"""
+test table case
+"""
+import os
+import shutil
+import json
+from lib import calculate_score
+import pytest
+from conf import conf
+
+code_path = os.environ.get('GITHUB_WORKSPACE')
+pdf_dev_path = conf.conf["pdf_dev_path"]
+pdf_res_path = conf.conf["pdf_res_path"]
+
+class TestTable():
+    """
+    test table
+    """
+    def test_paddle_table_master_cuda(self):
+        """
+        select table: paddle table master,mode is cuda
+        """
+    def test_paddle_table_master_cpu(self):
+        """
+        select table: paddle table master, mode is cpu
+        """
+    def test_st_table_cuda(self):
+        """
+        select table: ST, mode is cuda 
+        """
+
+    def test_st_table_cpu(self):
+        """
+        select table: ST, mode is cpu
+        """
+
+    def test_close_table_cuda(self):
+        """
+        close table, mode is cuda
+        """
+    
+
+
+
+def get_score():
+    """
+    get score
+    """
+    score = calculate_score.Scoring(os.path.join(pdf_dev_path, "result.json"))
+    score.calculate_similarity_total("mineru", pdf_dev_path)
+    res = score.summary_scores()
+    return res
+
+

BIN
tests/unittest/test_table/assets/table.jpg


+ 14 - 0
tests/unittest/test_table/test_tablemaster.py

@@ -0,0 +1,14 @@
+import pytest
+from PIL import Image
+from magic_pdf.model.ppTableModel import ppTableModel
+
+class TestppTableModel:
+    def test_image2html(self):
+        img = Image.open("tests/unittest/test_table/assets/table.jpg")
+        # 修改table模型路径
+        config = {"device": "cuda",
+                  "model_dir": "/home/quyuan/PDF-Extract-Kit/models/TabRec/TableMaster"}
+        table_model = ppTableModel(config)
+        res = table_model.img2html(img)
+        true_value = """<td><table  border="1"><thead><tr><td><b>Methods</b></td><td><b>R</b></td><td><b>P</b></td><td><b>F</b></td><td><b>FPS</b></td></tr></thead><tbody><tr><td>SegLink [26]</td><td>70.0</td><td>86.0</td><td>77.0</td><td>8.9</td></tr><tr><td>PixelLink [4]</td><td>73.2</td><td>83.0</td><td>77.8</td><td>-</td></tr><tr><td>TextSnake [18]</td><td>73.9</td><td>83.2</td><td>78.3</td><td>1.1</td></tr><tr><td>TextField [37]</td><td>75.9</td><td>87.4</td><td>81.3</td><td>5.2 </td></tr><tr><td>MSR[38]</td><td>76.7</td><td>87.4</td><td>81.7</td><td>-</td></tr><tr><td>FTSN[3]</td><td>77.1</td><td>87.6</td><td>82.0</td><td>-</td></tr><tr><td>LSE[30]</td><td>81.7</td><td>84.2</td><td>82.9</td><td>-</td></tr><tr><td>CRAFT [2]</td><td>78.2</td><td>88.2</td><td>82.9</td><td>8.6</td></tr><tr><td>MCN [16]</td><td>79</td><td>88.</td><td>83</td><td>-</td></tr><tr><td>ATRR[35]</td><td>82.1</td><td>85.2</td><td>83.6</td><td>-</td></tr><tr><td>PAN [34]</td><td>83.8</td><td>84.4</td><td>84.1</td><td>30.2</td></tr><tr><td>DB[12]</td><td>79.2</td><td>91.5</td><td>84.9</td><td>32.0</td></tr><tr><td>DRRG [41]</td><td>82.30</td><td>88.05</td><td>85.08</td><td>-</td></tr><tr><td>Ours (SynText)</td><td>80.68</td><td>85.40</td><td>82.97</td><td>12.68</td></tr><tr><td>Ours (MLT-17)</td><td>84.54</td><td>86.62</td><td>85.57</td><td>12.31</td></tr></tbody></table></td>\n"""
+        assert res == true_value

+ 0 - 0
tests/test_unit.py → tests/unittest/test_unit.py


Unele fișiere nu au fost afișate deoarece prea multe fișiere au fost modificate în acest diff