zhengchun/dots.ocr: https://github.com/rednote-hilab/dots.ocr.git @ zhch158

Here is the step-by-step doc to reproduce OmniDocBench benchmark results.

1. Model Env

Here, we use Docker Image for setup.

2. Model Launch

# dots.ocr parser env
git clone https://github.com/rednote-hilab/dots.ocr.git
cd dots.ocr
pip install -e .
 
# model setup and register
python3 tools/download_model.py
export hf_model_path=./weights/DotsOCR  # Path to your downloaded model weights, Please use a directory name without periods (e.g., `DotsOCR` instead of `dots.ocr`) for the model save path. This is a temporary workaround pending our integration with Transformers.
export PYTHONPATH=$(dirname "$hf_model_path"):$PYTHONPATH
sed -i '/^from vllm\.entrypoints\.cli\.main import main$/a\
from DotsOCR import modeling_dots_ocr_vllm' `which vllm`  # If you downloaded model weights by yourself, please replace `DotsOCR` by your model saved directory name, and remember to use a directory name without periods (e.g., `DotsOCR` instead of `dots.ocr`) 
 
# launch vllm server
CUDA_VISIBLE_DEVICES=0 vllm serve ${hf_model_path} --tensor-parallel-size 1 --gpu-memory-utilization 0.95  --chat-template-content-format string --served-model-name model --trust-remote-code

3. Model Inference


from tqdm import tqdm
import json
import argparse
from multiprocessing.pool import ThreadPool, Pool
import shutil
import os


if __name__=="__main__":
    from dots_ocr import DotsOCRParser
    parser = argparse.ArgumentParser(
        description="dots.ocr Multilingual Document Layout Parser",
    )
    
    parser.add_argument(
        '--bbox', 
        type=int, 
        nargs=4, 
        metavar=('x1', 'y1', 'x2', 'y2'),
        help='should give this argument if you want to prompt_grounding_ocr'
    )
    parser.add_argument(
        "--ip", type=str, default="localhost",
        help=""
    )
    parser.add_argument(
        "--port", type=int, default=8000,
        help=""
    )
    parser.add_argument(
        "--model_name", type=str, default="model",
        help=""
    )
    parser.add_argument(
        "--temperature", type=float, default=0.1,
        help=""
    )
    parser.add_argument(
        "--top_p", type=float, default=1.0,
        help=""
    )
    parser.add_argument(
        "--dpi", type=int, default=200,
        help=""
    )
    parser.add_argument(
        "--max_completion_tokens", type=int, default=16384,
        help=""
    )
    parser.add_argument(
        "--num_thread", type=int, default=128,
        help=""
    )
    # parser.add_argument(
    #     "--fitz_preprocess", type=bool, default=False,
    #     help="False will use tikz dpi upsample pipeline, good for images which has been render with low dpi, but maybe result in higher computational costs"
    # )
    parser.add_argument(
        "--min_pixels", type=int, default=None,
        help=""
    )
    parser.add_argument(
        "--max_pixels", type=int, default=None,
        help=""
    )
    args = parser.parse_args()

    dots_ocr_parser = DotsOCRParser(
        ip=args.ip,
        port=args.port,
        model_name=args.model_name,
        temperature=args.temperature,
        top_p=args.top_p,
        max_completion_tokens=args.max_completion_tokens,
        num_thread=args.num_thread,
        dpi=args.dpi,
        # output_dir=args.output, 
        min_pixels=args.min_pixels,
        max_pixels=args.max_pixels,
    )

    filepath = "/path/to/OmniDocBench.jsonl"  # download OmniDocBench datasets from https://huggingface.co/datasets/opendatalab/OmniDocBench, reformat it to input the images into model
    with open(filepath, 'r') as f:
        list_items = [json.loads(line) for line in f]
    
    results = []
    output_path = "./output_omni.jsonl"
    f_out = open(output_path, 'w')

    tasks = [[item['file_path'], f_out] for item in list_items]

    def _excute(task):
        image_path, f_out = task
        result = dots_ocr_parser.parse_file(
            image_path, 
            prompt_mode="prompt_layout_all_en",
            # prompt_mode="prompt_ocr",
            fitz_preprocess=True,
            )
        results.append(result)
        f_out.write(f"{json.dumps(result, ensure_ascii=False)}\n")
        f_out.flush()

    with ThreadPool(128) as pool:
        with tqdm(total=len(tasks)) as pbar:
            for result in pool.imap(_excute, tasks):
                pbar.update(1)
        pool.close()
        pool.join()   

    f_out.close()

    eval_result_save_dir = "./output_omni/"
    os.makedirs(eval_result_save_dir, exist_ok=True)

    with open(output_path, "r") as f:
        for line in f.readlines():
            item = json.loads(line)[0]
            if 'md_content_nohf_path' in item:
                file_name = os.path.basename(item['md_content_nohf_path']).replace("_nohf", "")
                shutil.copy2(item['md_content_nohf_path'], os.path.join(eval_result_save_dir, file_name))
            else:
                shutil.copy2(item['md_content_path'], eval_result_save_dir)

    print(f"md results saved to {eval_result_save_dir}")

4. Evaluation

We use OmniDocBench to evaluate the performance. Prepare the omnidocbench env by yourself, follow the official steps.

git clone https://github.com/opendatalab/OmniDocBench.git
cd /OmniDocBench

# Follow https://github.com/opendatalab/OmniDocBench?tab=readme-ov-file#environment-setup-and-running
conda create -n omnidocbench python=3.10
conda activate omnidocbench
pip install -r requirements.txt

# Eval. Change the gt&pred path in end2end.yaml to your own, here by our inference steps, prediction data_path set as: /path/to/dots.ocr/output_omni/
python pdf_validation.py --config ./end2end.yaml > evaluation_output.log
 
cat evaluation_output.log

./end2end.yaml like:

end2end_eval:
  metrics:
    text_block:
      metric:
        - Edit_dist
    display_formula:
      metric:
        - Edit_dist
        - CDM
    table:
      metric:
        - TEDS
        - Edit_dist
    reading_order:
      metric:
        - Edit_dist
  dataset:
    dataset_name: end2end_dataset
    ground_truth:
      data_path: ./OmniDocBench.json  # change to omnidocbench official gt
    prediction:
      data_path: /path/to/dots.ocr/output_omni/  # change to your own output dir
    match_method: quick_match

Eval results as follow:

DATASET_REGISTRY:  ['recogition_text_dataset', 'omnidocbench_single_module_dataset', 'recogition_formula_dataset', 'recogition_table_dataset', 'end2end_dataset', 'recogition_end2end_base_dataset', 'recogition_end2end_table_dataset', 'detection_dataset', 'detection_dataset_simple_format', 'md2md_dataset']
METRIC_REGISTRY:  ['TEDS', 'BLEU', 'METEOR', 'Edit_dist', 'CDM']
EVAL_TASK_REGISTRY:  ['detection_eval', 'end2end_eval', 'recogition_eval']
###### Process:  _quick_match
【Overall】
display_formula CDM is not found
display_formula CDM is not found
----------------------------  --------------------
text_block_Edit_dist_EN       0.031039851583834904
text_block_Edit_dist_CH       0.0643426705744435
display_formula_Edit_dist_EN  0.32843522681423176
display_formula_Edit_dist_CH  0.42557920974720154
display_formula_CDM_EN        -
display_formula_CDM_CH        -
table_TEDS_EN                 88.91012727754615
table_TEDS_CH                 89.0531009325606
table_Edit_dist_EN            0.0943878061222165
table_Edit_dist_CH            0.09173810770062703
reading_order_Edit_dist_EN    0.04079293927450415
reading_order_Edit_dist_CH    0.06625588944827145
overall_EN                    0.12366395594869685
overall_CH                    0.16197896936763587
----------------------------  --------------------


【PDF types】
--------------------------------  ---------
data_source: book                 0.0183191
data_source: PPT2PDF              0.0470068
data_source: research_report      0.0107441
data_source: colorful_textbook    0.0710044
data_source: exam_paper           0.0763102
data_source: magazine             0.0278807
data_source: academic_literature  0.0279
data_source: note                 0.112103
data_source: newspaper            0.0787516
ALL                               0.055183
--------------------------------  ---------


【Layout】
Layout                      Mean         Var
---------------------  ---------  ----------
layout: single_column  0.0267498  0.0187775
layout: double_column  0.0789817  0.0283393
layout: three_column   0.0738766  0.00154036
layout: other_layout   0.115941   0.0336075


【Text Attribute】
--------------------------------------  ---------
text_language: text_english             0.0296053
text_language: text_simplified_chinese  0.106577
text_language: text_en_ch_mixed         0.0888106
text_background: white                  0.0880222
text_background: single_colored         0.0752833
text_background: multi_colored          0.0723353
--------------------------------------  ---------


【Table Attribute】
----------------------------------  --------
language: table_en                  0.876843
language: table_simplified_chinese  0.872133
language: table_en_ch_mixed         0.941139
line: full_line                     0.895071
line: less_line                     0.897401
line: fewer_line                    0.842987
line: wireless_line                 0.847398
with_span: True                     0.865542
with_span: False                    0.881582
include_equation: True              0.839543
include_equation: False             0.884461
include_background: True            0.886555
include_background: False           0.8707
table_layout: vertical              0.884036
table_layout: horizontal            0.875826
----------------------------------  --------

Notes:

The metrics reported in the README.md are the average of 5 runs. Each run may show a variance of ±0.005 for the overall_EN and overall_ZH scores.

eval_omnidocbench.md 9.7 KB Пермалинк Историја Датотека

1. Model Env

2. Model Launch

3. Model Inference

4. Evaluation

eval_omnidocbench.md 9.7 KB

Пермалинк Историја Датотека