3 ヶ月前 · f1eadcb83d
--- a/zhch/OmniDocBench-说明.md
+++ b/zhch/OmniDocBench-说明.md
@@ -0,0 +1,282 @@
 
				+# OmniDocBench
			
 
				+
			
 
				+[English](https://opendatalab.com/OpenDataLab/OmniDocBench/blob/main/README_EN.md) | 简体中文
			
 
				+
			
 
				+**OmniDocBench**是一个针对真实场景下多样性文档解析评测集，具有以下特点：
			
 
				+- **文档类型多样**：该评测集涉及981个PDF页面，涵盖9种文档类型、4种排版类型和3种语言类型。覆盖面广，包含学术文献、财报、报纸、教材、手写笔记等；
			
 
				+- **标注信息丰富**：包含15个block级别（文本段落、标题、表格等，总量超过20k）和4个Span级别（文本行、行内公式、角标等，总量超过80k）的文档元素的**定位信息**，以及每个元素区域的**识别结果**（文本Text标注，公式LaTeX标注，表格包含LaTeX和HTML两种类型的标注）。OmniDocBench还提供了各个文档组件的**阅读顺序**的标注。除此之外，在页面和block级别还包含多种属性标签，标注了5种**页面属性标签**、3种**文本属性标签**和6种**表格属性标签**。
			
 
				+- **标注质量高**：经过人工筛选，智能标注，人工标注及全量专家质检和大模型质检，数据质量较高。
			
 
				+- **配套评测代码**：设计端到端评测及单模块评测代码，保证评测的公平性及准确性。配套的评测代码请访问[OmniDocBench](https://github.com/opendatalab/OmniDocBench)。
			
 
				+
			
 
				+## 更新
			
 
				+
			
 
				+- [2024/12/25] 新增了评测集的PDF格式，供需要PDF作为输入的模型进行评测。新增了包含元信息的原始PDF切片。
			
 
				+- [2024/12/10] 修正了部分样本height和width字段，该修正仅涉及页面级别的height和width字段，不影响其他标注的正确性
			
 
				+- [2024/12/04] OmniDocBench评测集发布
			
 
				+
			
 
				+## 评测集介绍
			
 
				+
			
 
				+该评测集涉及981个PDF页面，涵盖9种文档类型、4种排版类型和3种语言类型。OmniDocBench具有丰富的标注，包含15个block级别的标注（文本段落、标题、表格等）和4个Span级别的标注（文本行、行内公式、角标等）。所有文本相关的标注框上都包含文本识别的标注，公式包含LaTeX标注，表格包含LaTeX和HTML两种类型的标注。OmniDocBench还提供了各个文档组件的阅读顺序的标注。除此之外，在页面和block级别还包含多种属性标签，标注了5种页面属性标签、3种文本属性标签和6种表格属性标签。
			
 
				+
			
 
				+![](https://huggingface.co/datasets/opendatalab/OmniDocBench/resolve/main/data_diversity.png)
			
 
				+
			
 
				+## 使用
			
 
				+
			
 
				+评测可以使用我们提供的[评测脚本](https://github.com/opendatalab/OmniDocBench), 可进行以下几个维度的评测：
			
 
				+
			
 
				+- 端到端评测：包括end2end和md2md两种评测方式
			
 
				+- Layout检测
			
 
				+- 表格识别
			
 
				+- 公式识别
			
 
				+- 文本OCR
			
 
				+
			
 
				+评测集的文件包括：
			
 
				+
			
 
				+- [OmniDocBench.json](OmniDocBench.json) 是评测集的标注文件，以JSON格式存储，支持end2end的评测方式，其结构和字段在后文有解释。
			
 
				+- [images](./images/) 是对应的评测集图像，供需要图片作为输入的模型进行评测。
			
 
				+- [pdfs](./pdfs/) 是图片转的PDF，与评测图像的文件名是一一对应的，供需要PDF作为输入的模型进行评测。
			
 
				+- [ori_pdfs](./ori_pdfs/) 是直接从原始PDF中抽取的PDF页面，与评测图像的文件名是一一对应的，该PDF包含了原始PDF的元信息。注意，在评测的时候，我们对部分页面的部分区域做了mask的处理，涉及到368张PDF上的舍弃类（一些页眉页脚上的特殊图形），以及22张页面上的无法解析类（比如一些包含图片的表格），具体涉及到的页面记录在了[with_mask.json](with_mask.json)中。但是，在原始PDF的元信息中，把部分内容mask掉比较困难，**因此这部分数据没有mask处理，与评测使用的图像有区别。为了更加公平的对比，评测请使用[pdfs](./pdfs/) 或者[images](./images/)作为输入。**
			
 
				+
			
 
				+
			
 
				+<details>
			
 
				+  <summary>评测集的数据格式</summary>
			
 
				+
			
 
				+评测集的数据格式为JSON，其结构和各个字段的解释如下：
			
 
				+
			
 
				+```json
			
 
				+[{
			
 
				+    "layout_dets": [    // 页面元素列表
			
 
				+        {
			
 
				+            "category_type": "text_block",  // 类别名称
			
 
				+            "poly": [
			
 
				+                136.0, // 位置信息，分别是左上角、右上角、右下角、左下角的x,y坐标
			
 
				+                781.0,
			
 
				+                340.0,
			
 
				+                781.0,
			
 
				+                340.0,
			
 
				+                806.0,
			
 
				+                136.0,
			
 
				+                806.0
			
 
				+            ],
			
 
				+            "ignore": false,        // 是否在评测的时候不考虑
			
 
				+            "order": 0,             // 阅读顺序
			
 
				+            "anno_id": 0,           // 特殊的标注ID，每个layout框唯一
			
 
				+            "text": "xxx",          // 可选字段，OCR结果会写在这里
			
 
				+            "latex": "$xxx$",       // 可选字段，formula和table的LaTeX会写在这里
			
 
				+            "html": "xxx",          // 可选字段，table的HTML会写在这里
			
 
				+            "attribute" {"xxx": "xxx"},         // layout的分类属性，后文会详细展示
			
 
				+            "line_with_spans:": [   // span level的标注框
			
 
				+                {
			
 
				+                    "category_type": "text_span",
			
 
				+                    "poly": [...],
			
 
				+                    "ignore": false,
			
 
				+                    "text": "xxx",   
			
 
				+                    "latex": "$xxx$",
			
 
				+                 },
			
 
				+                 ...
			
 
				+            ],
			
 
				+            "merge_list": [    // 只有包含merge关系的标注框内有这个字段，是否包含merge逻辑取决于是否包含单换行分割小段落，比如列表类型
			
 
				+                {
			
 
				+                    "category_type": "text_block", 
			
 
				+                    "poly": [...],
			
 
				+                    ...   // 跟block级别标注的字段一致
			
 
				+                    "line_with_spans": [...]
			
 
				+                    ...
			
 
				+                 },
			
 
				+                 ...
			
 
				+            ]
			
 
				+        ...
			
 
				+    ],
			
 
				+    "page_info": {         
			
 
				+        "page_no": 0,            // 页码
			
 
				+        "height": 1684,          // 页面的宽
			
 
				+        "width": 1200,           // 页面的高
			
 
				+        "image_path": "xx/xx/",  // 标注的页面文件名称
			
 
				+        "page_attribute": {"xxx": "xxx"}     // 页面的属性标签
			
 
				+    },
			
 
				+    "extra": {
			
 
				+        "relation": [ // 具有相关关系的标注
			
 
				+            {  
			
 
				+                "source_anno_id": 1,
			
 
				+                "target_anno_id": 2, 
			
 
				+                "relation": "parent_son"  // figure/table与其对应的caption/footnote类别的关系标签
			
 
				+            },
			
 
				+            {  
			
 
				+                "source_anno_id": 5,
			
 
				+                "target_anno_id": 6,
			
 
				+                "relation_type": "truncated"  // 段落因为排版原因被截断，会标注一个截断关系标签，后续评测的时候会拼接后再作为一整个段落进行评测
			
 
				+            },
			
 
				+        ]
			
 
				+    }
			
 
				+},
			
 
				+...
			
 
				+]
			
 
				+```
			
 
				+
			
 
				+</details>
			
 
				+
			
 
				+<details>
			
 
				+  <summary>验证集类别</summary>
			
 
				+
			
 
				+验证集类别包括：
			
 
				+
			
 
				+```
			
 
				+# Block级别标注框
			
 
				+'title'               # 标题
			
 
				+'text_block'          # 段落级别纯文本
			
 
				+'figure',             # 图片类
			
 
				+'figure_caption',     # 图片说明、标题
			
 
				+'figure_footnote',    # 图片注释
			
 
				+'table',              # 表格主体
			
 
				+'table_caption',      # 表格说明和标题
			
 
				+'table_footnote',     # 表格的注释
			
 
				+'equation_isolated',  # 行间公式
			
 
				+'equation_caption',   # 公式序号
			
 
				+'header'              # 页眉
			
 
				+'footer'              # 页脚  
			
 
				+'page_number'         # 页码
			
 
				+'page_footnote'       # 页面注释
			
 
				+'abandon',            # 其他的舍弃类（比如页面中间的一些无关信息）
			
 
				+'code_txt',           # 代码块
			
 
				+'code_txt_caption',   # 代码块说明
			
 
				+'reference',          # 参考文献类
			
 
				+
			
 
				+# Span级别标注框
			
 
				+'text_span'           # span级别的纯文本
			
 
				+'equation_ignore',    # 需要忽略的公式类
			
 
				+'equation_inline',    # 行内公式类
			
 
				+'footnote_mark',      #文章的上下角标
			
 
				+```
			
 
				+
			
 
				+</details>
			
 
				+
			
 
				+<details>
			
 
				+  <summary>验证集属性标签</summary>
			
 
				+
			
 
				+页面分类属性包括：
			
 
				+```
			
 
				+'data_source': #PDF类型分类
			
 
				+    academic_literature  # 学术文献
			
 
				+    PPT2PDF # PPT转PDF
			
 
				+    book # 黑白的图书和教材
			
 
				+    colorful_textbook # 彩色图文教材
			
 
				+    exam_paper # 试卷
			
 
				+    note # 手写笔记
			
 
				+    magazine # 杂志
			
 
				+    research_report # 研报、财报
			
 
				+    newspaper # 报纸
			
 
				+
			
 
				+'language':#语种
			
 
				+    en # 英文
			
 
				+    simplified_chinese # 简体中文
			
 
				+    en_ch_mixed # 中英混合
			
 
				+
			
 
				+'layout': #页面布局类型
			
 
				+    single_column # 单栏
			
 
				+    double_column # 双栏
			
 
				+    three_column # 三栏
			
 
				+    1andmore_column # 一混多，常见于文献
			
 
				+    other_layout # 其他
			
 
				+
			
 
				+'watermark'： # 是否包含水印
			
 
				+    true  
			
 
				+    false
			
 
				+
			
 
				+'fuzzy_scan': # 是否模糊扫描
			
 
				+    true  
			
 
				+    false
			
 
				+
			
 
				+'colorful_backgroud': # 是否包含彩色背景，需要参与识别的内容的底色包含两个以上
			
 
				+    true  
			
 
				+    false
			
 
				+```
			
 
				+
			
 
				+标注框级别属性-表格相关属性:
			
 
				+
			
 
				+```
			
 
				+'table_layout': # 表格的方向
			
 
				+    vertical #竖版表格
			
 
				+    horizontal #横版表格
			
 
				+
			
 
				+'with_span': # 合并单元格
			
 
				+    False
			
 
				+    True
			
 
				+
			
 
				+'line':# 表格的线框
			
 
				+    full_line # 全线框
			
 
				+    less_line # 漏线框
			
 
				+    fewer_line # 三线框 
			
 
				+    wireless_line # 无线框
			
 
				+
			
 
				+'language': #表格的语种
			
 
				+    table_en  # 英文表格
			
 
				+    table_simplified_chinese  #中文简体表格
			
 
				+    table_en_ch_mixed  #中英混合表格
			
 
				+
			
 
				+'include_equation': # 表格是否包含公式
			
 
				+    False
			
 
				+    True
			
 
				+
			
 
				+'include_backgroud': # 表格是否包含底色
			
 
				+    False
			
 
				+    True
			
 
				+
			
 
				+'table_vertical' # 表格是否旋转90度或270度
			
 
				+    False
			
 
				+    True
			
 
				+```
			
 
				+
			
 
				+标注框级别属性-文本段落相关属性: 
			
 
				+```
			
 
				+'text_language': # 文本的段落内语种
			
 
				+    text_en  # 英文
			
 
				+    text_simplified_chinese #简体中文
			
 
				+    text_en_ch_mixed  #中英混合
			
 
				+
			
 
				+'text_background':  #文本的背景色
			
 
				+    white # 默认值，白色背景
			
 
				+    single_colored # 除白色外的单背景色
			
 
				+    multi_colored  # 混合背景色
			
 
				+
			
 
				+'text_rotate': # 文本的段落内文字旋转分类
			
 
				+    normal # 默认值，横向文本，没有旋转
			
 
				+    rotate90  # 旋转角度，顺时针旋转90度
			
 
				+    rotate180 # 顺时针旋转180度
			
 
				+    rotate270 # 顺时针旋转270度
			
 
				+    horizontal # 文字是正常的，排版是竖型文本
			
 
				+```
			
 
				+
			
 
				+标注框级别属性-公式相关属性: 
			
 
				+```
			
 
				+'formula_type': #公式类型
			
 
				+    print  # 打印体
			
 
				+    handwriting # 手写体
			
 
				+```
			
 
				+
			
 
				+</details>
			
 
				+
			
 
				+## 数据展示
			
 
				+![](https://huggingface.co/datasets/opendatalab/OmniDocBench/blob/main/show_pdf_types_1.png)
			
 
				+![](https://huggingface.co/datasets/opendatalab/OmniDocBench/resolve/main/show_pdf_types_2.png)
			
 
				+
			
 
				+## Acknowledgement
			
 
				+
			
 
				+- 感谢[Abaka AI](https://abaka.ai)支持数据集标注。
			
 
				+
			
 
				+## 版权声明
			
 
				+  
			
 
				+PDF来源从网络公开渠道收集以及社群用户贡献，已剔除了不允许分发的内容，只用作科研，不作为商业用途。若有侵权请联系OpenDataLab@pjlab.org.cn。
			
 
				+
			
 
				+## 引用
			
 
				+
			
 
				+```bibtex
			
 
				+@misc{ouyang2024omnidocbenchbenchmarkingdiversepdf,
			
 
				+      title={OmniDocBench: Benchmarking Diverse PDF Document Parsing with Comprehensive Annotations}, 
			
 
				+      author={Linke Ouyang and Yuan Qu and Hongbin Zhou and Jiawei Zhu and Rui Zhang and Qunshu Lin and Bin Wang and Zhiyuan Zhao and Man Jiang and Xiaomeng Zhao and Jin Shi and Fan Wu and Pei Chu and Minghao Liu and Zhenxiang Li and Chao Xu and Bo Zhang and Botian Shi and Zhongying Tu and Conghui He},
			
 
				+      year={2024},
			
 
				+      eprint={2412.07626},
			
 
				+      archivePrefix={arXiv},
			
 
				+      primaryClass={cs.CV},
			
 
				+      url={https://arxiv.org/abs/2412.07626}, 
			
 
				+}
			
 
				+```‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌
			
--- a/zhch/omnidocbench_eval.py
+++ b/zhch/omnidocbench_eval.py
@@ -0,0 +1,404 @@
 
				+# zhch/omnidocbench_eval_fixed.py
			
 
				+import json
			
 
				+import time
			
 
				+from pathlib import Path
			
 
				+from typing import List, Dict, Any, Tuple
			
 
				+import cv2
			
 
				+import numpy as np
			
 
				+from paddlex import create_pipeline
			
 
				+
			
 
				+class OmniDocBenchEvaluator:
			
 
				+    """OmniDocBench评估器（修正版），用于生成符合评测格式的结果"""
			
 
				+    
			
 
				+    def __init__(self, pipeline_config_path: str = "./PP-StructureV3-zhch.yaml"):
			
 
				+        """
			
 
				+        初始化评估器
			
 
				+        
			
 
				+        Args:
			
 
				+            pipeline_config_path: PaddleX pipeline配置文件路径
			
 
				+        """
			
 
				+        self.pipeline = create_pipeline(pipeline=pipeline_config_path)
			
 
				+        self.category_mapping = self._get_category_mapping()
			
 
				+        
			
 
				+    def _get_category_mapping(self) -> Dict[str, str]:
			
 
				+        """获取PaddleX类别到OmniDocBench类别的映射"""
			
 
				+        return {
			
 
				+            # PaddleX -> OmniDocBench 类别映射
			
 
				+            'title': 'title',
			
 
				+            'text': 'text_block',
			
 
				+            'figure': 'figure',
			
 
				+            'figure_caption': 'figure_caption',
			
 
				+            'table': 'table',
			
 
				+            'table_caption': 'table_caption',
			
 
				+            'equation': 'equation_isolated',
			
 
				+            'header': 'header',
			
 
				+            'footer': 'footer',
			
 
				+            'reference': 'reference',
			
 
				+            'seal': 'abandon',  # 印章通常作为舍弃类
			
 
				+            'number': 'page_number',
			
 
				+            # 添加更多映射关系
			
 
				+        }
			
 
				+    
			
 
				+    def evaluate_single_image(self, image_path: str, 
			
 
				+                            use_gpu: bool = True,
			
 
				+                            **kwargs) -> Dict[str, Any]:
			
 
				+        """
			
 
				+        评估单张图像
			
 
				+        
			
 
				+        Args:
			
 
				+            image_path: 图像路径
			
 
				+            use_gpu: 是否使用GPU
			
 
				+            **kwargs: 其他pipeline参数
			
 
				+            
			
 
				+        Returns:
			
 
				+            符合OmniDocBench格式的结果字典
			
 
				+        """
			
 
				+        print(f"正在处理图像: {image_path}")
			
 
				+        
			
 
				+        # 读取图像获取尺寸信息
			
 
				+        image = cv2.imread(image_path)
			
 
				+        height, width = image.shape[:2]
			
 
				+        
			
 
				+        # 运行PaddleX pipeline
			
 
				+        start_time = time.time()
			
 
				+        
			
 
				+        output = list(self.pipeline.predict(
			
 
				+            input=image_path,
			
 
				+            device="gpu" if use_gpu else "cpu",
			
 
				+            use_doc_orientation_classify=True,
			
 
				+            use_doc_unwarping=False,
			
 
				+            use_seal_recognition=True,
			
 
				+            use_chart_recognition=True,
			
 
				+            use_table_recognition=True,
			
 
				+            use_formula_recognition=True,
			
 
				+            **kwargs
			
 
				+        ))
			
 
				+        
			
 
				+        process_time = time.time() - start_time
			
 
				+        print(f"处理耗时: {process_time:.2f}秒")
			
 
				+        
			
 
				+        # 转换为OmniDocBench格式
			
 
				+        result = self._convert_to_omnidocbench_format(
			
 
				+            output, image_path, width, height
			
 
				+        )
			
 
				+        
			
 
				+        return result
			
 
				+    
			
 
				+    def _convert_to_omnidocbench_format(self, 
			
 
				+                                      paddlex_output: List, 
			
 
				+                                      image_path: str,
			
 
				+                                      width: int, 
			
 
				+                                      height: int) -> Dict[str, Any]:
			
 
				+        """
			
 
				+        将PaddleX输出转换为OmniDocBench格式
			
 
				+        
			
 
				+        Args:
			
 
				+            paddlex_output: PaddleX的输出结果列表
			
 
				+            image_path: 图像路径
			
 
				+            width: 图像宽度
			
 
				+            height: 图像高度
			
 
				+            
			
 
				+        Returns:
			
 
				+            OmniDocBench格式的结果
			
 
				+        """
			
 
				+        layout_dets = []
			
 
				+        anno_id_counter = 0
			
 
				+        
			
 
				+        # 处理PaddleX的输出
			
 
				+        for res in paddlex_output:
			
 
				+            # 从parsing_res_list中提取布局信息
			
 
				+            if hasattr(res, 'parsing_res_list') and res.parsing_res_list:
			
 
				+                parsing_list = res.parsing_res_list
			
 
				+                
			
 
				+                for item in parsing_list:
			
 
				+                    # 提取边界框和类别
			
 
				+                    bbox = item.get('block_bbox', [])
			
 
				+                    category = item.get('block_label', 'text_block')
			
 
				+                    content = item.get('block_content', '')
			
 
				+                    
			
 
				+                    # 转换bbox格式 [x1, y1, x2, y2] -> [x1, y1, x2, y1, x2, y2, x1, y2]
			
 
				+                    if len(bbox) == 4:
			
 
				+                        x1, y1, x2, y2 = bbox
			
 
				+                        poly = [x1, y1, x2, y1, x2, y2, x1, y2]
			
 
				+                    else:
			
 
				+                        poly = bbox
			
 
				+                    
			
 
				+                    # 映射类别
			
 
				+                    omni_category = self.category_mapping.get(category, 'text_block')
			
 
				+                    
			
 
				+                    # 创建layout检测结果
			
 
				+                    layout_det = {
			
 
				+                        "category_type": omni_category,
			
 
				+                        "poly": poly,
			
 
				+                        "ignore": False,
			
 
				+                        "order": anno_id_counter,
			
 
				+                        "anno_id": anno_id_counter,
			
 
				+                    }
			
 
				+                    
			
 
				+                    # 添加文本识别结果
			
 
				+                    if content and content.strip():
			
 
				+                        if omni_category == 'table':
			
 
				+                            # 表格内容作为HTML存储
			
 
				+                            layout_det["html"] = content
			
 
				+                        else:
			
 
				+                            # 其他类型作为文本存储
			
 
				+                            layout_det["text"] = content.strip()
			
 
				+                    
			
 
				+                    # 添加span级别的标注（从OCR结果中提取）
			
 
				+                    layout_det["line_with_spans"] = self._extract_spans_from_ocr(
			
 
				+                        res, bbox, omni_category
			
 
				+                    )
			
 
				+                    
			
 
				+                    # 添加属性标签
			
 
				+                    layout_det["attribute"] = self._extract_attributes(item, omni_category)
			
 
				+                    
			
 
				+                    layout_dets.append(layout_det)
			
 
				+                    anno_id_counter += 1
			
 
				+        
			
 
				+        # 构建完整结果
			
 
				+        result = {
			
 
				+            "layout_dets": layout_dets,
			
 
				+            "page_info": {
			
 
				+                "page_no": 0,
			
 
				+                "height": height,
			
 
				+                "width": width,
			
 
				+                "image_path": Path(image_path).name,
			
 
				+                "page_attribute": self._extract_page_attributes(paddlex_output)
			
 
				+            },
			
 
				+            "extra": {
			
 
				+                "relation": []  # 关系信息，需要根据具体情况提取
			
 
				+            }
			
 
				+        }
			
 
				+        
			
 
				+        return result
			
 
				+    
			
 
				+    def _extract_spans_from_ocr(self, res, block_bbox: List, category: str) -> List[Dict]:
			
 
				+        """从OCR结果中提取span级别的标注"""
			
 
				+        spans = []
			
 
				+        
			
 
				+        # 如果有OCR结果，提取相关的文本行
			
 
				+        if hasattr(res, 'overall_ocr_res') and res.overall_ocr_res:
			
 
				+            ocr_res = res.overall_ocr_res
			
 
				+            
			
 
				+            if hasattr(ocr_res, 'rec_texts') and hasattr(ocr_res, 'rec_boxes'):
			
 
				+                texts = ocr_res.rec_texts
			
 
				+                boxes = ocr_res.rec_boxes
			
 
				+                scores = getattr(ocr_res, 'rec_scores', [1.0] * len(texts))
			
 
				+                
			
 
				+                # 检查哪些OCR结果在当前block内
			
 
				+                if len(block_bbox) == 4:
			
 
				+                    x1, y1, x2, y2 = block_bbox
			
 
				+                    
			
 
				+                    for i, (text, box, score) in enumerate(zip(texts, boxes, scores)):
			
 
				+                        if len(box) >= 4:
			
 
				+                            # 检查OCR框是否在block内
			
 
				+                            ocr_x1, ocr_y1, ocr_x2, ocr_y2 = box[:4]
			
 
				+                            
			
 
				+                            # 简单的包含检查
			
 
				+                            if (ocr_x1 >= x1 and ocr_y1 >= y1 and 
			
 
				+                                ocr_x2 <= x2 and ocr_y2 <= y2):
			
 
				+                                
			
 
				+                                span = {
			
 
				+                                    "category_type": "text_span",
			
 
				+                                    "poly": [ocr_x1, ocr_y1, ocr_x2, ocr_y1, 
			
 
				+                                            ocr_x2, ocr_y2, ocr_x1, ocr_y2],
			
 
				+                                    "ignore": False,
			
 
				+                                    "text": text,
			
 
				+                                }
			
 
				+                                
			
 
				+                                # 如果置信度太低，可能需要忽略
			
 
				+                                if score < 0.5:
			
 
				+                                    span["ignore"] = True
			
 
				+                                
			
 
				+                                spans.append(span)
			
 
				+        
			
 
				+        return spans
			
 
				+    
			
 
				+    def _extract_attributes(self, item: Dict, category: str) -> Dict:
			
 
				+        """提取属性标签"""
			
 
				+        attributes = {}
			
 
				+        
			
 
				+        # 根据类别提取不同的属性
			
 
				+        if category == 'table':
			
 
				+            # 表格属性
			
 
				+            attributes.update({
			
 
				+                "table_layout": "vertical",  # 需要根据实际情况判断
			
 
				+                "with_span": False,          # 需要检查是否有合并单元格
			
 
				+                "line": "full_line",         # 需要检查线框类型
			
 
				+                "language": "table_simplified_chinese",  # 需要语言检测
			
 
				+                "include_equation": False,
			
 
				+                "include_backgroud": False,
			
 
				+                "table_vertical": False
			
 
				+            })
			
 
				+            
			
 
				+            # 检查表格内容是否有合并单元格
			
 
				+            content = item.get('block_content', '')
			
 
				+            if 'colspan' in content or 'rowspan' in content:
			
 
				+                attributes["with_span"] = True
			
 
				+                
			
 
				+        elif category in ['text_block', 'title']:
			
 
				+            # 文本属性
			
 
				+            attributes.update({
			
 
				+                "text_language": "text_simplified_chinese",
			
 
				+                "text_background": "white",
			
 
				+                "text_rotate": "normal"
			
 
				+            })
			
 
				+            
			
 
				+        elif 'equation' in category:
			
 
				+            # 公式属性
			
 
				+            attributes.update({
			
 
				+                "formula_type": "print"
			
 
				+            })
			
 
				+        
			
 
				+        return attributes
			
 
				+    
			
 
				+    def _extract_page_attributes(self, paddlex_output) -> Dict:
			
 
				+        """提取页面级别的属性"""
			
 
				+        return {
			
 
				+            "data_source": "research_report",  # 需要根据实际情况判断
			
 
				+            "language": "simplified_chinese",
			
 
				+            "layout": "single_column",
			
 
				+            "watermark": False,
			
 
				+            "fuzzy_scan": False,
			
 
				+            "colorful_backgroud": False
			
 
				+        }
			
 
				+    
			
 
				+    def load_existing_result(self, result_path: str) -> Dict[str, Any]:
			
 
				+        """
			
 
				+        从已有的PaddleX结果文件加载数据进行转换
			
 
				+        
			
 
				+        Args:
			
 
				+            result_path: PaddleX结果JSON文件路径
			
 
				+            
			
 
				+        Returns:
			
 
				+            OmniDocBench格式的结果字典
			
 
				+        """
			
 
				+        with open(result_path, 'r', encoding='utf-8') as f:
			
 
				+            data = json.load(f)
			
 
				+        
			
 
				+        # 从结果文件中提取图像信息
			
 
				+        input_path = data.get('input_path', '')
			
 
				+        
			
 
				+        # 读取图像获取尺寸
			
 
				+        if input_path and Path(input_path).exists():
			
 
				+            image = cv2.imread(input_path)
			
 
				+            height, width = image.shape[:2]
			
 
				+            image_name = Path(input_path).name
			
 
				+        else:
			
 
				+            # 如果图像路径不存在，使用默认值
			
 
				+            height, width = 1600, 1200
			
 
				+            image_name = "unknown.png"
			
 
				+        
			
 
				+        # 转换格式
			
 
				+        result = self._convert_paddlex_result_to_omnidocbench(
			
 
				+            data, image_name, width, height
			
 
				+        )
			
 
				+        
			
 
				+        return result
			
 
				+    
			
 
				+    def _convert_paddlex_result_to_omnidocbench(self, 
			
 
				+                                              paddlex_result: Dict,
			
 
				+                                              image_name: str,
			
 
				+                                              width: int, 
			
 
				+                                              height: int) -> Dict[str, Any]:
			
 
				+        """
			
 
				+        将已有的PaddleX结果转换为OmniDocBench格式
			
 
				+        """
			
 
				+        layout_dets = []
			
 
				+        anno_id_counter = 0
			
 
				+        
			
 
				+        # 从parsing_res_list中提取布局信息
			
 
				+        parsing_list = paddlex_result.get('parsing_res_list', [])
			
 
				+        
			
 
				+        for item in parsing_list:
			
 
				+            # 提取边界框和类别
			
 
				+            bbox = item.get('block_bbox', [])
			
 
				+            category = item.get('block_label', 'text_block')
			
 
				+            content = item.get('block_content', '')
			
 
				+            
			
 
				+            # 转换bbox格式
			
 
				+            if len(bbox) == 4:
			
 
				+                x1, y1, x2, y2 = bbox
			
 
				+                poly = [x1, y1, x2, y1, x2, y2, x1, y2]
			
 
				+            else:
			
 
				+                poly = bbox
			
 
				+            
			
 
				+            # 映射类别
			
 
				+            omni_category = self.category_mapping.get(category, 'text_block')
			
 
				+            
			
 
				+            # 创建layout检测结果
			
 
				+            layout_det = {
			
 
				+                "category_type": omni_category,
			
 
				+                "poly": poly,
			
 
				+                "ignore": False,
			
 
				+                "order": anno_id_counter,
			
 
				+                "anno_id": anno_id_counter,
			
 
				+            }
			
 
				+            
			
 
				+            # 添加内容
			
 
				+            if content and content.strip():
			
 
				+                if omni_category == 'table':
			
 
				+                    layout_det["html"] = content
			
 
				+                else:
			
 
				+                    layout_det["text"] = content.strip()
			
 
				+            
			
 
				+            # 添加属性
			
 
				+            layout_det["attribute"] = self._extract_attributes(item, omni_category)
			
 
				+            layout_det["line_with_spans"] = []  # 简化处理
			
 
				+            
			
 
				+            layout_dets.append(layout_det)
			
 
				+            anno_id_counter += 1
			
 
				+        
			
 
				+        # 构建完整结果
			
 
				+        result = {
			
 
				+            "layout_dets": layout_dets,
			
 
				+            "page_info": {
			
 
				+                "page_no": 0,
			
 
				+                "height": height,
			
 
				+                "width": width,
			
 
				+                "image_path": image_name,
			
 
				+                "page_attribute": {
			
 
				+                    "data_source": "research_report",
			
 
				+                    "language": "simplified_chinese",
			
 
				+                    "layout": "single_column",
			
 
				+                    "watermark": False,
			
 
				+                    "fuzzy_scan": False,
			
 
				+                    "colorful_backgroud": False
			
 
				+                }
			
 
				+            },
			
 
				+            "extra": {
			
 
				+                "relation": []
			
 
				+            }
			
 
				+        }
			
 
				+        
			
 
				+        return result
			
 
				+
			
 
				+def convert_existing_results():
			
 
				+    """转换已有的PaddleX结果"""
			
 
				+    evaluator = OmniDocBenchEvaluator()
			
 
				+    
			
 
				+    # 示例：转换单个结果文件
			
 
				+    result_file = "./sample_data/single_pipeline_output/PP-StructureV3-zhch/300674-母公司现金流量表-扫描_res.json"
			
 
				+    
			
 
				+    if Path(result_file).exists():
			
 
				+        print(f"正在转换结果文件: {result_file}")
			
 
				+        
			
 
				+        omnidocbench_result = evaluator.load_existing_result(result_file)
			
 
				+        
			
 
				+        # 保存转换后的结果
			
 
				+        output_file = "./omnidocbench_converted_result.json"
			
 
				+        with open(output_file, 'w', encoding='utf-8') as f:
			
 
				+            json.dump([omnidocbench_result], f, ensure_ascii=False, indent=2)
			
 
				+        
			
 
				+        print(f"转换完成，结果保存至: {output_file}")
			
 
				+        print(f"检测到的布局元素数量: {len(omnidocbench_result['layout_dets'])}")
			
 
				+        
			
 
				+        # 显示检测到的元素
			
 
				+        for i, item in enumerate(omnidocbench_result['layout_dets']):
			
 
				+            print(f"  {i+1}. {item['category_type']}: {item.get('text', item.get('html', ''))[:50]}...")
			
 
				+    
			
 
				+    else:
			
 
				+        print(f"结果文件不存在: {result_file}")
			
 
				+
			
 
				+if __name__ == "__main__":
			
 
				+    convert_existing_results()