9 次代码提交 8b7897cee9 ... 19be083b28

作者 SHA1 备注 提交日期
  zhch158_admin 19be083b28 feat: 修改初始化方法,支持通过配置字典传入配置,移除对 load_config 的依赖 1 周之前
  zhch158_admin 2643734c43 feat: 初始化配置管理器并优化文档和OCR工具的显示信息 1 周之前
  zhch158_admin afc9e3d481 feat: 新增配置管理器,支持分层配置和自动发现数据源,集成 Jinja2 模板变量 1 周之前
  zhch158_admin 206d52f443 Implement code changes to enhance functionality and improve performance 1 周之前
  zhch158_admin 776f9654da feat: 删除合并 MinerU 和 PaddleOCR 的结果的脚本,优化代码结构 1 周之前
  zhch158_admin 21757ecf65 feat: 添加多个OCR工具配置文件,支持不同文档的OCR结果管理 1 周之前
  zhch158_admin a9a8e8cf3b feat: 添加日志子目录和全局日志配置,优化处理器日志管理 1 周之前
  zhch158_admin 586f15b189 feat: 添加日志重定向支持,优化PDF批量处理器的日志管理 1 周之前
  zhch158_admin 9c8d546753 feat: 添加批量合并OCR结果的功能,支持日志重定向和处理器自动检测 1 周之前

+ 0 - 587
STREAMLIT_GUIDE.md

@@ -1,587 +0,0 @@
-# 🚀 Streamlit OCR可视化校验工具使用指南
-
-## 🎯 工具介绍
-
-基于Streamlit开发的OCR可视化校验工具,提供现代化的Web界面和丰富的交互体验,让OCR结果校验变得直观高效。特别新增了强大的表格数据分析功能。
-
-### 🔧 核心功能
-
-- ✅ **实时交互**: 点击文本即时高亮图片位置
-- ✅ **动态过滤**: 搜索、类别筛选、条件过滤
-- ✅ **数据表格**: 可排序的详细数据视图
-- ✅ **统计信息**: 实时统计和进度跟踪
-- ✅ **错误标记**: 一键标记和管理识别错误
-- ✅ **报告导出**: 生成详细的校验报告
-- ⭐ **表格分析**: HTML表格转DataFrame,支持过滤、排序、导出
-- ⭐ **多种渲染**: HTML/Markdown/DataFrame/原始文本四种显示模式
-
-## 🚀 快速启动
-
-### 1. 安装依赖
-
-```bash
-# 安装Streamlit和相关依赖
-pip install streamlit plotly pandas pillow numpy opencv-python openpyxl
-```
-
-### 2. 启动应用
-
-```bash
-# 方法1: 完整功能版本
-python -m streamlit run streamlit_ocr_validator.py
-
-# 方法2: 使用启动脚本
-python run_streamlit_validator.py
-
-# 方法3: 开发模式(自动重载)
-streamlit run streamlit_ocr_validator.py --server.runOnSave true
-```
-
-### 3. 访问界面
-
-浏览器会自动打开 http://localhost:8501,如果没有自动打开,请手动访问该地址。
-
-## 🖥️ 界面使用指南
-
-### 主界面布局
-
-```
-┌─────────────────────────────────────────────────────────────────────┐
-│ 🔍 OCR可视化校验工具                                                 │
-├─────────────────────────────────────────────────────────────────────┤
-│ 📊 总文本块: 13  🔗 可点击: 9  ❌ 标记错误: 2  ✅ 准确率: 85.7%      │
-├─────────────────────────┬───────────────────────────────────────────┤
-│ 📄 OCR识别内容            │ 🖼️ 原图标注                               │
-│                        │                                           │
-│ 🔍 搜索框               │ [显示选中位置的红框标注]                   │
-│ 📍 选择文本下拉框        │ [图片缩放和详细信息]                       │
-│                        │                                           │
-│ 📝 MD内容预览           │ 📍 选中文本详情                           │
-│ [4种渲染模式选择]        │ - 文本内容: xxx                           │
-│ ○ HTML渲染             │ - 位置: [x1,y1,x2,y2]                    │
-│ ● Markdown渲染         │ - 宽度: xxx px                            │
-│ ○ DataFrame表格 ⭐      │ - 高度: xxx px                            │
-│ ○ 原始文本             │                                           │
-│                        │                                           │
-│ 🎯 可点击文本列表        │                                           │
-│ [📍 文本1] [❌] [✅]     │                                           │
-│ [📍 文本2] [❌] [✅]     │                                           │
-└─────────────────────────┴───────────────────────────────────────────┘
-```
-
-### 侧边栏功能
-
-```
-┌─────────────────────┐
-│ 📁 文件选择          │
-│ [选择OCR结果文件]    │
-│ [🔄 加载文件]       │
-│                    │
-│ 🎛️ 控制面板         │
-│ [🧹 清除选择]       │
-│ [❌ 清除错误标记]   │
-│                    │
-│ 📊 表格快捷操作 ⭐   │
-│ [🔍 快速预览表格]   │
-│ [📥 一键导出所有表格] │
-│                    │
-│ 🔧 调试信息         │
-│ [调试信息开关]      │
-└─────────────────────┘
-```
-
-### 使用步骤
-
-1. **选择文件**
-   - 从侧边栏下拉框中选择OCR结果JSON文件
-   - 点击"🔄 加载文件"按钮加载数据
-   - 查看顶部统计信息确认加载成功
-
-2. **浏览统计信息**
-   - 查看总文本块数量
-   - 了解可点击文本数量
-   - 确认图片是否正确加载
-   - 查看当前准确率
-
-3. **交互校验**
-   - 使用下拉框选择要校验的文本
-   - 点击左侧的"📍 文本内容"按钮
-   - 观察右侧图片上的红色框标注
-   - 查看右下角显示的详细位置信息
-
-4. **搜索过滤**
-   - 使用搜索框快速定位特定文本
-   - 在MD内容预览中查看完整识别结果
-   - 通过搜索结果快速定位问题文本
-
-5. **表格数据分析** ⭐ 新增
-   - 选择"DataFrame表格"渲染模式
-   - 查看自动解析的HTML表格
-   - 使用过滤、排序功能分析数据
-   - 查看表格统计信息
-   - 导出CSV或Excel文件
-
-6. **错误标记管理**
-   - 点击文本旁边的"❌"按钮标记错误
-   - 点击"✅"按钮取消错误标记
-   - 观察准确率的实时变化
-   - 使用侧边栏批量清除标记
-
-## 🎨 高级功能
-
-### 完整版本独有功能 (`streamlit_ocr_validator.py`)
-
-#### 📊 错误标记系统
-- **标记错误**: 点击文本旁边的"❌"按钮标记识别错误
-- **取消标记**: 点击"✅"按钮取消错误标记
-- **统计准确率**: 自动计算识别准确率
-- **错误过滤**: 只显示标记为错误的文本
-- **批量操作**: 侧边栏提供批量清除功能
-
-#### 📈 表格数据分析 ⭐ 核心新功能
-- **智能表格检测**: 自动识别HTML表格内容
-- **DataFrame转换**: 将HTML表格转换为可操作的pandas DataFrame
-- **多维度操作**: 支持过滤、排序、搜索等操作
-- **统计分析**: 自动生成表格行列数、数值列统计等信息
-- **数据导出**: 支持CSV、Excel格式导出
-- **可视化图表**: 基于表格数据生成统计图表
-
-#### 🎛️ 多种渲染模式
-- **HTML渲染**: 原生HTML表格显示,保持格式
-- **Markdown渲染**: 转换为Markdown表格格式
-- **DataFrame表格**: 转换为可交互的数据表格 ⭐
-- **原始文本**: 纯文本格式显示
-
-#### 🔧 侧边栏控制
-- **文件管理**: 侧边栏选择和管理OCR文件
-- **控制面板**: 清除选择、清除错误标记等操作
-- **表格快捷操作**: 快速预览和导出表格功能 ⭐
-- **调试信息**: 详细的系统状态和数据信息
-
-#### 📊 过滤和筛选
-- **多条件过滤**: 按类别、错误状态、尺寸等多重筛选
-- **实时搜索**: 动态搜索文本内容
-- **数据表格**: 可排序、可筛选的完整数据视图
-
-### 表格分析功能详解 ⭐
-
-#### 功能特性
-
-```python
-# 表格检测和转换
-if '<table>' in display_content.lower():
-    st.session_state.validator.display_html_table_as_dataframe(display_content)
-else:
-    st.info("当前内容中没有检测到HTML表格")
-```
-
-#### 支持的表格操作
-
-1. **基础操作**
-   - 自动检测HTML表格
-   - 转换为pandas DataFrame
-   - 表格信息统计(行数、列数、列名等)
-
-2. **数据过滤**
-   - 按列内容过滤
-   - 支持文本搜索
-   - 条件筛选
-
-3. **数据排序**
-   - 按任意列排序
-   - 升序/降序选择
-   - 多列排序
-
-4. **统计分析**
-   - 数值列描述性统计
-   - 数据类型分析
-   - 缺失值统计
-
-5. **数据导出**
-   - CSV格式导出
-   - Excel格式导出
-   - 支持过滤后数据导出
-
-#### 使用示例
-
-```python
-# 在streamlit界面中
-def display_html_table_as_dataframe(self, html_content: str, enable_editing: bool = False):
-    """将HTML表格解析为DataFrame显示"""
-    import pandas as pd
-    from io import StringIO, BytesIO
-    
-    try:
-        # 使用pandas直接读取HTML表格
-        tables = pd.read_html(StringIO(html_content))
-        if tables:
-            for i, table in enumerate(tables):
-                st.subheader(f"📊 表格 {i+1}")
-                
-                # 创建表格操作按钮
-                col1, col2, col3, col4 = st.columns(4)
-                with col1:
-                    show_info = st.checkbox(f"显示表格信息", key=f"info_{i}")
-                with col2:
-                    show_stats = st.checkbox(f"显示统计信息", key=f"stats_{i}")
-                with col3:
-                    enable_filter = st.checkbox(f"启用过滤", key=f"filter_{i}")
-                with col4:
-                    enable_sort = st.checkbox(f"启用排序", key=f"sort_{i}")
-                
-                # 显示表格
-                st.dataframe(table, width='stretch')
-```
-
-## 🔧 自定义开发
-
-### 扩展功能开发
-
-#### 1. 添加新的表格处理功能 ⭐
-
-```python
-import plotly.express as px
-import streamlit as st
-
-def create_table_visualization(df):
-    """创建表格数据可视化"""
-    if not df.empty:
-        numeric_cols = df.select_dtypes(include=[np.number]).columns
-        
-        if len(numeric_cols) > 0:
-            # 创建统计图表
-            fig = px.bar(
-                x=df.index,
-                y=df[numeric_cols[0]],
-                title=f"{numeric_cols[0]} 分布"
-            )
-            st.plotly_chart(fig, width='stretch')
-            
-            # 创建散点图
-            if len(numeric_cols) > 1:
-                fig_scatter = px.scatter(
-                    df, 
-                    x=numeric_cols[0], 
-                    y=numeric_cols[1],
-                    title=f"{numeric_cols[0]} vs {numeric_cols[1]}"
-                )
-                st.plotly_chart(fig_scatter, width='stretch')
-
-# 在主应用中使用
-if st.checkbox("显示数据可视化"):
-    create_table_visualization(filtered_table)
-```
-
-#### 2. 高级表格编辑功能
-
-```python
-def advanced_table_editor(df):
-    """高级表格编辑器"""
-    st.subheader("🔧 高级编辑")
-    
-    # 数据编辑
-    edited_df = st.data_editor(
-        df,
-        width='stretch',
-        num_rows="dynamic",  # 允许添加删除行
-        key="advanced_editor"
-    )
-    
-    # 数据验证
-    if not edited_df.equals(df):
-        st.success("✏️ 数据已修改")
-        
-        # 显示变更统计
-        changes = len(edited_df) - len(df)
-        st.info(f"行数变化: {changes:+d}")
-        
-        # 导出修改后的数据
-        if st.button("💾 保存修改"):
-            csv_data = edited_df.to_csv(index=False)
-            st.download_button(
-                "下载修改后的数据",
-                csv_data,
-                "modified_table.csv",
-                "text/csv"
-            )
-    
-    return edited_df
-```
-
-#### 3. 批量表格处理
-
-```python
-def batch_table_processing():
-    """批量表格处理功能"""
-    st.subheader("📦 批量表格处理")
-    
-    uploaded_files = st.file_uploader(
-        "上传多个包含表格的文件", 
-        type=['md', 'html'], 
-        accept_multiple_files=True
-    )
-    
-    if uploaded_files and st.button("开始批量处理"):
-        all_tables = []
-        progress_bar = st.progress(0)
-        
-        for i, file in enumerate(uploaded_files):
-            content = file.read().decode('utf-8')
-            
-            if '<table' in content.lower():
-                tables = pd.read_html(StringIO(content))
-                for j, table in enumerate(tables):
-                    table['source_file'] = file.name
-                    table['table_index'] = j
-                    all_tables.append(table)
-            
-            progress_bar.progress((i + 1) / len(uploaded_files))
-        
-        if all_tables:
-            st.success(f"✅ 共处理 {len(all_tables)} 个表格")
-            
-            # 合并所有表格
-            if st.checkbox("合并所有表格"):
-                try:
-                    merged_df = pd.concat(all_tables, ignore_index=True)
-                    st.dataframe(merged_df)
-                    
-                    # 导出合并结果
-                    csv_data = merged_df.to_csv(index=False)
-                    st.download_button(
-                        "下载合并表格",
-                        csv_data,
-                        "merged_tables.csv",
-                        "text/csv"
-                    )
-                except Exception as e:
-                    st.error(f"合并失败: {e}")
-```
-
-#### 4. 表格数据质量检查 ⭐
-
-```python
-def table_quality_check(df):
-    """表格数据质量检查"""
-    st.subheader("🔍 数据质量检查")
-    
-    # 基础统计
-    col1, col2, col3 = st.columns(3)
-    with col1:
-        st.metric("总行数", len(df))
-    with col2:
-        st.metric("总列数", len(df.columns))
-    with col3:
-        null_percent = (df.isnull().sum().sum() / (len(df) * len(df.columns))) * 100
-        st.metric("缺失值比例", f"{null_percent:.1f}%")
-    
-    # 详细质量报告
-    quality_issues = []
-    
-    # 检查空值
-    null_cols = df.columns[df.isnull().any()].tolist()
-    if null_cols:
-        quality_issues.append(f"发现 {len(null_cols)} 列存在空值: {', '.join(null_cols)}")
-    
-    # 检查重复行
-    duplicate_rows = df.duplicated().sum()
-    if duplicate_rows > 0:
-        quality_issues.append(f"发现 {duplicate_rows} 行重复数据")
-    
-    # 检查数据类型一致性
-    for col in df.columns:
-        if df[col].dtype == 'object':
-            # 检查是否应该是数值类型
-            numeric_like = df[col].str.replace(',', '').str.replace('$', '')
-            try:
-                pd.to_numeric(numeric_like, errors='raise')
-                quality_issues.append(f"列 '{col}' 可能应该是数值类型")
-            except:
-                pass
-    
-    if quality_issues:
-        st.warning("⚠️ 发现数据质量问题:")
-        for issue in quality_issues:
-            st.write(f"- {issue}")
-    else:
-        st.success("✅ 数据质量良好")
-```
-
-## 📊 性能优化
-
-### 1. 缓存优化
-
-```python
-@st.cache_data
-def load_and_process_ocr_data(file_path: str):
-    """缓存OCR数据加载和处理"""
-    with open(file_path, 'r') as f:
-        ocr_data = json.load(f)
-    
-    # 处理数据
-    processed_data = process_ocr_data(ocr_data)
-    return processed_data
-
-@st.cache_resource
-def load_image(image_path: str):
-    """缓存图片加载"""
-    return Image.open(image_path)
-
-@st.cache_data
-def parse_html_tables(html_content: str):
-    """缓存表格解析结果"""
-    try:
-        tables = pd.read_html(StringIO(html_content))
-        return tables
-    except:
-        return []
-```
-
-### 2. 大文件处理
-
-```python
-def handle_large_tables():
-    """处理大型表格"""
-    if 'page_size' not in st.session_state:
-        st.session_state.page_size = 100
-    
-    # 分页显示表格
-    if not df.empty:
-        total_rows = len(df)
-        pages = (total_rows - 1) // st.session_state.page_size + 1
-        
-        col1, col2, col3 = st.columns([1, 2, 1])
-        with col2:
-            current_page = st.slider("页数", 1, pages, 1)
-        
-        # 显示当前页数据
-        start_idx = (current_page - 1) * st.session_state.page_size
-        end_idx = min(start_idx + st.session_state.page_size, total_rows)
-        current_df = df.iloc[start_idx:end_idx]
-        
-        st.dataframe(current_df, width='stretch')
-        st.info(f"显示第 {start_idx+1}-{end_idx} 行,共 {total_rows} 行")
-```
-
-### 3. 内存优化
-
-```python
-def optimize_dataframe_memory(df):
-    """优化DataFrame内存使用"""
-    initial_memory = df.memory_usage(deep=True).sum()
-    
-    # 优化数值类型
-    for col in df.select_dtypes(include=['int']).columns:
-        df[col] = pd.to_numeric(df[col], downcast='integer')
-    
-    for col in df.select_dtypes(include=['float']).columns:
-        df[col] = pd.to_numeric(df[col], downcast='float')
-    
-    # 优化字符串类型
-    for col in df.select_dtypes(include=['object']).columns:
-        if df[col].nunique() < len(df) * 0.5:  # 如果唯一值少于50%,转换为category
-            df[col] = df[col].astype('category')
-    
-    final_memory = df.memory_usage(deep=True).sum()
-    reduction = (initial_memory - final_memory) / initial_memory * 100
-    
-    st.info(f"内存优化:减少 {reduction:.1f}% ({initial_memory/1024/1024:.1f}MB → {final_memory/1024/1024:.1f}MB)")
-    
-    return df
-```
-
-## 🚀 部署指南
-
-### 本地开发部署
-
-```bash
-# 开发模式运行(自动重载)
-streamlit run streamlit_ocr_validator.py --server.runOnSave true
-
-# 指定端口运行
-streamlit run streamlit_ocr_validator.py --server.port 8502
-
-# 指定主机运行(局域网访问)
-streamlit run streamlit_ocr_validator.py --server.address 0.0.0.0
-```
-
-### Docker部署
-
-```dockerfile
-FROM python:3.9-slim
-
-WORKDIR /app
-
-# 安装系统依赖
-RUN apt-update && apt-get install -y \
-    libgl1-mesa-glx \
-    libglib2.0-0 \
-    libsm6 \
-    libxext6 \
-    libxrender-dev \
-    libgomp1 \
-    && rm -rf /var/lib/apt/lists/*
-
-# 安装Python依赖
-COPY requirements.txt .
-RUN pip install -r requirements.txt
-
-COPY . .
-
-EXPOSE 8501
-CMD ["streamlit", "run", "streamlit_ocr_validator.py", "--server.address=0.0.0.0"]
-```
-
-### Streamlit Cloud部署
-
-1. 将代码推送到GitHub仓库
-2. 访问 https://share.streamlit.io/
-3. 连接GitHub仓库并部署
-4. 设置环境变量(如API密钥)
-
-## 💡 最佳实践
-
-### 1. 用户体验优化
-
-- **加载状态**: 使用`st.spinner()`显示加载状态
-- **错误处理**: 使用`st.error()`友好地显示错误信息
-- **进度提示**: 使用`st.progress()`显示处理进度
-- **数据缓存**: 合理使用`@st.cache_data`提升性能
-
-### 2. 界面设计
-
-- **布局清晰**: 使用`st.columns()`合理分布内容
-- **视觉层次**: 使用不同级别的标题和分隔符
-- **交互反馈**: 及时响应用户操作
-- **移动友好**: 考虑不同屏幕尺寸的适配
-
-### 3. 表格处理最佳实践 ⭐
-
-- **大表格处理**: 对超过1000行的表格启用分页显示
-- **内存管理**: 使用数据类型优化减少内存使用
-- **导出优化**: 大表格导出时显示进度条
-- **错误处理**: 优雅处理表格解析失败的情况
-
-### 4. 数据安全
-
-- **输入验证**: 验证上传文件的格式和内容
-- **错误处理**: 妥善处理异常情况
-- **资源清理**: 及时清理临时文件和内存
-
-## 🎉 总结
-
-Streamlit版本的OCR校验工具经过升级后提供了更加强大的功能:
-
-✅ **基础功能**:实时交互、动态更新、错误管理  
-✅ **表格分析**:HTML表格转DataFrame、多种操作、导出功能 ⭐  
-✅ **数据处理**:过滤、排序、统计分析、可视化 ⭐  
-✅ **批量操作**:多文件处理、批量导出、合并功能 ⭐  
-✅ **质量检查**:数据质量分析、问题检测、优化建议 ⭐  
-✅ **扩展性**:易于添加新功能和自定义组件  
-✅ **用户体验**:现代化界面、响应式设计、直观操作  
-
-新增的表格分析功能使其不仅能够校验OCR结果,更能深入分析表格数据,成为一个完整的OCR数据处理工作台!
-
----
-
-> 🌟 **特别推荐**:使用DataFrame表格模式分析财务报表等结构化数据,体验完整的数据处理工作流程。

+ 730 - 0
batch_ocr/batch_merge_results.py

@@ -0,0 +1,730 @@
+#!/usr/bin/env python3
+"""
+批量合并 OCR 结果
+自动读取配置文件,对所有 VL 处理器的输出进行 bbox 合并
+支持执行器输出日志重定向
+"""
+
+import os
+import sys
+import yaml
+import argparse
+import subprocess
+from pathlib import Path
+from datetime import datetime
+from typing import Dict, List, Tuple, Optional, Any
+from dataclasses import dataclass
+import logging
+from tqdm import tqdm
+
+# 添加 merger 模块路径
+sys.path.insert(0, str(Path(__file__).parent.parent / 'merger'))
+
+
+@dataclass
+class MergeTask:
+    """合并任务"""
+    processor_name: str
+    vl_result_dir: Path
+    paddle_result_dir: Path
+    output_dir: Path
+    merger_script: str
+    description: str
+    log_file: str = ""  # 🎯 新增:日志文件路径
+
+
+class BatchMerger:
+    """批量合并器"""
+    
+    # VL 处理器类型映射到合并脚本
+    MERGER_SCRIPTS = {
+        'paddleocr_vl': 'merge_paddleocr_vl_paddleocr.py',
+        'mineru': 'merge_mineru_paddle_ocr.py',
+        'dotsocr': 'merge_mineru_paddle_ocr.py',  # DotsOCR 也用 MinerU 格式
+    }
+    
+    def __init__(self, config_file: str, base_dir: str = None):
+        """
+        Args:
+            config_file: processor_configs.yaml 路径
+            base_dir: PDF 基础目录,覆盖配置文件中的设置
+        """
+        self.config_file = Path(config_file)
+        self.config = self._load_config()
+        self.base_dir = Path(base_dir) if base_dir else Path(self.config['global']['base_dir'])
+        
+        # 🎯 日志基础目录
+        self.log_base_dir = self.base_dir / self.config['global'].get('log_dir', 'logs')
+        
+        # 设置日志
+        self.logger = self._setup_logger()
+        
+        # merger 脚本目录
+        self.merger_dir = Path(__file__).parent.parent / 'merger'
+        
+        # 🎯 统计信息
+        self.merge_results: List[Dict[str, Any]] = []
+    
+    def _load_config(self) -> Dict:
+        """加载配置文件"""
+        with open(self.config_file, 'r', encoding='utf-8') as f:
+            return yaml.safe_load(f)
+    
+    def _setup_logger(self) -> logging.Logger:
+        """设置日志"""
+        logger = logging.getLogger('BatchMerger')
+        logger.setLevel(logging.INFO)
+        
+        if not logger.handlers:
+            console_handler = logging.StreamHandler()
+            console_handler.setLevel(logging.INFO)
+            formatter = logging.Formatter(
+                '%(asctime)s - %(levelname)s - %(message)s',
+                datefmt='%Y-%m-%d %H:%M:%S'
+            )
+            console_handler.setFormatter(formatter)
+            logger.addHandler(console_handler)
+        
+        return logger
+    
+    def _detect_processor_type(self, processor_name: str) -> str:
+        """
+        检测处理器类型
+        
+        Returns:
+            'paddleocr_vl', 'mineru', 'dotsocr', 'ppstructure' 等
+        """
+        name_lower = processor_name.lower()
+        
+        if 'paddleocr_vl' in name_lower or 'paddleocr-vl' in name_lower:
+            return 'paddleocr_vl'
+        elif 'mineru' in name_lower:
+            return 'mineru'
+        elif 'dotsocr' in name_lower or 'dots' in name_lower:
+            return 'dotsocr'
+        elif 'ppstructure' in name_lower or 'pp-structure' in name_lower:
+            return 'ppstructure'
+        else:
+            return 'unknown'
+    
+    def _get_merger_script(self, processor_type: str) -> str:
+        """获取合并脚本路径"""
+        script_name = self.MERGER_SCRIPTS.get(processor_type)
+        if not script_name:
+            return None
+        
+        script_path = self.merger_dir / script_name
+        return str(script_path) if script_path.exists() else None
+    
+    def _find_paddle_result_dir(self, pdf_dir: Path) -> Path:
+        """
+        查找对应的 PaddleOCR 结果目录
+        
+        优先级:
+        1. ppstructurev3_cpu_results (本地 CPU)
+        2. ppstructurev3_results (默认)
+        3. data_PPStructureV3_Results (旧格式)
+        """
+        candidates = [
+            pdf_dir / 'ppstructurev3_client_results',
+            pdf_dir / 'ppstructurev3_single_process_results',
+        ]
+        
+        for candidate in candidates:
+            if candidate.exists():
+                return candidate
+        
+        return None
+    
+    def _get_log_file_path(self, pdf_dir: Path, processor_name: str) -> Path:
+        """
+        🎯 获取合并任务的日志文件路径
+        
+        日志结构:
+        PDF目录/
+        └── logs/
+            └── merge_processor_name/
+                └── PDF名称_merge_YYYYMMDD_HHMMSS.log
+        """
+        # 日志目录
+        log_dir = pdf_dir / 'logs' / f'merge_{processor_name}'
+        log_dir.mkdir(parents=True, exist_ok=True)
+        
+        # 日志文件名
+        timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
+        log_file = log_dir / f"{pdf_dir.name}_merge_{timestamp}.log"
+        
+        return log_file
+    
+    def discover_merge_tasks(
+        self, 
+        pdf_list: List[str] = None,
+        processors: List[str] = None
+    ) -> List[MergeTask]:
+        """
+        自动发现需要合并的任务
+        
+        Args:
+            pdf_list: PDF 文件列表(不含扩展名),如 ['德_内蒙古银行照', ...]
+            processors: 处理器列表,如 ['paddleocr_vl_single_process', ...]
+        
+        Returns:
+            MergeTask 列表
+        """
+        tasks = []
+        
+        # 如果没有指定处理器,扫描所有 VL 类型的处理器
+        if not processors:
+            processors = []
+            for proc_name, proc_config in self.config['processors'].items():
+                proc_type = self._detect_processor_type(proc_name)
+                if proc_type in ['paddleocr_vl', 'mineru', 'dotsocr']:
+                    processors.append(proc_name)
+        
+        # 如果没有指定 PDF 列表,扫描基础目录
+        if not pdf_list:
+            pdf_list = [d.name for d in self.base_dir.iterdir() if d.is_dir()]
+        
+        self.logger.info(f"📂 基础目录: {self.base_dir}")
+        self.logger.info(f"🔍 发现 {len(pdf_list)} 个 PDF 目录")
+        self.logger.info(f"⚙️  发现 {len(processors)} 个 VL 处理器")
+        
+        # 遍历每个 PDF 目录和处理器组合
+        for pdf_name in pdf_list:
+            pdf_dir = self.base_dir / pdf_name
+            
+            if not pdf_dir.exists():
+                self.logger.warning(f"⚠️  目录不存在: {pdf_dir}")
+                continue
+            
+            # 查找 PaddleOCR 结果目录
+            paddle_result_dir = self._find_paddle_result_dir(pdf_dir)
+            
+            if not paddle_result_dir:
+                self.logger.warning(f"⚠️  未找到 PaddleOCR 结果: {pdf_name}")
+                continue
+            
+            # 遍历每个 VL 处理器
+            for proc_name in processors:
+                if proc_name not in self.config['processors']:
+                    self.logger.warning(f"⚠️  处理器不存在: {proc_name}")
+                    continue
+                
+                proc_config = self.config['processors'][proc_name]
+                proc_type = self._detect_processor_type(proc_name)
+                
+                # 获取合并脚本
+                merger_script = self._get_merger_script(proc_type)
+                if not merger_script:
+                    self.logger.warning(f"⚠️  不支持的处理器类型: {proc_name} ({proc_type})")
+                    continue
+                
+                # VL 结果目录
+                vl_output_subdir = proc_config.get('output_subdir', f'{proc_name}_results')
+                vl_result_dir = pdf_dir / vl_output_subdir
+                
+                if not vl_result_dir.exists():
+                    self.logger.debug(f"⏭️  VL 结果不存在: {vl_result_dir}")
+                    continue
+                
+                # 输出目录
+                output_dir = pdf_dir / f"{vl_output_subdir}_cell_bbox"
+                
+                # 🎯 日志文件路径
+                log_file = self._get_log_file_path(pdf_dir, proc_name)
+                
+                # 创建任务
+                task = MergeTask(
+                    processor_name=proc_name,
+                    vl_result_dir=vl_result_dir,
+                    paddle_result_dir=paddle_result_dir,
+                    output_dir=output_dir,
+                    merger_script=merger_script,
+                    description=proc_config.get('description', proc_name),
+                    log_file=str(log_file)  # 🎯 新增
+                )
+                
+                tasks.append(task)
+        
+        return tasks
+    
+    def execute_merge_task(
+        self, 
+        task: MergeTask,
+        window: int = 15,
+        threshold: int = 85,
+        output_type: str = 'both',
+        dry_run: bool = False
+    ) -> Dict[str, Any]:
+        """
+        🎯 执行单个合并任务(支持日志重定向)
+        
+        Args:
+            task: 合并任务
+            window: 查找窗口
+            threshold: 相似度阈值
+            output_type: 输出格式
+            dry_run: 模拟运行
+        
+        Returns:
+            执行结果字典
+        """
+        self.logger.info(f"\n{'='*80}")
+        self.logger.info(f"📄 处理: {task.vl_result_dir.parent.name}")
+        self.logger.info(f"🔧 处理器: {task.description}")
+        self.logger.info(f"📂 VL 结果: {task.vl_result_dir}")
+        self.logger.info(f"📂 PaddleOCR 结果: {task.paddle_result_dir}")
+        self.logger.info(f"📂 输出目录: {task.output_dir}")
+        self.logger.info(f"📄 日志文件: {task.log_file}")
+        self.logger.info(f"{'='*80}")
+        
+        # 构建命令
+        cmd = [
+            sys.executable,  # 当前 Python 解释器
+            task.merger_script,
+            f"--{self._get_vl_arg_name(task.merger_script)}-dir", str(task.vl_result_dir),
+            '--paddle-dir', str(task.paddle_result_dir),
+            '--output-dir', str(task.output_dir),
+            '--output-type', output_type,
+            '--window', str(window),
+            '--threshold', str(threshold)
+        ]
+        
+        if dry_run:
+            self.logger.info(f"[DRY RUN] 命令: {' '.join(cmd)}")
+            return {
+                'task': task,
+                'success': True,
+                'duration': 0,
+                'error': '',
+                'dry_run': True
+            }
+        
+        # 🎯 执行命令并重定向输出到日志文件
+        import time
+        start_time = time.time()
+        
+        try:
+            with open(task.log_file, 'w', encoding='utf-8') as log_f:
+                # 写入日志头
+                log_f.write(f"{'='*80}\n")
+                log_f.write(f"合并任务日志\n")
+                log_f.write(f"{'='*80}\n\n")
+                log_f.write(f"PDF 目录: {task.vl_result_dir.parent}\n")
+                log_f.write(f"处理器: {task.description}\n")
+                log_f.write(f"处理器名称: {task.processor_name}\n")
+                log_f.write(f"VL 结果目录: {task.vl_result_dir}\n")
+                log_f.write(f"PaddleOCR 结果目录: {task.paddle_result_dir}\n")
+                log_f.write(f"输出目录: {task.output_dir}\n")
+                log_f.write(f"合并脚本: {task.merger_script}\n")
+                log_f.write(f"查找窗口: {window}\n")
+                log_f.write(f"相似度阈值: {threshold}\n")
+                log_f.write(f"输出格式: {output_type}\n")
+                log_f.write(f"开始时间: {datetime.now()}\n")
+                log_f.write(f"{'='*80}\n\n")
+                log_f.flush()
+                
+                # 执行命令
+                result = subprocess.run(
+                    cmd,
+                    stdout=log_f,  # 🎯 重定向 stdout
+                    stderr=subprocess.STDOUT,  # 🎯 合并 stderr 到 stdout
+                    text=True,
+                    check=True
+                )
+                
+                # 写入日志尾
+                log_f.write(f"\n{'='*80}\n")
+                log_f.write(f"结束时间: {datetime.now()}\n")
+                log_f.write(f"状态: 成功\n")
+                log_f.write(f"{'='*80}\n")
+            
+            duration = time.time() - start_time
+            self.logger.info(f"✅ 合并成功 (耗时: {duration:.2f}秒)")
+            
+            return {
+                'task': task,
+                'success': True,
+                'duration': duration,
+                'error': '',
+                'dry_run': False
+            }
+            
+        except subprocess.CalledProcessError as e:
+            duration = time.time() - start_time
+            error_msg = f"命令执行失败 (退出码: {e.returncode})"
+            
+            # 🎯 在日志文件中追加错误信息
+            with open(task.log_file, 'a', encoding='utf-8') as log_f:
+                log_f.write(f"\n{'='*80}\n")
+                log_f.write(f"结束时间: {datetime.now()}\n")
+                log_f.write(f"状态: 失败\n")
+                log_f.write(f"错误: {error_msg}\n")
+                log_f.write(f"{'='*80}\n")
+            
+            self.logger.error(f"❌ 合并失败 (耗时: {duration:.2f}秒)")
+            self.logger.error(f"错误信息: {error_msg}")
+            self.logger.error(f"详细日志: {task.log_file}")
+            
+            return {
+                'task': task,
+                'success': False,
+                'duration': duration,
+                'error': error_msg,
+                'dry_run': False
+            }
+            
+        except Exception as e:
+            duration = time.time() - start_time
+            error_msg = str(e)
+            
+            with open(task.log_file, 'a', encoding='utf-8') as log_f:
+                log_f.write(f"\n{'='*80}\n")
+                log_f.write(f"结束时间: {datetime.now()}\n")
+                log_f.write(f"状态: 异常\n")
+                log_f.write(f"错误: {error_msg}\n")
+                log_f.write(f"{'='*80}\n")
+            
+            self.logger.error(f"❌ 合并异常 (耗时: {duration:.2f}秒)")
+            self.logger.error(f"错误信息: {error_msg}")
+            self.logger.error(f"详细日志: {task.log_file}")
+            
+            return {
+                'task': task,
+                'success': False,
+                'duration': duration,
+                'error': error_msg,
+                'dry_run': False
+            }
+    
+    def _get_vl_arg_name(self, merger_script: str) -> str:
+        """获取 VL 参数名称"""
+        script_name = Path(merger_script).stem
+        
+        if 'paddleocr_vl' in script_name:
+            return 'paddleocr-vl'
+        elif 'mineru' in script_name:
+            return 'mineru'
+        else:
+            return 'vl'
+    
+    def _save_summary_log(self, stats: Dict[str, Any]):
+        """🎯 保存汇总日志"""
+        timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
+        summary_log_file = self.log_base_dir / f"merge_batch_summary_{timestamp}.log"
+        
+        # 确保目录存在
+        summary_log_file.parent.mkdir(parents=True, exist_ok=True)
+        
+        with open(summary_log_file, 'w', encoding='utf-8') as f:
+            f.write("OCR 结果批量合并汇总日志\n")
+            f.write("=" * 80 + "\n\n")
+            
+            f.write(f"配置文件: {self.config_file}\n")
+            f.write(f"基础目录: {self.base_dir}\n")
+            f.write(f"日志目录: {self.log_base_dir}\n")
+            f.write(f"开始时间: {datetime.now()}\n")
+            f.write(f"总耗时: {stats['total_duration']:.2f} 秒\n\n")
+            
+            f.write("统计信息:\n")
+            f.write(f"  总任务数: {stats['total']}\n")
+            f.write(f"  成功: {stats['success']}\n")
+            f.write(f"  失败: {stats['failed']}\n\n")
+            
+            if stats['failed_tasks']:
+                f.write("失败的任务:\n")
+                for item in stats['failed_tasks']:
+                    f.write(f"  ✗ {item['pdf_dir']} / {item['processor']}\n")
+                    f.write(f"    错误: {item['error']}\n")
+                    f.write(f"    日志: {item['log']}\n\n")
+            
+            f.write("详细结果:\n")
+            for result in self.merge_results:
+                task = result['task']
+                status = "✓" if result['success'] else "✗"
+                f.write(f"{status} {task.vl_result_dir.parent.name} / {task.processor_name} ({result['duration']:.2f}s)\n")
+                f.write(f"   日志: {task.log_file}\n")
+                if result['error']:
+                    f.write(f"   错误: {result['error']}\n")
+        
+        self.logger.info(f"汇总日志已保存: {summary_log_file}")
+    
+    def batch_merge(
+        self,
+        pdf_list: List[str] = None,
+        processors: List[str] = None,
+        window: int = 15,
+        threshold: int = 85,
+        output_type: str = 'both',
+        dry_run: bool = False
+    ) -> Dict:
+        """
+        批量合并
+        
+        Returns:
+            统计信息字典
+        """
+        # 发现任务
+        tasks = self.discover_merge_tasks(pdf_list, processors)
+        
+        if not tasks:
+            self.logger.warning("⚠️  没有发现任何合并任务")
+            return {
+                'total': 0,
+                'success': 0,
+                'failed': 0,
+                'total_duration': 0,
+                'failed_tasks': []
+            }
+        
+        self.logger.info(f"\n🎯 发现 {len(tasks)} 个合并任务\n")
+        
+        # 显示任务列表
+        for i, task in enumerate(tasks, 1):
+            self.logger.info(f"{i}. {task.vl_result_dir.parent.name} / {task.processor_name}")
+        
+        # 确认执行
+        if not dry_run:
+            confirm = input(f"\n是否继续执行 {len(tasks)} 个合并任务? [Y/n]: ")
+            if confirm.lower() not in ['', 'y', 'yes']:
+                self.logger.info("❌ 已取消")
+                return {
+                    'total': 0,
+                    'success': 0,
+                    'failed': 0,
+                    'total_duration': 0,
+                    'failed_tasks': []
+                }
+        
+        # 执行任务
+        import time
+        batch_start_time = time.time()
+        success_count = 0
+        failed_count = 0
+        
+        with tqdm(total=len(tasks), desc="合并进度", unit="task") as pbar:
+            for task in tasks:
+                result = self.execute_merge_task(
+                    task,
+                    window=window,
+                    threshold=threshold,
+                    output_type=output_type,
+                    dry_run=dry_run
+                )
+                
+                self.merge_results.append(result)
+                
+                if result['success']:
+                    success_count += 1
+                else:
+                    failed_count += 1
+                
+                pbar.update(1)
+                pbar.set_postfix({
+                    'success': success_count,
+                    'failed': failed_count
+                })
+        
+        total_duration = time.time() - batch_start_time
+        
+        # 统计失败任务
+        failed_tasks = [
+            {
+                'pdf_dir': r['task'].vl_result_dir.parent.name,
+                'processor': r['task'].processor_name,
+                'error': r['error'],
+                'log': r['task'].log_file
+            }
+            for r in self.merge_results if not r['success']
+        ]
+        
+        # 统计信息
+        stats = {
+            'total': len(tasks),
+            'success': success_count,
+            'failed': failed_count,
+            'total_duration': total_duration,
+            'failed_tasks': failed_tasks
+        }
+        
+        # 🎯 保存汇总日志
+        self._save_summary_log(stats)
+        
+        # 打印总结
+        self.logger.info(f"\n{'='*80}")
+        self.logger.info("📊 合并完成")
+        self.logger.info(f"  总任务数: {stats['total']}")
+        self.logger.info(f"  ✅ 成功: {stats['success']}")
+        self.logger.info(f"  ❌ 失败: {stats['failed']}")
+        self.logger.info(f"  ⏱️  总耗时: {stats['total_duration']:.2f} 秒")
+        self.logger.info(f"{'='*80}")
+        
+        if failed_tasks:
+            self.logger.info(f"\n失败的任务:")
+            for item in failed_tasks:
+                self.logger.info(f"  ✗ {item['pdf_dir']} / {item['processor']}")
+                self.logger.info(f"    错误: {item['error']}")
+                self.logger.info(f"    日志: {item['log']}")
+        
+        return stats
+
+
+def create_parser() -> argparse.ArgumentParser:
+    """创建命令行参数解析器"""
+    parser = argparse.ArgumentParser(
+        description='批量合并 OCR 结果(VL + PaddleOCR)',
+        formatter_class=argparse.RawDescriptionHelpFormatter,
+        epilog="""
+示例用法:
+
+  1. 合并配置文件中所有 VL 处理器的结果:
+     python batch_merge_results.py
+
+  2. 合并指定 PDF 的结果:
+     python batch_merge_results.py -f pdf_list.txt
+
+  3. 合并指定处理器的结果:
+     python batch_merge_results.py -p paddleocr_vl_single_process -p mineru_vllm
+
+  4. 自定义参数:
+     python batch_merge_results.py -w 20 -t 90
+
+  5. 模拟运行(不实际执行):
+     python batch_merge_results.py --dry-run
+        """
+    )
+    
+    # 配置文件
+    parser.add_argument(
+        '-c', '--config',
+        default='processor_configs.yaml',
+        help='配置文件路径 (默认: processor_configs.yaml)'
+    )
+    
+    # PDF 和处理器
+    parser.add_argument(
+        '-d', '--base-dir',
+        help='PDF 基础目录(覆盖配置文件)'
+    )
+    
+    parser.add_argument(
+        '-f', '--file-list',
+        help='PDF 列表文件(每行一个 PDF 名称,不含扩展名)'
+    )
+    
+    parser.add_argument(
+        '-l', '--pdf-list',
+        nargs='+',
+        help='PDF 名称列表(不含扩展名)'
+    )
+    
+    parser.add_argument(
+        '-p', '--processors',
+        nargs='+',
+        help='处理器列表(不指定则自动检测所有 VL 处理器)'
+    )
+    
+    # 合并参数
+    parser.add_argument(
+        '-w', '--window',
+        type=int,
+        default=15,
+        help='查找窗口大小 (默认: 15)'
+    )
+    
+    parser.add_argument(
+        '-t', '--threshold',
+        type=int,
+        default=85,
+        help='相似度阈值 (默认: 85)'
+    )
+    
+    parser.add_argument(
+        '--output-type',
+        choices=['json', 'markdown', 'both'],
+        default='both',
+        help='输出格式 (默认: both)'
+    )
+    
+    # 工具选项
+    parser.add_argument(
+        '--dry-run',
+        action='store_true',
+        help='模拟运行,不实际执行'
+    )
+    
+    parser.add_argument(
+        '-v', '--verbose',
+        action='store_true',
+        help='详细输出'
+    )
+    
+    return parser
+
+
+def main():
+    """主函数"""
+    parser = create_parser()
+    args = parser.parse_args()
+    
+    # 设置日志级别
+    if args.verbose:
+        logging.getLogger().setLevel(logging.DEBUG)
+    
+    # 读取 PDF 列表
+    pdf_list = None
+    if args.file_list:
+        pdf_list = []
+        with open(args.file_list, 'r', encoding='utf-8') as f:
+            for line in f:
+                line = line.strip()
+                if line and not line.startswith('#'):
+                    # 移除 .pdf 扩展名
+                    pdf_name = line.replace('.pdf', '')
+                    pdf_list.append(pdf_name)
+    elif args.pdf_list:
+        pdf_list = [p.replace('.pdf', '') for p in args.pdf_list]
+    
+    # 创建批量合并器
+    merger = BatchMerger(
+        config_file=args.config,
+        base_dir=args.base_dir
+    )
+    
+    # 执行批量合并
+    stats = merger.batch_merge(
+        pdf_list=pdf_list,
+        processors=args.processors,
+        window=args.window,
+        threshold=args.threshold,
+        output_type=args.output_type,
+        dry_run=args.dry_run
+    )
+    
+    return 0 if stats['failed'] == 0 else 1
+
+
+if __name__ == '__main__':
+    print("🚀 启动批量OCR bbox 合并程序...")
+    
+    import sys
+    
+    if len(sys.argv) == 1:
+        # 如果没有命令行参数,使用默认配置运行
+        print("ℹ️  未提供命令行参数,使用默认配置运行...")
+        
+        # 默认配置
+        default_config = {
+            "file-list": "pdf_list.txt",
+        }
+        
+        print("⚙️  默认参数:")
+        for key, value in default_config.items():
+            print(f"  --{key}: {value}")
+        # 构造参数
+        sys.argv = [sys.argv[0]]
+        for key, value in default_config.items():
+            sys.argv.extend([f"--{key}", str(value)])
+        sys.argv.append("--dry-run")
+        sys.argv.append("--verbose")  # 添加详细输出参数 
+    sys.exit(main())

+ 181 - 49
batch_ocr/batch_process_pdf.py

@@ -3,6 +3,7 @@
 PDF 批量处理脚本
 支持多种处理器,配置文件驱动
 支持自动切换虚拟环境
+支持执行器输出日志重定向
 """
 
 import os
@@ -32,7 +33,8 @@ class ProcessorConfig:
     output_arg: str = "--output_dir"
     extra_args: List[str] = field(default_factory=list)
     output_subdir: str = "results"
-    venv: Optional[str] = None  # 虚拟环境激活命令
+    log_subdir: str = "logs"  # 🎯 新增:日志子目录
+    venv: Optional[str] = None
     description: str = ""
 
 
@@ -43,6 +45,7 @@ class ProcessResult:
     success: bool
     duration: float
     error_message: str = ""
+    log_file: str = ""  # 🎯 新增:日志文件路径
 
 
 # ============================================================================
@@ -64,7 +67,8 @@ class ConfigManager:
                 ],
                 'output_subdir': 'paddleocr_vl_results',
                 'venv': 'source /Users/zhch158/workspace/repository.git/PaddleX/paddle_env/bin/activate',
-                'description': 'PaddleOCR-VL 处理器'
+                'description': 'PaddleOCR-VL 处理器',
+                'log_subdir': 'logs/paddleocr_vl_single_process'  # 🎯 新增
             },
             'ppstructurev3_single_process': {
                 'script': '/Users/zhch158/workspace/repository.git/PaddleX/zhch/ppstructurev3_single_process.py',
@@ -75,7 +79,8 @@ class ConfigManager:
                 ],
                 'output_subdir': 'ppstructurev3_results',
                 'venv': 'conda activate paddle',
-                'description': 'PP-StructureV3 处理器'
+                'description': 'PP-StructureV3 处理器',
+                'log_subdir': 'logs/ppstructurev3_single_process'  # 🎯 新增
             },
             'ppstructurev3_single_client': {
                 'script': '/Users/zhch158/workspace/repository.git/PaddleX/zhch/ppstructurev3_single_client.py',
@@ -87,7 +92,8 @@ class ConfigManager:
                 ],
                 'output_subdir': 'ppstructurev3_client_results',
                 'venv': 'source /Users/zhch158/workspace/repository.git/PaddleX/paddle_env/bin/activate',
-                'description': 'PP-StructureV3 HTTP API 客户端'
+                'description': 'PP-StructureV3 HTTP API 客户端',
+                'log_subdir': 'logs/ppstructurev3_single_client'  # 🎯 新增
             },
             'mineru_vllm': {
                 'script': '/Users/zhch158/workspace/repository.git/MinerU/zhch/mineru2_vllm_multthreads.py',
@@ -100,7 +106,8 @@ class ConfigManager:
                 ],
                 'output_subdir': 'mineru_vllm_results',
                 'venv': 'conda activate mineru2',
-                'description': 'MinerU vLLM 处理器'
+                'description': 'MinerU vLLM 处理器',
+                'log_subdir': 'logs/mineru_vllm'  # 🎯 新增
             },
             'dotsocr_vllm': {
                 'script': '/Users/zhch158/workspace/repository.git/dots.ocr/zhch/dotsocr_vllm_multthreads.py',
@@ -117,12 +124,16 @@ class ConfigManager:
                 ],
                 'output_subdir': 'dotsocr_vllm_results',
                 'venv': 'conda activate py312',
-                'description': 'DotsOCR vLLM 处理器 - 支持PDF和图片'
+                'description': 'DotsOCR vLLM 处理器 - 支持PDF和图片',
+                'log_subdir': 'logs/dotsocr_vllm'  # 🎯 新增
             }
         },
         'global': {
             'base_dir': '/Users/zhch158/workspace/data/流水分析',
-            'output_subdir': 'results'
+            'output_subdir': 'results',
+            'log_dir': 'logs',
+            'log_retention_days': 30,
+            'log_level': 'INFO'
         }
     }
     
@@ -153,6 +164,7 @@ class ConfigManager:
             output_arg=proc_config.get('output_arg', '--output_dir'),
             extra_args=proc_config.get('extra_args', []),
             output_subdir=proc_config.get('output_subdir', processor_name + '_results'),
+            log_subdir=proc_config.get('log_subdir', f'logs/{processor_name}'),  # 🎯 新增
             venv=proc_config.get('venv'),
             description=proc_config.get('description', '')
         )
@@ -250,11 +262,13 @@ class PDFBatchProcessor:
         self,
         processor_config: ProcessorConfig,
         output_subdir: Optional[str] = None,
+        log_base_dir: Optional[str] = None,  # 🎯 新增:日志基础目录
         dry_run: bool = False
     ):
         self.processor_config = processor_config
         # 如果指定了output_subdir,使用指定的;否则使用处理器配置中的
         self.output_subdir = output_subdir or processor_config.output_subdir
+        self.log_base_dir = Path(log_base_dir) if log_base_dir else Path('logs')  # 🎯 新增
         self.dry_run = dry_run
         
         # 设置日志
@@ -282,12 +296,37 @@ class PDFBatchProcessor:
         
         return logger
     
+    def _get_log_file_path(self, pdf_file: Path) -> Path:
+        """
+        🎯 获取日志文件路径
+        
+        日志结构:
+        base_dir/
+        └── PDF名称/
+            └── logs/
+                └── processor_name/
+                    └── PDF名称_YYYYMMDD_HHMMSS.log
+        """
+        # PDF 目录
+        pdf_dir = pdf_file.parent / pdf_file.stem
+        
+        # 日志目录: pdf_dir / logs / processor_name
+        log_dir = pdf_dir / self.processor_config.log_subdir
+        log_dir.mkdir(parents=True, exist_ok=True)
+        
+        # 日志文件名: PDF名称_时间戳.log
+        timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
+        log_file = log_dir / f"{pdf_file.stem}_{timestamp}.log"
+        
+        return log_file
+    
     def process_files(self, pdf_files: List[Path]) -> Dict[str, Any]:
         """批量处理文件"""
         self.logger.info(f"开始处理 {len(pdf_files)} 个文件")
         self.logger.info(f"处理器: {self.processor_config.description}")
         self.logger.info(f"脚本: {self.processor_config.script}")
         self.logger.info(f"输出目录: {self.output_subdir}")
+        self.logger.info(f"日志目录: {self.processor_config.log_subdir}")
         
         if self.processor_config.venv:
             self.logger.info(f"虚拟环境: {self.processor_config.venv}")
@@ -312,14 +351,12 @@ class PDFBatchProcessor:
         
         # 生成统计信息
         stats = self._generate_stats(total_duration)
-        
-        # 保存日志
-        self._save_log(stats)
+        self._save_summary_log(stats)
         
         return stats
     
     def _process_single_file(self, pdf_file: Path) -> ProcessResult:
-        """处理单个文件"""
+        """🎯 处理单个文件(支持日志重定向)"""
         self.logger.info(f"处理: {pdf_file}")
         
         # 检查文件是否存在
@@ -335,10 +372,14 @@ class PDFBatchProcessor:
         # 确定输出目录
         output_dir = pdf_file.parent / pdf_file.stem / self.output_subdir
         
+        # 🎯 获取日志文件路径
+        log_file = self._get_log_file_path(pdf_file)
+        
         # 构建命令
         cmd = self._build_command(pdf_file, output_dir)
         
         self.logger.debug(f"执行命令: {cmd if isinstance(cmd, str) else ' '.join(cmd)}")
+        self.logger.info(f"日志输出: {log_file}")
         
         if self.dry_run:
             self.logger.info(f"[DRY RUN] 将执行: {cmd if isinstance(cmd, str) else ' '.join(cmd)}")
@@ -346,53 +387,103 @@ class PDFBatchProcessor:
                 pdf_file=str(pdf_file),
                 success=True,
                 duration=0,
-                error_message=""
+                error_message="",
+                log_file=str(log_file)
             )
         
-        # 执行命令
+        # 🎯 执行命令并重定向输出到日志文件
         start_time = time.time()
         try:
-            # 如果是 shell 命令(包含 venv),使用 shell=True
-            if isinstance(cmd, str):
-                result = subprocess.run(
-                    cmd,
-                    shell=True,
-                    executable='/bin/bash',  # 使用 bash
-                    capture_output=True,
-                    text=True,
-                    check=True
-                )
-            else:
-                result = subprocess.run(
-                    cmd,
-                    capture_output=True,
-                    text=True,
-                    check=True
-                )
+            with open(log_file, 'w', encoding='utf-8') as log_f:
+                # 写入日志头
+                log_f.write(f"{'='*80}\n")
+                log_f.write(f"处理器: {self.processor_config.description}\n")
+                log_f.write(f"PDF 文件: {pdf_file}\n")
+                log_f.write(f"输出目录: {output_dir}\n")
+                log_f.write(f"开始时间: {datetime.now()}\n")
+                log_f.write(f"{'='*80}\n\n")
+                log_f.flush()
+                
+                # 执行命令
+                if isinstance(cmd, str):
+                    result = subprocess.run(
+                        cmd,
+                        shell=True,
+                        executable='/bin/bash',
+                        stdout=log_f,  # 🎯 重定向 stdout
+                        stderr=subprocess.STDOUT,  # 🎯 合并 stderr 到 stdout
+                        text=True,
+                        check=True
+                    )
+                else:
+                    result = subprocess.run(
+                        cmd,
+                        stdout=log_f,  # 🎯 重定向 stdout
+                        stderr=subprocess.STDOUT,  # 🎯 合并 stderr
+                        text=True,
+                        check=True
+                    )
+                
+                # 写入日志尾
+                log_f.write(f"\n{'='*80}\n")
+                log_f.write(f"结束时间: {datetime.now()}\n")
+                log_f.write(f"状态: 成功\n")
+                log_f.write(f"{'='*80}\n")
             
             duration = time.time() - start_time
-            
             self.logger.info(f"✓ 成功 (耗时: {duration:.2f}秒)")
             
             return ProcessResult(
                 pdf_file=str(pdf_file),
                 success=True,
                 duration=duration,
-                error_message=""
+                error_message="",
+                log_file=str(log_file)
             )
             
         except subprocess.CalledProcessError as e:
             duration = time.time() - start_time
-            error_msg = e.stderr if e.stderr else str(e)
+            error_msg = f"命令执行失败 (退出码: {e.returncode})"
+            
+            # 🎯 在日志文件中追加错误信息
+            with open(log_file, 'a', encoding='utf-8') as log_f:
+                log_f.write(f"\n{'='*80}\n")
+                log_f.write(f"结束时间: {datetime.now()}\n")
+                log_f.write(f"状态: 失败\n")
+                log_f.write(f"错误: {error_msg}\n")
+                log_f.write(f"{'='*80}\n")
             
             self.logger.error(f"✗ 失败 (耗时: {duration:.2f}秒)")
             self.logger.error(f"错误信息: {error_msg}")
+            self.logger.error(f"详细日志: {log_file}")
+            
+            return ProcessResult(
+                pdf_file=str(pdf_file),
+                success=False,
+                duration=duration,
+                error_message=error_msg,
+                log_file=str(log_file)
+            )
+        except Exception as e:
+            duration = time.time() - start_time
+            error_msg = str(e)
+            
+            with open(log_file, 'a', encoding='utf-8') as log_f:
+                log_f.write(f"\n{'='*80}\n")
+                log_f.write(f"结束时间: {datetime.now()}\n")
+                log_f.write(f"状态: 异常\n")
+                log_f.write(f"错误: {error_msg}\n")
+                log_f.write(f"{'='*80}\n")
+            
+            self.logger.error(f"✗ 异常 (耗时: {duration:.2f}秒)")
+            self.logger.error(f"错误信息: {error_msg}")
             
             return ProcessResult(
                 pdf_file=str(pdf_file),
                 success=False,
                 duration=duration,
-                error_message=error_msg
+                error_message=error_msg,
+                log_file=str(log_file)
             )
     
     def _build_command(self, pdf_file: Path, output_dir: Path):
@@ -451,7 +542,14 @@ eval "$(conda shell.bash hook)"
         success_count = sum(1 for r in self.results if r.success)
         failed_count = len(self.results) - success_count
         
-        failed_files = [r.pdf_file for r in self.results if not r.success]
+        failed_files = [
+            {
+                'file': r.pdf_file,
+                'error': r.error_message,
+                'log': r.log_file
+            }
+            for r in self.results if not r.success
+        ]
         
         stats = {
             'total': len(self.results),
@@ -464,7 +562,8 @@ eval "$(conda shell.bash hook)"
                     'file': r.pdf_file,
                     'success': r.success,
                     'duration': r.duration,
-                    'error': r.error_message
+                    'error': r.error_message,
+                    'log': r.log_file
                 }
                 for r in self.results
             ]
@@ -472,19 +571,23 @@ eval "$(conda shell.bash hook)"
         
         return stats
     
-    def _save_log(self, stats: Dict[str, Any]):
-        """保存日志"""
+    def _save_summary_log(self, stats: Dict[str, Any]):
+        """🎯 保存汇总日志"""
         timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
-        log_file = f"batch_process_{self.processor_config.name}_{timestamp}.log"
+        summary_log_file = self.log_base_dir / f"batch_summary_{self.processor_config.name}_{timestamp}.log"
+        
+        # 确保目录存在
+        summary_log_file.parent.mkdir(parents=True, exist_ok=True)
         
-        with open(log_file, 'w', encoding='utf-8') as f:
-            f.write("PDF 批量处理日志\n")
+        with open(summary_log_file, 'w', encoding='utf-8') as f:
+            f.write("PDF 批量处理汇总日志\n")
             f.write("=" * 80 + "\n\n")
             
             f.write(f"处理器: {self.processor_config.description}\n")
             f.write(f"处理器名称: {self.processor_config.name}\n")
             f.write(f"脚本: {self.processor_config.script}\n")
             f.write(f"输出目录: {self.output_subdir}\n")
+            f.write(f"日志目录: {self.processor_config.log_subdir}\n")
             
             if self.processor_config.venv:
                 f.write(f"虚拟环境: {self.processor_config.venv}\n")
@@ -499,18 +602,20 @@ eval "$(conda shell.bash hook)"
             
             if stats['failed_files']:
                 f.write("失败的文件:\n")
-                for file in stats['failed_files']:
-                    f.write(f"  - {file}\n")
-                f.write("\n")
+                for item in stats['failed_files']:
+                    f.write(f"  ✗ {item['file']}\n")
+                    f.write(f"    错误: {item['error']}\n")
+                    f.write(f"    日志: {item['log']}\n\n")
             
             f.write("详细结果:\n")
             for result in stats['results']:
                 status = "✓" if result['success'] else "✗"
                 f.write(f"{status} {result['file']} ({result['duration']:.2f}s)\n")
+                f.write(f"   日志: {result['log']}\n")
                 if result['error']:
                     f.write(f"   错误: {result['error']}\n")
         
-        self.logger.info(f"日志已保存: {log_file}")
+        self.logger.info(f"汇总日志已保存: {summary_log_file}")
 
 
 # ============================================================================
@@ -684,6 +789,7 @@ def main():
     base_dir = args.base_dir or config_manager.get_global_config('base_dir')
     if not base_dir:
         parser.error("必须指定 -d 参数或在配置文件中设置 base_dir")
+    log_base_dir = base_dir + '/' + config_manager.get_global_config('log_dir', 'logs')
     
     # 查找 PDF 文件
     finder = PDFFileFinder(base_dir)
@@ -727,6 +833,7 @@ def main():
     processor = PDFBatchProcessor(
         processor_config=processor_config,
         output_subdir=args.output_subdir,
+        log_base_dir=log_base_dir,  # 🎯 传递日志目录
         dry_run=args.dry_run
     )
     
@@ -739,8 +846,7 @@ def main():
     print(f"\n📊 统计信息:")
     print(f"  处理器: {processor_config.description}")
     print(f"  输出目录: {processor.output_subdir}")
-    if stats.get('venv'):
-        print(f"  虚拟环境: {stats['venv']}")
+    print(f"  日志目录: {processor.processor_config.log_subdir}")
     print(f"  总文件数: {stats['total']}")
     print(f"  ✓ 成功: {stats['success']}")
     print(f"  ✗ 失败: {stats['failed']}")
@@ -748,11 +854,37 @@ def main():
     
     if stats['failed_files']:
         print(f"\n失败的文件:")
-        for file in stats['failed_files']:
-            print(f"  ✗ {file}")
+        for item in stats['failed_files']:
+            print(f"  ✗ {item['file']}")
+            print(f"    错误: {item['error']}")
+            print(f"    日志: {item['log']}")
     
     return 0 if stats['failed'] == 0 else 1
 
 
 if __name__ == '__main__':
+    print("🚀 启动批量OCR程序...")
+    
+    import sys
+    
+    if len(sys.argv) == 1:
+        # 如果没有命令行参数,使用默认配置运行
+        print("ℹ️  未提供命令行参数,使用默认配置运行...")
+        
+        # 默认配置
+        default_config = {
+            "processor": "mineru_vllm",
+            "file-list": "pdf_list.txt",
+        }
+        
+        print("⚙️  默认参数:")
+        for key, value in default_config.items():
+            print(f"  --{key}: {value}")
+        # 构造参数
+        sys.argv = [sys.argv[0]]
+        for key, value in default_config.items():
+            sys.argv.extend([f"--{key}", str(value)])
+        sys.argv.append("--dry-run")
+        sys.argv.append("--verbose")  # 添加详细输出参数 
+
     sys.exit(main())

+ 19 - 8
batch_ocr/processor_configs.yaml

@@ -6,7 +6,6 @@
 processors:
   # -------------------------------------------------------------------------
   # PaddleOCR-VL 处理器
-  # 用于视觉语言模型的 OCR 处理
   # -------------------------------------------------------------------------
   paddleocr_vl_single_process:
     script: "/Users/zhch158/workspace/repository.git/PaddleX/zhch/paddleocr_vl_single_process.py"
@@ -14,14 +13,15 @@ processors:
     output_arg: "--output_dir"
     extra_args:
       - "--pipeline=/Users/zhch158/workspace/repository.git/PaddleX/zhch/my_config/PaddleOCR-VL-Client-RT-DETR-H_layout_17cls.yaml"
-      - "--no-adapter"
+      - "--device=cpu"
+      # - "--no-adapter"
     output_subdir: "paddleocr_vl_results"
+    log_subdir: "logs/paddleocr_vl"  # 🎯 新增:日志子目录
     venv: "source /Users/zhch158/workspace/repository.git/PaddleX/paddle_env/bin/activate"
     description: "PaddleOCR-VL 处理器 - 视觉语言模型OCR"
 
   # -------------------------------------------------------------------------
   # PP-StructureV3 本地处理器
-  # 用于文档结构化分析(本地GPU/CPU处理)
   # -------------------------------------------------------------------------
   ppstructurev3_single_process:
     script: "/home/ubuntu/zhch/PaddleX/zhch/ppstructurev3_single_process.py"
@@ -29,22 +29,25 @@ processors:
     output_arg: "--output_dir"
     extra_args:
       - "--pipeline=/home/ubuntu/zhch/PaddleX/zhch/my_config/PP-StructureV3.yaml"
+      - "--device=cpu"
     output_subdir: "ppstructurev3_results"
+    log_subdir: "logs/ppstructurev3"
     venv: "conda activate paddle"
     description: "PP-StructureV3 处理器 - 本地处理"
 
-  # -------------------------------------------------------------------------
-  # PP-StructureV3 GPU 处理器
-  # 明确使用 GPU 加速
-  # -------------------------------------------------------------------------
   ppstructurev3_gpu:
     script: "/home/ubuntu/zhch/PaddleX/zhch/ppstructurev3_single_process.py"
     input_arg: "--input_file"
     output_arg: "--output_dir"
     extra_args:
       - "--pipeline=/home/ubuntu/zhch/PaddleX/zhch/my_config/PP-StructureV3.yaml"
+    input_arg: "--input_file"
+    output_arg: "--output_dir"
+    extra_args:
+      - "--pipeline=/home/ubuntu/zhch/PaddleX/zhch/my_config/PP-StructureV3.yaml"
       - "--device=gpu"
     output_subdir: "ppstructurev3_gpu_results"
+    log_subdir: "logs/ppstructurev3_gpu"
     venv: "conda activate paddle"
     description: "PP-StructureV3 处理器 - GPU加速"
 
@@ -60,6 +63,7 @@ processors:
       - "--pipeline=/Users/zhch158/workspace/repository.git/PaddleX/zhch/my_config/PP-StructureV3-zhch.yaml"
       - "--device=cpu"
     output_subdir: "ppstructurev3_cpu_results"
+    log_subdir: "logs/ppstructurev3_cpu"
     venv: "source /Users/zhch158/workspace/repository.git/PaddleX/paddle_env/bin/activate"
     description: "PP-StructureV3 处理器 - CPU处理"
 
@@ -75,6 +79,7 @@ processors:
       - "--api_url=http://10.192.72.11:8111/layout-parsing"
       - "--timeout=300"
     output_subdir: "ppstructurev3_client_results"
+    log_subdir: "logs/ppstructurev3_client"
     venv: "source /Users/zhch158/workspace/repository.git/PaddleX/paddle_env/bin/activate"
     description: "PP-StructureV3 HTTP API 客户端 - 远程服务"
 
@@ -91,6 +96,7 @@ processors:
       - "--timeout=300"
       - "--batch_size=1"
     output_subdir: "mineru_vllm_results"
+    log_subdir: "logs/mineru_vllm"
     venv: "conda activate mineru2"
     description: "MinerU vLLM 处理器 - 支持PDF和图片"
 
@@ -111,6 +117,7 @@ processors:
       - "--max_workers=1"
       - "--dpi=200"
     output_subdir: "dotsocr_vllm_results"
+    log_subdir: "logs/dotsocr_vllm"
     venv: "conda activate py312"
     description: "DotsOCR vLLM 处理器 - 支持PDF和图片"
 
@@ -123,4 +130,8 @@ global:
   
   # 默认输出子目录名称(如果处理器未指定)
   output_subdir: "results"
-  
+  
+  # 🎯 新增:全局日志配置
+  log_dir: "logs"  # 全局日志目录(相对于 base_dir)
+  log_retention_days: 30  # 日志保留天数
+  log_level: "INFO"  # 日志级别: DEBUG, INFO, WARNING, ERROR

+ 48 - 0
config/A用户_单元格扫描流水.yaml

@@ -0,0 +1,48 @@
+document:
+  name: "A用户_单元格扫描流水"
+  base_dir: "/Users/zhch158/workspace/data/流水分析/A用户_单元格扫描流水"
+
+  # 🎯 关键改进:定义该文档使用的 OCR 工具及其结果目录
+  ocr_results:
+    # PPStructV3
+    - tool: "ppstructv3"
+      result_dir: "ppstructurev3_client_results"
+      image_dir: "ppstructurev3_client_results/{{name}}"
+      description: "PPStructV3 图片合成结果"
+      enabled: true
+    
+    # PaddleOCR-VL
+    - tool: "paddleocr_vl"
+      result_dir: "paddleocr_vl_results"
+      image_dir: "paddleocr_vl_results/{{name}}"
+      description: "PaddleOCR VLM 图片合成结果"
+      enabled: true
+    
+    # PaddleOCR-VL (带 cell bbox)
+    - tool: "mineru"  # 格式同 MinerU
+      result_dir: "paddleocr_vl_results_cell_bbox"
+      image_dir: "paddleocr_vl_results/{{name}}"
+      description: "PaddleOCR VLM + PaddleOCR 坐标"
+      enabled: true
+    
+    # MinerU
+    - tool: "mineru"
+      result_dir: "mineru_vllm_results"
+      image_dir: "mineru_vllm_results/{{name}}"
+      description: "MinerU 图片合成结果"
+      enabled: true
+    
+    # MinerU (带 cell bbox)
+    - tool: "mineru"
+      result_dir: "mineru_vllm_results_cell_bbox"
+      image_dir: "mineru_vllm_results/{{name}}"
+      description: "MinerU + PaddleOCR 坐标"
+      enabled: true
+    
+    # DotsOCR
+    - tool: "dots_ocr"
+      result_dir: "dotsocr_vllm_results"
+      image_dir: "dotsocr_vllm_results/{{name}}"
+      description: "Dots OCR 图片合成结果"
+      enabled: true
+  

+ 47 - 0
config/B用户_扫描流水.yaml

@@ -0,0 +1,47 @@
+document:
+  name: "B用户_扫描流水"
+  base_dir: "/Users/zhch158/workspace/data/流水分析/B用户_扫描流水"
+  
+  # 🎯 关键改进:定义该文档使用的 OCR 工具及其结果目录
+  ocr_results:
+    # PPStructV3
+    - tool: "ppstructv3"
+      result_dir: "ppstructurev3_client_results"
+      image_dir: "ppstructurev3_client_results/{{name}}"
+      description: "PPStructV3 图片合成结果"
+      enabled: true
+    
+    # PaddleOCR-VL
+    - tool: "paddleocr_vl"
+      result_dir: "paddleocr_vl_results"
+      image_dir: "paddleocr_vl_results/{{name}}"
+      description: "PaddleOCR VLM 图片合成结果"
+      enabled: true
+    
+    # PaddleOCR-VL (带 cell bbox)
+    - tool: "mineru"  # 格式同 MinerU
+      result_dir: "paddleocr_vl_results_cell_bbox"
+      image_dir: "paddleocr_vl_results/{{name}}"
+      description: "PaddleOCR VLM + PaddleOCR 坐标"
+      enabled: true
+    
+    # MinerU
+    - tool: "mineru"
+      result_dir: "mineru_vllm_results"
+      image_dir: "mineru_vllm_results/{{name}}"
+      description: "MinerU 图片合成结果"
+      enabled: true
+    
+    # MinerU (带 cell bbox)
+    - tool: "mineru"
+      result_dir: "mineru_vllm_results_cell_bbox"
+      image_dir: "mineru_vllm_results/{{name}}"
+      description: "MinerU + PaddleOCR 坐标"
+      enabled: true
+    
+    # DotsOCR
+    - tool: "dots_ocr"
+      result_dir: "dotsocr_vllm_results"
+      image_dir: "dotsocr_vllm_results/{{name}}"
+      description: "Dots OCR 图片合成结果"
+      enabled: true

+ 153 - 0
config/global.yaml

@@ -0,0 +1,153 @@
+# OCR验证工具配置文件
+
+# 样式配置
+styles:
+  font_size: 8
+  
+  colors:
+    primary: "#0288d1"
+    secondary: "#ff9800"
+    success: "#4caf50"
+    error: "#f44336"
+    warning: "#ff9800"
+    background: "#fafafa"
+    text: "#333333"
+  
+  layout:
+    default_zoom: 1.0
+    default_height: 800
+    sidebar_width: 1
+    content_width: 0.65
+
+# 界面配置
+ui:
+  page_title: "OCR可视化校验工具"
+  page_icon: "🔍"
+  layout: "wide"
+  sidebar_state: "expanded"
+  
+# OCR数据配置
+ocr:
+  min_text_length: 2
+  default_confidence: 1.0
+  exclude_texts: ["Picture", ""]
+  
+  # 图片方向检测配置
+  orientation_detection:
+    enabled: true
+    confidence_threshold: 0.3  # 置信度阈值
+    methods: ["opencv_analysis"]  # 检测方法
+    cache_results: true  # 缓存检测结果
+  
+  # OCR工具类型配置
+  tools:
+    dots_ocr:
+      name: "Dots OCR"
+      description: "专业VLM OCR"
+      json_structure: "array"  # JSON为数组格式
+      text_field: "text"
+      bbox_field: "bbox"
+      category_field: "category"
+      confidence_field: "confidence"
+      # 旋转处理配置
+      rotation:
+        coordinates_are_pre_rotated: false  # 坐标不是预旋转的
+        
+    ppstructv3:
+      name: "PPStructV3"
+      description: "PaddleOCR PP-StructureV3"
+      json_structure: "object"  # JSON为对象格式
+      parsing_results_field: "parsing_res_list"
+      text_field: "block_content"
+      bbox_field: "block_bbox"
+      rec_texts_field: "overall_ocr_res.rec_texts" # 针对表格中的文字块
+      rec_boxes_field: "overall_ocr_res.rec_boxes" # 针对表格中的文字块
+      category_field: "block_label"
+      confidence_field: "confidence"
+      # 旋转处理配置
+      rotation:
+        coordinates_are_pre_rotated: true  # 坐标已经是预旋转的
+      
+    table_recognition_v2:
+      name: "TableRecognitionV2"
+      description: "PaddleOCR Table Recognition V2"
+      json_structure: "object"
+      parsing_results_field: "table_res_list"
+      text_field: "pred_html"
+      bbox_field: "cell_box_list"            # 原先的 cell_box_listox 为笔误
+      rec_texts_field: "table_ocr_pred.rec_texts" # 针对表格中的文字块
+      rec_boxes_field: "table_ocr_pred.rec_boxes" # 针对表格中的文字块
+      category_field: "type"
+      confidence_field: "confidence"
+      rotation:
+        coordinates_are_pre_rotated: true
+    
+    mineru:
+      name: "MinerU"
+      description: "MinerU OCR"
+      json_structure: "array"  # JSON为数组格式
+      text_field: "text"
+      bbox_field: "bbox"
+      category_field: "type"
+      confidence_field: "confidence"
+      # 表格相关字段
+      table_body_field: "table_body"
+      table_cells_field: "table_cells"
+      img_path_field: "img_path"
+      # 旋转处理配置
+      rotation:
+        coordinates_are_pre_rotated: false
+  
+  # 自动检测工具类型的规则(按优先级从高到低)
+  auto_detection:
+    enabled: true
+    rules:
+      # Table Recognition V2 - 最高优先级
+      - tool_type: "table_recognition_v2"
+        conditions:
+          - type: "field_exists"
+            field: "table_res_list"
+          - type: "field_not_exists"
+            field: "parsing_res_list"
+        priority: 4
+      
+      # PPStructV3 - 第二优先级
+      - tool_type: "ppstructv3"
+        conditions:
+          - type: "field_exists"
+            field: "parsing_res_list"
+          - type: "field_exists"
+            field: "doc_preprocessor_res"
+        priority: 2
+      
+      # MinerU - 第三优先级
+      - tool_type: "mineru"
+        conditions:
+          - type: "field_exists"
+            field: "page_idx"
+          - type: "field_exists"
+            field: "type"
+          - type: "json_structure"
+            structure: "array"
+        priority: 1
+      
+      # Dots OCR - 最低优先级(默认)
+      - tool_type: "dots_ocr"
+        conditions:
+          - type: "json_structure"
+            structure: "array"
+          - type: "field_exists"
+            field: "category"
+        priority: 3
+
+# 预校验结果文件路径
+pre_validation:
+  out_dir: "./output/pre_validation/"
+
+data_sources:
+  - 德_内蒙古银行照.yaml
+  - 对公_招商银行图.yaml
+  - A用户_单元格扫描流水.yaml
+  - B用户_扫描流水.yaml
+  - 至远彩色_2023年报.yaml
+

+ 47 - 0
config/对公_招商银行图.yaml

@@ -0,0 +1,47 @@
+document:
+  name: "对公_招商银行图"
+  base_dir: "/Users/zhch158/workspace/data/流水分析/对公_招商银行图"
+  
+  # 🎯 关键改进:定义该文档使用的 OCR 工具及其结果目录
+  ocr_results:
+    # PPStructV3
+    - tool: "ppstructv3"
+      result_dir: "ppstructurev3_client_results"
+      image_dir: "ppstructurev3_client_results/{{name}}"
+      description: "PPStructV3 图片合成结果"
+      enabled: true
+    
+    # PaddleOCR-VL
+    - tool: "paddleocr_vl"
+      result_dir: "paddleocr_vl_results"
+      image_dir: "paddleocr_vl_results/{{name}}"
+      description: "PaddleOCR VLM 图片合成结果"
+      enabled: true
+    
+    # PaddleOCR-VL (带 cell bbox)
+    - tool: "mineru"  # 格式同 MinerU
+      result_dir: "paddleocr_vl_results_cell_bbox"
+      image_dir: "paddleocr_vl_results/{{name}}"
+      description: "PaddleOCR VLM + PaddleOCR 坐标"
+      enabled: true
+    
+    # MinerU
+    - tool: "mineru"
+      result_dir: "mineru_vllm_results"
+      image_dir: "mineru_vllm_results/{{name}}"
+      description: "MinerU 图片合成结果"
+      enabled: true
+    
+    # MinerU (带 cell bbox)
+    - tool: "mineru"
+      result_dir: "mineru_vllm_results_cell_bbox"
+      image_dir: "mineru_vllm_results/{{name}}"
+      description: "MinerU + PaddleOCR 坐标"
+      enabled: true
+    
+    # DotsOCR
+    - tool: "dots_ocr"
+      result_dir: "dotsocr_vllm_results"
+      image_dir: "dotsocr_vllm_results/{{name}}"
+      description: "Dots OCR 图片合成结果"
+      enabled: true

+ 48 - 0
config/德_内蒙古银行照.yaml

@@ -0,0 +1,48 @@
+# 文档: 德_内蒙古银行照
+document:
+  name: "德_内蒙古银行照"
+  base_dir: "/Users/zhch158/workspace/data/流水分析/德_内蒙古银行照"
+  
+  # 🎯 关键改进:定义该文档使用的 OCR 工具及其结果目录
+  ocr_results:
+    # PPStructV3
+    - tool: "ppstructv3"
+      result_dir: "ppstructurev3_client_results"
+      image_dir: "ppstructurev3_client_results/{{name}}"
+      description: "PPStructV3 图片合成结果"
+      enabled: true
+    
+    # PaddleOCR-VL
+    - tool: "paddleocr_vl"
+      result_dir: "paddleocr_vl_results"
+      image_dir: "paddleocr_vl_results/{{name}}"
+      description: "PaddleOCR VLM 图片合成结果"
+      enabled: true
+    
+    # PaddleOCR-VL (带 cell bbox)
+    - tool: "mineru"  # 格式同 MinerU
+      result_dir: "paddleocr_vl_results_cell_bbox"
+      image_dir: "paddleocr_vl_results/{{name}}"
+      description: "PaddleOCR VLM + PaddleOCR 坐标"
+      enabled: true
+    
+    # MinerU
+    - tool: "mineru"
+      result_dir: "mineru_vllm_results"
+      image_dir: "mineru_vllm_results/{{name}}"
+      description: "MinerU 图片合成结果"
+      enabled: true
+    
+    # MinerU (带 cell bbox)
+    - tool: "mineru"
+      result_dir: "mineru_vllm_results_cell_bbox"
+      image_dir: "mineru_vllm_results/{{name}}"
+      description: "MinerU + PaddleOCR 坐标"
+      enabled: true
+    
+    # DotsOCR
+    - tool: "dots_ocr"
+      result_dir: "dotsocr_vllm_results"
+      image_dir: "dotsocr_vllm_results/{{name}}"
+      description: "Dots OCR 图片合成结果"
+      enabled: true

+ 47 - 0
config/至远彩色_2023年报.yaml

@@ -0,0 +1,47 @@
+document:
+  name: "至远彩色_2023年报"
+  base_dir: "/Users/zhch158/workspace/data/流水分析/至远彩色_2023年报"
+  
+  # 🎯 关键改进:定义该文档使用的 OCR 工具及其结果目录
+  ocr_results:
+    # PPStructV3
+    - tool: "ppstructv3"
+      result_dir: "ppstructurev3_client_results"
+      image_dir: "ppstructurev3_client_results/{{name}}"
+      description: "PPStructV3 图片合成结果"
+      enabled: true
+    
+    # PaddleOCR-VL
+    - tool: "paddleocr_vl"
+      result_dir: "paddleocr_vl_results"
+      image_dir: "paddleocr_vl_results/{{name}}"
+      description: "PaddleOCR VLM 图片合成结果"
+      enabled: true
+    
+    # PaddleOCR-VL (带 cell bbox)
+    - tool: "mineru"  # 格式同 MinerU
+      result_dir: "paddleocr_vl_results_cell_bbox"
+      image_dir: "paddleocr_vl_results/{{name}}"
+      description: "PaddleOCR VLM + PaddleOCR 坐标"
+      enabled: true
+    
+    # MinerU
+    - tool: "mineru"
+      result_dir: "mineru_vllm_results"
+      image_dir: "mineru_vllm_results/{{name}}"
+      description: "MinerU 图片合成结果"
+      enabled: true
+    
+    # MinerU (带 cell bbox)
+    - tool: "mineru"
+      result_dir: "mineru_vllm_results_cell_bbox"
+      image_dir: "mineru_vllm_results/{{name}}"
+      description: "MinerU + PaddleOCR 坐标"
+      enabled: true
+    
+    # DotsOCR
+    - tool: "dots_ocr"
+      result_dir: "dotsocr_vllm_results"
+      image_dir: "dotsocr_vllm_results/{{name}}"
+      description: "Dots OCR 图片合成结果"
+      enabled: true

+ 360 - 0
config_manager.py

@@ -0,0 +1,360 @@
+"""
+配置管理器
+支持分层配置和自动发现数据源
+支持 Jinja2 模板变量
+"""
+
+import yaml
+from pathlib import Path
+from typing import Dict, List, Optional, Any
+from dataclasses import dataclass, field
+import logging
+from jinja2 import Template  # 🎯 新增
+
+
+@dataclass
+class OCRToolConfig:
+    """OCR 工具配置"""
+    name: str
+    description: str
+    json_structure: str
+    text_field: str
+    bbox_field: str
+    category_field: str
+    confidence_field: str = "confidence"
+    parsing_results_field: Optional[str] = None
+    rec_texts_field: Optional[str] = None
+    rec_boxes_field: Optional[str] = None
+    table_body_field: Optional[str] = None
+    table_cells_field: Optional[str] = None
+    img_path_field: Optional[str] = None
+    rotation: Dict[str, Any] = field(default_factory=dict)
+    
+    @classmethod
+    def from_dict(cls, tool_id: str, data: Dict) -> 'OCRToolConfig':
+        """从字典创建"""
+        return cls(
+            name=data.get('name', tool_id),
+            description=data.get('description', ''),
+            json_structure=data.get('json_structure', 'object'),
+            text_field=data.get('text_field', 'text'),
+            bbox_field=data.get('bbox_field', 'bbox'),
+            category_field=data.get('category_field', 'category'),
+            confidence_field=data.get('confidence_field', 'confidence'),
+            parsing_results_field=data.get('parsing_results_field'),
+            rec_texts_field=data.get('rec_texts_field'),
+            rec_boxes_field=data.get('rec_boxes_field'),
+            table_body_field=data.get('table_body_field'),
+            table_cells_field=data.get('table_cells_field'),
+            img_path_field=data.get('img_path_field'),
+            rotation=data.get('rotation', {})
+        )
+    
+    def to_dict(self) -> Dict:
+        """转换为字典(用于 OCRValidator)"""
+        config_dict = {
+            'name': self.name,
+            'description': self.description,
+            'json_structure': self.json_structure,
+            'text_field': self.text_field,
+            'bbox_field': self.bbox_field,
+            'category_field': self.category_field,
+            'confidence_field': self.confidence_field,
+            'rotation': self.rotation
+        }
+        
+        # 添加可选字段
+        if self.parsing_results_field:
+            config_dict['parsing_results_field'] = self.parsing_results_field
+        if self.rec_texts_field:
+            config_dict['rec_texts_field'] = self.rec_texts_field
+        if self.rec_boxes_field:
+            config_dict['rec_boxes_field'] = self.rec_boxes_field
+        if self.table_body_field:
+            config_dict['table_body_field'] = self.table_body_field
+        if self.table_cells_field:
+            config_dict['table_cells_field'] = self.table_cells_field
+        if self.img_path_field:
+            config_dict['img_path_field'] = self.img_path_field
+        
+        return config_dict
+
+
+@dataclass
+class OCRResultConfig:
+    """OCR 结果配置"""
+    tool: str
+    result_dir: str
+    image_dir: Optional[str]
+    description: str = ""
+    enabled: bool = True
+    
+    @classmethod
+    def from_dict(cls, data: Dict, context: Dict = None) -> 'OCRResultConfig':
+        """
+        🎯 从字典创建(支持 Jinja2 模板)
+        
+        Args:
+            data: 配置数据
+            context: 模板上下文(如 {'name': '德_内蒙古银行照'})
+        """
+        # 🎯 渲染模板
+        if context:
+            result_dir = cls._render_template(data['result_dir'], context)
+            image_dir = cls._render_template(data.get('image_dir'), context) if data.get('image_dir') else None
+            description = cls._render_template(data.get('description', ''), context)
+        else:
+            result_dir = data['result_dir']
+            image_dir = data.get('image_dir')
+            description = data.get('description', '')
+        
+        return cls(
+            tool=data['tool'],
+            result_dir=result_dir,
+            image_dir=image_dir,
+            description=description,
+            enabled=data.get('enabled', True)
+        )
+    
+    @staticmethod
+    def _render_template(template_str: Optional[str], context: Dict) -> Optional[str]:
+        """🎯 渲染 Jinja2 模板"""
+        if not template_str:
+            return None
+        
+        try:
+            template = Template(template_str)
+            return template.render(context)
+        except Exception as e:
+            logging.warning(f"模板渲染失败: {template_str}, 错误: {e}")
+            return template_str
+
+
+@dataclass
+class DocumentConfig:
+    """文档配置"""
+    name: str
+    base_dir: str
+    ocr_results: List[OCRResultConfig] = field(default_factory=list)
+    
+    @classmethod
+    def from_dict(cls, data: Dict) -> 'DocumentConfig':
+        """从字典创建(支持 Jinja2 模板)"""
+        doc_data = data.get('document', data)
+        
+        # 🎯 构建模板上下文
+        context = {
+            'name': doc_data['name'],
+            'base_dir': doc_data['base_dir']
+        }
+        
+        return cls(
+            name=doc_data['name'],
+            base_dir=doc_data['base_dir'],
+            ocr_results=[
+                OCRResultConfig.from_dict(r, context) 
+                for r in doc_data.get('ocr_results', [])
+            ]
+        )
+
+
+@dataclass
+class DataSource:
+    """数据源(用于 OCRValidator)"""
+    name: str
+    ocr_tool: str
+    ocr_out_dir: str
+    src_img_dir: str
+    description: str = ""
+
+
+class ConfigManager:
+    """配置管理器"""
+    
+    def __init__(self, config_dir: str = "config"):
+        """
+        Args:
+            config_dir: 配置文件目录
+        """
+        self.config_dir = Path(config_dir)
+        self.logger = logging.getLogger(__name__)
+        
+        # 加载配置
+        self.global_config = self._load_global_config()
+        self.ocr_tools = self._load_ocr_tools()
+        self.documents = self._load_documents()
+    
+    def _load_global_config(self) -> Dict:
+        """加载全局配置"""
+        config_file = self.config_dir / "global.yaml"
+        
+        if not config_file.exists():
+            self.logger.warning(f"全局配置文件不存在: {config_file}")
+            return {}
+        
+        with open(config_file, 'r', encoding='utf-8') as f:
+            return yaml.safe_load(f) or {}
+    
+    def _load_ocr_tools(self) -> Dict[str, OCRToolConfig]:
+        """加载 OCR 工具配置(从 global.yaml)"""
+        tools_data = self.global_config.get('ocr', {}).get('tools', {})
+        
+        tools = {}
+        for tool_id, tool_data in tools_data.items():
+            tools[tool_id] = OCRToolConfig.from_dict(tool_id, tool_data)
+        
+        return tools
+    
+    def _load_documents(self) -> Dict[str, DocumentConfig]:
+        """加载文档配置(支持 Jinja2 模板)"""
+        documents = {}
+        
+        # 从 global.yaml 读取文档配置文件列表
+        doc_files = self.global_config.get('data_sources', [])
+        
+        for doc_file in doc_files:
+            # 支持相对路径和绝对路径
+            if not doc_file.endswith('.yaml'):
+                doc_file = f"{doc_file}.yaml"
+            
+            yaml_path = self.config_dir / doc_file
+            
+            if not yaml_path.exists():
+                self.logger.warning(f"文档配置文件不存在: {yaml_path}")
+                continue
+            
+            try:
+                with open(yaml_path, 'r', encoding='utf-8') as f:
+                    data = yaml.safe_load(f)
+                
+                # 🎯 使用支持 Jinja2 的解析方法
+                doc_config = DocumentConfig.from_dict(data)
+                documents[doc_config.name] = doc_config
+                
+                self.logger.info(f"✅ 加载文档配置: {doc_config.name} ({len(doc_config.ocr_results)} 个 OCR 结果)")
+                
+            except Exception as e:
+                self.logger.error(f"加载文档配置失败: {yaml_path}, 错误: {e}")
+        
+        return documents
+    
+    def get_ocr_tool(self, tool_id: str) -> Optional[OCRToolConfig]:
+        """获取 OCR 工具配置"""
+        return self.ocr_tools.get(tool_id)
+    
+    def get_document(self, doc_name: str) -> Optional[DocumentConfig]:
+        """获取文档配置"""
+        return self.documents.get(doc_name)
+    
+    def list_documents(self) -> List[str]:
+        """列出所有文档"""
+        return list(self.documents.keys())
+    
+    def list_ocr_tools(self) -> List[str]:
+        """列出所有 OCR 工具"""
+        return list(self.ocr_tools.keys())
+    
+    def get_data_sources(self) -> List[DataSource]:
+        """
+        生成数据源列表(供 OCRValidator 使用)
+        
+        从文档配置自动生成 data_sources
+        """
+        data_sources = []
+        
+        for doc_name, doc_config in self.documents.items():
+            base_dir = Path(doc_config.base_dir)
+            
+            for ocr_result in doc_config.ocr_results:
+                if not ocr_result.enabled:
+                    continue
+                
+                # 构建完整路径
+                ocr_out_dir = str(base_dir / ocr_result.result_dir)
+                
+                if ocr_result.image_dir:
+                    src_img_dir = str(base_dir / ocr_result.image_dir)
+                else:
+                    # 如果未指定图片目录,使用结果目录
+                    src_img_dir = str(base_dir / ocr_result.result_dir / doc_name)
+                
+                # 生成数据源名称
+                if ocr_result.description:
+                    source_name = f"{doc_name}_{ocr_result.description}"
+                else:
+                    tool_config = self.get_ocr_tool(ocr_result.tool)
+                    tool_name = tool_config.name if tool_config else ocr_result.tool
+                    source_name = f"{doc_name}_{tool_name}"
+                
+                data_source = DataSource(
+                    name=source_name,
+                    ocr_tool=ocr_result.tool,
+                    ocr_out_dir=ocr_out_dir,
+                    src_img_dir=src_img_dir,
+                    description=ocr_result.description or f"{doc_name} 使用 {ocr_result.tool}"
+                )
+                
+                data_sources.append(data_source)
+        
+        return data_sources
+    
+    def get_config_value(self, key_path: str, default=None):
+        """
+        获取配置值(支持点号路径)
+        
+        Examples:
+            get_config_value('styles.font_size')
+            get_config_value('ocr.min_text_length')
+        """
+        keys = key_path.split('.')
+        value = self.global_config
+        
+        for key in keys:
+            if isinstance(value, dict):
+                value = value.get(key)
+            else:
+                return default
+        
+        return value if value is not None else default
+    
+    def to_validator_config(self) -> Dict:
+        """
+        转换为 OCRValidator 所需的配置格式
+        
+        Returns:
+            包含 data_sources 和 ocr.tools 的配置字典
+        """
+        # 构建 data_sources 列表
+        data_sources_list = []
+        for ds in self.get_data_sources():
+            data_sources_list.append({
+                'name': ds.name,
+                'ocr_tool': ds.ocr_tool,
+                'ocr_out_dir': ds.ocr_out_dir,
+                'src_img_dir': ds.src_img_dir
+            })
+        
+        # 构建 ocr.tools 字典
+        ocr_tools_dict = {}
+        for tool_id, tool_config in self.ocr_tools.items():
+            ocr_tools_dict[tool_id] = tool_config.to_dict()
+        
+        # 返回完整配置
+        config = self.global_config.copy()
+        config['data_sources'] = data_sources_list
+        
+        # 确保 ocr.tools 存在
+        if 'ocr' not in config:
+            config['ocr'] = {}
+        config['ocr']['tools'] = ocr_tools_dict
+        
+        return config
+
+
+# ============================================================================
+# 便捷函数
+# ============================================================================
+
+def load_config(config_dir: str = "config") -> ConfigManager:
+    """加载配置"""
+    return ConfigManager(config_dir)

+ 0 - 816
merger/merge_mineru_paddle_ocr.1.py

@@ -1,816 +0,0 @@
-"""
-合并 MinerU 和 PaddleOCR 的结果
-使用 MinerU 的表格结构识别 + PaddleOCR 的文字框坐标
-"""
-import json
-import re
-import argparse
-from pathlib import Path
-from typing import List, Dict, Tuple, Optional
-from bs4 import BeautifulSoup
-from fuzzywuzzy import fuzz
-import shutil
-
-class MinerUPaddleOCRMerger:
-    """合并 MinerU 和 PaddleOCR 的结果"""
-    
-    def __init__(self, look_ahead_window: int = 10, similarity_threshold: int = 90):
-        """
-        Args:
-            look_ahead_window: 向前查找的窗口大小
-            similarity_threshold: 文本相似度阈值
-        """
-        self.look_ahead_window = look_ahead_window
-        self.similarity_threshold = similarity_threshold
-    
-    def merge_table_with_bbox(self, mineru_json_path: str, paddle_json_path: str) -> List[Dict]:
-        """
-        合并 MinerU 和 PaddleOCR 的结果
-        
-        Args:
-            mineru_json_path: MinerU 输出的 JSON 路径
-            paddle_json_path: PaddleOCR 输出的 JSON 路径
-            output_path: 输出路径(可选)
-        
-        Returns:
-            合并后的结果字典
-        """
-        merged_data = None
-        # 加载数据
-        with open(mineru_json_path, 'r', encoding='utf-8') as f:
-            mineru_data = json.load(f)
-        
-        with open(paddle_json_path, 'r', encoding='utf-8') as f:
-            paddle_data = json.load(f)
-        
-        # 提取 PaddleOCR 的文字框信息
-        paddle_text_boxes = self._extract_paddle_text_boxes(paddle_data)
-        
-        # 处理 MinerU 的数据
-        merged_data = self._process_mineru_data(mineru_data, paddle_text_boxes)
-        
-        return merged_data
-    
-    def _extract_paddle_text_boxes(self, paddle_data: Dict) -> List[Dict]:
-        """提取 PaddleOCR 的文字框信息"""
-        text_boxes = []
-        
-        if 'overall_ocr_res' in paddle_data:
-            ocr_res = paddle_data['overall_ocr_res']
-            rec_texts = ocr_res.get('rec_texts', [])
-            rec_polys = ocr_res.get('rec_polys', [])
-            rec_scores = ocr_res.get('rec_scores', [])
-
-            for i, (text, poly, score) in enumerate(zip(rec_texts, rec_polys, rec_scores)):
-                if text and text.strip():
-                    # 计算 bbox (x_min, y_min, x_max, y_max)
-                    xs = [p[0] for p in poly]
-                    ys = [p[1] for p in poly]
-                    bbox = [min(xs), min(ys), max(xs), max(ys)]
-                    
-                    text_boxes.append({
-                        'text': text,
-                        'bbox': bbox,
-                        'poly': poly,
-                        'score': score,
-                        'paddle_bbox_index': i,
-                        'used': False  # 标记是否已被使用
-                    })
-
-        return text_boxes
-    
-    def _process_mineru_data(self, mineru_data: List[Dict], 
-                            paddle_text_boxes: List[Dict]) -> List[Dict]:
-        """处理 MinerU 数据,添加 bbox 信息
-
-        Args:
-            mineru_data (List[Dict]): _description_
-            paddle_text_boxes (List[Dict]): _description_
-
-        Returns:
-            List[Dict]: _description_
-        """ 
-
-        merged_data = []
-        cells = None  # 存储所有表格单元格信息
-        paddle_pointer = 0  # PaddleOCR 文字框指针
-        last_matched_index = 0  # 上次匹配成功的索引
-
-        # 对mineru_data按bbox从上到下排序,从左到右确保顺序一致
-        mineru_data.sort(key=lambda x: (x['bbox'][1], x['bbox'][0]) if 'bbox' in x else (float('inf'), float('inf')))
-
-        for item in mineru_data:
-            if item['type'] == 'table':
-                # 处理表格
-                merged_item = item.copy()
-                table_html = item.get('table_body', '')
-                
-                # 解析 HTML 表格并添加 bbox
-                enhanced_html, cells, paddle_pointer = self._enhance_table_html_with_bbox(
-                    table_html, paddle_text_boxes, paddle_pointer
-                )
-                
-                merged_item['table_body'] = enhanced_html
-                merged_item['table_body_with_bbox'] = enhanced_html
-                merged_item['bbox_mapping'] = 'merged_from_paddle_ocr'
-                merged_item['table_cells'] = cells if cells else []
-                
-                merged_data.append(merged_item)
-            
-            elif item['type'] in ['text', 'title']:
-                # 处理普通文本
-                merged_item = item.copy()
-                text = item.get('text', '')
-                
-                # 查找匹配的 bbox
-                matched_bbox, paddle_pointer, last_matched_index = self._find_matching_bbox(
-                    text, paddle_text_boxes, paddle_pointer, last_matched_index
-                )
-                
-                if matched_bbox:
-                    # merged_item['bbox'] = matched_bbox['bbox']
-                    # merged_item['bbox_source'] = 'paddle_ocr'
-                    # merged_item['text_score'] = matched_bbox['score']
-
-                    # 沿用mineru的bbox, 就是要移动位置paddle_pointer, last_matched_index
-                    # 标记为已使用
-                    matched_bbox['used'] = True
-                
-                merged_data.append(merged_item)
-            elif item['type'] == 'list':
-                # 处理列表项
-                merged_item = item.copy()
-                list_items = item.get('list_items', [])
-                sub_type = item.get('sub_type', 'unordered')  # 有序或无序
-
-                for list_item in list_items:
-                    # 查找匹配的 bbox
-                    matched_bbox, paddle_pointer, last_matched_index = self._find_matching_bbox(
-                        list_item, paddle_text_boxes, paddle_pointer, last_matched_index
-                    )
-                    
-                    if matched_bbox:
-                        # 沿用mineru的bbox, 就是要移动位置paddle_pointer, last_matched_index
-                        # 标记为已使用
-                        matched_bbox['used'] = True
-                
-                merged_data.append(merged_item)
-            else:
-                # 其他类型直接复制
-                merged_data.append(item.copy())
-        
-        return merged_data
-    
-    def _enhance_table_html_with_bbox(self, html: str, paddle_text_boxes: List[Dict], 
-                                      start_pointer: int) -> Tuple[str, List[Dict], int]:
-        """
-        为 HTML 表格添加 bbox 信息
-        
-        Args:
-            html: 原始 HTML 表格
-            paddle_text_boxes: PaddleOCR 文字框列表
-            start_pointer: 起始指针位置
-        
-        Returns:
-            (增强后的 HTML, 单元格数组, 新的指针位置)
-        """
-        # 需要处理minerU识别为2个连着的cell,如: -741.00|357,259.63, paddle识别为一个cell,如: -741.00357,259.63
-        soup = BeautifulSoup(html, 'html.parser')
-        current_pointer = start_pointer
-        last_matched_index = start_pointer
-        cells = []  # 存储单元格的 bbox 信息
-
-        # 遍历所有行
-        for row_idx, row in enumerate(soup.find_all('tr')):
-            # 遍历所有单元格
-            for col_idx, cell in enumerate(row.find_all(['td', 'th'])):
-                cell_text = cell.get_text(strip=True)
-            
-                if not cell_text:
-                    continue
-                
-                # 查找匹配的 bbox
-                matched_bbox, current_pointer, last_matched_index = self._find_matching_bbox(
-                    cell_text, paddle_text_boxes, current_pointer, last_matched_index
-                )
-                
-                if matched_bbox:
-                    # 添加 data-bbox 属性
-                    bbox = matched_bbox['bbox']
-                    cell['data-bbox'] = f"[{bbox[0]},{bbox[1]},{bbox[2]},{bbox[3]}]"
-                    cell['data-score'] = f"{matched_bbox['score']:.4f}"
-                    cell['data-paddle-index'] = str(matched_bbox['paddle_bbox_index'])
-
-                    cells.append({
-                        'type': 'table_cell',
-                        'text': cell_text,
-                        'bbox': bbox,
-                        'row': row_idx+1,
-                        'col': col_idx+1,
-                        'score': matched_bbox['score'],
-                        'paddle_bbox_index': matched_bbox['paddle_bbox_index']
-                    })
-                    # 标记为已使用
-                    matched_bbox['used'] = True
-        
-        return str(soup), cells, current_pointer
-    
-    def _find_matching_bbox(self, target_text: str, text_boxes: List[Dict], 
-                           start_index: int, last_match_index: int) -> tuple[Optional[Dict], int, int]:
-        """
-        查找匹配的文字框
-        
-        Args:
-            target_text: 目标文本
-            text_boxes: 文字框列表
-            start_index: 起始索引, 是最后一个used=True的位置+1 
-            last_match_index: 上次匹配成功的索引, 可能比start_index小
-        
-        Returns:
-            (匹配的文字框信息, 新的指针位置, last_match_index)
-        """
-        target_text = self._normalize_text(target_text)
-        
-        # 过滤过短的目标文本
-        if len(target_text) < 2:
-            return None, start_index, last_match_index
-
-        # 由于minerU和Paddle的顺序基本一致, 也有不一致的地方, 所以需要向前找第一个未使用的位置
-        # MinerU和Paddle都可能识别错误,所以需要一个look_ahead_window来避免漏掉匹配
-        # 匹配时会遇到一些特殊情况,比如Paddle把两个连着的cell识别为一个字符串,MinerU将单元格上下2行识别为一行
-        # 	'1|2024-08-11|扫二维码付'   minerU识别为“扫二维码付款”,Paddle识别为'12024-08-11扫二维码付'  
-        #                  款
-        # 字符串的顺序极大概率是一致的,所以如果短字符串是长字符串的子串,可以增加相似权重
-
-        search_start = last_match_index - 1
-        unused_count = 0
-        while search_start >= 0:
-            if text_boxes[search_start]['used'] == False:
-                unused_count += 1
-            if unused_count >= self.look_ahead_window:
-                break
-            search_start -= 1
-        if search_start < 0:
-            search_start = 0
-            while search_start < start_index and text_boxes[search_start]['used']:
-                search_start += 1
-        search_end = min(start_index + self.look_ahead_window, len(text_boxes))
-        
-        best_match = None
-        best_index = start_index
-        
-        for i in range(search_start, search_end):
-            if text_boxes[i]['used']:
-                continue
-            
-            box_text = self._normalize_text(text_boxes[i]['text'])
-            # 精确匹配优先
-            if target_text == box_text:
-                if i >= start_index:
-                    return text_boxes[i], i + 1, i
-                else:
-                    return text_boxes[i], start_index, i
-            
-            # 过滤过短的候选文本(避免单字符匹配)
-            if len(box_text) < 2:
-                continue
-            
-            # 长度比例检查 - 避免长度差异过大的匹配
-            length_ratio = min(len(target_text), len(box_text)) / max(len(target_text), len(box_text))
-            if length_ratio < 0.3:  # 长度差异超过70%则跳过
-                continue
-
-            # 子串检查
-            shorter = target_text if len(target_text) < len(box_text) else box_text
-            longer = box_text if len(target_text) < len(box_text) else target_text
-            is_substring = shorter in longer        
-
-            # 计算多种相似度
-            # token_sort_ratio = fuzz.token_sort_ratio(target_text, box_text)
-            partial_ratio = fuzz.partial_ratio(target_text, box_text)
-            if is_substring:
-                partial_ratio += 10  # 子串时提升相似度
-            
-            # 综合相似度 - 两种算法都要达到阈值
-            if (partial_ratio >= self.similarity_threshold):
-                if i >= start_index:
-                    return text_boxes[i], i + 1, last_match_index
-                else:
-                    return text_boxes[i], start_index, last_match_index
-
-        return best_match, best_index, last_match_index
-
-    def _normalize_text(self, text: str) -> str:
-        """标准化文本(去除空格、标点等)"""
-        # 移除所有空白字符
-        text = re.sub(r'\s+', '', text)
-        # 转换全角数字和字母为半角
-        text = self._full_to_half(text)
-        return text.lower()
-    
-    def _full_to_half(self, text: str) -> str:
-        """全角转半角"""
-        result = []
-        for char in text:
-            code = ord(char)
-            if code == 0x3000:  # 全角空格
-                code = 0x0020
-            elif 0xFF01 <= code <= 0xFF5E:  # 全角字符
-                code -= 0xFEE0
-            result.append(chr(code))
-        return ''.join(result)
-    
-    def generate_enhanced_markdown(self, merged_data: List[Dict], 
-                                   output_path: Optional[str] = None, mineru_file: Optional[str] = None) -> str:
-        """
-        生成增强的 Markdown(包含 bbox 信息的注释)
-        参考 MinerU 的实现,支持标题、列表、表格标题等
-        
-        Args:
-            merged_data: 合并后的数据
-            output_path: 输出路径(可选)
-            mineru_file: MinerU 源文件路径(用于复制图片)
-        
-        Returns:
-            Markdown 内容
-        """
-        md_lines = []
-        
-        for item in merged_data:
-            item_type = item.get('type', '')
-            bbox = item.get('bbox', [])
-            
-            # 添加 bbox 注释
-            if bbox:
-                md_lines.append(f"<!-- bbox: {bbox} -->")
-            
-            # 根据类型处理
-            if item_type == 'title':
-                # 标题 - 使用 text_level 确定标题级别
-                text = item.get('text', '')
-                text_level = item.get('text_level', 1)
-                heading = '#' * min(text_level, 6)  # 最多6级标题
-                md_lines.append(f"{heading} {text}\n")
-            
-            elif item_type == 'text':
-                # 普通文本 - 可能也有 text_level
-                text = item.get('text', '')
-                text_level = item.get('text_level', 0)
-                
-                if text_level > 0:
-                    # 作为标题处理
-                    heading = '#' * min(text_level, 6)
-                    md_lines.append(f"{heading} {text}\n")
-                else:
-                    # 普通段落
-                    md_lines.append(f"{text}\n")
-            
-            elif item_type == 'list':
-                # 列表
-                sub_type = item.get('sub_type', 'text')
-                list_items = item.get('list_items', [])
-                
-                for list_item in list_items:
-                    md_lines.append(f"{list_item}\n")
-                
-                md_lines.append("")  # 列表后添加空行
-            
-            elif item_type == 'table':
-                # 表格标题
-                table_caption = item.get('table_caption', [])
-                if table_caption:
-                    for caption in table_caption:
-                        if caption:  # 跳过空标题
-                            md_lines.append(f"**{caption}**\n")
-                
-                # 表格内容
-                table_body = item.get('table_body_with_bbox', item.get('table_body', ''))
-                if table_body:
-                    md_lines.append(table_body)
-                    md_lines.append("")
-                
-                # 表格脚注
-                table_footnote = item.get('table_footnote', [])
-                if table_footnote:
-                    for footnote in table_footnote:
-                        if footnote:
-                            md_lines.append(f"*{footnote}*")
-                    md_lines.append("")
-            
-            elif item_type == 'image':
-                # 图片
-                img_path = item.get('img_path', '')
-                
-                # 复制图片到输出目录
-                if img_path and mineru_file and output_path:
-                    mineru_dir = Path(mineru_file).parent
-                    img_full_path = mineru_dir / img_path
-                    if img_full_path.exists():
-                        output_img_path = Path(output_path).parent / img_path
-                        output_img_path.parent.mkdir(parents=True, exist_ok=True)
-                        shutil.copy(img_full_path, output_img_path)
-                
-                # 图片标题
-                image_caption = item.get('image_caption', [])
-                if image_caption:
-                    for caption in image_caption:
-                        if caption:
-                            md_lines.append(f"**{caption}**\n")
-                
-                # 插入图片
-                md_lines.append(f"![Image]({img_path})\n")
-                
-                # 图片脚注
-                image_footnote = item.get('image_footnote', [])
-                if image_footnote:
-                    for footnote in image_footnote:
-                        if footnote:
-                            md_lines.append(f"*{footnote}*")
-                    md_lines.append("")
-            
-            elif item_type == 'equation':
-                # 公式
-                latex = item.get('latex', '')
-                if latex:
-                    md_lines.append(f"$$\n{latex}\n$$\n")
-            
-            elif item_type == 'inline_equation':
-                # 行内公式
-                latex = item.get('latex', '')
-                if latex:
-                    md_lines.append(f"${latex}$\n")
-            
-            elif item_type == 'page_number':
-                # 页码 - 通常跳过或作为注释
-                text = item.get('text', '')
-                md_lines.append(f"<!-- 页码: {text} -->\n")
-            
-            elif item_type == 'header':
-                # 页眉
-                text = item.get('text', '')
-                md_lines.append(f"<!-- 页眉: {text} -->\n")
-            
-            elif item_type == 'footer':
-                # 页脚
-                text = item.get('text', '')
-                if text:
-                    md_lines.append(f"<!-- 页脚: {text} -->\n")
-            
-            elif item_type == 'reference':
-                # 参考文献
-                text = item.get('text', '')
-                md_lines.append(f"> {text}\n")
-            
-            else:
-                # 未知类型 - 尝试提取文本
-                text = item.get('text', '')
-                if text:
-                    md_lines.append(f"{text}\n")
-        
-        markdown_content = '\n'.join(md_lines)
-        
-        # 保存文件
-        if output_path:
-            with open(output_path, 'w', encoding='utf-8') as f:
-                f.write(markdown_content)
-        
-        return markdown_content
-
-    def extract_table_cells_with_bbox(self, merged_data: List[Dict]) -> List[Dict]:
-        """
-        提取所有表格单元格及其 bbox 信息
-        
-        Returns:
-            单元格列表,每个包含 text, bbox, row, col 等信息
-        """
-        cells = []
-        
-        for item in merged_data:
-            if item['type'] != 'table':
-                continue
-            
-            html = item.get('table_body_with_bbox', item.get('table_body', ''))
-            soup = BeautifulSoup(html, 'html.parser')
-            
-            # 遍历所有行
-            for row_idx, row in enumerate(soup.find_all('tr')):
-                # 遍历所有单元格
-                for col_idx, cell in enumerate(row.find_all(['td', 'th'])):
-                    cell_text = cell.get_text(strip=True)
-                    bbox_str = cell.get('data-bbox', '')
-                    
-                    if bbox_str:
-                        try:
-                            bbox = json.loads(bbox_str)
-                            cells.append({
-                                'text': cell_text,
-                                'bbox': bbox,
-                                'row': row_idx,
-                                'col': col_idx,
-                                'score': float(cell.get('data-score', 0)),
-                                'paddle_index': int(cell.get('data-paddle-index', -1))
-                            })
-                        except (json.JSONDecodeError, ValueError):
-                            pass
-        
-        return cells
-
-
-def merge_single_file(mineru_file: Path, paddle_file: Path, output_dir: Path, 
-                     output_format: str, merger: MinerUPaddleOCRMerger) -> bool:
-    """
-    合并单个文件
-    
-    Args:
-        mineru_file: MinerU JSON 文件路径
-        paddle_file: PaddleOCR JSON 文件路径
-        output_dir: 输出目录
-        merger: 合并器实例
-    
-    Returns:
-        是否成功
-    """
-    print(f"📄 处理: {mineru_file.name}")
-    
-    # 输出文件路径
-    merged_md_path = output_dir / f"{mineru_file.stem}.md"
-    merged_json_path = output_dir / f"{mineru_file.stem}.json"
-    
-    try:
-        # 合并数据
-        merged_data = merger.merge_table_with_bbox(
-            str(mineru_file),
-            str(paddle_file)
-        )
-        
-        # 生成 Markdown
-        if output_format in ['markdown', 'both']:
-            merger.generate_enhanced_markdown(merged_data, str(merged_md_path), mineru_file)
-        
-        # 提取单元格信息
-        # cells = merger.extract_table_cells_with_bbox(merged_data)
-        if output_format in ['json', 'both']:
-            with open(merged_json_path, 'w', encoding='utf-8') as f:
-                json.dump(merged_data, f, ensure_ascii=False, indent=2)
-
-        print(f"  ✅ 合并完成")
-        print(f"  📊 共处理了 {len(merged_data)} 个对象")
-        print(f"  💾 输出文件:")
-        if output_format in ['markdown', 'both']:
-            print(f"    - {merged_md_path.name}")
-        if output_format in ['json', 'both']:
-            print(f"    - {merged_json_path.name}")
-
-        return True
-        
-    except Exception as e:
-        print(f"  ❌ 处理失败: {e}")
-        import traceback
-        traceback.print_exc()
-        return False
-
-
-def merge_mineru_paddle_batch(mineru_dir: str, paddle_dir: str, output_dir: str, output_format: str = 'both',
-                              look_ahead_window: int = 10, 
-                              similarity_threshold: int = 80):
-    """
-    批量合并 MinerU 和 PaddleOCR 的结果
-    
-    Args:
-        mineru_dir: MinerU 结果目录
-        paddle_dir: PaddleOCR 结果目录
-        output_dir: 输出目录
-        look_ahead_window: 向前查找窗口大小
-        similarity_threshold: 相似度阈值
-    """
-    mineru_path = Path(mineru_dir)
-    paddle_path = Path(paddle_dir)
-    output_path = Path(output_dir)
-    output_path.mkdir(parents=True, exist_ok=True)
-    
-    merger = MinerUPaddleOCRMerger(
-        look_ahead_window=look_ahead_window, 
-        similarity_threshold=similarity_threshold
-    )
-    
-    # 查找所有 MinerU 的 JSON 文件
-    mineru_files = list(mineru_path.glob('*_page_*[0-9].json'))
-    mineru_files.sort()
-    
-    print(f"\n🔍 找到 {len(mineru_files)} 个 MinerU 文件")
-    print(f"📂 MinerU 目录: {mineru_dir}")
-    print(f"📂 PaddleOCR 目录: {paddle_dir}")
-    print(f"📂 输出目录: {output_dir}")
-    print(f"⚙️  查找窗口: {look_ahead_window}")
-    print(f"⚙️  相似度阈值: {similarity_threshold}%\n")
-    
-    success_count = 0
-    failed_count = 0
-    
-    for mineru_file in mineru_files:
-        # 查找对应的 PaddleOCR 文件
-        paddle_file = paddle_path / mineru_file.name
-        
-        if not paddle_file.exists():
-            print(f"⚠️  跳过: 未找到对应的 PaddleOCR 文件: {paddle_file.name}\n")
-            failed_count += 1
-            continue
-
-        if merge_single_file(mineru_file, paddle_file, output_path, output_format, merger):
-            success_count += 1
-        else:
-            failed_count += 1
-        
-        print()  # 空行分隔
-    
-    # 打印统计信息
-    print("=" * 60)
-    print(f"✅ 处理完成!")
-    print(f"📊 统计信息:")
-    print(f"  - 总文件数: {len(mineru_files)}")
-    print(f"  - 成功: {success_count}")
-    print(f"  - 失败: {failed_count}")
-    print("=" * 60)
-
-
-def main():
-    """主函数"""
-    parser = argparse.ArgumentParser(
-        description='合并 MinerU 和 PaddleOCR 的识别结果,添加 bbox 坐标信息',
-        formatter_class=argparse.RawDescriptionHelpFormatter,
-        epilog="""
-示例用法:
-
-  1. 批量处理整个目录:
-     python merge_mineru_paddle_ocr.py \\
-         --mineru-dir /path/to/mineru/results \\
-         --paddle-dir /path/to/paddle/results \\
-         --output-dir /path/to/output
-
-  2. 处理单个文件:
-     python merge_mineru_paddle_ocr.py \\
-         --mineru-file /path/to/file_page_001.json \\
-         --paddle-file /path/to/file_page_001.json \\
-         --output-dir /path/to/output
-
-  3. 自定义参数:
-     python merge_mineru_paddle_ocr.py \\
-         --mineru-dir /path/to/mineru \\
-         --paddle-dir /path/to/paddle \\
-         --output-dir /path/to/output \\
-         --window 15 \\
-         --threshold 85
-        """
-    )
-    
-    # 文件/目录参数
-    file_group = parser.add_argument_group('文件参数')
-    file_group.add_argument(
-        '--mineru-file', 
-        type=str,
-        help='MinerU 输出的 JSON 文件路径(单文件模式)'
-    )
-    file_group.add_argument(
-        '--paddle-file', 
-        type=str,
-        help='PaddleOCR 输出的 JSON 文件路径(单文件模式)'
-    )
-    
-    dir_group = parser.add_argument_group('目录参数')
-    dir_group.add_argument(
-        '--mineru-dir', 
-        type=str,
-        help='MinerU 结果目录(批量模式)'
-    )
-    dir_group.add_argument(
-        '--paddle-dir', 
-        type=str,
-        help='PaddleOCR 结果目录(批量模式)'
-    )
-    
-    # 输出参数
-    output_group = parser.add_argument_group('输出参数')
-    output_group.add_argument(
-        '-o', '--output-dir',
-        type=str,
-        required=True,
-        help='输出目录(必需)'
-    )
-    output_group.add_argument(
-        '-f', '--format', 
-        choices=['json', 'markdown', 'both'], 
-        default='both', help='输出格式'
-    )
-
-    # 算法参数
-    algo_group = parser.add_argument_group('算法参数')
-    algo_group.add_argument(
-        '-w', '--window',
-        type=int,
-        default=15,
-        help='向前查找的窗口大小(默认: 10)'
-    )
-    algo_group.add_argument(
-        '-t', '--threshold',
-        type=int,
-        default=80,
-        help='文本相似度阈值(0-100,默认: 80)'
-    )
-    
-    args = parser.parse_args()
-    output_format = args.format.lower()
-    
-    # 验证参数
-    if args.mineru_file and args.paddle_file:
-        # 单文件模式
-        mineru_file = Path(args.mineru_file)
-        paddle_file = Path(args.paddle_file)
-        output_dir = Path(args.output_dir)
-        
-        if not mineru_file.exists():
-            print(f"❌ 错误: MinerU 文件不存在: {mineru_file}")
-            return
-        
-        if not paddle_file.exists():
-            print(f"❌ 错误: PaddleOCR 文件不存在: {paddle_file}")
-            return
-        
-        output_dir.mkdir(parents=True, exist_ok=True)
-        
-        print("\n🔧 单文件处理模式")
-        print(f"📄 MinerU 文件: {mineru_file}")
-        print(f"📄 PaddleOCR 文件: {paddle_file}")
-        print(f"📂 输出目录: {output_dir}")
-        print(f"⚙️  查找窗口: {args.window}")
-        print(f"⚙️  相似度阈值: {args.threshold}%\n")
-        
-        merger = MinerUPaddleOCRMerger(
-            look_ahead_window=args.window,
-            similarity_threshold=args.threshold
-        )
-        
-        success = merge_single_file(mineru_file, paddle_file, output_dir, output_format, merger)
-        
-        if success:
-            print("\n✅ 处理完成!")
-        else:
-            print("\n❌ 处理失败!")
-    
-    elif args.mineru_dir and args.paddle_dir:
-        # 批量模式
-        if not Path(args.mineru_dir).exists():
-            print(f"❌ 错误: MinerU 目录不存在: {args.mineru_dir}")
-            return
-        
-        if not Path(args.paddle_dir).exists():
-            print(f"❌ 错误: PaddleOCR 目录不存在: {args.paddle_dir}")
-            return
-        
-        print("\n🔧 批量处理模式")
-        
-        merge_mineru_paddle_batch(
-            args.mineru_dir,
-            args.paddle_dir,
-            args.output_dir,
-            output_format=output_format,
-            look_ahead_window=args.window,
-            similarity_threshold=args.threshold
-        )
-    
-    else:
-        parser.print_help()
-        print("\n❌ 错误: 请指定单文件模式或批量模式的参数")
-        print("  单文件模式: --mineru-file 和 --paddle-file")
-        print("  批量模式: --mineru-dir 和 --paddle-dir")
-
-if __name__ == "__main__":
-    print("🚀 启动 MinerU + PaddleOCR 合并程序...")
-    
-    import sys
-    
-    if len(sys.argv) == 1:
-        # 如果没有命令行参数,使用默认配置运行
-        print("ℹ️  未提供命令行参数,使用默认配置运行...")
-        
-        # 默认配置
-        default_config = {
-            "mineru-file": "/Users/zhch158/workspace/data/流水分析/对公_招商银行图/mineru-vlm-2.5.3_Results/对公_招商银行图_page_001.json",
-            "paddle-file": "/Users/zhch158/workspace/data/流水分析/对公_招商银行图/data_PPStructureV3_Results/对公_招商银行图_page_001.json",
-            "output-dir": "/Users/zhch158/workspace/data/流水分析/对公_招商银行图/merged_results",
-            # "mineru-dir": "/Users/zhch158/workspace/data/流水分析/德_内蒙古银行照/mineru-vlm-2.5.3_Results",
-            # "paddle-dir": "/Users/zhch158/workspace/data/流水分析/德_内蒙古银行照/data_PPStructureV3_Results",
-            # "output-dir": "/Users/zhch158/workspace/data/流水分析/德_内蒙古银行照/merged_results",
-            "format": "both",
-            "window": "15",
-            "threshold": "85"
-        }
-        
-        print("⚙️  默认参数:")
-        for key, value in default_config.items():
-            print(f"  --{key}: {value}")
-        # 构造参数
-        sys.argv = [sys.argv[0]]
-        for key, value in default_config.items():
-            sys.argv.extend([f"--{key}", str(value)])
-    
-    sys.exit(main())

+ 114 - 10
streamlit_ocr_validator.py

@@ -17,6 +17,7 @@ from streamlit_validator_cross import (
 )
 from streamlit_validator_result import display_single_page_cross_validation
 from ocr_validator_utils import get_data_source_display_name
+from config_manager import load_config  # 🎯 使用新配置管理器
 
 
 def reset_cross_validation_results():
@@ -28,22 +29,37 @@ def reset_cross_validation_results():
 
 def main():
     """主应用"""
+    # 🎯 初始化配置管理器
+    if 'config_manager' not in st.session_state:
+        try:
+            st.session_state.config_manager = load_config(config_dir="config")
+            # 🎯 生成 OCRValidator 所需的配置
+            st.session_state.validator_config = st.session_state.config_manager.to_validator_config()
+            print("✅ 配置管理器初始化成功")
+            print(f"📄 发现 {len(st.session_state.config_manager.list_documents())} 个文档配置")
+            print(f"🔧 发现 {len(st.session_state.config_manager.list_ocr_tools())} 个 OCR 工具")
+        except Exception as e:
+            st.error(f"❌ 配置加载失败: {e}")
+            st.stop()
+    
+    config_manager = st.session_state.config_manager
+    validator_config = config_manager.to_validator_config()
+    
     # 初始化应用
     if 'validator' not in st.session_state:
-        validator = StreamlitOCRValidator()
+        # 🎯 直接传递配置字典给 OCRValidator
+        validator = StreamlitOCRValidator(config_dict=validator_config)
         st.session_state.validator = validator
-        setup_page_config(validator.config)
+        setup_page_config(validator_config)
         
         # 页面标题
-        config = st.session_state.validator.config
-        st.title(config['ui']['page_title'])
+        st.title(validator_config['ui']['page_title'])
         
         # 初始化数据源追踪
         st.session_state.current_ocr_source = validator.current_source_key
         st.session_state.current_verify_source = validator.verify_source_key
     else:
         validator = st.session_state.validator
-        config = st.session_state.validator.config
     
     if 'selected_text' not in st.session_state:
         st.session_state.selected_text = None
@@ -84,6 +100,44 @@ def main():
     
     # 如果没有可用的数据源,提前返回
     if not validator.all_sources:
+        st.warning("⚠️ 未找到任何数据源,请检查配置文件")
+        
+        # 🎯 显示配置信息帮助调试
+        with st.expander("🔍 配置信息", expanded=True):
+            st.write("**已加载的文档:**")
+            docs = config_manager.list_documents()
+            if docs:
+                for doc in docs:
+                    doc_config = config_manager.get_document(doc)
+                    st.write(f"- **{doc}**")
+                    st.write(f"  - 基础目录: `{doc_config.base_dir}`")
+                    st.write(f"  - OCR 结果: {len([r for r in doc_config.ocr_results if r.enabled])} 个已启用")
+            else:
+                st.write("无")
+            
+            st.write("**已加载的 OCR 工具:**")
+            tools = config_manager.list_ocr_tools()
+            if tools:
+                for tool in tools:
+                    tool_config = config_manager.get_ocr_tool(tool)
+                    st.write(f"- **{tool_config.name}** (`{tool}`)")
+            else:
+                st.write("无")
+            
+            st.write("**配置文件路径:**")
+            st.code(str(config_manager.config_dir / "global.yaml"))
+            
+            st.write("**生成的数据源:**")
+            data_sources = config_manager.get_data_sources()
+            if data_sources:
+                for ds in data_sources:
+                    st.write(f"- `{ds.name}`")
+                    st.write(f"  - 工具: {ds.ocr_tool}")
+                    st.write(f"  - 结果目录: {ds.ocr_out_dir}")
+                    st.write(f"  - 图片目录: {ds.src_img_dir}")
+            else:
+                st.write("无")
+        
         st.stop()
     
     # 文件选择区域
@@ -170,7 +224,7 @@ def main():
             show_batch_cross_validation_results_dialog()
 
     # 显示当前数据源统计信息
-    with st.expander("🔧 OCR工具统计信息", expanded=False):
+    with st.expander("统� OCR工具计信息", expanded=False):
         stats = validator.get_statistics()
         col1, col2, col3, col4, col5 = st.columns(5)
         
@@ -184,20 +238,70 @@ def main():
             st.metric("✅ 准确率", f"{stats['accuracy_rate']:.1f}%")
         with col5:
             if validator.current_source_config:
-                tool_display = validator.current_source_config['ocr_tool'].upper()
+                tool_id = validator.current_source_config['ocr_tool']
+                # 🎯 从配置管理器获取工具名称
+                tool_config = config_manager.get_ocr_tool(tool_id)
+                tool_display = tool_config.name if tool_config else tool_id.upper()
                 st.metric("🔧 OCR工具", tool_display)
         
         if stats['tool_info']:
             st.write("**详细信息:**", stats['tool_info'])
+        
+        # 🎯 显示当前文档和 OCR 结果信息
+        if validator.current_source_config:
+            source_name = validator.current_source_config['name']
+            # 解析数据源名称,提取文档名(更精确的解析)
+            parts = source_name.split('_', 1)
+            doc_name = parts[0] if parts else source_name
+            
+            doc_config = config_manager.get_document(doc_name)
+            if doc_config:
+                st.write("**文档信息:**")
+                st.write(f"- 文档名称: {doc_config.name}")
+                st.write(f"- 基础目录: {doc_config.base_dir}")
+                st.write(f"- 可用 OCR 工具: {len([r for r in doc_config.ocr_results if r.enabled])} 个")
+    
+    # 🎯 添加配置管理面板
+    with st.expander("⚙️ 配置管理", expanded=False):
+        col1, col2 = st.columns(2)
+        
+        with col1:
+            st.subheader("📄 已加载文档")
+            docs = config_manager.list_documents()
+            for doc_name in docs:
+                doc_config = config_manager.get_document(doc_name)
+                enabled_count = len([r for r in doc_config.ocr_results if r.enabled])
+                total_count = len(doc_config.ocr_results)
+                
+                with st.container():
+                    st.write(f"✅ **{doc_name}**")
+                    st.caption(f"📊 {enabled_count}/{total_count} 工具已启用")
+                    
+                    # 显示每个 OCR 工具的状态
+                    for ocr_result in doc_config.ocr_results:
+                        status_icon = "🟢" if ocr_result.enabled else "⚪"
+                        tool_config = config_manager.get_ocr_tool(ocr_result.tool)
+                        tool_name = tool_config.name if tool_config else ocr_result.tool
+                        st.caption(f"  {status_icon} {tool_name} - {ocr_result.description or ocr_result.result_dir}")
+        
+        with col2:
+            st.subheader("🔧 已加载 OCR 工具")
+            tools = config_manager.list_ocr_tools()
+            for tool_id in tools:
+                tool_config = config_manager.get_ocr_tool(tool_id)
+                with st.container():
+                    st.write(f"🔧 **{tool_config.name}**")
+                    st.caption(f"ID: `{tool_id}`")
+                    st.caption(f"描述: {tool_config.description}")
     
     tab1, tab2, tab3 = st.tabs(["📄 内容人工检查", "🔍 交叉验证结果", "📊 表格分析"])
     
     with tab1:
-        validator.create_compact_layout(config)
+        validator.create_compact_layout(validator_config)
 
     with tab2:
         # ✅ 使用封装的函数显示单页交叉验证结果
-        display_single_page_cross_validation(validator, config)
+        display_single_page_cross_validation(validator, validator_config)
 
     with tab3:
         st.header("📊 表格数据分析")
@@ -207,7 +311,7 @@ def main():
             display_html_table_as_dataframe(validator.md_content)
         else:
             st.info("当前OCR结果中没有检测到表格数据")
-    
+
 
 if __name__ == "__main__":
     main()

+ 0 - 1496
streamlit_ocr_validator_v1.py

@@ -1,1496 +0,0 @@
-#!/usr/bin/env python3
-"""
-基于Streamlit的OCR可视化校验工具(重构版)
-提供丰富的交互组件和更好的用户体验
-"""
-
-import streamlit as st
-from pathlib import Path
-from PIL import Image
-from typing import Dict, List, Optional
-import plotly.graph_objects as go
-from io import BytesIO
-import pandas as pd
-import numpy as np
-import plotly.express as px
-import json
-
-# 导入工具模块
-from ocr_validator_utils import (
-    load_config,  load_ocr_data_file, process_ocr_data,
-     get_ocr_statistics, 
-    find_available_ocr_files, 
-     group_texts_by_category,
-    find_available_ocr_files_multi_source, get_data_source_display_name
-)
-from ocr_validator_file_utils import (
-    load_css_styles,
-    draw_bbox_on_image,
-    convert_html_table_to_markdown,
-    parse_html_tables, 
-    create_dynamic_css,
-    export_tables_to_excel, 
-    get_table_statistics,
-)
-from ocr_validator_layout import OCRLayoutManager
-from ocr_by_vlm import ocr_with_vlm
-from compare_ocr_results import compare_ocr_results
-
-
-class StreamlitOCRValidator:
-    def __init__(self):
-        self.config = load_config()
-        self.ocr_data = []
-        self.md_content = ""
-        self.image_path = ""
-        self.text_bbox_mapping = {}
-        self.selected_text = None
-        self.marked_errors = set()
-        
-        # 多数据源相关
-        self.all_sources = {}
-        self.current_source_key = None
-        self.current_source_config = None
-        self.file_info = []
-        self.selected_file_index = -1
-        self.display_options = []
-        self.file_paths = []
-        
-        # ✅ 新增:交叉验证数据源
-        self.verify_source_key = None
-        self.verify_source_config = None
-        self.verify_file_info = []
-        self.verify_display_options = []
-        self.verify_file_paths = []
-
-        # 初始化布局管理器
-        self.layout_manager = OCRLayoutManager(self)
-
-        # 加载多数据源文件信息
-        self.load_multi_source_info()
-        
-    def load_multi_source_info(self):
-        """加载多数据源文件信息"""
-        self.all_sources = find_available_ocr_files_multi_source(self.config)
-        
-        # 如果有数据源,默认选择第一个作为OCR源
-        if self.all_sources:
-            source_keys = list(self.all_sources.keys())
-            first_source_key = source_keys[0]
-            self.switch_to_source(first_source_key)
-            
-            # 如果有第二个数据源,默认作为验证源
-            if len(source_keys) > 1:
-                self.switch_to_verify_source(source_keys[1])
-    
-    def switch_to_source(self, source_key: str):
-        """切换到指定OCR数据源"""
-        if source_key in self.all_sources:
-            self.current_source_key = source_key
-            source_data = self.all_sources[source_key]
-            self.current_source_config = source_data['config']
-            self.file_info = source_data['files']
-            
-            if self.file_info:
-                # 创建显示选项列表
-                self.display_options = [f"{info['display_name']}" for info in self.file_info]
-                self.file_paths = [info['path'] for info in self.file_info]
-                
-                # 重置文件选择
-                self.selected_file_index = -1
-                print(f"✅ 切换到OCR数据源: {source_key}")
-            else:
-                print(f"⚠️ 数据源 {source_key} 没有可用文件")
-    
-    def switch_to_verify_source(self, source_key: str):
-        """切换到指定验证数据源"""
-        if source_key in self.all_sources:
-            self.verify_source_key = source_key
-            source_data = self.all_sources[source_key]
-            self.verify_source_config = source_data['config']
-            self.verify_file_info = source_data['files']
-            
-            if self.verify_file_info:
-                self.verify_display_options = [f"{info['display_name']}" for info in self.verify_file_info]
-                self.verify_file_paths = [info['path'] for info in self.verify_file_info]
-                print(f"✅ 切换到验证数据源: {source_key}")
-            else:
-                print(f"⚠️ 验证数据源 {source_key} 没有可用文件")
-
-    def setup_page_config(self):
-        """设置页面配置"""
-        ui_config = self.config['ui']
-        st.set_page_config(
-            page_title=ui_config['page_title'],
-            page_icon=ui_config['page_icon'],
-            layout=ui_config['layout'],
-            initial_sidebar_state=ui_config['sidebar_state']
-        )
-        
-        # 加载CSS样式
-        css_content = load_css_styles()
-        st.markdown(f"<style>{css_content}</style>", unsafe_allow_html=True)
-
-    def create_data_source_selector(self):
-        """创建双数据源选择器 - 支持交叉验证"""
-        if not self.all_sources:
-            st.warning("❌ 未找到任何数据源,请检查配置文件")
-            return
-        
-        # 准备数据源选项
-        source_options = {}
-        for source_key, source_data in self.all_sources.items():
-            display_name = get_data_source_display_name(source_data['config'])
-            source_options[display_name] = source_key
-        
-        # 创建两列布局
-        col1, col2 = st.columns(2)
-        
-        with col1:
-            st.markdown("#### 📊 OCR数据源")
-            # OCR数据源选择
-            current_display_name = None
-            if self.current_source_key:
-                for display_name, key in source_options.items():
-                    if key == self.current_source_key:
-                        current_display_name = display_name
-                        break
-            
-            selected_ocr_display = st.selectbox(
-                "选择OCR数据源",
-                options=list(source_options.keys()),
-                index=list(source_options.keys()).index(current_display_name) if current_display_name else 0,
-                key="ocr_source_selector",
-                label_visibility="collapsed",
-                help="选择要分析的OCR数据源"
-            )
-            
-            selected_ocr_key = source_options[selected_ocr_display]
-            
-            # 如果OCR数据源发生变化,切换数据源
-            if selected_ocr_key != self.current_source_key:
-                self.switch_to_source(selected_ocr_key)
-                if 'selected_file_index' in st.session_state:
-                    st.session_state.selected_file_index = 0
-                st.rerun()
-            
-            # 显示OCR数据源信息
-            if self.current_source_config:
-                with st.expander("📋 OCR数据源详情", expanded=False):
-                    st.write(f"**工具:** {self.current_source_config['ocr_tool']}")
-                    st.write(f"**文件数:** {len(self.file_info)}")
-        
-        with col2:
-            st.markdown("#### 🔍 验证数据源")
-            # 验证数据源选择
-            verify_display_name = None
-            if self.verify_source_key:
-                for display_name, key in source_options.items():
-                    if key == self.verify_source_key:
-                        verify_display_name = display_name
-                        break
-            
-            selected_verify_display = st.selectbox(
-                "选择验证数据源",
-                options=list(source_options.keys()),
-                index=list(source_options.keys()).index(verify_display_name) if verify_display_name else (1 if len(source_options) > 1 else 0),
-                key="verify_source_selector",
-                label_visibility="collapsed",
-                help="选择用于交叉验证的数据源"
-            )
-            
-            selected_verify_key = source_options[selected_verify_display]
-            
-            # 如果验证数据源发生变化,切换数据源
-            if selected_verify_key != self.verify_source_key:
-                self.switch_to_verify_source(selected_verify_key)
-                st.rerun()
-            
-            # 显示验证数据源信息
-            if self.verify_source_config:
-                with st.expander("📋 验证数据源详情", expanded=False):
-                    st.write(f"**工具:** {self.verify_source_config['ocr_tool']}")
-                    st.write(f"**文件数:** {len(self.verify_file_info)}")
-        
-        # 数据源对比提示
-        if self.current_source_key == self.verify_source_key:
-            st.warning("⚠️ OCR数据源和验证数据源相同,建议选择不同的数据源进行交叉验证")
-        else:
-            st.success(f"✅ 已选择 {selected_ocr_display} 与 {selected_verify_display} 进行交叉验证")    
-    
-    def load_ocr_data(self, json_path: str, md_path: Optional[str] = None, image_path: Optional[str] = None):
-        """加载OCR相关数据 - 支持多数据源配置"""
-        try:
-            # 使用当前数据源的配置加载数据
-            if self.current_source_config:
-                # 临时修改config以使用当前数据源的配置
-                temp_config = self.config.copy()
-                temp_config['paths'] = {
-                    'ocr_out_dir': self.current_source_config['ocr_out_dir'],
-                    'src_img_dir': self.current_source_config.get('src_img_dir', ''),
-                    'pre_validation_dir': self.config['pre_validation']['out_dir']
-                }
-                
-                # 设置OCR工具类型
-                temp_config['current_ocr_tool'] = self.current_source_config['ocr_tool']
-                
-                self.ocr_data, self.md_content, self.image_path = load_ocr_data_file(json_path, temp_config)
-            else:
-                self.ocr_data, self.md_content, self.image_path = load_ocr_data_file(json_path, self.config)
-                
-            self.process_data()
-        except Exception as e:
-            st.error(f"❌ 加载失败: {e}")
-            st.exception(e)
-    
-    def process_data(self):
-        """处理OCR数据"""
-        self.text_bbox_mapping = process_ocr_data(self.ocr_data, self.config)
-    
-    def get_statistics(self) -> Dict:
-        """获取统计信息"""
-        return get_ocr_statistics(self.ocr_data, self.text_bbox_mapping, self.marked_errors)
-    
-    def display_html_table_as_dataframe(self, html_content: str, enable_editing: bool = False):
-        """将HTML表格解析为DataFrame显示 - 增强版本支持横向滚动"""
-        tables = parse_html_tables(html_content)
-        wide_table_threshold = 15  # 超宽表格列数阈值
-        
-        if not tables:
-            st.warning("未找到可解析的表格")
-            # 对于无法解析的HTML表格,使用自定义CSS显示
-            st.markdown("""
-            <style>
-            .scrollable-table {
-                overflow-x: auto;
-                white-space: nowrap;
-                border: 1px solid #ddd;
-                border-radius: 5px;
-                margin: 10px 0;
-            }
-            .scrollable-table table {
-                width: 100%;
-                border-collapse: collapse;
-            }
-            .scrollable-table th, .scrollable-table td {
-                border: 1px solid #ddd;
-                padding: 8px;
-                text-align: left;
-                min-width: 100px;
-            }
-            .scrollable-table th {
-                background-color: #f5f5f5;
-                font-weight: bold;
-            }
-            </style>
-            """, unsafe_allow_html=True)
-            
-            st.markdown(f'<div class="scrollable-table">{html_content}</div>', unsafe_allow_html=True)
-            return
-            
-        for i, table in enumerate(tables):
-            st.subheader(f"📊 表格 {i+1}")
-            
-            # 表格信息显示
-            col_info1, col_info2, col_info3, col_info4 = st.columns(4)
-            with col_info1:
-                st.metric("行数", len(table))
-            with col_info2:
-                st.metric("列数", len(table.columns))
-            with col_info3:
-                # 检查是否有超宽表格
-                is_wide_table = len(table.columns) > wide_table_threshold
-                st.metric("表格类型", "超宽表格" if is_wide_table else "普通表格")
-            with col_info4:
-                # 表格操作模式选择
-                display_mode = st.selectbox(
-                    f"显示模式 (表格{i+1})",
-                    ["完整显示", "分页显示", "筛选列显示"],
-                    key=f"display_mode_{i}"
-                )
-            
-            # 创建表格操作按钮
-            col1, col2, col3, col4 = st.columns(4)
-            with col1:
-                show_info = st.checkbox(f"显示详细信息", key=f"info_{i}")
-            with col2:
-                show_stats = st.checkbox(f"显示统计信息", key=f"stats_{i}")
-            with col3:
-                enable_filter = st.checkbox(f"启用过滤", key=f"filter_{i}")
-            with col4:
-                enable_sort = st.checkbox(f"启用排序", key=f"sort_{i}")
-            
-            # 根据显示模式处理表格
-            display_table = self._process_table_display_mode(table, i, display_mode)
-            
-            # 数据过滤和排序逻辑
-            filtered_table = self._apply_table_filters_and_sorts(display_table, i, enable_filter, enable_sort)
-            
-            # 显示表格 - 使用自定义CSS支持横向滚动
-            st.markdown("""
-            <style>
-            .dataframe-container {
-                overflow-x: auto;
-                border: 1px solid #ddd;
-                border-radius: 5px;
-                margin: 10px 0;
-            }
-            
-            /* 为超宽表格特殊样式 */
-            .wide-table-container {
-                overflow-x: auto;
-                max-height: 500px;
-                overflow-y: auto;
-                border: 2px solid #0288d1;
-                border-radius: 8px;
-                background: linear-gradient(90deg, #f8f9fa 0%, #ffffff 100%);
-            }
-            
-            .dataframe thead th {
-                position: sticky;
-                top: 0;
-                background-color: #f5f5f5 !important;
-                z-index: 10;
-                border-bottom: 2px solid #0288d1;
-            }
-            
-            .dataframe tbody td {
-                white-space: nowrap;
-                min-width: 100px;
-                max-width: 300px;
-                overflow: hidden;
-                text-overflow: ellipsis;
-            }
-            </style>
-            """, unsafe_allow_html=True)
-            
-            # 根据表格宽度选择显示容器
-            container_class = "wide-table-container" if len(table.columns) > wide_table_threshold else "dataframe-container"
-            
-            if enable_editing:
-                st.markdown(f'<div class="{container_class}">', unsafe_allow_html=True)
-                edited_table = st.data_editor(
-                    filtered_table, 
-                    width='stretch', 
-                    key=f"editor_{i}",
-                    height=400 if len(table.columns) > 8 else None
-                )
-                st.markdown('</div>', unsafe_allow_html=True)
-                
-                if not edited_table.equals(filtered_table):
-                    st.success("✏️ 表格已编辑,可以导出修改后的数据")
-            else:
-                st.markdown(f'<div class="{container_class}">', unsafe_allow_html=True)
-                st.dataframe(
-                    filtered_table, 
-                    # width='stretch',
-                    width =400 if len(table.columns) > wide_table_threshold else "stretch"
-                )
-                st.markdown('</div>', unsafe_allow_html=True)
-            
-            # 显示表格信息和统计
-            self._display_table_info_and_stats(table, filtered_table, show_info, show_stats, i)
-            
-            st.markdown("---")
-    
-    def _apply_table_filters_and_sorts(self, table: pd.DataFrame, table_index: int, enable_filter: bool, enable_sort: bool) -> pd.DataFrame:
-        """应用表格过滤和排序"""
-        filtered_table = table.copy()
-        
-        # 数据过滤
-        if enable_filter and not table.empty:
-            filter_col = st.selectbox(
-                f"选择过滤列 (表格 {table_index+1})", 
-                options=['无'] + list(table.columns),
-                key=f"filter_col_{table_index}"
-            )
-            
-            if filter_col != '无':
-                filter_value = st.text_input(f"过滤值 (表格 {table_index+1})", key=f"filter_value_{table_index}")
-                if filter_value:
-                    filtered_table = table[table[filter_col].astype(str).str.contains(filter_value, na=False)]
-        
-        # 数据排序
-        if enable_sort and not filtered_table.empty:
-            sort_col = st.selectbox(
-                f"选择排序列 (表格 {table_index+1})", 
-                options=['无'] + list(filtered_table.columns),
-                key=f"sort_col_{table_index}"
-            )
-            
-            if sort_col != '无':
-                sort_order = st.radio(
-                    f"排序方式 (表格 {table_index+1})",
-                    options=['升序', '降序'],
-                    horizontal=True,
-                    key=f"sort_order_{table_index}"
-                )
-                ascending = (sort_order == '升序')
-                filtered_table = filtered_table.sort_values(sort_col, ascending=ascending)
-        
-        return filtered_table
-    
-    def _display_table_info_and_stats(self, original_table: pd.DataFrame, filtered_table: pd.DataFrame, 
-                                     show_info: bool, show_stats: bool, table_index: int):
-        """显示表格信息和统计数据"""
-        if show_info:
-            st.write("**表格信息:**")
-            st.write(f"- 原始行数: {len(original_table)}")
-            st.write(f"- 过滤后行数: {len(filtered_table)}")
-            st.write(f"- 列数: {len(original_table.columns)}")
-            st.write(f"- 列名: {', '.join(original_table.columns)}")
-        
-        if show_stats:
-            st.write("**统计信息:**")
-            numeric_cols = filtered_table.select_dtypes(include=[np.number]).columns
-            if len(numeric_cols) > 0:
-                st.dataframe(filtered_table[numeric_cols].describe())
-            else:
-                st.info("表格中没有数值列")
-        
-        # 导出功能
-        if st.button(f"📥 导出表格 {table_index+1}", key=f"export_{table_index}"):
-            self._create_export_buttons(filtered_table, table_index)
-    
-    def _create_export_buttons(self, table: pd.DataFrame, table_index: int):
-        """创建导出按钮"""
-        # CSV导出
-        csv_data = table.to_csv(index=False)
-        st.download_button(
-            label=f"下载CSV (表格 {table_index+1})",
-            data=csv_data,
-            file_name=f"table_{table_index+1}.csv",
-            mime="text/csv",
-            key=f"download_csv_{table_index}"
-        )
-        
-        # Excel导出
-        excel_buffer = BytesIO()
-        table.to_excel(excel_buffer, index=False)
-        st.download_button(
-            label=f"下载Excel (表格 {table_index+1})",
-            data=excel_buffer.getvalue(),
-            file_name=f"table_{table_index+1}.xlsx",
-            mime="application/vnd.openxmlformats-officedocument.spreadsheetml.sheet",
-            key=f"download_excel_{table_index}"
-        )
-    
-    def _process_table_display_mode(self, table: pd.DataFrame, table_index: int, display_mode: str) -> pd.DataFrame:
-        """根据显示模式处理表格"""
-        if display_mode == "分页显示":
-            # 分页显示
-            page_size = st.selectbox(
-                f"每页显示行数 (表格 {table_index+1})",
-                [10, 20, 50, 100],
-                key=f"page_size_{table_index}"
-            )
-            
-            total_pages = (len(table) - 1) // page_size + 1
-            
-            if total_pages > 1:
-                page_number = st.selectbox(
-                    f"页码 (表格 {table_index+1})",
-                    range(1, total_pages + 1),
-                    key=f"page_number_{table_index}"
-                )
-                
-                start_idx = (page_number - 1) * page_size
-                end_idx = start_idx + page_size
-                return table.iloc[start_idx:end_idx]
-            
-            return table
-            
-        elif display_mode == "筛选列显示":
-            # 列筛选显示
-            if len(table.columns) > 5:
-                selected_columns = st.multiselect(
-                    f"选择要显示的列 (表格 {table_index+1})",
-                    table.columns.tolist(),
-                    default=table.columns.tolist()[:5],  # 默认显示前5列
-                    key=f"selected_columns_{table_index}"
-                )
-                
-                if selected_columns:
-                    return table[selected_columns]
-            
-            return table
-            
-        else:  # 完整显示
-            return table
-
-    def find_verify_md_path(self, selected_file_index: int) -> Optional[Path]:
-        """查找当前OCR文件对应的验证文件路径"""
-        current_page = self.file_info[selected_file_index]['page']
-        verify_md_path = None
-
-        for i, info in enumerate(self.verify_file_info):
-            if info['page'] == current_page:
-                verify_md_path = Path(self.verify_file_paths[i]).with_suffix('.md')
-                break
-
-        return verify_md_path
-
-    @st.dialog("交叉验证", width="large", dismissible=True, on_dismiss="rerun")
-    def cross_validation(self):
-        """交叉验证功能 - 批量比对两个数据源的所有OCR结果"""
-        
-        if self.current_source_key == self.verify_source_key:
-            st.error("❌ OCR数据源和验证数据源不能相同")
-            return
-        
-        # 初始化对比结果存储
-        if 'cross_validation_batch_result' not in st.session_state:
-            st.session_state.cross_validation_batch_result = None
-        
-        st.header("🔄 批量交叉验证")
-        
-        # 显示数据源信息
-        col1, col2 = st.columns(2)
-        with col1:
-            st.info(f"**OCR数据源:** {get_data_source_display_name(self.current_source_config)}")
-            st.write(f"📁 文件数量: {len(self.file_info)}")
-        with col2:
-            st.info(f"**验证数据源:** {get_data_source_display_name(self.verify_source_config)}")
-            st.write(f"📁 文件数量: {len(self.verify_file_info)}")
-        
-        # 批量验证选项
-        with st.expander("⚙️ 验证选项", expanded=True):
-            col1, col2 = st.columns(2)
-            with col1:
-                table_mode = st.selectbox(
-                    "表格比对模式",
-                    options=['standard', 'flow_list'],
-                    index=1,  # 默认使用flow_list
-                    format_func=lambda x: '流水表格模式' if x == 'flow_list' else '标准模式',
-                    help="选择表格比对算法"
-                )
-            with col2:
-                similarity_algorithm = st.selectbox(
-                    "相似度算法",
-                    options=['ratio', 'partial_ratio', 'token_sort_ratio', 'token_set_ratio'],
-                    index=0,
-                    help="选择文本相似度计算算法"
-                )
-        
-        # 开始批量验证按钮
-        if st.button("🚀 开始批量验证", type="primary", width='stretch'):
-            self._run_batch_cross_validation(table_mode, similarity_algorithm)
-        
-        # 显示历史批量验证结果
-        if 'cross_validation_batch_result' in st.session_state and st.session_state.cross_validation_batch_result:
-            st.markdown("---")
-            self._display_batch_validation_results(st.session_state.cross_validation_batch_result)
-    
-    def _generate_batch_validation_markdown(self, batch_results: dict, output_path: str):
-        """生成批量验证的Markdown报告"""
-        
-        with open(output_path, "w", encoding="utf-8") as f:
-            f.write("# 批量交叉验证报告\n\n")
-            
-            # 基本信息
-            f.write("## 📋 基本信息\n\n")
-            f.write(f"- **OCR数据源:** {batch_results['ocr_source']}\n")
-            f.write(f"- **验证数据源:** {batch_results['verify_source']}\n")
-            f.write(f"- **表格模式:** {batch_results['table_mode']}\n")
-            f.write(f"- **相似度算法:** {batch_results['similarity_algorithm']}\n")
-            f.write(f"- **验证时间:** {batch_results['timestamp']}\n\n")
-            
-            # 汇总统计
-            summary = batch_results['summary']
-            f.write("## 📊 汇总统计\n\n")
-            f.write(f"- **总页数:** {summary['total_pages']}\n")
-            f.write(f"- **成功页数:** {summary['successful_pages']}\n")
-            f.write(f"- **失败页数:** {summary['failed_pages']}\n")
-            f.write(f"- **总差异数:** {summary['total_differences']}\n")
-            f.write(f"- **表格差异:** {summary['total_table_differences']}\n")
-            f.write(f"  - 金额差异: {summary.get('total_amount_differences', 0)}\n")
-            f.write(f"  - 日期差异: {summary.get('total_datetime_differences', 0)}\n")
-            f.write(f"  - 文本差异: {summary.get('total_text_differences', 0)}\n")
-            f.write(f"  - 表头前差异: {summary.get('total_table_pre_header', 0)}\n")
-            f.write(f"  - 表头位置差异: {summary.get('total_table_header_position', 0)}\n")
-            f.write(f"  - 表头严重错误: {summary.get('total_table_header_critical', 0)}\n")
-            f.write(f"  - 行缺失: {summary.get('total_table_row_missing', 0)}\n")
-            f.write(f"- **段落差异:** {summary['total_paragraph_differences']}\n")
-            f.write(f"- **严重程度统计:**\n")
-            f.write(f"  - 高严重度: {summary.get('total_high_severity', 0)}\n")
-            f.write(f"  - 中严重度: {summary.get('total_medium_severity', 0)}\n")
-            f.write(f"  - 低严重度: {summary.get('total_low_severity', 0)}\n\n")
-            
-            # 详细结果表格
-            f.write("## 📄 各页差异统计\n\n")
-            f.write("| 页码 | 状态 | 总差异 | 表格差异 | 金额 | 日期 | 文本 | 段落 | 表头前 | 表头位置 | 表头错误 | 行缺失 | 高 | 中 | 低 |\n")
-            f.write("|------|------|--------|----------|------|------|------|------|--------|----------|----------|--------|----|----|----|\n")
-            
-            for page in batch_results['pages']:
-                if page['status'] == 'success':
-                    status_icon = "✅" if page['total_differences'] == 0 else "⚠️"
-                    f.write(f"| {page['page_num']} | {status_icon} | ")
-                    f.write(f"{page['total_differences']} | ")
-                    f.write(f"{page['table_differences']} | ")
-                    f.write(f"{page.get('amount_differences', 0)} | ")
-                    f.write(f"{page.get('datetime_differences', 0)} | ")
-                    f.write(f"{page.get('text_differences', 0)} | ")
-                    f.write(f"{page['paragraph_differences']} | ")
-                    f.write(f"{page.get('table_pre_header', 0)} | ")
-                    f.write(f"{page.get('table_header_position', 0)} | ")
-                    f.write(f"{page.get('table_header_critical', 0)} | ")
-                    f.write(f"{page.get('table_row_missing', 0)} | ")
-                    f.write(f"{page.get('high_severity', 0)} | ")
-                    f.write(f"{page.get('medium_severity', 0)} | ")
-                    f.write(f"{page.get('low_severity', 0)} |\n")
-                else:
-                    f.write(f"| {page['page_num']} | ❌ | - | - | - | - | - | - | - | - | - | - | - | - | - |\n")
-            
-            f.write("\n")
-            
-            # 问题汇总
-            f.write("## 🔍 问题汇总\n\n")
-            
-            high_diff_pages = [p for p in batch_results['pages'] 
-                             if p['status'] == 'success' and p['total_differences'] > 10]
-            if high_diff_pages:
-                f.write("### ⚠️ 高差异页面(差异>10)\n\n")
-                for page in high_diff_pages:
-                    f.write(f"- 第 {page['page_num']} 页:{page['total_differences']} 个差异\n")
-                f.write("\n")
-            
-            amount_error_pages = [p for p in batch_results['pages'] 
-                                if p['status'] == 'success' and p.get('amount_differences', 0) > 0]
-            if amount_error_pages:
-                f.write("### 💰 金额差异页面\n\n")
-                for page in amount_error_pages:
-                    f.write(f"- 第 {page['page_num']} 页:{page.get('amount_differences', 0)} 个金额差异\n")
-                f.write("\n")
-            
-            header_error_pages = [p for p in batch_results['pages'] 
-                                if p['status'] == 'success' and p.get('table_header_critical', 0) > 0]
-            if header_error_pages:
-                f.write("### ❌ 表头严重错误页面\n\n")
-                for page in header_error_pages:
-                    f.write(f"- 第 {page['page_num']} 页:{page['table_header_critical']} 个表头错误\n")
-                f.write("\n")
-            
-            failed_pages = [p for p in batch_results['pages'] if p['status'] == 'failed']
-            if failed_pages:
-                f.write("### 💥 验证失败页面\n\n")
-                for page in failed_pages:
-                    f.write(f"- 第 {page['page_num']} 页:{page.get('error', '未知错误')}\n")
-                f.write("\n")
-
-    def _run_batch_cross_validation(self, table_mode: str, similarity_algorithm: str):
-        """执行批量交叉验证"""
-        
-        # 准备输出目录
-        pre_validation_dir = Path(self.config['pre_validation'].get('out_dir', './output/pre_validation/')).resolve()
-        pre_validation_dir.mkdir(parents=True, exist_ok=True)
-        
-        # ✅ 批量结果存储 - 更新统计字段
-        batch_results = {
-            'ocr_source': get_data_source_display_name(self.current_source_config),
-            'verify_source': get_data_source_display_name(self.verify_source_config),
-            'table_mode': table_mode,
-            'similarity_algorithm': similarity_algorithm,
-            'timestamp': pd.Timestamp.now().strftime('%Y-%m-%d %H:%M:%S'),
-            'pages': [],
-            'summary': {
-                'total_pages': 0,
-                'successful_pages': 0,
-                'failed_pages': 0,
-                'total_differences': 0,
-                'total_table_differences': 0,
-                'total_amount_differences': 0,
-                'total_datetime_differences': 0,
-                'total_text_differences': 0,
-                'total_paragraph_differences': 0,
-                'total_table_pre_header': 0,
-                'total_table_header_position': 0,
-                'total_table_header_critical': 0,
-                'total_table_row_missing': 0,
-                'total_high_severity': 0,
-                'total_medium_severity': 0,
-                'total_low_severity': 0
-            }
-        }
-        
-        # 创建进度条
-        progress_bar = st.progress(0)
-        status_text = st.empty()
-        
-        # 建立页码映射
-        ocr_page_map = {info['page']: i for i, info in enumerate(self.file_info)}
-        verify_page_map = {info['page']: i for i, info in enumerate(self.verify_file_info)}
-        
-        # 找出两个数据源共同的页码
-        common_pages = sorted(set(ocr_page_map.keys()) & set(verify_page_map.keys()))
-        
-        if not common_pages:
-            st.error("❌ 两个数据源没有共同的页码,无法进行对比")
-            return
-        
-        batch_results['summary']['total_pages'] = len(common_pages)
-        
-        # 创建详细日志区域
-        with st.expander("📋 详细对比日志", expanded=True):
-            log_container = st.container()
-        
-        # 逐页对比
-        for idx, page_num in enumerate(common_pages):
-            try:
-                # 更新进度
-                progress = (idx + 1) / len(common_pages)
-                progress_bar.progress(progress)
-                status_text.text(f"正在对比第 {page_num} 页... ({idx + 1}/{len(common_pages)})")
-                
-                # 获取文件路径
-                ocr_file_index = ocr_page_map[page_num]
-                verify_file_index = verify_page_map[page_num]
-                
-                ocr_md_path = Path(self.file_paths[ocr_file_index]).with_suffix('.md')
-                verify_md_path = Path(self.verify_file_paths[verify_file_index]).with_suffix('.md')
-                
-                if not ocr_md_path.exists() or not verify_md_path.exists():
-                    with log_container:
-                        st.warning(f"⚠️ 第 {page_num} 页:文件不存在,跳过")
-                    batch_results['summary']['failed_pages'] += 1
-                    continue
-                
-                # 执行对比
-                comparison_result_path = pre_validation_dir / f"{ocr_md_path.stem}_cross_validation"
-                
-                # 捕获对比输出
-                import io
-                import contextlib
-                
-                output_buffer = io.StringIO()
-                
-                with contextlib.redirect_stdout(output_buffer):
-                    comparison_result = compare_ocr_results(
-                        file1_path=str(ocr_md_path),
-                        file2_path=str(verify_md_path),
-                        output_file=str(comparison_result_path),
-                        output_format='both',
-                        ignore_images=True,
-                        table_mode=table_mode,
-                        similarity_algorithm=similarity_algorithm
-                    )
-                
-                # ✅ 提取统计信息 - 更新字段
-                stats = comparison_result['statistics']
-                
-                page_result = {
-                    'page_num': page_num,
-                    'ocr_file': str(ocr_md_path.name),
-                    'verify_file': str(verify_md_path.name),
-                    'total_differences': stats['total_differences'],
-                    'table_differences': stats['table_differences'],
-                    'amount_differences': stats.get('amount_differences', 0),
-                    'datetime_differences': stats.get('datetime_differences', 0),
-                    'text_differences': stats.get('text_differences', 0),
-                    'paragraph_differences': stats['paragraph_differences'],
-                    'table_pre_header': stats.get('table_pre_header', 0),
-                    'table_header_position': stats.get('table_header_position', 0),
-                    'table_header_critical': stats.get('table_header_critical', 0),
-                    'table_row_missing': stats.get('table_row_missing', 0),
-                    'high_severity': stats.get('high_severity', 0),
-                    'medium_severity': stats.get('medium_severity', 0),
-                    'low_severity': stats.get('low_severity', 0),
-                    'status': 'success',
-                    'comparison_json': f"{comparison_result_path}.json",
-                    'comparison_md': f"{comparison_result_path}.md"
-                }
-                
-                batch_results['pages'].append(page_result)
-                batch_results['summary']['successful_pages'] += 1
-                batch_results['summary']['total_differences'] += stats['total_differences']
-                batch_results['summary']['total_table_differences'] += stats['table_differences']
-                batch_results['summary']['total_amount_differences'] += stats.get('amount_differences', 0)
-                batch_results['summary']['total_datetime_differences'] += stats.get('datetime_differences', 0)
-                batch_results['summary']['total_text_differences'] += stats.get('text_differences', 0)
-                batch_results['summary']['total_paragraph_differences'] += stats['paragraph_differences']
-                batch_results['summary']['total_table_pre_header'] += stats.get('table_pre_header', 0)
-                batch_results['summary']['total_table_header_position'] += stats.get('table_header_position', 0)
-                batch_results['summary']['total_table_header_critical'] += stats.get('table_header_critical', 0)
-                batch_results['summary']['total_table_row_missing'] += stats.get('table_row_missing', 0)
-                batch_results['summary']['total_high_severity'] += stats.get('high_severity', 0)
-                batch_results['summary']['total_medium_severity'] += stats.get('medium_severity', 0)
-                batch_results['summary']['total_low_severity'] += stats.get('low_severity', 0)
-                
-                # 显示当前页对比结果
-                with log_container:
-                    if stats['total_differences'] == 0:
-                        st.success(f"✅ 第 {page_num} 页:完全匹配")
-                    else:
-                        st.warning(f"⚠️ 第 {page_num} 页:发现 {stats['total_differences']} 个差异")
-                
-            except Exception as e:
-                with log_container:
-                    st.error(f"❌ 第 {page_num} 页:对比失败 - {str(e)}")
-                
-                page_result = {
-                    'page_num': page_num,
-                    'status': 'failed',
-                    'error': str(e)
-                }
-                batch_results['pages'].append(page_result)
-                batch_results['summary']['failed_pages'] += 1
-        
-        # 保存批量结果
-        batch_result_path = pre_validation_dir / f"{self.current_source_config['name']}_{self.current_source_config['ocr_tool']}_vs_{self.verify_source_config['ocr_tool']}_batch_cross_validation"
-        
-        # 保存JSON
-        with open(f"{batch_result_path}.json", "w", encoding="utf-8") as f:
-            json.dump(batch_results, f, ensure_ascii=False, indent=2)
-        
-        # 生成Markdown报告
-        self._generate_batch_validation_markdown(batch_results, f"{batch_result_path}.md")
-        
-        # 保存到session state
-        st.session_state.cross_validation_batch_result = batch_results
-        
-        # 完成提示
-        progress_bar.progress(1.0)
-        status_text.text("✅ 批量验证完成!")
-        
-        st.success(f"🎉 批量验证完成!成功: {batch_results['summary']['successful_pages']}, 失败: {batch_results['summary']['failed_pages']}")
-
-    def _display_batch_validation_results(self, batch_results: dict):
-        """显示批量验证结果"""
-        
-        st.header("📊 批量验证结果")
-        
-        # 汇总统计
-        summary = batch_results['summary']
-        
-        col1, col2, col3, col4 = st.columns(4)
-        with col1:
-            st.metric("总页数", summary['total_pages'])
-        with col2:
-            st.metric("成功页数", summary['successful_pages'], 
-                     delta=f"{summary['successful_pages']/summary['total_pages']*100:.1f}%")
-        with col3:
-            st.metric("失败页数", summary['failed_pages'],
-                     delta=f"-{summary['failed_pages']}" if summary['failed_pages'] > 0 else "0")
-        with col4:
-            st.metric("总差异数", summary['total_differences'])
-        
-        # ✅ 详细差异类型统计 - 更新展示
-        st.subheader("📈 差异类型统计")
-        
-        col1, col2, col3 = st.columns(3)
-        with col1:
-            st.metric("表格差异", summary['total_table_differences'])
-            st.caption(f"金额: {summary.get('total_amount_differences', 0)} | 日期: {summary.get('total_datetime_differences', 0)} | 文本: {summary.get('total_text_differences', 0)}")
-        with col2:
-            st.metric("段落差异", summary['total_paragraph_differences'])
-        with col3:
-            st.metric("严重度", f"高:{summary.get('total_high_severity', 0)} 中:{summary.get('total_medium_severity', 0)} 低:{summary.get('total_low_severity', 0)}")
-        
-        # 表格结构差异统计
-        with st.expander("📋 表格结构差异详情", expanded=False):
-            col1, col2, col3, col4 = st.columns(4)
-            with col1:
-                st.metric("表头前", summary.get('total_table_pre_header', 0))
-            with col2:
-                st.metric("表头位置", summary.get('total_table_header_position', 0))
-            with col3:
-                st.metric("表头错误", summary.get('total_table_header_critical', 0))
-            with col4:
-                st.metric("行缺失", summary.get('total_table_row_missing', 0))
-        
-        # ✅ 各页详细结果表格 - 更新列
-        st.subheader("📄 各页详细结果")
-        
-        # 准备DataFrame
-        page_data = []
-        for page in batch_results['pages']:
-            if page['status'] == 'success':
-                page_data.append({
-                    '页码': page['page_num'],
-                    '状态': '✅ 成功' if page['total_differences'] == 0 else '⚠️ 有差异',
-                    '总差异': page['total_differences'],
-                    '表格差异': page['table_differences'],
-                    '金额': page.get('amount_differences', 0),
-                    '日期': page.get('datetime_differences', 0),
-                    '文本': page.get('text_differences', 0),
-                    '段落': page['paragraph_differences'],
-                    '表头前': page.get('table_pre_header', 0),
-                    '表头位置': page.get('table_header_position', 0),
-                    '表头错误': page.get('table_header_critical', 0),
-                    '行缺失': page.get('table_row_missing', 0),
-                    '高': page.get('high_severity', 0),
-                    '中': page.get('medium_severity', 0),
-                    '低': page.get('low_severity', 0)
-                })
-            else:
-                page_data.append({
-                    '页码': page['page_num'],
-                    '状态': '❌ 失败',
-                    '总差异': '-', '表格差异': '-', '金额': '-', '日期': '-', 
-                    '文本': '-', '段落': '-', '表头前': '-', '表头位置': '-',
-                    '表头错误': '-', '行缺失': '-', '高': '-', '中': '-', '低': '-'
-                })
-        
-        df_pages = pd.DataFrame(page_data)
-        
-        # 显示表格
-        st.dataframe(
-            df_pages,
-            width='stretch',
-            hide_index=True,
-            column_config={
-                "页码": st.column_config.NumberColumn("页码", width="small"),
-                "状态": st.column_config.TextColumn("状态", width="small"),
-                "总差异": st.column_config.NumberColumn("总差异", width="small"),
-                "表格差异": st.column_config.NumberColumn("表格", width="small"),
-                "金额": st.column_config.NumberColumn("金额", width="small"),
-                "日期": st.column_config.NumberColumn("日期", width="small"),
-                "文本": st.column_config.NumberColumn("文本", width="small"),
-                "段落": st.column_config.NumberColumn("段落", width="small"),
-            }
-        )
-        
-        # 下载选项
-        st.subheader("📥 导出报告")
-        
-        col1, col2 = st.columns(2)
-        
-        with col1:
-            # 导出Excel
-            excel_buffer = BytesIO()
-            df_pages.to_excel(excel_buffer, index=False, sheet_name='验证结果')
-            
-            st.download_button(
-                label="📊 下载Excel报告",
-                data=excel_buffer.getvalue(),
-                file_name=f"batch_validation_{pd.Timestamp.now().strftime('%Y%m%d_%H%M%S')}.xlsx",
-                mime="application/vnd.openxmlformats-officedocument.spreadsheetml.sheet"
-            )
-        
-        with col2:
-            # 导出JSON
-            json_data = json.dumps(batch_results, ensure_ascii=False, indent=2)
-            
-            st.download_button(
-                label="📄 下载JSON报告",
-                data=json_data,
-                file_name=f"batch_validation_{pd.Timestamp.now().strftime('%Y%m%d_%H%M%S')}.json",
-                mime="application/json"
-            )
-
-    @st.dialog("查看交叉验证结果", width="large", dismissible=True, on_dismiss="rerun")
-    def show_batch_cross_validation_results_dialog(self):
-        if 'cross_validation_batch_result' in st.session_state and st.session_state.cross_validation_batch_result:
-            self._display_batch_validation_results(st.session_state.cross_validation_batch_result)
-            
-        else:
-            st.info("暂无交叉验证结果,请先运行交叉验证")
-
-    def display_comparison_results(self, comparison_result: dict, detailed: bool = True):
-        """显示对比结果摘要 - 使用DataFrame展示"""
-        
-        st.header("📊 VLM预校验结果")
-        
-        # 统计信息
-        stats = comparison_result['statistics']
-        
-        # 统计信息概览
-        col1, col2, col3, col4 = st.columns(4)
-        with col1:
-            st.metric("总差异数", stats['total_differences'])
-        with col2:
-            st.metric("表格差异", stats['table_differences'])
-        with col3:
-            st.metric("其中表格金额差异", stats['amount_differences'])
-        with col4:
-            st.metric("段落差异", stats['paragraph_differences'])
-        
-        # 结果判断
-        if stats['total_differences'] == 0:
-            st.success("🎉 完美匹配!VLM识别结果与原OCR结果完全一致")
-        else:
-            st.warning(f"⚠️ 发现 {stats['total_differences']} 个差异,建议人工检查")
-            
-            # 使用DataFrame显示差异详情
-            if comparison_result['differences']:
-                st.subheader("🔍 差异详情对比")
-                
-                # 准备DataFrame数据
-                diff_data = []
-                for i, diff in enumerate(comparison_result['differences'], 1):
-                    diff_data.append({
-                        '序号': i,
-                        '位置': diff['position'],
-                        '类型': diff['type'],
-                        '原OCR结果': diff['file1_value'][:100] + ('...' if len(diff['file1_value']) > 100 else ''),
-                        'VLM识别结果': diff['file2_value'][:100] + ('...' if len(diff['file2_value']) > 100 else ''),
-                        '描述': diff['description'][:80] + ('...' if len(diff['description']) > 80 else ''),
-                        '严重程度': self._get_severity_level(diff)
-                    })
-                
-                # 创建DataFrame
-                df_differences = pd.DataFrame(diff_data)
-                
-                # 添加样式
-                def highlight_severity(val):
-                    """根据严重程度添加颜色"""
-                    if val == '高':
-                        return 'background-color: #ffebee; color: #c62828'
-                    elif val == '中':
-                        return 'background-color: #fff3e0; color: #ef6c00'
-                    elif val == '低':
-                        return 'background-color: #e8f5e8; color: #2e7d32'
-                    return ''
-                
-                # 显示DataFrame
-                styled_df = df_differences.style.applymap(
-                    highlight_severity, 
-                    subset=['严重程度']
-                ).format({
-                    '序号': '{:d}',
-                })
-                
-                st.dataframe(
-                    styled_df, 
-                    width='stretch',
-                    height=400,
-                    hide_index=True,
-                    column_config={
-                        "序号": st.column_config.NumberColumn(
-                            "序号", 
-                            width=None,  # 自动调整宽度
-                            pinned=True,
-                            help="差异项序号"
-                        ),
-                        "位置": st.column_config.TextColumn(
-                            "位置", 
-                            width=None,  # 自动调整宽度
-                            pinned=True,
-                            help="差异在文档中的位置"
-                        ),
-                        "类型": st.column_config.TextColumn(
-                            "类型", 
-                            width=None,  # 自动调整宽度
-                            pinned=True,
-                            help="差异类型"
-                        ),
-                        "原OCR结果": st.column_config.TextColumn(
-                            "原OCR结果", 
-                            width="large",  # 自动调整宽度
-                            pinned=True,
-                            help="原始OCR识别结果"
-                        ),
-                        "VLM识别结果": st.column_config.TextColumn(
-                            "VLM识别结果", 
-                            width="large",  # 自动调整宽度
-                            help="VLM重新识别的结果"
-                        ),
-                        "描述": st.column_config.TextColumn(
-                            "描述", 
-                            width="medium",  # 自动调整宽度
-                            help="差异详细描述"
-                        ),
-                        "严重程度": st.column_config.TextColumn(
-                            "严重程度", 
-                            width=None,  # 自动调整宽度
-                            help="差异严重程度评级"
-                        )
-                    }
-                )
-                
-                # 详细差异查看
-                st.subheader("🔍 详细差异查看")
-                
-                if detailed:
-                    # 选择要查看的差异
-                    selected_diff_index = st.selectbox(
-                        "选择要查看的差异:",
-                        options=range(len(comparison_result['differences'])),
-                        format_func=lambda x: f"差异 {x+1}: {comparison_result['differences'][x]['position']} - {comparison_result['differences'][x]['type']}",
-                        key="selected_diff"
-                    )
-                    
-                    if selected_diff_index is not None:
-                        diff = comparison_result['differences'][selected_diff_index]
-                        
-                        # 并排显示完整内容
-                        col1, col2 = st.columns(2)
-                        
-                        with col1:
-                            st.write("**原OCR结果:**")
-                            st.text_area(
-                                "原OCR结果详情",
-                                value=diff['file1_value'],
-                                height=200,
-                                key=f"original_{selected_diff_index}",
-                                label_visibility="collapsed"
-                            )
-                        
-                        with col2:
-                            st.write("**验证数据源识别结果:**")
-                            st.text_area(
-                                "验证数据源识别结果详情",
-                                value=diff['file2_value'],
-                                height=200,
-                                key=f"vlm_{selected_diff_index}",
-                                label_visibility="collapsed"
-                            )
-                        
-                        # 差异详细信息
-                        st.info(f"**位置:** {diff['position']}")
-                        st.info(f"**类型:** {diff['type']}")
-                        st.info(f"**描述:** {diff['description']}")
-                        st.info(f"**严重程度:** {self._get_severity_level(diff)}")
-                
-                # 差异统计图表
-                st.subheader("📈 差异类型分布")
-                
-                # 按类型统计差异
-                type_counts = {}
-                severity_counts = {'高': 0, '中': 0, '低': 0}
-                
-                for diff in comparison_result['differences']:
-                    diff_type = diff['type']
-                    type_counts[diff_type] = type_counts.get(diff_type, 0) + 1
-                    
-                    severity = self._get_severity_level(diff)
-                    severity_counts[severity] += 1
-                
-                col1, col2 = st.columns(2)
-                
-                with col1:
-                    # 类型分布饼图
-                    if type_counts:
-                        fig_type = px.pie(
-                            values=list(type_counts.values()),
-                            names=list(type_counts.keys()),
-                            title="差异类型分布"
-                        )
-                        st.plotly_chart(fig_type, width='stretch')
-                
-                with col2:
-                    # 严重程度分布条形图
-                    fig_severity = px.bar(
-                        x=list(severity_counts.keys()),
-                        y=list(severity_counts.values()),
-                        title="差异严重程度分布",
-                        color=list(severity_counts.keys()),
-                        color_discrete_map={'高': '#f44336', '中': '#ff9800', '低': '#4caf50'}
-                    )
-                    st.plotly_chart(fig_severity, width='stretch')
-        
-        # 下载选项
-        if detailed:
-            self._provide_download_options_in_results(comparison_result)
-
-    def _get_severity_level(self, diff: dict) -> str:
-        """根据差异类型和内容判断严重程度"""
-        # 如果差异中已经包含严重程度,直接使用
-        if 'severity' in diff:
-            severity_map = {'high': '高', 'medium': '中', 'low': '低'}
-            return severity_map.get(diff['severity'], '中')
-        
-        # 原有的逻辑作为后备
-        diff_type = diff['type'].lower()
-        
-        # 金额相关差异为高严重程度
-        if 'amount' in diff_type or 'number' in diff_type:
-            return '高'
-        
-        # 表格结构差异为中等严重程度
-        if 'table' in diff_type or 'structure' in diff_type:
-            return '中'
-        
-        # 检查相似度
-        if 'similarity' in diff:
-            similarity = diff['similarity']
-            if similarity < 50:
-                return '高'
-            elif similarity < 85:
-                return '中'
-            else:
-                return '低'
-        
-        # 检查内容长度差异
-        len_diff = abs(len(diff['file1_value']) - len(diff['file2_value']))
-        if len_diff > 50:
-            return '高'
-        elif len_diff > 10:
-            return '中'
-        else:
-            return '低'
-
-    def _provide_download_options_in_results(self, comparison_result: dict):
-        """在结果页面提供下载选项"""
-        
-        st.subheader("📥 导出预校验结果")
-        
-        col1, col2, col3 = st.columns(3)
-        
-        with col1:
-            # 导出差异详情为Excel
-            if comparison_result['differences']:
-                diff_data = []
-                for i, diff in enumerate(comparison_result['differences'], 1):
-                    diff_data.append({
-                        '序号': i,
-                        '位置': diff['position'],
-                        '类型': diff['type'],
-                        '原OCR结果': diff['file1_value'],
-                        'VLM识别结果': diff['file2_value'],
-                        '描述': diff['description'],
-                        '严重程度': self._get_severity_level(diff)
-                    })
-                
-                df_export = pd.DataFrame(diff_data)
-                excel_buffer = BytesIO()
-                df_export.to_excel(excel_buffer, index=False, sheet_name='差异详情')
-                
-                st.download_button(
-                    label="📊 下载差异详情(Excel)",
-                    data=excel_buffer.getvalue(),
-                    file_name=f"vlm_comparison_differences_{pd.Timestamp.now().strftime('%Y%m%d_%H%M%S')}.xlsx",
-                    mime="application/vnd.openxmlformats-officedocument.spreadsheetml.sheet",
-                    key="download_differences_excel"
-                )
-        
-        with col2:
-            # 导出统计报告
-            stats_data = {
-                '统计项目': ['总差异数', '表格差异', '其中表格金额差异', '段落差异'],
-                '数量': [
-                    comparison_result['statistics']['total_differences'],
-                    comparison_result['statistics']['table_differences'],
-                    comparison_result['statistics']['amount_differences'],
-                    comparison_result['statistics']['paragraph_differences']
-                ]
-            }
-            
-            df_stats = pd.DataFrame(stats_data)
-            csv_stats = df_stats.to_csv(index=False)
-            
-            st.download_button(
-                label="📈 下载统计报告(CSV)",
-                data=csv_stats,
-                file_name=f"vlm_comparison_stats_{pd.Timestamp.now().strftime('%Y%m%d_%H%M%S')}.csv",
-                mime="text/csv",
-                key="download_stats_csv"
-            )
-        
-        with col3:
-            # 导出完整报告为JSON
-            import json
-            
-            report_json = json.dumps(comparison_result, ensure_ascii=False, indent=2)
-            
-            st.download_button(
-                label="📄 下载完整报告(JSON)",
-                data=report_json,
-                file_name=f"vlm_comparison_full_{pd.Timestamp.now().strftime('%Y%m%d_%H%M%S')}.json",
-                mime="application/json",
-                key="download_full_json"
-            )
-        
-        # 操作建议
-        st.subheader("🚀 后续操作建议")
-        
-        total_diffs = comparison_result['statistics']['total_differences']
-        if total_diffs == 0:
-            st.success("✅ VLM识别结果与原OCR完全一致,可信度很高,无需人工校验")
-        elif total_diffs <= 5:
-            st.warning("⚠️ 发现少量差异,建议重点检查高严重程度的差异项")
-        elif total_diffs <= 20:
-            st.warning("🔍 发现中等数量差异,建议详细检查差异表格中标红的项目")
-        else:
-            st.error("❌ 发现大量差异,建议重新进行OCR识别或检查原始图片质量")
-    
-    def create_compact_layout(self, config):
-        """创建滚动凑布局"""
-        return self.layout_manager.create_compact_layout(config)
-
-@st.dialog("message", width="small", dismissible=True, on_dismiss="rerun")
-def message_box(msg: str, msg_type: str = "info"):
-    if msg_type == "info":
-        st.info(msg)
-    elif msg_type == "warning":
-        st.warning(msg)
-    elif msg_type == "error":
-        st.error(msg)
-
-def main():
-    """主应用"""
-    # 初始化应用
-    if 'validator' not in st.session_state:
-        validator = StreamlitOCRValidator()
-        st.session_state.validator = validator
-        st.session_state.validator.setup_page_config()
-        
-        # 页面标题
-        config = st.session_state.validator.config
-        st.title(config['ui']['page_title'])
-    else:
-        validator = st.session_state.validator
-        config = st.session_state.validator.config
-    
-    if 'selected_text' not in st.session_state:
-        st.session_state.selected_text = None
-    
-    if 'marked_errors' not in st.session_state:
-        st.session_state.marked_errors = set()
-    
-    # 数据源选择器
-    validator.create_data_source_selector()
-    
-    # 如果没有可用的数据源,提前返回
-    if not validator.all_sources:
-        st.stop()
-    
-    # 文件选择区域
-    with st.container(height=75, horizontal=True, horizontal_alignment='left', gap="medium"):
-        # 初始化session_state中的选择索引
-        if 'selected_file_index' not in st.session_state:
-            st.session_state.selected_file_index = 0
-            
-        if validator.display_options:
-            # 文件选择下拉框
-            selected_index = st.selectbox(
-                "选择OCR结果文件", 
-                range(len(validator.display_options)),
-                format_func=lambda i: validator.display_options[i],
-                index=st.session_state.selected_file_index,
-                key="selected_selectbox",
-                label_visibility="collapsed"
-            )
-            
-            # 更新session_state
-            if selected_index != st.session_state.selected_file_index:
-                st.session_state.selected_file_index = selected_index
-
-            selected_file = validator.file_paths[selected_index]
-
-            # 页码输入器
-            current_page = validator.file_info[selected_index]['page']
-            page_input = st.number_input(
-                "输入页码", 
-                placeholder="输入页码", 
-                label_visibility="collapsed",
-                min_value=1, 
-                max_value=len(validator.display_options), 
-                value=current_page, 
-                step=1,
-                key="page_input"
-            )
-            
-            # 当页码输入改变时,更新文件选择
-            if page_input != current_page:
-                for i, info in enumerate(validator.file_info):
-                    if info['page'] == page_input:
-                        st.session_state.selected_file_index = i
-                        selected_file = validator.file_paths[i]
-                        st.rerun()
-                        break
-
-            # 自动加载文件
-            if (st.session_state.selected_file_index >= 0
-                and validator.selected_file_index != st.session_state.selected_file_index
-                and selected_file):
-                validator.selected_file_index = st.session_state.selected_file_index
-                st.session_state.validator.load_ocr_data(selected_file)
-                
-                # 显示加载成功信息
-                current_source_name = get_data_source_display_name(validator.current_source_config)
-                st.success(f"✅ 已加载 {current_source_name} - 第{validator.file_info[st.session_state.selected_file_index]['page']}页")
-                st.rerun()
-        else:
-            st.warning("当前数据源中未找到OCR结果文件")
-
-        # 交叉验证按钮
-        if st.button("交叉验证", type="primary", icon=":material/compare_arrows:"):
-            if validator.image_path and validator.md_content:
-                validator.cross_validation()
-            else:
-                message_box("❌ 请先选择OCR数据文件", "error")
-
-        # 查看预校验结果按钮
-        if st.button("查看验证结果", type="secondary", icon=":material/quick_reference_all:"):
-            validator.show_batch_cross_validation_results_dialog()
-
-    # 显示当前数据源统计信息
-    with st.expander("🔧 OCR工具统计信息", expanded=False):
-        stats = validator.get_statistics()
-        col1, col2, col3, col4, col5 = st.columns(5)
-        
-        with col1:
-            st.metric("📊 总文本块", stats['total_texts'])
-        with col2:
-            st.metric("🔗 可点击文本", stats['clickable_texts'])
-        with col3:
-            st.metric("❌ 标记错误", stats['marked_errors'])
-        with col4:
-            st.metric("✅ 准确率", f"{stats['accuracy_rate']:.1f}%")
-        with col5:
-            # 显示当前数据源信息
-            if validator.current_source_config:
-                tool_display = validator.current_source_config['ocr_tool'].upper()
-                st.metric("🔧 OCR工具", tool_display)
-        
-        # 详细工具信息
-        if stats['tool_info']:
-            st.write("**详细信息:**", stats['tool_info'])
-    
-    # 其余标签页保持不变...
-    tab1, tab2, tab3 = st.tabs(["📄 内容人工检查", "🔍 交叉验证结果", "📊 表格分析"])
-    
-    with tab1:
-        validator.create_compact_layout(config)
-
-    with tab2:
-        # st.header("📄 VLM预校验识别结果")
-        current_md_path = Path(validator.file_paths[validator.selected_file_index]).with_suffix('.md')
-        pre_validation_dir = Path(validator.config['pre_validation'].get('out_dir', './output/pre_validation/')).resolve()
-        comparison_result_path = pre_validation_dir / f"{current_md_path.stem}_cross_validation.json"
-        # pre_validation_path = pre_validation_dir / f"{current_md_path.stem}.md"
-        verify_md_path = validator.find_verify_md_path(validator.selected_file_index)
-        
-        if comparison_result_path.exists():
-            # 加载并显示验证结果
-            with open(comparison_result_path, "r", encoding="utf-8") as f:
-                comparison_result = json.load(f)
-
-            # 左边显示OCR结果,右边显示VLM结果
-            col1, col2 = st.columns([1,1])
-            with col1:
-                st.subheader("🤖 原OCR识别结果")
-                with open(current_md_path, "r", encoding="utf-8") as f:
-                    original_md_content = f.read()
-                font_size = config['styles'].get('font_size', 10)
-                height = config['styles']['layout'].get('default_height', 800)
-                layout_type = "compact"
-                validator.layout_manager.render_content_by_mode(original_md_content, "HTML渲染", font_size, height, layout_type)
-            with col2:
-                st.subheader("🤖 验证识别结果")
-                with open(str(verify_md_path), "r", encoding="utf-8") as f:
-                    verify_md_content = f.read()
-                font_size = config['styles'].get('font_size', 10)
-                height = config['styles']['layout'].get('default_height', 800)
-                layout_type = "compact"
-                validator.layout_manager.render_content_by_mode(verify_md_content, "HTML渲染", font_size, height, layout_type)
-
-            # 显示差异统计
-            st.markdown("---")
-            validator.display_comparison_results(comparison_result, detailed=True)
-        else:
-            st.info("暂无预校验结果,请先运行VLM预校验")
-
-    with tab3:
-        # 表格分析页面 - 保持原有逻辑
-        st.header("📊 表格数据分析")
-        
-        if validator.md_content and '<table' in validator.md_content.lower():
-            st.subheader("🔍 表格数据预览")
-            validator.display_html_table_as_dataframe(validator.md_content)
-            
-        else:
-            st.info("当前OCR结果中没有检测到表格数据")
-    
-if __name__ == "__main__":
-    main()

+ 9 - 3
streamlit_validator_core.py

@@ -7,7 +7,7 @@ from typing import Dict, List, Optional
 import json
 
 from ocr_validator_utils import (
-    load_config, load_ocr_data_file, process_ocr_data,
+    load_ocr_data_file, process_ocr_data,
     get_ocr_statistics, find_available_ocr_files_multi_source, 
     get_data_source_display_name
 )
@@ -17,8 +17,14 @@ from ocr_validator_layout import OCRLayoutManager
 class StreamlitOCRValidator:
     """核心验证器类"""
     
-    def __init__(self):
-        self.config = load_config()
+    def __init__(self, config_dict: Dict = None):  # 🎯 参数名改为 config_dict
+        """
+        初始化验证器
+        
+        Args:
+            config_dict: 配置字典(从 ConfigManager.to_validator_config() 生成)
+        """
+        self.config = config_dict  # 🎯 直接赋值
         self.ocr_data = []
         self.md_content = ""
         self.image_path = ""