SHA1
--- a/STREAMLIT_GUIDE.md
+++ b/STREAMLIT_GUIDE.md
@@ -1,587 +0,0 @@
 
				-# 🚀 Streamlit OCR可视化校验工具使用指南
			
 
				-
			
 
				-## 🎯 工具介绍
			
 
				-
			
 
				-基于Streamlit开发的OCR可视化校验工具，提供现代化的Web界面和丰富的交互体验，让OCR结果校验变得直观高效。特别新增了强大的表格数据分析功能。
			
 
				-
			
 
				-### 🔧 核心功能
			
 
				-
			
 
				-- ✅ **实时交互**: 点击文本即时高亮图片位置
			
 
				-- ✅ **动态过滤**: 搜索、类别筛选、条件过滤
			
 
				-- ✅ **数据表格**: 可排序的详细数据视图
			
 
				-- ✅ **统计信息**: 实时统计和进度跟踪
			
 
				-- ✅ **错误标记**: 一键标记和管理识别错误
			
 
				-- ✅ **报告导出**: 生成详细的校验报告
			
 
				-- ⭐ **表格分析**: HTML表格转DataFrame，支持过滤、排序、导出
			
 
				-- ⭐ **多种渲染**: HTML/Markdown/DataFrame/原始文本四种显示模式
			
 
				-
			
 
				-## 🚀 快速启动
			
 
				-
			
 
				-### 1. 安装依赖
			
 
				-
			
 
				-```bash
			
 
				-# 安装Streamlit和相关依赖
			
 
				-pip install streamlit plotly pandas pillow numpy opencv-python openpyxl
			
 
				-```
			
 
				-
			
 
				-### 2. 启动应用
			
 
				-
			
 
				-```bash
			
 
				-# 方法1: 完整功能版本
			
 
				-python -m streamlit run streamlit_ocr_validator.py
			
 
				-
			
 
				-# 方法2: 使用启动脚本
			
 
				-python run_streamlit_validator.py
			
 
				-
			
 
				-# 方法3: 开发模式（自动重载）
			
 
				-streamlit run streamlit_ocr_validator.py --server.runOnSave true
			
 
				-```
			
 
				-
			
 
				-### 3. 访问界面
			
 
				-
			
 
				-浏览器会自动打开 http://localhost:8501，如果没有自动打开，请手动访问该地址。
			
 
				-
			
 
				-## 🖥️ 界面使用指南
			
 
				-
			
 
				-### 主界面布局
			
 
				-
			
 
				-```
			
 
				-┌─────────────────────────────────────────────────────────────────────┐
			
 
				-│ 🔍 OCR可视化校验工具                                                 │
			
 
				-├─────────────────────────────────────────────────────────────────────┤
			
 
				-│ 📊 总文本块: 13  🔗 可点击: 9  ❌ 标记错误: 2  ✅ 准确率: 85.7%      │
			
 
				-├─────────────────────────┬───────────────────────────────────────────┤
			
 
				-│ 📄 OCR识别内容            │ 🖼️ 原图标注                               │
			
 
				-│                        │                                           │
			
 
				-│ 🔍 搜索框               │ [显示选中位置的红框标注]                   │
			
 
				-│ 📍 选择文本下拉框        │ [图片缩放和详细信息]                       │
			
 
				-│                        │                                           │
			
 
				-│ 📝 MD内容预览           │ 📍 选中文本详情                           │
			
 
				-│ [4种渲染模式选择]        │ - 文本内容: xxx                           │
			
 
				-│ ○ HTML渲染             │ - 位置: [x1,y1,x2,y2]                    │
			
 
				-│ ● Markdown渲染         │ - 宽度: xxx px                            │
			
 
				-│ ○ DataFrame表格 ⭐      │ - 高度: xxx px                            │
			
 
				-│ ○ 原始文本             │                                           │
			
 
				-│                        │                                           │
			
 
				-│ 🎯 可点击文本列表        │                                           │
			
 
				-│ [📍 文本1] [❌] [✅]     │                                           │
			
 
				-│ [📍 文本2] [❌] [✅]     │                                           │
			
 
				-└─────────────────────────┴───────────────────────────────────────────┘
			
 
				-```
			
 
				-
			
 
				-### 侧边栏功能
			
 
				-
			
 
				-```
			
 
				-┌─────────────────────┐
			
 
				-│ 📁 文件选择          │
			
 
				-│ [选择OCR结果文件]    │
			
 
				-│ [🔄 加载文件]       │
			
 
				-│                    │
			
 
				-│ 🎛️ 控制面板         │
			
 
				-│ [🧹 清除选择]       │
			
 
				-│ [❌ 清除错误标记]   │
			
 
				-│                    │
			
 
				-│ 📊 表格快捷操作 ⭐   │
			
 
				-│ [🔍 快速预览表格]   │
			
 
				-│ [📥 一键导出所有表格] │
			
 
				-│                    │
			
 
				-│ 🔧 调试信息         │
			
 
				-│ [调试信息开关]      │
			
 
				-└─────────────────────┘
			
 
				-```
			
 
				-
			
 
				-### 使用步骤
			
 
				-
			
 
				-1. **选择文件**
			
 
				-   - 从侧边栏下拉框中选择OCR结果JSON文件
			
 
				-   - 点击"🔄 加载文件"按钮加载数据
			
 
				-   - 查看顶部统计信息确认加载成功
			
 
				-
			
 
				-2. **浏览统计信息**
			
 
				-   - 查看总文本块数量
			
 
				-   - 了解可点击文本数量
			
 
				-   - 确认图片是否正确加载
			
 
				-   - 查看当前准确率
			
 
				-
			
 
				-3. **交互校验**
			
 
				-   - 使用下拉框选择要校验的文本
			
 
				-   - 点击左侧的"📍 文本内容"按钮
			
 
				-   - 观察右侧图片上的红色框标注
			
 
				-   - 查看右下角显示的详细位置信息
			
 
				-
			
 
				-4. **搜索过滤**
			
 
				-   - 使用搜索框快速定位特定文本
			
 
				-   - 在MD内容预览中查看完整识别结果
			
 
				-   - 通过搜索结果快速定位问题文本
			
 
				-
			
 
				-5. **表格数据分析** ⭐ 新增
			
 
				-   - 选择"DataFrame表格"渲染模式
			
 
				-   - 查看自动解析的HTML表格
			
 
				-   - 使用过滤、排序功能分析数据
			
 
				-   - 查看表格统计信息
			
 
				-   - 导出CSV或Excel文件
			
 
				-
			
 
				-6. **错误标记管理**
			
 
				-   - 点击文本旁边的"❌"按钮标记错误
			
 
				-   - 点击"✅"按钮取消错误标记
			
 
				-   - 观察准确率的实时变化
			
 
				-   - 使用侧边栏批量清除标记
			
 
				-
			
 
				-## 🎨 高级功能
			
 
				-
			
 
				-### 完整版本独有功能 (`streamlit_ocr_validator.py`)
			
 
				-
			
 
				-#### 📊 错误标记系统
			
 
				-- **标记错误**: 点击文本旁边的"❌"按钮标记识别错误
			
 
				-- **取消标记**: 点击"✅"按钮取消错误标记
			
 
				-- **统计准确率**: 自动计算识别准确率
			
 
				-- **错误过滤**: 只显示标记为错误的文本
			
 
				-- **批量操作**: 侧边栏提供批量清除功能
			
 
				-
			
 
				-#### 📈 表格数据分析 ⭐ 核心新功能
			
 
				-- **智能表格检测**: 自动识别HTML表格内容
			
 
				-- **DataFrame转换**: 将HTML表格转换为可操作的pandas DataFrame
			
 
				-- **多维度操作**: 支持过滤、排序、搜索等操作
			
 
				-- **统计分析**: 自动生成表格行列数、数值列统计等信息
			
 
				-- **数据导出**: 支持CSV、Excel格式导出
			
 
				-- **可视化图表**: 基于表格数据生成统计图表
			
 
				-
			
 
				-#### 🎛️ 多种渲染模式
			
 
				-- **HTML渲染**: 原生HTML表格显示，保持格式
			
 
				-- **Markdown渲染**: 转换为Markdown表格格式
			
 
				-- **DataFrame表格**: 转换为可交互的数据表格 ⭐
			
 
				-- **原始文本**: 纯文本格式显示
			
 
				-
			
 
				-#### 🔧 侧边栏控制
			
 
				-- **文件管理**: 侧边栏选择和管理OCR文件
			
 
				-- **控制面板**: 清除选择、清除错误标记等操作
			
 
				-- **表格快捷操作**: 快速预览和导出表格功能 ⭐
			
 
				-- **调试信息**: 详细的系统状态和数据信息
			
 
				-
			
 
				-#### 📊 过滤和筛选
			
 
				-- **多条件过滤**: 按类别、错误状态、尺寸等多重筛选
			
 
				-- **实时搜索**: 动态搜索文本内容
			
 
				-- **数据表格**: 可排序、可筛选的完整数据视图
			
 
				-
			
 
				-### 表格分析功能详解 ⭐
			
 
				-
			
 
				-#### 功能特性
			
 
				-
			
 
				-```python
			
 
				-# 表格检测和转换
			
 
				-if '<table>' in display_content.lower():
			
 
				-    st.session_state.validator.display_html_table_as_dataframe(display_content)
			
 
				-else:
			
 
				-    st.info("当前内容中没有检测到HTML表格")
			
 
				-```
			
 
				-
			
 
				-#### 支持的表格操作
			
 
				-
			
 
				-1. **基础操作**
			
 
				-   - 自动检测HTML表格
			
 
				-   - 转换为pandas DataFrame
			
 
				-   - 表格信息统计（行数、列数、列名等）
			
 
				-
			
 
				-2. **数据过滤**
			
 
				-   - 按列内容过滤
			
 
				-   - 支持文本搜索
			
 
				-   - 条件筛选
			
 
				-
			
 
				-3. **数据排序**
			
 
				-   - 按任意列排序
			
 
				-   - 升序/降序选择
			
 
				-   - 多列排序
			
 
				-
			
 
				-4. **统计分析**
			
 
				-   - 数值列描述性统计
			
 
				-   - 数据类型分析
			
 
				-   - 缺失值统计
			
 
				-
			
 
				-5. **数据导出**
			
 
				-   - CSV格式导出
			
 
				-   - Excel格式导出
			
 
				-   - 支持过滤后数据导出
			
 
				-
			
 
				-#### 使用示例
			
 
				-
			
 
				-```python
			
 
				-# 在streamlit界面中
			
 
				-def display_html_table_as_dataframe(self, html_content: str, enable_editing: bool = False):
			
 
				-    """将HTML表格解析为DataFrame显示"""
			
 
				-    import pandas as pd
			
 
				-    from io import StringIO, BytesIO
			
 
				-    
			
 
				-    try:
			
 
				-        # 使用pandas直接读取HTML表格
			
 
				-        tables = pd.read_html(StringIO(html_content))
			
 
				-        if tables:
			
 
				-            for i, table in enumerate(tables):
			
 
				-                st.subheader(f"📊 表格 {i+1}")
			
 
				-                
			
 
				-                # 创建表格操作按钮
			
 
				-                col1, col2, col3, col4 = st.columns(4)
			
 
				-                with col1:
			
 
				-                    show_info = st.checkbox(f"显示表格信息", key=f"info_{i}")
			
 
				-                with col2:
			
 
				-                    show_stats = st.checkbox(f"显示统计信息", key=f"stats_{i}")
			
 
				-                with col3:
			
 
				-                    enable_filter = st.checkbox(f"启用过滤", key=f"filter_{i}")
			
 
				-                with col4:
			
 
				-                    enable_sort = st.checkbox(f"启用排序", key=f"sort_{i}")
			
 
				-                
			
 
				-                # 显示表格
			
 
				-                st.dataframe(table, width='stretch')
			
 
				-```
			
 
				-
			
 
				-## 🔧 自定义开发
			
 
				-
			
 
				-### 扩展功能开发
			
 
				-
			
 
				-#### 1. 添加新的表格处理功能 ⭐
			
 
				-
			
 
				-```python
			
 
				-import plotly.express as px
			
 
				-import streamlit as st
			
 
				-
			
 
				-def create_table_visualization(df):
			
 
				-    """创建表格数据可视化"""
			
 
				-    if not df.empty:
			
 
				-        numeric_cols = df.select_dtypes(include=[np.number]).columns
			
 
				-        
			
 
				-        if len(numeric_cols) > 0:
			
 
				-            # 创建统计图表
			
 
				-            fig = px.bar(
			
 
				-                x=df.index,
			
 
				-                y=df[numeric_cols[0]],
			
 
				-                title=f"{numeric_cols[0]} 分布"
			
 
				-            )
			
 
				-            st.plotly_chart(fig, width='stretch')
			
 
				-            
			
 
				-            # 创建散点图
			
 
				-            if len(numeric_cols) > 1:
			
 
				-                fig_scatter = px.scatter(
			
 
				-                    df, 
			
 
				-                    x=numeric_cols[0], 
			
 
				-                    y=numeric_cols[1],
			
 
				-                    title=f"{numeric_cols[0]} vs {numeric_cols[1]}"
			
 
				-                )
			
 
				-                st.plotly_chart(fig_scatter, width='stretch')
			
 
				-
			
 
				-# 在主应用中使用
			
 
				-if st.checkbox("显示数据可视化"):
			
 
				-    create_table_visualization(filtered_table)
			
 
				-```
			
 
				-
			
 
				-#### 2. 高级表格编辑功能
			
 
				-
			
 
				-```python
			
 
				-def advanced_table_editor(df):
			
 
				-    """高级表格编辑器"""
			
 
				-    st.subheader("🔧 高级编辑")
			
 
				-    
			
 
				-    # 数据编辑
			
 
				-    edited_df = st.data_editor(
			
 
				-        df,
			
 
				-        width='stretch',
			
 
				-        num_rows="dynamic",  # 允许添加删除行
			
 
				-        key="advanced_editor"
			
 
				-    )
			
 
				-    
			
 
				-    # 数据验证
			
 
				-    if not edited_df.equals(df):
			
 
				-        st.success("✏️ 数据已修改")
			
 
				-        
			
 
				-        # 显示变更统计
			
 
				-        changes = len(edited_df) - len(df)
			
 
				-        st.info(f"行数变化: {changes:+d}")
			
 
				-        
			
 
				-        # 导出修改后的数据
			
 
				-        if st.button("💾 保存修改"):
			
 
				-            csv_data = edited_df.to_csv(index=False)
			
 
				-            st.download_button(
			
 
				-                "下载修改后的数据",
			
 
				-                csv_data,
			
 
				-                "modified_table.csv",
			
 
				-                "text/csv"
			
 
				-            )
			
 
				-    
			
 
				-    return edited_df
			
 
				-```
			
 
				-
			
 
				-#### 3. 批量表格处理
			
 
				-
			
 
				-```python
			
 
				-def batch_table_processing():
			
 
				-    """批量表格处理功能"""
			
 
				-    st.subheader("📦 批量表格处理")
			
 
				-    
			
 
				-    uploaded_files = st.file_uploader(
			
 
				-        "上传多个包含表格的文件", 
			
 
				-        type=['md', 'html'], 
			
 
				-        accept_multiple_files=True
			
 
				-    )
			
 
				-    
			
 
				-    if uploaded_files and st.button("开始批量处理"):
			
 
				-        all_tables = []
			
 
				-        progress_bar = st.progress(0)
			
 
				-        
			
 
				-        for i, file in enumerate(uploaded_files):
			
 
				-            content = file.read().decode('utf-8')
			
 
				-            
			
 
				-            if '<table' in content.lower():
			
 
				-                tables = pd.read_html(StringIO(content))
			
 
				-                for j, table in enumerate(tables):
			
 
				-                    table['source_file'] = file.name
			
 
				-                    table['table_index'] = j
			
 
				-                    all_tables.append(table)
			
 
				-            
			
 
				-            progress_bar.progress((i + 1) / len(uploaded_files))
			
 
				-        
			
 
				-        if all_tables:
			
 
				-            st.success(f"✅ 共处理 {len(all_tables)} 个表格")
			
 
				-            
			
 
				-            # 合并所有表格
			
 
				-            if st.checkbox("合并所有表格"):
			
 
				-                try:
			
 
				-                    merged_df = pd.concat(all_tables, ignore_index=True)
			
 
				-                    st.dataframe(merged_df)
			
 
				-                    
			
 
				-                    # 导出合并结果
			
 
				-                    csv_data = merged_df.to_csv(index=False)
			
 
				-                    st.download_button(
			
 
				-                        "下载合并表格",
			
 
				-                        csv_data,
			
 
				-                        "merged_tables.csv",
			
 
				-                        "text/csv"
			
 
				-                    )
			
 
				-                except Exception as e:
			
 
				-                    st.error(f"合并失败: {e}")
			
 
				-```
			
 
				-
			
 
				-#### 4. 表格数据质量检查 ⭐
			
 
				-
			
 
				-```python
			
 
				-def table_quality_check(df):
			
 
				-    """表格数据质量检查"""
			
 
				-    st.subheader("🔍 数据质量检查")
			
 
				-    
			
 
				-    # 基础统计
			
 
				-    col1, col2, col3 = st.columns(3)
			
 
				-    with col1:
			
 
				-        st.metric("总行数", len(df))
			
 
				-    with col2:
			
 
				-        st.metric("总列数", len(df.columns))
			
 
				-    with col3:
			
 
				-        null_percent = (df.isnull().sum().sum() / (len(df) * len(df.columns))) * 100
			
 
				-        st.metric("缺失值比例", f"{null_percent:.1f}%")
			
 
				-    
			
 
				-    # 详细质量报告
			
 
				-    quality_issues = []
			
 
				-    
			
 
				-    # 检查空值
			
 
				-    null_cols = df.columns[df.isnull().any()].tolist()
			
 
				-    if null_cols:
			
 
				-        quality_issues.append(f"发现 {len(null_cols)} 列存在空值: {', '.join(null_cols)}")
			
 
				-    
			
 
				-    # 检查重复行
			
 
				-    duplicate_rows = df.duplicated().sum()
			
 
				-    if duplicate_rows > 0:
			
 
				-        quality_issues.append(f"发现 {duplicate_rows} 行重复数据")
			
 
				-    
			
 
				-    # 检查数据类型一致性
			
 
				-    for col in df.columns:
			
 
				-        if df[col].dtype == 'object':
			
 
				-            # 检查是否应该是数值类型
			
 
				-            numeric_like = df[col].str.replace(',', '').str.replace('$', '')
			
 
				-            try:
			
 
				-                pd.to_numeric(numeric_like, errors='raise')
			
 
				-                quality_issues.append(f"列 '{col}' 可能应该是数值类型")
			
 
				-            except:
			
 
				-                pass
			
 
				-    
			
 
				-    if quality_issues:
			
 
				-        st.warning("⚠️ 发现数据质量问题:")
			
 
				-        for issue in quality_issues:
			
 
				-            st.write(f"- {issue}")
			
 
				-    else:
			
 
				-        st.success("✅ 数据质量良好")
			
 
				-```
			
 
				-
			
 
				-## 📊 性能优化
			
 
				-
			
 
				-### 1. 缓存优化
			
 
				-
			
 
				-```python
			
 
				-@st.cache_data
			
 
				-def load_and_process_ocr_data(file_path: str):
			
 
				-    """缓存OCR数据加载和处理"""
			
 
				-    with open(file_path, 'r') as f:
			
 
				-        ocr_data = json.load(f)
			
 
				-    
			
 
				-    # 处理数据
			
 
				-    processed_data = process_ocr_data(ocr_data)
			
 
				-    return processed_data
			
 
				-
			
 
				-@st.cache_resource
			
 
				-def load_image(image_path: str):
			
 
				-    """缓存图片加载"""
			
 
				-    return Image.open(image_path)
			
 
				-
			
 
				-@st.cache_data
			
 
				-def parse_html_tables(html_content: str):
			
 
				-    """缓存表格解析结果"""
			
 
				-    try:
			
 
				-        tables = pd.read_html(StringIO(html_content))
			
 
				-        return tables
			
 
				-    except:
			
 
				-        return []
			
 
				-```
			
 
				-
			
 
				-### 2. 大文件处理
			
 
				-
			
 
				-```python
			
 
				-def handle_large_tables():
			
 
				-    """处理大型表格"""
			
 
				-    if 'page_size' not in st.session_state:
			
 
				-        st.session_state.page_size = 100
			
 
				-    
			
 
				-    # 分页显示表格
			
 
				-    if not df.empty:
			
 
				-        total_rows = len(df)
			
 
				-        pages = (total_rows - 1) // st.session_state.page_size + 1
			
 
				-        
			
 
				-        col1, col2, col3 = st.columns([1, 2, 1])
			
 
				-        with col2:
			
 
				-            current_page = st.slider("页数", 1, pages, 1)
			
 
				-        
			
 
				-        # 显示当前页数据
			
 
				-        start_idx = (current_page - 1) * st.session_state.page_size
			
 
				-        end_idx = min(start_idx + st.session_state.page_size, total_rows)
			
 
				-        current_df = df.iloc[start_idx:end_idx]
			
 
				-        
			
 
				-        st.dataframe(current_df, width='stretch')
			
 
				-        st.info(f"显示第 {start_idx+1}-{end_idx} 行，共 {total_rows} 行")
			
 
				-```
			
 
				-
			
 
				-### 3. 内存优化
			
 
				-
			
 
				-```python
			
 
				-def optimize_dataframe_memory(df):
			
 
				-    """优化DataFrame内存使用"""
			
 
				-    initial_memory = df.memory_usage(deep=True).sum()
			
 
				-    
			
 
				-    # 优化数值类型
			
 
				-    for col in df.select_dtypes(include=['int']).columns:
			
 
				-        df[col] = pd.to_numeric(df[col], downcast='integer')
			
 
				-    
			
 
				-    for col in df.select_dtypes(include=['float']).columns:
			
 
				-        df[col] = pd.to_numeric(df[col], downcast='float')
			
 
				-    
			
 
				-    # 优化字符串类型
			
 
				-    for col in df.select_dtypes(include=['object']).columns:
			
 
				-        if df[col].nunique() < len(df) * 0.5:  # 如果唯一值少于50%，转换为category
			
 
				-            df[col] = df[col].astype('category')
			
 
				-    
			
 
				-    final_memory = df.memory_usage(deep=True).sum()
			
 
				-    reduction = (initial_memory - final_memory) / initial_memory * 100
			
 
				-    
			
 
				-    st.info(f"内存优化：减少 {reduction:.1f}% ({initial_memory/1024/1024:.1f}MB → {final_memory/1024/1024:.1f}MB)")
			
 
				-    
			
 
				-    return df
			
 
				-```
			
 
				-
			
 
				-## 🚀 部署指南
			
 
				-
			
 
				-### 本地开发部署
			
 
				-
			
 
				-```bash
			
 
				-# 开发模式运行（自动重载）
			
 
				-streamlit run streamlit_ocr_validator.py --server.runOnSave true
			
 
				-
			
 
				-# 指定端口运行
			
 
				-streamlit run streamlit_ocr_validator.py --server.port 8502
			
 
				-
			
 
				-# 指定主机运行（局域网访问）
			
 
				-streamlit run streamlit_ocr_validator.py --server.address 0.0.0.0
			
 
				-```
			
 
				-
			
 
				-### Docker部署
			
 
				-
			
 
				-```dockerfile
			
 
				-FROM python:3.9-slim
			
 
				-
			
 
				-WORKDIR /app
			
 
				-
			
 
				-# 安装系统依赖
			
 
				-RUN apt-update && apt-get install -y \
			
 
				-    libgl1-mesa-glx \
			
 
				-    libglib2.0-0 \
			
 
				-    libsm6 \
			
 
				-    libxext6 \
			
 
				-    libxrender-dev \
			
 
				-    libgomp1 \
			
 
				-    && rm -rf /var/lib/apt/lists/*
			
 
				-
			
 
				-# 安装Python依赖
			
 
				-COPY requirements.txt .
			
 
				-RUN pip install -r requirements.txt
			
 
				-
			
 
				-COPY . .
			
 
				-
			
 
				-EXPOSE 8501
			
 
				-CMD ["streamlit", "run", "streamlit_ocr_validator.py", "--server.address=0.0.0.0"]
			
 
				-```
			
 
				-
			
 
				-### Streamlit Cloud部署
			
 
				-
			
 
				-1. 将代码推送到GitHub仓库
			
 
				-2. 访问 https://share.streamlit.io/
			
 
				-3. 连接GitHub仓库并部署
			
 
				-4. 设置环境变量（如API密钥）
			
 
				-
			
 
				-## 💡 最佳实践
			
 
				-
			
 
				-### 1. 用户体验优化
			
 
				-
			
 
				-- **加载状态**: 使用`st.spinner()`显示加载状态
			
 
				-- **错误处理**: 使用`st.error()`友好地显示错误信息
			
 
				-- **进度提示**: 使用`st.progress()`显示处理进度
			
 
				-- **数据缓存**: 合理使用`@st.cache_data`提升性能
			
 
				-
			
 
				-### 2. 界面设计
			
 
				-
			
 
				-- **布局清晰**: 使用`st.columns()`合理分布内容
			
 
				-- **视觉层次**: 使用不同级别的标题和分隔符
			
 
				-- **交互反馈**: 及时响应用户操作
			
 
				-- **移动友好**: 考虑不同屏幕尺寸的适配
			
 
				-
			
 
				-### 3. 表格处理最佳实践 ⭐
			
 
				-
			
 
				-- **大表格处理**: 对超过1000行的表格启用分页显示
			
 
				-- **内存管理**: 使用数据类型优化减少内存使用
			
 
				-- **导出优化**: 大表格导出时显示进度条
			
 
				-- **错误处理**: 优雅处理表格解析失败的情况
			
 
				-
			
 
				-### 4. 数据安全
			
 
				-
			
 
				-- **输入验证**: 验证上传文件的格式和内容
			
 
				-- **错误处理**: 妥善处理异常情况
			
 
				-- **资源清理**: 及时清理临时文件和内存
			
 
				-
			
 
				-## 🎉 总结
			
 
				-
			
 
				-Streamlit版本的OCR校验工具经过升级后提供了更加强大的功能：
			
 
				-
			
 
				-✅ **基础功能**：实时交互、动态更新、错误管理  
			
 
				-✅ **表格分析**：HTML表格转DataFrame、多种操作、导出功能 ⭐  
			
 
				-✅ **数据处理**：过滤、排序、统计分析、可视化 ⭐  
			
 
				-✅ **批量操作**：多文件处理、批量导出、合并功能 ⭐  
			
 
				-✅ **质量检查**：数据质量分析、问题检测、优化建议 ⭐  
			
 
				-✅ **扩展性**：易于添加新功能和自定义组件  
			
 
				-✅ **用户体验**：现代化界面、响应式设计、直观操作  
			
 
				-
			
 
				-新增的表格分析功能使其不仅能够校验OCR结果，更能深入分析表格数据，成为一个完整的OCR数据处理工作台！
			
 
				-
			
 
				----
			
 
				-
			
 
				-> 🌟 **特别推荐**：使用DataFrame表格模式分析财务报表等结构化数据，体验完整的数据处理工作流程。
			
--- a/batch_ocr/batch_merge_results.py
+++ b/batch_ocr/batch_merge_results.py
@@ -0,0 +1,730 @@
 
				+#!/usr/bin/env python3
			
 
				+"""
			
 
				+批量合并 OCR 结果
			
 
				+自动读取配置文件，对所有 VL 处理器的输出进行 bbox 合并
			
 
				+支持执行器输出日志重定向
			
 
				+"""
			
 
				+
			
 
				+import os
			
 
				+import sys
			
 
				+import yaml
			
 
				+import argparse
			
 
				+import subprocess
			
 
				+from pathlib import Path
			
 
				+from datetime import datetime
			
 
				+from typing import Dict, List, Tuple, Optional, Any
			
 
				+from dataclasses import dataclass
			
 
				+import logging
			
 
				+from tqdm import tqdm
			
 
				+
			
 
				+# 添加 merger 模块路径
			
 
				+sys.path.insert(0, str(Path(__file__).parent.parent / 'merger'))
			
 
				+
			
 
				+
			
 
				+@dataclass
			
 
				+class MergeTask:
			
 
				+    """合并任务"""
			
 
				+    processor_name: str
			
 
				+    vl_result_dir: Path
			
 
				+    paddle_result_dir: Path
			
 
				+    output_dir: Path
			
 
				+    merger_script: str
			
 
				+    description: str
			
 
				+    log_file: str = ""  # 🎯 新增：日志文件路径
			
 
				+
			
 
				+
			
 
				+class BatchMerger:
			
 
				+    """批量合并器"""
			
 
				+    
			
 
				+    # VL 处理器类型映射到合并脚本
			
 
				+    MERGER_SCRIPTS = {
			
 
				+        'paddleocr_vl': 'merge_paddleocr_vl_paddleocr.py',
			
 
				+        'mineru': 'merge_mineru_paddle_ocr.py',
			
 
				+        'dotsocr': 'merge_mineru_paddle_ocr.py',  # DotsOCR 也用 MinerU 格式
			
 
				+    }
			
 
				+    
			
 
				+    def __init__(self, config_file: str, base_dir: str = None):
			
 
				+        """
			
 
				+        Args:
			
 
				+            config_file: processor_configs.yaml 路径
			
 
				+            base_dir: PDF 基础目录，覆盖配置文件中的设置
			
 
				+        """
			
 
				+        self.config_file = Path(config_file)
			
 
				+        self.config = self._load_config()
			
 
				+        self.base_dir = Path(base_dir) if base_dir else Path(self.config['global']['base_dir'])
			
 
				+        
			
 
				+        # 🎯 日志基础目录
			
 
				+        self.log_base_dir = self.base_dir / self.config['global'].get('log_dir', 'logs')
			
 
				+        
			
 
				+        # 设置日志
			
 
				+        self.logger = self._setup_logger()
			
 
				+        
			
 
				+        # merger 脚本目录
			
 
				+        self.merger_dir = Path(__file__).parent.parent / 'merger'
			
 
				+        
			
 
				+        # 🎯 统计信息
			
 
				+        self.merge_results: List[Dict[str, Any]] = []
			
 
				+    
			
 
				+    def _load_config(self) -> Dict:
			
 
				+        """加载配置文件"""
			
 
				+        with open(self.config_file, 'r', encoding='utf-8') as f:
			
 
				+            return yaml.safe_load(f)
			
 
				+    
			
 
				+    def _setup_logger(self) -> logging.Logger:
			
 
				+        """设置日志"""
			
 
				+        logger = logging.getLogger('BatchMerger')
			
 
				+        logger.setLevel(logging.INFO)
			
 
				+        
			
 
				+        if not logger.handlers:
			
 
				+            console_handler = logging.StreamHandler()
			
 
				+            console_handler.setLevel(logging.INFO)
			
 
				+            formatter = logging.Formatter(
			
 
				+                '%(asctime)s - %(levelname)s - %(message)s',
			
 
				+                datefmt='%Y-%m-%d %H:%M:%S'
			
 
				+            )
			
 
				+            console_handler.setFormatter(formatter)
			
 
				+            logger.addHandler(console_handler)
			
 
				+        
			
 
				+        return logger
			
 
				+    
			
 
				+    def _detect_processor_type(self, processor_name: str) -> str:
			
 
				+        """
			
 
				+        检测处理器类型
			
 
				+        
			
 
				+        Returns:
			
 
				+            'paddleocr_vl', 'mineru', 'dotsocr', 'ppstructure' 等
			
 
				+        """
			
 
				+        name_lower = processor_name.lower()
			
 
				+        
			
 
				+        if 'paddleocr_vl' in name_lower or 'paddleocr-vl' in name_lower:
			
 
				+            return 'paddleocr_vl'
			
 
				+        elif 'mineru' in name_lower:
			
 
				+            return 'mineru'
			
 
				+        elif 'dotsocr' in name_lower or 'dots' in name_lower:
			
 
				+            return 'dotsocr'
			
 
				+        elif 'ppstructure' in name_lower or 'pp-structure' in name_lower:
			
 
				+            return 'ppstructure'
			
 
				+        else:
			
 
				+            return 'unknown'
			
 
				+    
			
 
				+    def _get_merger_script(self, processor_type: str) -> str:
			
 
				+        """获取合并脚本路径"""
			
 
				+        script_name = self.MERGER_SCRIPTS.get(processor_type)
			
 
				+        if not script_name:
			
 
				+            return None
			
 
				+        
			
 
				+        script_path = self.merger_dir / script_name
			
 
				+        return str(script_path) if script_path.exists() else None
			
 
				+    
			
 
				+    def _find_paddle_result_dir(self, pdf_dir: Path) -> Path:
			
 
				+        """
			
 
				+        查找对应的 PaddleOCR 结果目录
			
 
				+        
			
 
				+        优先级:
			
 
				+        1. ppstructurev3_cpu_results (本地 CPU)
			
 
				+        2. ppstructurev3_results (默认)
			
 
				+        3. data_PPStructureV3_Results (旧格式)
			
 
				+        """
			
 
				+        candidates = [
			
 
				+            pdf_dir / 'ppstructurev3_client_results',
			
 
				+            pdf_dir / 'ppstructurev3_single_process_results',
			
 
				+        ]
			
 
				+        
			
 
				+        for candidate in candidates:
			
 
				+            if candidate.exists():
			
 
				+                return candidate
			
 
				+        
			
 
				+        return None
			
 
				+    
			
 
				+    def _get_log_file_path(self, pdf_dir: Path, processor_name: str) -> Path:
			
 
				+        """
			
 
				+        🎯 获取合并任务的日志文件路径
			
 
				+        
			
 
				+        日志结构:
			
 
				+        PDF目录/
			
 
				+        └── logs/
			
 
				+            └── merge_processor_name/
			
 
				+                └── PDF名称_merge_YYYYMMDD_HHMMSS.log
			
 
				+        """
			
 
				+        # 日志目录
			
 
				+        log_dir = pdf_dir / 'logs' / f'merge_{processor_name}'
			
 
				+        log_dir.mkdir(parents=True, exist_ok=True)
			
 
				+        
			
 
				+        # 日志文件名
			
 
				+        timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
			
 
				+        log_file = log_dir / f"{pdf_dir.name}_merge_{timestamp}.log"
			
 
				+        
			
 
				+        return log_file
			
 
				+    
			
 
				+    def discover_merge_tasks(
			
 
				+        self, 
			
 
				+        pdf_list: List[str] = None,
			
 
				+        processors: List[str] = None
			
 
				+    ) -> List[MergeTask]:
			
 
				+        """
			
 
				+        自动发现需要合并的任务
			
 
				+        
			
 
				+        Args:
			
 
				+            pdf_list: PDF 文件列表（不含扩展名），如 ['德_内蒙古银行照', ...]
			
 
				+            processors: 处理器列表，如 ['paddleocr_vl_single_process', ...]
			
 
				+        
			
 
				+        Returns:
			
 
				+            MergeTask 列表
			
 
				+        """
			
 
				+        tasks = []
			
 
				+        
			
 
				+        # 如果没有指定处理器，扫描所有 VL 类型的处理器
			
 
				+        if not processors:
			
 
				+            processors = []
			
 
				+            for proc_name, proc_config in self.config['processors'].items():
			
 
				+                proc_type = self._detect_processor_type(proc_name)
			
 
				+                if proc_type in ['paddleocr_vl', 'mineru', 'dotsocr']:
			
 
				+                    processors.append(proc_name)
			
 
				+        
			
 
				+        # 如果没有指定 PDF 列表，扫描基础目录
			
 
				+        if not pdf_list:
			
 
				+            pdf_list = [d.name for d in self.base_dir.iterdir() if d.is_dir()]
			
 
				+        
			
 
				+        self.logger.info(f"📂 基础目录: {self.base_dir}")
			
 
				+        self.logger.info(f"🔍 发现 {len(pdf_list)} 个 PDF 目录")
			
 
				+        self.logger.info(f"⚙️  发现 {len(processors)} 个 VL 处理器")
			
 
				+        
			
 
				+        # 遍历每个 PDF 目录和处理器组合
			
 
				+        for pdf_name in pdf_list:
			
 
				+            pdf_dir = self.base_dir / pdf_name
			
 
				+            
			
 
				+            if not pdf_dir.exists():
			
 
				+                self.logger.warning(f"⚠️  目录不存在: {pdf_dir}")
			
 
				+                continue
			
 
				+            
			
 
				+            # 查找 PaddleOCR 结果目录
			
 
				+            paddle_result_dir = self._find_paddle_result_dir(pdf_dir)
			
 
				+            
			
 
				+            if not paddle_result_dir:
			
 
				+                self.logger.warning(f"⚠️  未找到 PaddleOCR 结果: {pdf_name}")
			
 
				+                continue
			
 
				+            
			
 
				+            # 遍历每个 VL 处理器
			
 
				+            for proc_name in processors:
			
 
				+                if proc_name not in self.config['processors']:
			
 
				+                    self.logger.warning(f"⚠️  处理器不存在: {proc_name}")
			
 
				+                    continue
			
 
				+                
			
 
				+                proc_config = self.config['processors'][proc_name]
			
 
				+                proc_type = self._detect_processor_type(proc_name)
			
 
				+                
			
 
				+                # 获取合并脚本
			
 
				+                merger_script = self._get_merger_script(proc_type)
			
 
				+                if not merger_script:
			
 
				+                    self.logger.warning(f"⚠️  不支持的处理器类型: {proc_name} ({proc_type})")
			
 
				+                    continue
			
 
				+                
			
 
				+                # VL 结果目录
			
 
				+                vl_output_subdir = proc_config.get('output_subdir', f'{proc_name}_results')
			
 
				+                vl_result_dir = pdf_dir / vl_output_subdir
			
 
				+                
			
 
				+                if not vl_result_dir.exists():
			
 
				+                    self.logger.debug(f"⏭️  VL 结果不存在: {vl_result_dir}")
			
 
				+                    continue
			
 
				+                
			
 
				+                # 输出目录
			
 
				+                output_dir = pdf_dir / f"{vl_output_subdir}_cell_bbox"
			
 
				+                
			
 
				+                # 🎯 日志文件路径
			
 
				+                log_file = self._get_log_file_path(pdf_dir, proc_name)
			
 
				+                
			
 
				+                # 创建任务
			
 
				+                task = MergeTask(
			
 
				+                    processor_name=proc_name,
			
 
				+                    vl_result_dir=vl_result_dir,
			
 
				+                    paddle_result_dir=paddle_result_dir,
			
 
				+                    output_dir=output_dir,
			
 
				+                    merger_script=merger_script,
			
 
				+                    description=proc_config.get('description', proc_name),
			
 
				+                    log_file=str(log_file)  # 🎯 新增
			
 
				+                )
			
 
				+                
			
 
				+                tasks.append(task)
			
 
				+        
			
 
				+        return tasks
			
 
				+    
			
 
				+    def execute_merge_task(
			
 
				+        self, 
			
 
				+        task: MergeTask,
			
 
				+        window: int = 15,
			
 
				+        threshold: int = 85,
			
 
				+        output_type: str = 'both',
			
 
				+        dry_run: bool = False
			
 
				+    ) -> Dict[str, Any]:
			
 
				+        """
			
 
				+        🎯 执行单个合并任务（支持日志重定向）
			
 
				+        
			
 
				+        Args:
			
 
				+            task: 合并任务
			
 
				+            window: 查找窗口
			
 
				+            threshold: 相似度阈值
			
 
				+            output_type: 输出格式
			
 
				+            dry_run: 模拟运行
			
 
				+        
			
 
				+        Returns:
			
 
				+            执行结果字典
			
 
				+        """
			
 
				+        self.logger.info(f"\n{'='*80}")
			
 
				+        self.logger.info(f"📄 处理: {task.vl_result_dir.parent.name}")
			
 
				+        self.logger.info(f"🔧 处理器: {task.description}")
			
 
				+        self.logger.info(f"📂 VL 结果: {task.vl_result_dir}")
			
 
				+        self.logger.info(f"📂 PaddleOCR 结果: {task.paddle_result_dir}")
			
 
				+        self.logger.info(f"📂 输出目录: {task.output_dir}")
			
 
				+        self.logger.info(f"📄 日志文件: {task.log_file}")
			
 
				+        self.logger.info(f"{'='*80}")
			
 
				+        
			
 
				+        # 构建命令
			
 
				+        cmd = [
			
 
				+            sys.executable,  # 当前 Python 解释器
			
 
				+            task.merger_script,
			
 
				+            f"--{self._get_vl_arg_name(task.merger_script)}-dir", str(task.vl_result_dir),
			
 
				+            '--paddle-dir', str(task.paddle_result_dir),
			
 
				+            '--output-dir', str(task.output_dir),
			
 
				+            '--output-type', output_type,
			
 
				+            '--window', str(window),
			
 
				+            '--threshold', str(threshold)
			
 
				+        ]
			
 
				+        
			
 
				+        if dry_run:
			
 
				+            self.logger.info(f"[DRY RUN] 命令: {' '.join(cmd)}")
			
 
				+            return {
			
 
				+                'task': task,
			
 
				+                'success': True,
			
 
				+                'duration': 0,
			
 
				+                'error': '',
			
 
				+                'dry_run': True
			
 
				+            }
			
 
				+        
			
 
				+        # 🎯 执行命令并重定向输出到日志文件
			
 
				+        import time
			
 
				+        start_time = time.time()
			
 
				+        
			
 
				+        try:
			
 
				+            with open(task.log_file, 'w', encoding='utf-8') as log_f:
			
 
				+                # 写入日志头
			
 
				+                log_f.write(f"{'='*80}\n")
			
 
				+                log_f.write(f"合并任务日志\n")
			
 
				+                log_f.write(f"{'='*80}\n\n")
			
 
				+                log_f.write(f"PDF 目录: {task.vl_result_dir.parent}\n")
			
 
				+                log_f.write(f"处理器: {task.description}\n")
			
 
				+                log_f.write(f"处理器名称: {task.processor_name}\n")
			
 
				+                log_f.write(f"VL 结果目录: {task.vl_result_dir}\n")
			
 
				+                log_f.write(f"PaddleOCR 结果目录: {task.paddle_result_dir}\n")
			
 
				+                log_f.write(f"输出目录: {task.output_dir}\n")
			
 
				+                log_f.write(f"合并脚本: {task.merger_script}\n")
			
 
				+                log_f.write(f"查找窗口: {window}\n")
			
 
				+                log_f.write(f"相似度阈值: {threshold}\n")
			
 
				+                log_f.write(f"输出格式: {output_type}\n")
			
 
				+                log_f.write(f"开始时间: {datetime.now()}\n")
			
 
				+                log_f.write(f"{'='*80}\n\n")
			
 
				+                log_f.flush()
			
 
				+                
			
 
				+                # 执行命令
			
 
				+                result = subprocess.run(
			
 
				+                    cmd,
			
 
				+                    stdout=log_f,  # 🎯 重定向 stdout
			
 
				+                    stderr=subprocess.STDOUT,  # 🎯 合并 stderr 到 stdout
			
 
				+                    text=True,
			
 
				+                    check=True
			
 
				+                )
			
 
				+                
			
 
				+                # 写入日志尾
			
 
				+                log_f.write(f"\n{'='*80}\n")
			
 
				+                log_f.write(f"结束时间: {datetime.now()}\n")
			
 
				+                log_f.write(f"状态: 成功\n")
			
 
				+                log_f.write(f"{'='*80}\n")
			
 
				+            
			
 
				+            duration = time.time() - start_time
			
 
				+            self.logger.info(f"✅ 合并成功 (耗时: {duration:.2f}秒)")
			
 
				+            
			
 
				+            return {
			
 
				+                'task': task,
			
 
				+                'success': True,
			
 
				+                'duration': duration,
			
 
				+                'error': '',
			
 
				+                'dry_run': False
			
 
				+            }
			
 
				+            
			
 
				+        except subprocess.CalledProcessError as e:
			
 
				+            duration = time.time() - start_time
			
 
				+            error_msg = f"命令执行失败 (退出码: {e.returncode})"
			
 
				+            
			
 
				+            # 🎯 在日志文件中追加错误信息
			
 
				+            with open(task.log_file, 'a', encoding='utf-8') as log_f:
			
 
				+                log_f.write(f"\n{'='*80}\n")
			
 
				+                log_f.write(f"结束时间: {datetime.now()}\n")
			
 
				+                log_f.write(f"状态: 失败\n")
			
 
				+                log_f.write(f"错误: {error_msg}\n")
			
 
				+                log_f.write(f"{'='*80}\n")
			
 
				+            
			
 
				+            self.logger.error(f"❌ 合并失败 (耗时: {duration:.2f}秒)")
			
 
				+            self.logger.error(f"错误信息: {error_msg}")
			
 
				+            self.logger.error(f"详细日志: {task.log_file}")
			
 
				+            
			
 
				+            return {
			
 
				+                'task': task,
			
 
				+                'success': False,
			
 
				+                'duration': duration,
			
 
				+                'error': error_msg,
			
 
				+                'dry_run': False
			
 
				+            }
			
 
				+            
			
 
				+        except Exception as e:
			
 
				+            duration = time.time() - start_time
			
 
				+            error_msg = str(e)
			
 
				+            
			
 
				+            with open(task.log_file, 'a', encoding='utf-8') as log_f:
			
 
				+                log_f.write(f"\n{'='*80}\n")
			
 
				+                log_f.write(f"结束时间: {datetime.now()}\n")
			
 
				+                log_f.write(f"状态: 异常\n")
			
 
				+                log_f.write(f"错误: {error_msg}\n")
			
 
				+                log_f.write(f"{'='*80}\n")
			
 
				+            
			
 
				+            self.logger.error(f"❌ 合并异常 (耗时: {duration:.2f}秒)")
			
 
				+            self.logger.error(f"错误信息: {error_msg}")
			
 
				+            self.logger.error(f"详细日志: {task.log_file}")
			
 
				+            
			
 
				+            return {
			
 
				+                'task': task,
			
 
				+                'success': False,
			
 
				+                'duration': duration,
			
 
				+                'error': error_msg,
			
 
				+                'dry_run': False
			
 
				+            }
			
 
				+    
			
 
				+    def _get_vl_arg_name(self, merger_script: str) -> str:
			
 
				+        """获取 VL 参数名称"""
			
 
				+        script_name = Path(merger_script).stem
			
 
				+        
			
 
				+        if 'paddleocr_vl' in script_name:
			
 
				+            return 'paddleocr-vl'
			
 
				+        elif 'mineru' in script_name:
			
 
				+            return 'mineru'
			
 
				+        else:
			
 
				+            return 'vl'
			
 
				+    
			
 
				+    def _save_summary_log(self, stats: Dict[str, Any]):
			
 
				+        """🎯 保存汇总日志"""
			
 
				+        timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
			
 
				+        summary_log_file = self.log_base_dir / f"merge_batch_summary_{timestamp}.log"
			
 
				+        
			
 
				+        # 确保目录存在
			
 
				+        summary_log_file.parent.mkdir(parents=True, exist_ok=True)
			
 
				+        
			
 
				+        with open(summary_log_file, 'w', encoding='utf-8') as f:
			
 
				+            f.write("OCR 结果批量合并汇总日志\n")
			
 
				+            f.write("=" * 80 + "\n\n")
			
 
				+            
			
 
				+            f.write(f"配置文件: {self.config_file}\n")
			
 
				+            f.write(f"基础目录: {self.base_dir}\n")
			
 
				+            f.write(f"日志目录: {self.log_base_dir}\n")
			
 
				+            f.write(f"开始时间: {datetime.now()}\n")
			
 
				+            f.write(f"总耗时: {stats['total_duration']:.2f} 秒\n\n")
			
 
				+            
			
 
				+            f.write("统计信息:\n")
			
 
				+            f.write(f"  总任务数: {stats['total']}\n")
			
 
				+            f.write(f"  成功: {stats['success']}\n")
			
 
				+            f.write(f"  失败: {stats['failed']}\n\n")
			
 
				+            
			
 
				+            if stats['failed_tasks']:
			
 
				+                f.write("失败的任务:\n")
			
 
				+                for item in stats['failed_tasks']:
			
 
				+                    f.write(f"  ✗ {item['pdf_dir']} / {item['processor']}\n")
			
 
				+                    f.write(f"    错误: {item['error']}\n")
			
 
				+                    f.write(f"    日志: {item['log']}\n\n")
			
 
				+            
			
 
				+            f.write("详细结果:\n")
			
 
				+            for result in self.merge_results:
			
 
				+                task = result['task']
			
 
				+                status = "✓" if result['success'] else "✗"
			
 
				+                f.write(f"{status} {task.vl_result_dir.parent.name} / {task.processor_name} ({result['duration']:.2f}s)\n")
			
 
				+                f.write(f"   日志: {task.log_file}\n")
			
 
				+                if result['error']:
			
 
				+                    f.write(f"   错误: {result['error']}\n")
			
 
				+        
			
 
				+        self.logger.info(f"汇总日志已保存: {summary_log_file}")
			
 
				+    
			
 
				+    def batch_merge(
			
 
				+        self,
			
 
				+        pdf_list: List[str] = None,
			
 
				+        processors: List[str] = None,
			
 
				+        window: int = 15,
			
 
				+        threshold: int = 85,
			
 
				+        output_type: str = 'both',
			
 
				+        dry_run: bool = False
			
 
				+    ) -> Dict:
			
 
				+        """
			
 
				+        批量合并
			
 
				+        
			
 
				+        Returns:
			
 
				+            统计信息字典
			
 
				+        """
			
 
				+        # 发现任务
			
 
				+        tasks = self.discover_merge_tasks(pdf_list, processors)
			
 
				+        
			
 
				+        if not tasks:
			
 
				+            self.logger.warning("⚠️  没有发现任何合并任务")
			
 
				+            return {
			
 
				+                'total': 0,
			
 
				+                'success': 0,
			
 
				+                'failed': 0,
			
 
				+                'total_duration': 0,
			
 
				+                'failed_tasks': []
			
 
				+            }
			
 
				+        
			
 
				+        self.logger.info(f"\n🎯 发现 {len(tasks)} 个合并任务\n")
			
 
				+        
			
 
				+        # 显示任务列表
			
 
				+        for i, task in enumerate(tasks, 1):
			
 
				+            self.logger.info(f"{i}. {task.vl_result_dir.parent.name} / {task.processor_name}")
			
 
				+        
			
 
				+        # 确认执行
			
 
				+        if not dry_run:
			
 
				+            confirm = input(f"\n是否继续执行 {len(tasks)} 个合并任务? [Y/n]: ")
			
 
				+            if confirm.lower() not in ['', 'y', 'yes']:
			
 
				+                self.logger.info("❌ 已取消")
			
 
				+                return {
			
 
				+                    'total': 0,
			
 
				+                    'success': 0,
			
 
				+                    'failed': 0,
			
 
				+                    'total_duration': 0,
			
 
				+                    'failed_tasks': []
			
 
				+                }
			
 
				+        
			
 
				+        # 执行任务
			
 
				+        import time
			
 
				+        batch_start_time = time.time()
			
 
				+        success_count = 0
			
 
				+        failed_count = 0
			
 
				+        
			
 
				+        with tqdm(total=len(tasks), desc="合并进度", unit="task") as pbar:
			
 
				+            for task in tasks:
			
 
				+                result = self.execute_merge_task(
			
 
				+                    task,
			
 
				+                    window=window,
			
 
				+                    threshold=threshold,
			
 
				+                    output_type=output_type,
			
 
				+                    dry_run=dry_run
			
 
				+                )
			
 
				+                
			
 
				+                self.merge_results.append(result)
			
 
				+                
			
 
				+                if result['success']:
			
 
				+                    success_count += 1
			
 
				+                else:
			
 
				+                    failed_count += 1
			
 
				+                
			
 
				+                pbar.update(1)
			
 
				+                pbar.set_postfix({
			
 
				+                    'success': success_count,
			
 
				+                    'failed': failed_count
			
 
				+                })
			
 
				+        
			
 
				+        total_duration = time.time() - batch_start_time
			
 
				+        
			
 
				+        # 统计失败任务
			
 
				+        failed_tasks = [
			
 
				+            {
			
 
				+                'pdf_dir': r['task'].vl_result_dir.parent.name,
			
 
				+                'processor': r['task'].processor_name,
			
 
				+                'error': r['error'],
			
 
				+                'log': r['task'].log_file
			
 
				+            }
			
 
				+            for r in self.merge_results if not r['success']
			
 
				+        ]
			
 
				+        
			
 
				+        # 统计信息
			
 
				+        stats = {
			
 
				+            'total': len(tasks),
			
 
				+            'success': success_count,
			
 
				+            'failed': failed_count,
			
 
				+            'total_duration': total_duration,
			
 
				+            'failed_tasks': failed_tasks
			
 
				+        }
			
 
				+        
			
 
				+        # 🎯 保存汇总日志
			
 
				+        self._save_summary_log(stats)
			
 
				+        
			
 
				+        # 打印总结
			
 
				+        self.logger.info(f"\n{'='*80}")
			
 
				+        self.logger.info("📊 合并完成")
			
 
				+        self.logger.info(f"  总任务数: {stats['total']}")
			
 
				+        self.logger.info(f"  ✅ 成功: {stats['success']}")
			
 
				+        self.logger.info(f"  ❌ 失败: {stats['failed']}")
			
 
				+        self.logger.info(f"  ⏱️  总耗时: {stats['total_duration']:.2f} 秒")
			
 
				+        self.logger.info(f"{'='*80}")
			
 
				+        
			
 
				+        if failed_tasks:
			
 
				+            self.logger.info(f"\n失败的任务:")
			
 
				+            for item in failed_tasks:
			
 
				+                self.logger.info(f"  ✗ {item['pdf_dir']} / {item['processor']}")
			
 
				+                self.logger.info(f"    错误: {item['error']}")
			
 
				+                self.logger.info(f"    日志: {item['log']}")
			
 
				+        
			
 
				+        return stats
			
 
				+
			
 
				+
			
 
				+def create_parser() -> argparse.ArgumentParser:
			
 
				+    """创建命令行参数解析器"""
			
 
				+    parser = argparse.ArgumentParser(
			
 
				+        description='批量合并 OCR 结果（VL + PaddleOCR）',
			
 
				+        formatter_class=argparse.RawDescriptionHelpFormatter,
			
 
				+        epilog="""
			
 
				+示例用法:
			
 
				+
			
 
				+  1. 合并配置文件中所有 VL 处理器的结果:
			
 
				+     python batch_merge_results.py
			
 
				+
			
 
				+  2. 合并指定 PDF 的结果:
			
 
				+     python batch_merge_results.py -f pdf_list.txt
			
 
				+
			
 
				+  3. 合并指定处理器的结果:
			
 
				+     python batch_merge_results.py -p paddleocr_vl_single_process -p mineru_vllm
			
 
				+
			
 
				+  4. 自定义参数:
			
 
				+     python batch_merge_results.py -w 20 -t 90
			
 
				+
			
 
				+  5. 模拟运行（不实际执行）:
			
 
				+     python batch_merge_results.py --dry-run
			
 
				+        """
			
 
				+    )
			
 
				+    
			
 
				+    # 配置文件
			
 
				+    parser.add_argument(
			
 
				+        '-c', '--config',
			
 
				+        default='processor_configs.yaml',
			
 
				+        help='配置文件路径 (默认: processor_configs.yaml)'
			
 
				+    )
			
 
				+    
			
 
				+    # PDF 和处理器
			
 
				+    parser.add_argument(
			
 
				+        '-d', '--base-dir',
			
 
				+        help='PDF 基础目录（覆盖配置文件）'
			
 
				+    )
			
 
				+    
			
 
				+    parser.add_argument(
			
 
				+        '-f', '--file-list',
			
 
				+        help='PDF 列表文件（每行一个 PDF 名称，不含扩展名）'
			
 
				+    )
			
 
				+    
			
 
				+    parser.add_argument(
			
 
				+        '-l', '--pdf-list',
			
 
				+        nargs='+',
			
 
				+        help='PDF 名称列表（不含扩展名）'
			
 
				+    )
			
 
				+    
			
 
				+    parser.add_argument(
			
 
				+        '-p', '--processors',
			
 
				+        nargs='+',
			
 
				+        help='处理器列表（不指定则自动检测所有 VL 处理器）'
			
 
				+    )
			
 
				+    
			
 
				+    # 合并参数
			
 
				+    parser.add_argument(
			
 
				+        '-w', '--window',
			
 
				+        type=int,
			
 
				+        default=15,
			
 
				+        help='查找窗口大小 (默认: 15)'
			
 
				+    )
			
 
				+    
			
 
				+    parser.add_argument(
			
 
				+        '-t', '--threshold',
			
 
				+        type=int,
			
 
				+        default=85,
			
 
				+        help='相似度阈值 (默认: 85)'
			
 
				+    )
			
 
				+    
			
 
				+    parser.add_argument(
			
 
				+        '--output-type',
			
 
				+        choices=['json', 'markdown', 'both'],
			
 
				+        default='both',
			
 
				+        help='输出格式 (默认: both)'
			
 
				+    )
			
 
				+    
			
 
				+    # 工具选项
			
 
				+    parser.add_argument(
			
 
				+        '--dry-run',
			
 
				+        action='store_true',
			
 
				+        help='模拟运行，不实际执行'
			
 
				+    )
			
 
				+    
			
 
				+    parser.add_argument(
			
 
				+        '-v', '--verbose',
			
 
				+        action='store_true',
			
 
				+        help='详细输出'
			
 
				+    )
			
 
				+    
			
 
				+    return parser
			
 
				+
			
 
				+
			
 
				+def main():
			
 
				+    """主函数"""
			
 
				+    parser = create_parser()
			
 
				+    args = parser.parse_args()
			
 
				+    
			
 
				+    # 设置日志级别
			
 
				+    if args.verbose:
			
 
				+        logging.getLogger().setLevel(logging.DEBUG)
			
 
				+    
			
 
				+    # 读取 PDF 列表
			
 
				+    pdf_list = None
			
 
				+    if args.file_list:
			
 
				+        pdf_list = []
			
 
				+        with open(args.file_list, 'r', encoding='utf-8') as f:
			
 
				+            for line in f:
			
 
				+                line = line.strip()
			
 
				+                if line and not line.startswith('#'):
			
 
				+                    # 移除 .pdf 扩展名
			
 
				+                    pdf_name = line.replace('.pdf', '')
			
 
				+                    pdf_list.append(pdf_name)
			
 
				+    elif args.pdf_list:
			
 
				+        pdf_list = [p.replace('.pdf', '') for p in args.pdf_list]
			
 
				+    
			
 
				+    # 创建批量合并器
			
 
				+    merger = BatchMerger(
			
 
				+        config_file=args.config,
			
 
				+        base_dir=args.base_dir
			
 
				+    )
			
 
				+    
			
 
				+    # 执行批量合并
			
 
				+    stats = merger.batch_merge(
			
 
				+        pdf_list=pdf_list,
			
 
				+        processors=args.processors,
			
 
				+        window=args.window,
			
 
				+        threshold=args.threshold,
			
 
				+        output_type=args.output_type,
			
 
				+        dry_run=args.dry_run
			
 
				+    )
			
 
				+    
			
 
				+    return 0 if stats['failed'] == 0 else 1
			
 
				+
			
 
				+
			
 
				+if __name__ == '__main__':
			
 
				+    print("🚀 启动批量OCR bbox 合并程序...")
			
 
				+    
			
 
				+    import sys
			
 
				+    
			
 
				+    if len(sys.argv) == 1:
			
 
				+        # 如果没有命令行参数，使用默认配置运行
			
 
				+        print("ℹ️  未提供命令行参数，使用默认配置运行...")
			
 
				+        
			
 
				+        # 默认配置
			
 
				+        default_config = {
			
 
				+            "file-list": "pdf_list.txt",
			
 
				+        }
			
 
				+        
			
 
				+        print("⚙️  默认参数:")
			
 
				+        for key, value in default_config.items():
			
 
				+            print(f"  --{key}: {value}")
			
 
				+        # 构造参数
			
 
				+        sys.argv = [sys.argv[0]]
			
 
				+        for key, value in default_config.items():
			
 
				+            sys.argv.extend([f"--{key}", str(value)])
			
 
				+        sys.argv.append("--dry-run")
			
 
				+        sys.argv.append("--verbose")  # 添加详细输出参数 
			
 
				+    sys.exit(main())
			
--- a/batch_ocr/batch_process_pdf.py
+++ b/batch_ocr/batch_process_pdf.py
@@ -3,6 +3,7 @@
 
				 PDF 批量处理脚本
			
 
				 支持多种处理器，配置文件驱动
			
 
				 支持自动切换虚拟环境
			
 
				+支持执行器输出日志重定向
			
 
				 """
			
 
				 
			
 
				 import os
			
@@ -32,7 +33,8 @@ class ProcessorConfig:
 
				     output_arg: str = "--output_dir"
			
 
				     extra_args: List[str] = field(default_factory=list)
			
 
				     output_subdir: str = "results"
			
 
				-    venv: Optional[str] = None  # 虚拟环境激活命令
			
 
				+    log_subdir: str = "logs"  # 🎯 新增：日志子目录
			
 
				+    venv: Optional[str] = None
			
 
				     description: str = ""
			
 
				 
			
 
				 
			
@@ -43,6 +45,7 @@ class ProcessResult:
 
				     success: bool
			
 
				     duration: float
			
 
				     error_message: str = ""
			
 
				+    log_file: str = ""  # 🎯 新增：日志文件路径
			
 
				 
			
 
				 
			
 
				 # ============================================================================
			
@@ -64,7 +67,8 @@ class ConfigManager:
 
				                 ],
			
 
				                 'output_subdir': 'paddleocr_vl_results',
			
 
				                 'venv': 'source /Users/zhch158/workspace/repository.git/PaddleX/paddle_env/bin/activate',
			
 
				-                'description': 'PaddleOCR-VL 处理器'
			
 
				+                'description': 'PaddleOCR-VL 处理器',
			
 
				+                'log_subdir': 'logs/paddleocr_vl_single_process'  # 🎯 新增
			
 
				             },
			
 
				             'ppstructurev3_single_process': {
			
 
				                 'script': '/Users/zhch158/workspace/repository.git/PaddleX/zhch/ppstructurev3_single_process.py',
			
@@ -75,7 +79,8 @@ class ConfigManager:
 
				                 ],
			
 
				                 'output_subdir': 'ppstructurev3_results',
			
 
				                 'venv': 'conda activate paddle',
			
 
				-                'description': 'PP-StructureV3 处理器'
			
 
				+                'description': 'PP-StructureV3 处理器',
			
 
				+                'log_subdir': 'logs/ppstructurev3_single_process'  # 🎯 新增
			
 
				             },
			
 
				             'ppstructurev3_single_client': {
			
 
				                 'script': '/Users/zhch158/workspace/repository.git/PaddleX/zhch/ppstructurev3_single_client.py',
			
@@ -87,7 +92,8 @@ class ConfigManager:
 
				                 ],
			
 
				                 'output_subdir': 'ppstructurev3_client_results',
			
 
				                 'venv': 'source /Users/zhch158/workspace/repository.git/PaddleX/paddle_env/bin/activate',
			
 
				-                'description': 'PP-StructureV3 HTTP API 客户端'
			
 
				+                'description': 'PP-StructureV3 HTTP API 客户端',
			
 
				+                'log_subdir': 'logs/ppstructurev3_single_client'  # 🎯 新增
			
 
				             },
			
 
				             'mineru_vllm': {
			
 
				                 'script': '/Users/zhch158/workspace/repository.git/MinerU/zhch/mineru2_vllm_multthreads.py',
			
@@ -100,7 +106,8 @@ class ConfigManager:
 
				                 ],
			
 
				                 'output_subdir': 'mineru_vllm_results',
			
 
				                 'venv': 'conda activate mineru2',
			
 
				-                'description': 'MinerU vLLM 处理器'
			
 
				+                'description': 'MinerU vLLM 处理器',
			
 
				+                'log_subdir': 'logs/mineru_vllm'  # 🎯 新增
			
 
				             },
			
 
				             'dotsocr_vllm': {
			
 
				                 'script': '/Users/zhch158/workspace/repository.git/dots.ocr/zhch/dotsocr_vllm_multthreads.py',
			
@@ -117,12 +124,16 @@ class ConfigManager:
 
				                 ],
			
 
				                 'output_subdir': 'dotsocr_vllm_results',
			
 
				                 'venv': 'conda activate py312',
			
 
				-                'description': 'DotsOCR vLLM 处理器 - 支持PDF和图片'
			
 
				+                'description': 'DotsOCR vLLM 处理器 - 支持PDF和图片',
			
 
				+                'log_subdir': 'logs/dotsocr_vllm'  # 🎯 新增
			
 
				             }
			
 
				         },
			
 
				         'global': {
			
 
				             'base_dir': '/Users/zhch158/workspace/data/流水分析',
			
 
				-            'output_subdir': 'results'
			
 
				+            'output_subdir': 'results',
			
 
				+            'log_dir': 'logs',
			
 
				+            'log_retention_days': 30,
			
 
				+            'log_level': 'INFO'
			
 
				         }
			
 
				     }
			
 
				     
			
@@ -153,6 +164,7 @@ class ConfigManager:
 
				             output_arg=proc_config.get('output_arg', '--output_dir'),
			
 
				             extra_args=proc_config.get('extra_args', []),
			
 
				             output_subdir=proc_config.get('output_subdir', processor_name + '_results'),
			
 
				+            log_subdir=proc_config.get('log_subdir', f'logs/{processor_name}'),  # 🎯 新增
			
 
				             venv=proc_config.get('venv'),
			
 
				             description=proc_config.get('description', '')
			
 
				         )
			
@@ -250,11 +262,13 @@ class PDFBatchProcessor:
 
				         self,
			
 
				         processor_config: ProcessorConfig,
			
 
				         output_subdir: Optional[str] = None,
			
 
				+        log_base_dir: Optional[str] = None,  # 🎯 新增：日志基础目录
			
 
				         dry_run: bool = False
			
 
				     ):
			
 
				         self.processor_config = processor_config
			
 
				         # 如果指定了output_subdir，使用指定的；否则使用处理器配置中的
			
 
				         self.output_subdir = output_subdir or processor_config.output_subdir
			
 
				+        self.log_base_dir = Path(log_base_dir) if log_base_dir else Path('logs')  # 🎯 新增
			
 
				         self.dry_run = dry_run
			
 
				         
			
 
				         # 设置日志
			
@@ -282,12 +296,37 @@ class PDFBatchProcessor:
 
				         
			
 
				         return logger
			
 
				     
			
 
				+    def _get_log_file_path(self, pdf_file: Path) -> Path:
			
 
				+        """
			
 
				+        🎯 获取日志文件路径
			
 
				+        
			
 
				+        日志结构:
			
 
				+        base_dir/
			
 
				+        └── PDF名称/
			
 
				+            └── logs/
			
 
				+                └── processor_name/
			
 
				+                    └── PDF名称_YYYYMMDD_HHMMSS.log
			
 
				+        """
			
 
				+        # PDF 目录
			
 
				+        pdf_dir = pdf_file.parent / pdf_file.stem
			
 
				+        
			
 
				+        # 日志目录: pdf_dir / logs / processor_name
			
 
				+        log_dir = pdf_dir / self.processor_config.log_subdir
			
 
				+        log_dir.mkdir(parents=True, exist_ok=True)
			
 
				+        
			
 
				+        # 日志文件名: PDF名称_时间戳.log
			
 
				+        timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
			
 
				+        log_file = log_dir / f"{pdf_file.stem}_{timestamp}.log"
			
 
				+        
			
 
				+        return log_file
			
 
				+    
			
 
				     def process_files(self, pdf_files: List[Path]) -> Dict[str, Any]:
			
 
				         """批量处理文件"""
			
 
				         self.logger.info(f"开始处理 {len(pdf_files)} 个文件")
			
 
				         self.logger.info(f"处理器: {self.processor_config.description}")
			
 
				         self.logger.info(f"脚本: {self.processor_config.script}")
			
 
				         self.logger.info(f"输出目录: {self.output_subdir}")
			
 
				+        self.logger.info(f"日志目录: {self.processor_config.log_subdir}")
			
 
				         
			
 
				         if self.processor_config.venv:
			
 
				             self.logger.info(f"虚拟环境: {self.processor_config.venv}")
			
@@ -312,14 +351,12 @@ class PDFBatchProcessor:
 
				         
			
 
				         # 生成统计信息
			
 
				         stats = self._generate_stats(total_duration)
			
 
				-        
			
 
				-        # 保存日志
			
 
				-        self._save_log(stats)
			
 
				+        self._save_summary_log(stats)
			
 
				         
			
 
				         return stats
			
 
				     
			
 
				     def _process_single_file(self, pdf_file: Path) -> ProcessResult:
			
 
				-        """处理单个文件"""
			
 
				+        """🎯 处理单个文件（支持日志重定向）"""
			
 
				         self.logger.info(f"处理: {pdf_file}")
			
 
				         
			
 
				         # 检查文件是否存在
			
@@ -335,10 +372,14 @@ class PDFBatchProcessor:
 
				         # 确定输出目录
			
 
				         output_dir = pdf_file.parent / pdf_file.stem / self.output_subdir
			
 
				         
			
 
				+        # 🎯 获取日志文件路径
			
 
				+        log_file = self._get_log_file_path(pdf_file)
			
 
				+        
			
 
				         # 构建命令
			
 
				         cmd = self._build_command(pdf_file, output_dir)
			
 
				         
			
 
				         self.logger.debug(f"执行命令: {cmd if isinstance(cmd, str) else ' '.join(cmd)}")
			
 
				+        self.logger.info(f"日志输出: {log_file}")
			
 
				         
			
 
				         if self.dry_run:
			
 
				             self.logger.info(f"[DRY RUN] 将执行: {cmd if isinstance(cmd, str) else ' '.join(cmd)}")
			
@@ -346,53 +387,103 @@ class PDFBatchProcessor:
 
				                 pdf_file=str(pdf_file),
			
 
				                 success=True,
			
 
				                 duration=0,
			
 
				-                error_message=""
			
 
				+                error_message="",
			
 
				+                log_file=str(log_file)
			
 
				             )
			
 
				         
			
 
				-        # 执行命令
			
 
				+        # 🎯 执行命令并重定向输出到日志文件
			
 
				         start_time = time.time()
			
 
				         try:
			
 
				-            # 如果是 shell 命令（包含 venv），使用 shell=True
			
 
				-            if isinstance(cmd, str):
			
 
				-                result = subprocess.run(
			
 
				-                    cmd,
			
 
				-                    shell=True,
			
 
				-                    executable='/bin/bash',  # 使用 bash
			
 
				-                    capture_output=True,
			
 
				-                    text=True,
			
 
				-                    check=True
			
 
				-                )
			
 
				-            else:
			
 
				-                result = subprocess.run(
			
 
				-                    cmd,
			
 
				-                    capture_output=True,
			
 
				-                    text=True,
			
 
				-                    check=True
			
 
				-                )
			
 
				+            with open(log_file, 'w', encoding='utf-8') as log_f:
			
 
				+                # 写入日志头
			
 
				+                log_f.write(f"{'='*80}\n")
			
 
				+                log_f.write(f"处理器: {self.processor_config.description}\n")
			
 
				+                log_f.write(f"PDF 文件: {pdf_file}\n")
			
 
				+                log_f.write(f"输出目录: {output_dir}\n")
			
 
				+                log_f.write(f"开始时间: {datetime.now()}\n")
			
 
				+                log_f.write(f"{'='*80}\n\n")
			
 
				+                log_f.flush()
			
 
				+                
			
 
				+                # 执行命令
			
 
				+                if isinstance(cmd, str):
			
 
				+                    result = subprocess.run(
			
 
				+                        cmd,
			
 
				+                        shell=True,
			
 
				+                        executable='/bin/bash',
			
 
				+                        stdout=log_f,  # 🎯 重定向 stdout
			
 
				+                        stderr=subprocess.STDOUT,  # 🎯 合并 stderr 到 stdout
			
 
				+                        text=True,
			
 
				+                        check=True
			
 
				+                    )
			
 
				+                else:
			
 
				+                    result = subprocess.run(
			
 
				+                        cmd,
			
 
				+                        stdout=log_f,  # 🎯 重定向 stdout
			
 
				+                        stderr=subprocess.STDOUT,  # 🎯 合并 stderr
			
 
				+                        text=True,
			
 
				+                        check=True
			
 
				+                    )
			
 
				+                
			
 
				+                # 写入日志尾
			
 
				+                log_f.write(f"\n{'='*80}\n")
			
 
				+                log_f.write(f"结束时间: {datetime.now()}\n")
			
 
				+                log_f.write(f"状态: 成功\n")
			
 
				+                log_f.write(f"{'='*80}\n")
			
 
				             
			
 
				             duration = time.time() - start_time
			
 
				-            
			
 
				             self.logger.info(f"✓ 成功 (耗时: {duration:.2f}秒)")
			
 
				             
			
 
				             return ProcessResult(
			
 
				                 pdf_file=str(pdf_file),
			
 
				                 success=True,
			
 
				                 duration=duration,
			
 
				-                error_message=""
			
 
				+                error_message="",
			
 
				+                log_file=str(log_file)
			
 
				             )
			
 
				             
			
 
				         except subprocess.CalledProcessError as e:
			
 
				             duration = time.time() - start_time
			
 
				-            error_msg = e.stderr if e.stderr else str(e)
			
 
				+            error_msg = f"命令执行失败 (退出码: {e.returncode})"
			
 
				+            
			
 
				+            # 🎯 在日志文件中追加错误信息
			
 
				+            with open(log_file, 'a', encoding='utf-8') as log_f:
			
 
				+                log_f.write(f"\n{'='*80}\n")
			
 
				+                log_f.write(f"结束时间: {datetime.now()}\n")
			
 
				+                log_f.write(f"状态: 失败\n")
			
 
				+                log_f.write(f"错误: {error_msg}\n")
			
 
				+                log_f.write(f"{'='*80}\n")
			
 
				             
			
 
				             self.logger.error(f"✗ 失败 (耗时: {duration:.2f}秒)")
			
 
				             self.logger.error(f"错误信息: {error_msg}")
			
 
				+            self.logger.error(f"详细日志: {log_file}")
			
 
				+            
			
 
				+            return ProcessResult(
			
 
				+                pdf_file=str(pdf_file),
			
 
				+                success=False,
			
 
				+                duration=duration,
			
 
				+                error_message=error_msg,
			
 
				+                log_file=str(log_file)
			
 
				+            )
			
 
				+        except Exception as e:
			
 
				+            duration = time.time() - start_time
			
 
				+            error_msg = str(e)
			
 
				+            
			
 
				+            with open(log_file, 'a', encoding='utf-8') as log_f:
			
 
				+                log_f.write(f"\n{'='*80}\n")
			
 
				+                log_f.write(f"结束时间: {datetime.now()}\n")
			
 
				+                log_f.write(f"状态: 异常\n")
			
 
				+                log_f.write(f"错误: {error_msg}\n")
			
 
				+                log_f.write(f"{'='*80}\n")
			
 
				+            
			
 
				+            self.logger.error(f"✗ 异常 (耗时: {duration:.2f}秒)")
			
 
				+            self.logger.error(f"错误信息: {error_msg}")
			
 
				             
			
 
				             return ProcessResult(
			
 
				                 pdf_file=str(pdf_file),
			
 
				                 success=False,
			
 
				                 duration=duration,
			
 
				-                error_message=error_msg
			
 
				+                error_message=error_msg,
			
 
				+                log_file=str(log_file)
			
 
				             )
			
 
				     
			
 
				     def _build_command(self, pdf_file: Path, output_dir: Path):
			
@@ -451,7 +542,14 @@ eval "$(conda shell.bash hook)"
 
				         success_count = sum(1 for r in self.results if r.success)
			
 
				         failed_count = len(self.results) - success_count
			
 
				         
			
 
				-        failed_files = [r.pdf_file for r in self.results if not r.success]
			
 
				+        failed_files = [
			
 
				+            {
			
 
				+                'file': r.pdf_file,
			
 
				+                'error': r.error_message,
			
 
				+                'log': r.log_file
			
 
				+            }
			
 
				+            for r in self.results if not r.success
			
 
				+        ]
			
 
				         
			
 
				         stats = {
			
 
				             'total': len(self.results),
			
@@ -464,7 +562,8 @@ eval "$(conda shell.bash hook)"
 
				                     'file': r.pdf_file,
			
 
				                     'success': r.success,
			
 
				                     'duration': r.duration,
			
 
				-                    'error': r.error_message
			
 
				+                    'error': r.error_message,
			
 
				+                    'log': r.log_file
			
 
				                 }
			
 
				                 for r in self.results
			
 
				             ]
			
@@ -472,19 +571,23 @@ eval "$(conda shell.bash hook)"
 
				         
			
 
				         return stats
			
 
				     
			
 
				-    def _save_log(self, stats: Dict[str, Any]):
			
 
				-        """保存日志"""
			
 
				+    def _save_summary_log(self, stats: Dict[str, Any]):
			
 
				+        """🎯 保存汇总日志"""
			
 
				         timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
			
 
				-        log_file = f"batch_process_{self.processor_config.name}_{timestamp}.log"
			
 
				+        summary_log_file = self.log_base_dir / f"batch_summary_{self.processor_config.name}_{timestamp}.log"
			
 
				+        
			
 
				+        # 确保目录存在
			
 
				+        summary_log_file.parent.mkdir(parents=True, exist_ok=True)
			
 
				         
			
 
				-        with open(log_file, 'w', encoding='utf-8') as f:
			
 
				-            f.write("PDF 批量处理日志\n")
			
 
				+        with open(summary_log_file, 'w', encoding='utf-8') as f:
			
 
				+            f.write("PDF 批量处理汇总日志\n")
			
 
				             f.write("=" * 80 + "\n\n")
			
 
				             
			
 
				             f.write(f"处理器: {self.processor_config.description}\n")
			
 
				             f.write(f"处理器名称: {self.processor_config.name}\n")
			
 
				             f.write(f"脚本: {self.processor_config.script}\n")
			
 
				             f.write(f"输出目录: {self.output_subdir}\n")
			
 
				+            f.write(f"日志目录: {self.processor_config.log_subdir}\n")
			
 
				             
			
 
				             if self.processor_config.venv:
			
 
				                 f.write(f"虚拟环境: {self.processor_config.venv}\n")
			
@@ -499,18 +602,20 @@ eval "$(conda shell.bash hook)"
 
				             
			
 
				             if stats['failed_files']:
			
 
				                 f.write("失败的文件:\n")
			
 
				-                for file in stats['failed_files']:
			
 
				-                    f.write(f"  - {file}\n")
			
 
				-                f.write("\n")
			
 
				+                for item in stats['failed_files']:
			
 
				+                    f.write(f"  ✗ {item['file']}\n")
			
 
				+                    f.write(f"    错误: {item['error']}\n")
			
 
				+                    f.write(f"    日志: {item['log']}\n\n")
			
 
				             
			
 
				             f.write("详细结果:\n")
			
 
				             for result in stats['results']:
			
 
				                 status = "✓" if result['success'] else "✗"
			
 
				                 f.write(f"{status} {result['file']} ({result['duration']:.2f}s)\n")
			
 
				+                f.write(f"   日志: {result['log']}\n")
			
 
				                 if result['error']:
			
 
				                     f.write(f"   错误: {result['error']}\n")
			
 
				         
			
 
				-        self.logger.info(f"日志已保存: {log_file}")
			
 
				+        self.logger.info(f"汇总日志已保存: {summary_log_file}")
			
 
				 
			
 
				 
			
 
				 # ============================================================================
			
@@ -684,6 +789,7 @@ def main():
 
				     base_dir = args.base_dir or config_manager.get_global_config('base_dir')
			
 
				     if not base_dir:
			
 
				         parser.error("必须指定 -d 参数或在配置文件中设置 base_dir")
			
 
				+    log_base_dir = base_dir + '/' + config_manager.get_global_config('log_dir', 'logs')
			
 
				     
			
 
				     # 查找 PDF 文件
			
 
				     finder = PDFFileFinder(base_dir)
			
@@ -727,6 +833,7 @@ def main():
 
				     processor = PDFBatchProcessor(
			
 
				         processor_config=processor_config,
			
 
				         output_subdir=args.output_subdir,
			
 
				+        log_base_dir=log_base_dir,  # 🎯 传递日志目录
			
 
				         dry_run=args.dry_run
			
 
				     )
			
 
				     
			
@@ -739,8 +846,7 @@ def main():
 
				     print(f"\n📊 统计信息:")
			
 
				     print(f"  处理器: {processor_config.description}")
			
 
				     print(f"  输出目录: {processor.output_subdir}")
			
 
				-    if stats.get('venv'):
			
 
				-        print(f"  虚拟环境: {stats['venv']}")
			
 
				+    print(f"  日志目录: {processor.processor_config.log_subdir}")
			
 
				     print(f"  总文件数: {stats['total']}")
			
 
				     print(f"  ✓ 成功: {stats['success']}")
			
 
				     print(f"  ✗ 失败: {stats['failed']}")
			
@@ -748,11 +854,37 @@ def main():
 
				     
			
 
				     if stats['failed_files']:
			
 
				         print(f"\n失败的文件:")
			
 
				-        for file in stats['failed_files']:
			
 
				-            print(f"  ✗ {file}")
			
 
				+        for item in stats['failed_files']:
			
 
				+            print(f"  ✗ {item['file']}")
			
 
				+            print(f"    错误: {item['error']}")
			
 
				+            print(f"    日志: {item['log']}")
			
 
				     
			
 
				     return 0 if stats['failed'] == 0 else 1
			
 
				 
			
 
				 
			
 
				 if __name__ == '__main__':
			
 
				+    print("🚀 启动批量OCR程序...")
			
 
				+    
			
 
				+    import sys
			
 
				+    
			
 
				+    if len(sys.argv) == 1:
			
 
				+        # 如果没有命令行参数，使用默认配置运行
			
 
				+        print("ℹ️  未提供命令行参数，使用默认配置运行...")
			
 
				+        
			
 
				+        # 默认配置
			
 
				+        default_config = {
			
 
				+            "processor": "mineru_vllm",
			
 
				+            "file-list": "pdf_list.txt",
			
 
				+        }
			
 
				+        
			
 
				+        print("⚙️  默认参数:")
			
 
				+        for key, value in default_config.items():
			
 
				+            print(f"  --{key}: {value}")
			
 
				+        # 构造参数
			
 
				+        sys.argv = [sys.argv[0]]
			
 
				+        for key, value in default_config.items():
			
 
				+            sys.argv.extend([f"--{key}", str(value)])
			
 
				+        sys.argv.append("--dry-run")
			
 
				+        sys.argv.append("--verbose")  # 添加详细输出参数 
			
 
				+
			
 
				     sys.exit(main())
			
--- a/batch_ocr/processor_configs.yaml
+++ b/batch_ocr/processor_configs.yaml
@@ -6,7 +6,6 @@
 
				 processors:
			
 
				   # -------------------------------------------------------------------------
			
 
				   # PaddleOCR-VL 处理器
			
 
				-  # 用于视觉语言模型的 OCR 处理
			
 
				   # -------------------------------------------------------------------------
			
 
				   paddleocr_vl_single_process:
			
 
				     script: "/Users/zhch158/workspace/repository.git/PaddleX/zhch/paddleocr_vl_single_process.py"
			
@@ -14,14 +13,15 @@ processors:
 
				     output_arg: "--output_dir"
			
 
				     extra_args:
			
 
				       - "--pipeline=/Users/zhch158/workspace/repository.git/PaddleX/zhch/my_config/PaddleOCR-VL-Client-RT-DETR-H_layout_17cls.yaml"
			
 
				-      - "--no-adapter"
			
 
				+      - "--device=cpu"
			
 
				+      # - "--no-adapter"
			
 
				     output_subdir: "paddleocr_vl_results"
			
 
				+    log_subdir: "logs/paddleocr_vl"  # 🎯 新增：日志子目录
			
 
				     venv: "source /Users/zhch158/workspace/repository.git/PaddleX/paddle_env/bin/activate"
			
 
				     description: "PaddleOCR-VL 处理器 - 视觉语言模型OCR"
			
 
				 
			
 
				   # -------------------------------------------------------------------------
			
 
				   # PP-StructureV3 本地处理器
			
 
				-  # 用于文档结构化分析（本地GPU/CPU处理）
			
 
				   # -------------------------------------------------------------------------
			
 
				   ppstructurev3_single_process:
			
 
				     script: "/home/ubuntu/zhch/PaddleX/zhch/ppstructurev3_single_process.py"
			
@@ -29,22 +29,25 @@ processors:
 
				     output_arg: "--output_dir"
			
 
				     extra_args:
			
 
				       - "--pipeline=/home/ubuntu/zhch/PaddleX/zhch/my_config/PP-StructureV3.yaml"
			
 
				+      - "--device=cpu"
			
 
				     output_subdir: "ppstructurev3_results"
			
 
				+    log_subdir: "logs/ppstructurev3"
			
 
				     venv: "conda activate paddle"
			
 
				     description: "PP-StructureV3 处理器 - 本地处理"
			
 
				 
			
 
				-  # -------------------------------------------------------------------------
			
 
				-  # PP-StructureV3 GPU 处理器
			
 
				-  # 明确使用 GPU 加速
			
 
				-  # -------------------------------------------------------------------------
			
 
				   ppstructurev3_gpu:
			
 
				     script: "/home/ubuntu/zhch/PaddleX/zhch/ppstructurev3_single_process.py"
			
 
				     input_arg: "--input_file"
			
 
				     output_arg: "--output_dir"
			
 
				     extra_args:
			
 
				       - "--pipeline=/home/ubuntu/zhch/PaddleX/zhch/my_config/PP-StructureV3.yaml"
			
 
				+    input_arg: "--input_file"
			
 
				+    output_arg: "--output_dir"
			
 
				+    extra_args:
			
 
				+      - "--pipeline=/home/ubuntu/zhch/PaddleX/zhch/my_config/PP-StructureV3.yaml"
			
 
				       - "--device=gpu"
			
 
				     output_subdir: "ppstructurev3_gpu_results"
			
 
				+    log_subdir: "logs/ppstructurev3_gpu"
			
 
				     venv: "conda activate paddle"
			
 
				     description: "PP-StructureV3 处理器 - GPU加速"
			
 
				 
			
@@ -60,6 +63,7 @@ processors:
 
				       - "--pipeline=/Users/zhch158/workspace/repository.git/PaddleX/zhch/my_config/PP-StructureV3-zhch.yaml"
			
 
				       - "--device=cpu"
			
 
				     output_subdir: "ppstructurev3_cpu_results"
			
 
				+    log_subdir: "logs/ppstructurev3_cpu"
			
 
				     venv: "source /Users/zhch158/workspace/repository.git/PaddleX/paddle_env/bin/activate"
			
 
				     description: "PP-StructureV3 处理器 - CPU处理"
			
 
				 
			
@@ -75,6 +79,7 @@ processors:
 
				       - "--api_url=http://10.192.72.11:8111/layout-parsing"
			
 
				       - "--timeout=300"
			
 
				     output_subdir: "ppstructurev3_client_results"
			
 
				+    log_subdir: "logs/ppstructurev3_client"
			
 
				     venv: "source /Users/zhch158/workspace/repository.git/PaddleX/paddle_env/bin/activate"
			
 
				     description: "PP-StructureV3 HTTP API 客户端 - 远程服务"
			
 
				 
			
@@ -91,6 +96,7 @@ processors:
 
				       - "--timeout=300"
			
 
				       - "--batch_size=1"
			
 
				     output_subdir: "mineru_vllm_results"
			
 
				+    log_subdir: "logs/mineru_vllm"
			
 
				     venv: "conda activate mineru2"
			
 
				     description: "MinerU vLLM 处理器 - 支持PDF和图片"
			
 
				 
			
@@ -111,6 +117,7 @@ processors:
 
				       - "--max_workers=1"
			
 
				       - "--dpi=200"
			
 
				     output_subdir: "dotsocr_vllm_results"
			
 
				+    log_subdir: "logs/dotsocr_vllm"
			
 
				     venv: "conda activate py312"
			
 
				     description: "DotsOCR vLLM 处理器 - 支持PDF和图片"
			
 
				 
			
@@ -123,4 +130,8 @@ global:
 
				   
			
 
				   # 默认输出子目录名称（如果处理器未指定）
			
 
				   output_subdir: "results"
			
 
				-  
			
 
				+  
			
 
				+  # 🎯 新增：全局日志配置
			
 
				+  log_dir: "logs"  # 全局日志目录（相对于 base_dir）
			
 
				+  log_retention_days: 30  # 日志保留天数
			
 
				+  log_level: "INFO"  # 日志级别: DEBUG, INFO, WARNING, ERROR
			
--- a/config/A用户_单元格扫描流水.yaml
+++ b/config/A用户_单元格扫描流水.yaml
@@ -0,0 +1,48 @@
 
				+document:
			
 
				+  name: "A用户_单元格扫描流水"
			
 
				+  base_dir: "/Users/zhch158/workspace/data/流水分析/A用户_单元格扫描流水"
			
 
				+
			
 
				+  # 🎯 关键改进：定义该文档使用的 OCR 工具及其结果目录
			
 
				+  ocr_results:
			
 
				+    # PPStructV3
			
 
				+    - tool: "ppstructv3"
			
 
				+      result_dir: "ppstructurev3_client_results"
			
 
				+      image_dir: "ppstructurev3_client_results/{{name}}"
			
 
				+      description: "PPStructV3 图片合成结果"
			
 
				+      enabled: true
			
 
				+    
			
 
				+    # PaddleOCR-VL
			
 
				+    - tool: "paddleocr_vl"
			
 
				+      result_dir: "paddleocr_vl_results"
			
 
				+      image_dir: "paddleocr_vl_results/{{name}}"
			
 
				+      description: "PaddleOCR VLM 图片合成结果"
			
 
				+      enabled: true
			
 
				+    
			
 
				+    # PaddleOCR-VL (带 cell bbox)
			
 
				+    - tool: "mineru"  # 格式同 MinerU
			
 
				+      result_dir: "paddleocr_vl_results_cell_bbox"
			
 
				+      image_dir: "paddleocr_vl_results/{{name}}"
			
 
				+      description: "PaddleOCR VLM + PaddleOCR 坐标"
			
 
				+      enabled: true
			
 
				+    
			
 
				+    # MinerU
			
 
				+    - tool: "mineru"
			
 
				+      result_dir: "mineru_vllm_results"
			
 
				+      image_dir: "mineru_vllm_results/{{name}}"
			
 
				+      description: "MinerU 图片合成结果"
			
 
				+      enabled: true
			
 
				+    
			
 
				+    # MinerU (带 cell bbox)
			
 
				+    - tool: "mineru"
			
 
				+      result_dir: "mineru_vllm_results_cell_bbox"
			
 
				+      image_dir: "mineru_vllm_results/{{name}}"
			
 
				+      description: "MinerU + PaddleOCR 坐标"
			
 
				+      enabled: true
			
 
				+    
			
 
				+    # DotsOCR
			
 
				+    - tool: "dots_ocr"
			
 
				+      result_dir: "dotsocr_vllm_results"
			
 
				+      image_dir: "dotsocr_vllm_results/{{name}}"
			
 
				+      description: "Dots OCR 图片合成结果"
			
 
				+      enabled: true
			
 
				+  
			
--- a/config/B用户_扫描流水.yaml
+++ b/config/B用户_扫描流水.yaml
@@ -0,0 +1,47 @@
 
				+document:
			
 
				+  name: "B用户_扫描流水"
			
 
				+  base_dir: "/Users/zhch158/workspace/data/流水分析/B用户_扫描流水"
			
 
				+  
			
 
				+  # 🎯 关键改进：定义该文档使用的 OCR 工具及其结果目录
			
 
				+  ocr_results:
			
 
				+    # PPStructV3
			
 
				+    - tool: "ppstructv3"
			
 
				+      result_dir: "ppstructurev3_client_results"
			
 
				+      image_dir: "ppstructurev3_client_results/{{name}}"
			
 
				+      description: "PPStructV3 图片合成结果"
			
 
				+      enabled: true
			
 
				+    
			
 
				+    # PaddleOCR-VL
			
 
				+    - tool: "paddleocr_vl"
			
 
				+      result_dir: "paddleocr_vl_results"
			
 
				+      image_dir: "paddleocr_vl_results/{{name}}"
			
 
				+      description: "PaddleOCR VLM 图片合成结果"
			
 
				+      enabled: true
			
 
				+    
			
 
				+    # PaddleOCR-VL (带 cell bbox)
			
 
				+    - tool: "mineru"  # 格式同 MinerU
			
 
				+      result_dir: "paddleocr_vl_results_cell_bbox"
			
 
				+      image_dir: "paddleocr_vl_results/{{name}}"
			
 
				+      description: "PaddleOCR VLM + PaddleOCR 坐标"
			
 
				+      enabled: true
			
 
				+    
			
 
				+    # MinerU
			
 
				+    - tool: "mineru"
			
 
				+      result_dir: "mineru_vllm_results"
			
 
				+      image_dir: "mineru_vllm_results/{{name}}"
			
 
				+      description: "MinerU 图片合成结果"
			
 
				+      enabled: true
			
 
				+    
			
 
				+    # MinerU (带 cell bbox)
			
 
				+    - tool: "mineru"
			
 
				+      result_dir: "mineru_vllm_results_cell_bbox"
			
 
				+      image_dir: "mineru_vllm_results/{{name}}"
			
 
				+      description: "MinerU + PaddleOCR 坐标"
			
 
				+      enabled: true
			
 
				+    
			
 
				+    # DotsOCR
			
 
				+    - tool: "dots_ocr"
			
 
				+      result_dir: "dotsocr_vllm_results"
			
 
				+      image_dir: "dotsocr_vllm_results/{{name}}"
			
 
				+      description: "Dots OCR 图片合成结果"
			
 
				+      enabled: true
			
--- a/config/global.yaml
+++ b/config/global.yaml
@@ -0,0 +1,153 @@
 
				+# OCR验证工具配置文件
			
 
				+
			
 
				+# 样式配置
			
 
				+styles:
			
 
				+  font_size: 8
			
 
				+  
			
 
				+  colors:
			
 
				+    primary: "#0288d1"
			
 
				+    secondary: "#ff9800"
			
 
				+    success: "#4caf50"
			
 
				+    error: "#f44336"
			
 
				+    warning: "#ff9800"
			
 
				+    background: "#fafafa"
			
 
				+    text: "#333333"
			
 
				+  
			
 
				+  layout:
			
 
				+    default_zoom: 1.0
			
 
				+    default_height: 800
			
 
				+    sidebar_width: 1
			
 
				+    content_width: 0.65
			
 
				+
			
 
				+# 界面配置
			
 
				+ui:
			
 
				+  page_title: "OCR可视化校验工具"
			
 
				+  page_icon: "🔍"
			
 
				+  layout: "wide"
			
 
				+  sidebar_state: "expanded"
			
 
				+  
			
 
				+# OCR数据配置
			
 
				+ocr:
			
 
				+  min_text_length: 2
			
 
				+  default_confidence: 1.0
			
 
				+  exclude_texts: ["Picture", ""]
			
 
				+  
			
 
				+  # 图片方向检测配置
			
 
				+  orientation_detection:
			
 
				+    enabled: true
			
 
				+    confidence_threshold: 0.3  # 置信度阈值
			
 
				+    methods: ["opencv_analysis"]  # 检测方法
			
 
				+    cache_results: true  # 缓存检测结果
			
 
				+  
			
 
				+  # OCR工具类型配置
			
 
				+  tools:
			
 
				+    dots_ocr:
			
 
				+      name: "Dots OCR"
			
 
				+      description: "专业VLM OCR"
			
 
				+      json_structure: "array"  # JSON为数组格式
			
 
				+      text_field: "text"
			
 
				+      bbox_field: "bbox"
			
 
				+      category_field: "category"
			
 
				+      confidence_field: "confidence"
			
 
				+      # 旋转处理配置
			
 
				+      rotation:
			
 
				+        coordinates_are_pre_rotated: false  # 坐标不是预旋转的
			
 
				+        
			
 
				+    ppstructv3:
			
 
				+      name: "PPStructV3"
			
 
				+      description: "PaddleOCR PP-StructureV3"
			
 
				+      json_structure: "object"  # JSON为对象格式
			
 
				+      parsing_results_field: "parsing_res_list"
			
 
				+      text_field: "block_content"
			
 
				+      bbox_field: "block_bbox"
			
 
				+      rec_texts_field: "overall_ocr_res.rec_texts" # 针对表格中的文字块
			
 
				+      rec_boxes_field: "overall_ocr_res.rec_boxes" # 针对表格中的文字块
			
 
				+      category_field: "block_label"
			
 
				+      confidence_field: "confidence"
			
 
				+      # 旋转处理配置
			
 
				+      rotation:
			
 
				+        coordinates_are_pre_rotated: true  # 坐标已经是预旋转的
			
 
				+      
			
 
				+    table_recognition_v2:
			
 
				+      name: "TableRecognitionV2"
			
 
				+      description: "PaddleOCR Table Recognition V2"
			
 
				+      json_structure: "object"
			
 
				+      parsing_results_field: "table_res_list"
			
 
				+      text_field: "pred_html"
			
 
				+      bbox_field: "cell_box_list"            # 原先的 cell_box_listox 为笔误
			
 
				+      rec_texts_field: "table_ocr_pred.rec_texts" # 针对表格中的文字块
			
 
				+      rec_boxes_field: "table_ocr_pred.rec_boxes" # 针对表格中的文字块
			
 
				+      category_field: "type"
			
 
				+      confidence_field: "confidence"
			
 
				+      rotation:
			
 
				+        coordinates_are_pre_rotated: true
			
 
				+    
			
 
				+    mineru:
			
 
				+      name: "MinerU"
			
 
				+      description: "MinerU OCR"
			
 
				+      json_structure: "array"  # JSON为数组格式
			
 
				+      text_field: "text"
			
 
				+      bbox_field: "bbox"
			
 
				+      category_field: "type"
			
 
				+      confidence_field: "confidence"
			
 
				+      # 表格相关字段
			
 
				+      table_body_field: "table_body"
			
 
				+      table_cells_field: "table_cells"
			
 
				+      img_path_field: "img_path"
			
 
				+      # 旋转处理配置
			
 
				+      rotation:
			
 
				+        coordinates_are_pre_rotated: false
			
 
				+  
			
 
				+  # 自动检测工具类型的规则（按优先级从高到低）
			
 
				+  auto_detection:
			
 
				+    enabled: true
			
 
				+    rules:
			
 
				+      # Table Recognition V2 - 最高优先级
			
 
				+      - tool_type: "table_recognition_v2"
			
 
				+        conditions:
			
 
				+          - type: "field_exists"
			
 
				+            field: "table_res_list"
			
 
				+          - type: "field_not_exists"
			
 
				+            field: "parsing_res_list"
			
 
				+        priority: 4
			
 
				+      
			
 
				+      # PPStructV3 - 第二优先级
			
 
				+      - tool_type: "ppstructv3"
			
 
				+        conditions:
			
 
				+          - type: "field_exists"
			
 
				+            field: "parsing_res_list"
			
 
				+          - type: "field_exists"
			
 
				+            field: "doc_preprocessor_res"
			
 
				+        priority: 2
			
 
				+      
			
 
				+      # MinerU - 第三优先级
			
 
				+      - tool_type: "mineru"
			
 
				+        conditions:
			
 
				+          - type: "field_exists"
			
 
				+            field: "page_idx"
			
 
				+          - type: "field_exists"
			
 
				+            field: "type"
			
 
				+          - type: "json_structure"
			
 
				+            structure: "array"
			
 
				+        priority: 1
			
 
				+      
			
 
				+      # Dots OCR - 最低优先级（默认）
			
 
				+      - tool_type: "dots_ocr"
			
 
				+        conditions:
			
 
				+          - type: "json_structure"
			
 
				+            structure: "array"
			
 
				+          - type: "field_exists"
			
 
				+            field: "category"
			
 
				+        priority: 3
			
 
				+
			
 
				+# 预校验结果文件路径
			
 
				+pre_validation:
			
 
				+  out_dir: "./output/pre_validation/"
			
 
				+
			
 
				+data_sources:
			
 
				+  - 德_内蒙古银行照.yaml
			
 
				+  - 对公_招商银行图.yaml
			
 
				+  - A用户_单元格扫描流水.yaml
			
 
				+  - B用户_扫描流水.yaml
			
 
				+  - 至远彩色_2023年报.yaml
			
 
				+
			
--- a/config/对公_招商银行图.yaml
+++ b/config/对公_招商银行图.yaml
@@ -0,0 +1,47 @@
 
				+document:
			
 
				+  name: "对公_招商银行图"
			
 
				+  base_dir: "/Users/zhch158/workspace/data/流水分析/对公_招商银行图"
			
 
				+  
			
 
				+  # 🎯 关键改进：定义该文档使用的 OCR 工具及其结果目录
			
 
				+  ocr_results:
			
 
				+    # PPStructV3
			
 
				+    - tool: "ppstructv3"
			
 
				+      result_dir: "ppstructurev3_client_results"
			
 
				+      image_dir: "ppstructurev3_client_results/{{name}}"
			
 
				+      description: "PPStructV3 图片合成结果"
			
 
				+      enabled: true
			
 
				+    
			
 
				+    # PaddleOCR-VL
			
 
				+    - tool: "paddleocr_vl"
			
 
				+      result_dir: "paddleocr_vl_results"
			
 
				+      image_dir: "paddleocr_vl_results/{{name}}"
			
 
				+      description: "PaddleOCR VLM 图片合成结果"
			
 
				+      enabled: true
			
 
				+    
			
 
				+    # PaddleOCR-VL (带 cell bbox)
			
 
				+    - tool: "mineru"  # 格式同 MinerU
			
 
				+      result_dir: "paddleocr_vl_results_cell_bbox"
			
 
				+      image_dir: "paddleocr_vl_results/{{name}}"
			
 
				+      description: "PaddleOCR VLM + PaddleOCR 坐标"
			
 
				+      enabled: true
			
 
				+    
			
 
				+    # MinerU
			
 
				+    - tool: "mineru"
			
 
				+      result_dir: "mineru_vllm_results"
			
 
				+      image_dir: "mineru_vllm_results/{{name}}"
			
 
				+      description: "MinerU 图片合成结果"
			
 
				+      enabled: true
			
 
				+    
			
 
				+    # MinerU (带 cell bbox)
			
 
				+    - tool: "mineru"
			
 
				+      result_dir: "mineru_vllm_results_cell_bbox"
			
 
				+      image_dir: "mineru_vllm_results/{{name}}"
			
 
				+      description: "MinerU + PaddleOCR 坐标"
			
 
				+      enabled: true
			
 
				+    
			
 
				+    # DotsOCR
			
 
				+    - tool: "dots_ocr"
			
 
				+      result_dir: "dotsocr_vllm_results"
			
 
				+      image_dir: "dotsocr_vllm_results/{{name}}"
			
 
				+      description: "Dots OCR 图片合成结果"
			
 
				+      enabled: true
			
--- a/config/德_内蒙古银行照.yaml
+++ b/config/德_内蒙古银行照.yaml
@@ -0,0 +1,48 @@
 
				+# 文档: 德_内蒙古银行照
			
 
				+document:
			
 
				+  name: "德_内蒙古银行照"
			
 
				+  base_dir: "/Users/zhch158/workspace/data/流水分析/德_内蒙古银行照"
			
 
				+  
			
 
				+  # 🎯 关键改进：定义该文档使用的 OCR 工具及其结果目录
			
 
				+  ocr_results:
			
 
				+    # PPStructV3
			
 
				+    - tool: "ppstructv3"
			
 
				+      result_dir: "ppstructurev3_client_results"
			
 
				+      image_dir: "ppstructurev3_client_results/{{name}}"
			
 
				+      description: "PPStructV3 图片合成结果"
			
 
				+      enabled: true
			
 
				+    
			
 
				+    # PaddleOCR-VL
			
 
				+    - tool: "paddleocr_vl"
			
 
				+      result_dir: "paddleocr_vl_results"
			
 
				+      image_dir: "paddleocr_vl_results/{{name}}"
			
 
				+      description: "PaddleOCR VLM 图片合成结果"
			
 
				+      enabled: true
			
 
				+    
			
 
				+    # PaddleOCR-VL (带 cell bbox)
			
 
				+    - tool: "mineru"  # 格式同 MinerU
			
 
				+      result_dir: "paddleocr_vl_results_cell_bbox"
			
 
				+      image_dir: "paddleocr_vl_results/{{name}}"
			
 
				+      description: "PaddleOCR VLM + PaddleOCR 坐标"
			
 
				+      enabled: true
			
 
				+    
			
 
				+    # MinerU
			
 
				+    - tool: "mineru"
			
 
				+      result_dir: "mineru_vllm_results"
			
 
				+      image_dir: "mineru_vllm_results/{{name}}"
			
 
				+      description: "MinerU 图片合成结果"
			
 
				+      enabled: true
			
 
				+    
			
 
				+    # MinerU (带 cell bbox)
			
 
				+    - tool: "mineru"
			
 
				+      result_dir: "mineru_vllm_results_cell_bbox"
			
 
				+      image_dir: "mineru_vllm_results/{{name}}"
			
 
				+      description: "MinerU + PaddleOCR 坐标"
			
 
				+      enabled: true
			
 
				+    
			
 
				+    # DotsOCR
			
 
				+    - tool: "dots_ocr"
			
 
				+      result_dir: "dotsocr_vllm_results"
			
 
				+      image_dir: "dotsocr_vllm_results/{{name}}"
			
 
				+      description: "Dots OCR 图片合成结果"
			
 
				+      enabled: true
			
--- a/config/至远彩色_2023年报.yaml
+++ b/config/至远彩色_2023年报.yaml
@@ -0,0 +1,47 @@
 
				+document:
			
 
				+  name: "至远彩色_2023年报"
			
 
				+  base_dir: "/Users/zhch158/workspace/data/流水分析/至远彩色_2023年报"
			
 
				+  
			
 
				+  # 🎯 关键改进：定义该文档使用的 OCR 工具及其结果目录
			
 
				+  ocr_results:
			
 
				+    # PPStructV3
			
 
				+    - tool: "ppstructv3"
			
 
				+      result_dir: "ppstructurev3_client_results"
			
 
				+      image_dir: "ppstructurev3_client_results/{{name}}"
			
 
				+      description: "PPStructV3 图片合成结果"
			
 
				+      enabled: true
			
 
				+    
			
 
				+    # PaddleOCR-VL
			
 
				+    - tool: "paddleocr_vl"
			
 
				+      result_dir: "paddleocr_vl_results"
			
 
				+      image_dir: "paddleocr_vl_results/{{name}}"
			
 
				+      description: "PaddleOCR VLM 图片合成结果"
			
 
				+      enabled: true
			
 
				+    
			
 
				+    # PaddleOCR-VL (带 cell bbox)
			
 
				+    - tool: "mineru"  # 格式同 MinerU
			
 
				+      result_dir: "paddleocr_vl_results_cell_bbox"
			
 
				+      image_dir: "paddleocr_vl_results/{{name}}"
			
 
				+      description: "PaddleOCR VLM + PaddleOCR 坐标"
			
 
				+      enabled: true
			
 
				+    
			
 
				+    # MinerU
			
 
				+    - tool: "mineru"
			
 
				+      result_dir: "mineru_vllm_results"
			
 
				+      image_dir: "mineru_vllm_results/{{name}}"
			
 
				+      description: "MinerU 图片合成结果"
			
 
				+      enabled: true
			
 
				+    
			
 
				+    # MinerU (带 cell bbox)
			
 
				+    - tool: "mineru"
			
 
				+      result_dir: "mineru_vllm_results_cell_bbox"
			
 
				+      image_dir: "mineru_vllm_results/{{name}}"
			
 
				+      description: "MinerU + PaddleOCR 坐标"
			
 
				+      enabled: true
			
 
				+    
			
 
				+    # DotsOCR
			
 
				+    - tool: "dots_ocr"
			
 
				+      result_dir: "dotsocr_vllm_results"
			
 
				+      image_dir: "dotsocr_vllm_results/{{name}}"
			
 
				+      description: "Dots OCR 图片合成结果"
			
 
				+      enabled: true
			
--- a/config_manager.py
+++ b/config_manager.py
@@ -0,0 +1,360 @@
 
				+"""
			
 
				+配置管理器
			
 
				+支持分层配置和自动发现数据源
			
 
				+支持 Jinja2 模板变量
			
 
				+"""
			
 
				+
			
 
				+import yaml
			
 
				+from pathlib import Path
			
 
				+from typing import Dict, List, Optional, Any
			
 
				+from dataclasses import dataclass, field
			
 
				+import logging
			
 
				+from jinja2 import Template  # 🎯 新增
			
 
				+
			
 
				+
			
 
				+@dataclass
			
 
				+class OCRToolConfig:
			
 
				+    """OCR 工具配置"""
			
 
				+    name: str
			
 
				+    description: str
			
 
				+    json_structure: str
			
 
				+    text_field: str
			
 
				+    bbox_field: str
			
 
				+    category_field: str
			
 
				+    confidence_field: str = "confidence"
			
 
				+    parsing_results_field: Optional[str] = None
			
 
				+    rec_texts_field: Optional[str] = None
			
 
				+    rec_boxes_field: Optional[str] = None
			
 
				+    table_body_field: Optional[str] = None
			
 
				+    table_cells_field: Optional[str] = None
			
 
				+    img_path_field: Optional[str] = None
			
 
				+    rotation: Dict[str, Any] = field(default_factory=dict)
			
 
				+    
			
 
				+    @classmethod
			
 
				+    def from_dict(cls, tool_id: str, data: Dict) -> 'OCRToolConfig':
			
 
				+        """从字典创建"""
			
 
				+        return cls(
			
 
				+            name=data.get('name', tool_id),
			
 
				+            description=data.get('description', ''),
			
 
				+            json_structure=data.get('json_structure', 'object'),
			
 
				+            text_field=data.get('text_field', 'text'),
			
 
				+            bbox_field=data.get('bbox_field', 'bbox'),
			
 
				+            category_field=data.get('category_field', 'category'),
			
 
				+            confidence_field=data.get('confidence_field', 'confidence'),
			
 
				+            parsing_results_field=data.get('parsing_results_field'),
			
 
				+            rec_texts_field=data.get('rec_texts_field'),
			
 
				+            rec_boxes_field=data.get('rec_boxes_field'),
			
 
				+            table_body_field=data.get('table_body_field'),
			
 
				+            table_cells_field=data.get('table_cells_field'),
			
 
				+            img_path_field=data.get('img_path_field'),
			
 
				+            rotation=data.get('rotation', {})
			
 
				+        )
			
 
				+    
			
 
				+    def to_dict(self) -> Dict:
			
 
				+        """转换为字典（用于 OCRValidator）"""
			
 
				+        config_dict = {
			
 
				+            'name': self.name,
			
 
				+            'description': self.description,
			
 
				+            'json_structure': self.json_structure,
			
 
				+            'text_field': self.text_field,
			
 
				+            'bbox_field': self.bbox_field,
			
 
				+            'category_field': self.category_field,
			
 
				+            'confidence_field': self.confidence_field,
			
 
				+            'rotation': self.rotation
			
 
				+        }
			
 
				+        
			
 
				+        # 添加可选字段
			
 
				+        if self.parsing_results_field:
			
 
				+            config_dict['parsing_results_field'] = self.parsing_results_field
			
 
				+        if self.rec_texts_field:
			
 
				+            config_dict['rec_texts_field'] = self.rec_texts_field
			
 
				+        if self.rec_boxes_field:
			
 
				+            config_dict['rec_boxes_field'] = self.rec_boxes_field
			
 
				+        if self.table_body_field:
			
 
				+            config_dict['table_body_field'] = self.table_body_field
			
 
				+        if self.table_cells_field:
			
 
				+            config_dict['table_cells_field'] = self.table_cells_field
			
 
				+        if self.img_path_field:
			
 
				+            config_dict['img_path_field'] = self.img_path_field
			
 
				+        
			
 
				+        return config_dict
			
 
				+
			
 
				+
			
 
				+@dataclass
			
 
				+class OCRResultConfig:
			
 
				+    """OCR 结果配置"""
			
 
				+    tool: str
			
 
				+    result_dir: str
			
 
				+    image_dir: Optional[str]
			
 
				+    description: str = ""
			
 
				+    enabled: bool = True
			
 
				+    
			
 
				+    @classmethod
			
 
				+    def from_dict(cls, data: Dict, context: Dict = None) -> 'OCRResultConfig':
			
 
				+        """
			
 
				+        🎯 从字典创建（支持 Jinja2 模板）
			
 
				+        
			
 
				+        Args:
			
 
				+            data: 配置数据
			
 
				+            context: 模板上下文（如 {'name': '德_内蒙古银行照'}）
			
 
				+        """
			
 
				+        # 🎯 渲染模板
			
 
				+        if context:
			
 
				+            result_dir = cls._render_template(data['result_dir'], context)
			
 
				+            image_dir = cls._render_template(data.get('image_dir'), context) if data.get('image_dir') else None
			
 
				+            description = cls._render_template(data.get('description', ''), context)
			
 
				+        else:
			
 
				+            result_dir = data['result_dir']
			
 
				+            image_dir = data.get('image_dir')
			
 
				+            description = data.get('description', '')
			
 
				+        
			
 
				+        return cls(
			
 
				+            tool=data['tool'],
			
 
				+            result_dir=result_dir,
			
 
				+            image_dir=image_dir,
			
 
				+            description=description,
			
 
				+            enabled=data.get('enabled', True)
			
 
				+        )
			
 
				+    
			
 
				+    @staticmethod
			
 
				+    def _render_template(template_str: Optional[str], context: Dict) -> Optional[str]:
			
 
				+        """🎯 渲染 Jinja2 模板"""
			
 
				+        if not template_str:
			
 
				+            return None
			
 
				+        
			
 
				+        try:
			
 
				+            template = Template(template_str)
			
 
				+            return template.render(context)
			
 
				+        except Exception as e:
			
 
				+            logging.warning(f"模板渲染失败: {template_str}, 错误: {e}")
			
 
				+            return template_str
			
 
				+
			
 
				+
			
 
				+@dataclass
			
 
				+class DocumentConfig:
			
 
				+    """文档配置"""
			
 
				+    name: str
			
 
				+    base_dir: str
			
 
				+    ocr_results: List[OCRResultConfig] = field(default_factory=list)
			
 
				+    
			
 
				+    @classmethod
			
 
				+    def from_dict(cls, data: Dict) -> 'DocumentConfig':
			
 
				+        """从字典创建（支持 Jinja2 模板）"""
			
 
				+        doc_data = data.get('document', data)
			
 
				+        
			
 
				+        # 🎯 构建模板上下文
			
 
				+        context = {
			
 
				+            'name': doc_data['name'],
			
 
				+            'base_dir': doc_data['base_dir']
			
 
				+        }
			
 
				+        
			
 
				+        return cls(
			
 
				+            name=doc_data['name'],
			
 
				+            base_dir=doc_data['base_dir'],
			
 
				+            ocr_results=[
			
 
				+                OCRResultConfig.from_dict(r, context) 
			
 
				+                for r in doc_data.get('ocr_results', [])
			
 
				+            ]
			
 
				+        )
			
 
				+
			
 
				+
			
 
				+@dataclass
			
 
				+class DataSource:
			
 
				+    """数据源（用于 OCRValidator）"""
			
 
				+    name: str
			
 
				+    ocr_tool: str
			
 
				+    ocr_out_dir: str
			
 
				+    src_img_dir: str
			
 
				+    description: str = ""
			
 
				+
			
 
				+
			
 
				+class ConfigManager:
			
 
				+    """配置管理器"""
			
 
				+    
			
 
				+    def __init__(self, config_dir: str = "config"):
			
 
				+        """
			
 
				+        Args:
			
 
				+            config_dir: 配置文件目录
			
 
				+        """
			
 
				+        self.config_dir = Path(config_dir)
			
 
				+        self.logger = logging.getLogger(__name__)
			
 
				+        
			
 
				+        # 加载配置
			
 
				+        self.global_config = self._load_global_config()
			
 
				+        self.ocr_tools = self._load_ocr_tools()
			
 
				+        self.documents = self._load_documents()
			
 
				+    
			
 
				+    def _load_global_config(self) -> Dict:
			
 
				+        """加载全局配置"""
			
 
				+        config_file = self.config_dir / "global.yaml"
			
 
				+        
			
 
				+        if not config_file.exists():
			
 
				+            self.logger.warning(f"全局配置文件不存在: {config_file}")
			
 
				+            return {}
			
 
				+        
			
 
				+        with open(config_file, 'r', encoding='utf-8') as f:
			
 
				+            return yaml.safe_load(f) or {}
			
 
				+    
			
 
				+    def _load_ocr_tools(self) -> Dict[str, OCRToolConfig]:
			
 
				+        """加载 OCR 工具配置（从 global.yaml）"""
			
 
				+        tools_data = self.global_config.get('ocr', {}).get('tools', {})
			
 
				+        
			
 
				+        tools = {}
			
 
				+        for tool_id, tool_data in tools_data.items():
			
 
				+            tools[tool_id] = OCRToolConfig.from_dict(tool_id, tool_data)
			
 
				+        
			
 
				+        return tools
			
 
				+    
			
 
				+    def _load_documents(self) -> Dict[str, DocumentConfig]:
			
 
				+        """加载文档配置（支持 Jinja2 模板）"""
			
 
				+        documents = {}
			
 
				+        
			
 
				+        # 从 global.yaml 读取文档配置文件列表
			
 
				+        doc_files = self.global_config.get('data_sources', [])
			
 
				+        
			
 
				+        for doc_file in doc_files:
			
 
				+            # 支持相对路径和绝对路径
			
 
				+            if not doc_file.endswith('.yaml'):
			
 
				+                doc_file = f"{doc_file}.yaml"
			
 
				+            
			
 
				+            yaml_path = self.config_dir / doc_file
			
 
				+            
			
 
				+            if not yaml_path.exists():
			
 
				+                self.logger.warning(f"文档配置文件不存在: {yaml_path}")
			
 
				+                continue
			
 
				+            
			
 
				+            try:
			
 
				+                with open(yaml_path, 'r', encoding='utf-8') as f:
			
 
				+                    data = yaml.safe_load(f)
			
 
				+                
			
 
				+                # 🎯 使用支持 Jinja2 的解析方法
			
 
				+                doc_config = DocumentConfig.from_dict(data)
			
 
				+                documents[doc_config.name] = doc_config
			
 
				+                
			
 
				+                self.logger.info(f"✅ 加载文档配置: {doc_config.name} ({len(doc_config.ocr_results)} 个 OCR 结果)")
			
 
				+                
			
 
				+            except Exception as e:
			
 
				+                self.logger.error(f"加载文档配置失败: {yaml_path}, 错误: {e}")
			
 
				+        
			
 
				+        return documents
			
 
				+    
			
 
				+    def get_ocr_tool(self, tool_id: str) -> Optional[OCRToolConfig]:
			
 
				+        """获取 OCR 工具配置"""
			
 
				+        return self.ocr_tools.get(tool_id)
			
 
				+    
			
 
				+    def get_document(self, doc_name: str) -> Optional[DocumentConfig]:
			
 
				+        """获取文档配置"""
			
 
				+        return self.documents.get(doc_name)
			
 
				+    
			
 
				+    def list_documents(self) -> List[str]:
			
 
				+        """列出所有文档"""
			
 
				+        return list(self.documents.keys())
			
 
				+    
			
 
				+    def list_ocr_tools(self) -> List[str]:
			
 
				+        """列出所有 OCR 工具"""
			
 
				+        return list(self.ocr_tools.keys())
			
 
				+    
			
 
				+    def get_data_sources(self) -> List[DataSource]:
			
 
				+        """
			
 
				+        生成数据源列表（供 OCRValidator 使用）
			
 
				+        
			
 
				+        从文档配置自动生成 data_sources
			
 
				+        """
			
 
				+        data_sources = []
			
 
				+        
			
 
				+        for doc_name, doc_config in self.documents.items():
			
 
				+            base_dir = Path(doc_config.base_dir)
			
 
				+            
			
 
				+            for ocr_result in doc_config.ocr_results:
			
 
				+                if not ocr_result.enabled:
			
 
				+                    continue
			
 
				+                
			
 
				+                # 构建完整路径
			
 
				+                ocr_out_dir = str(base_dir / ocr_result.result_dir)
			
 
				+                
			
 
				+                if ocr_result.image_dir:
			
 
				+                    src_img_dir = str(base_dir / ocr_result.image_dir)
			
 
				+                else:
			
 
				+                    # 如果未指定图片目录，使用结果目录
			
 
				+                    src_img_dir = str(base_dir / ocr_result.result_dir / doc_name)
			
 
				+                
			
 
				+                # 生成数据源名称
			
 
				+                if ocr_result.description:
			
 
				+                    source_name = f"{doc_name}_{ocr_result.description}"
			
 
				+                else:
			
 
				+                    tool_config = self.get_ocr_tool(ocr_result.tool)
			
 
				+                    tool_name = tool_config.name if tool_config else ocr_result.tool
			
 
				+                    source_name = f"{doc_name}_{tool_name}"
			
 
				+                
			
 
				+                data_source = DataSource(
			
 
				+                    name=source_name,
			
 
				+                    ocr_tool=ocr_result.tool,
			
 
				+                    ocr_out_dir=ocr_out_dir,
			
 
				+                    src_img_dir=src_img_dir,
			
 
				+                    description=ocr_result.description or f"{doc_name} 使用 {ocr_result.tool}"
			
 
				+                )
			
 
				+                
			
 
				+                data_sources.append(data_source)
			
 
				+        
			
 
				+        return data_sources
			
 
				+    
			
 
				+    def get_config_value(self, key_path: str, default=None):
			
 
				+        """
			
 
				+        获取配置值（支持点号路径）
			
 
				+        
			
 
				+        Examples:
			
 
				+            get_config_value('styles.font_size')
			
 
				+            get_config_value('ocr.min_text_length')
			
 
				+        """
			
 
				+        keys = key_path.split('.')
			
 
				+        value = self.global_config
			
 
				+        
			
 
				+        for key in keys:
			
 
				+            if isinstance(value, dict):
			
 
				+                value = value.get(key)
			
 
				+            else:
			
 
				+                return default
			
 
				+        
			
 
				+        return value if value is not None else default
			
 
				+    
			
 
				+    def to_validator_config(self) -> Dict:
			
 
				+        """
			
 
				+        转换为 OCRValidator 所需的配置格式
			
 
				+        
			
 
				+        Returns:
			
 
				+            包含 data_sources 和 ocr.tools 的配置字典
			
 
				+        """
			
 
				+        # 构建 data_sources 列表
			
 
				+        data_sources_list = []
			
 
				+        for ds in self.get_data_sources():
			
 
				+            data_sources_list.append({
			
 
				+                'name': ds.name,
			
 
				+                'ocr_tool': ds.ocr_tool,
			
 
				+                'ocr_out_dir': ds.ocr_out_dir,
			
 
				+                'src_img_dir': ds.src_img_dir
			
 
				+            })
			
 
				+        
			
 
				+        # 构建 ocr.tools 字典
			
 
				+        ocr_tools_dict = {}
			
 
				+        for tool_id, tool_config in self.ocr_tools.items():
			
 
				+            ocr_tools_dict[tool_id] = tool_config.to_dict()
			
 
				+        
			
 
				+        # 返回完整配置
			
 
				+        config = self.global_config.copy()
			
 
				+        config['data_sources'] = data_sources_list
			
 
				+        
			
 
				+        # 确保 ocr.tools 存在
			
 
				+        if 'ocr' not in config:
			
 
				+            config['ocr'] = {}
			
 
				+        config['ocr']['tools'] = ocr_tools_dict
			
 
				+        
			
 
				+        return config
			
 
				+
			
 
				+
			
 
				+# ============================================================================
			
 
				+# 便捷函数
			
 
				+# ============================================================================
			
 
				+
			
 
				+def load_config(config_dir: str = "config") -> ConfigManager:
			
 
				+    """加载配置"""
			
 
				+    return ConfigManager(config_dir)
			
--- a/merger/merge_mineru_paddle_ocr.1.py
+++ b/merger/merge_mineru_paddle_ocr.1.py
@@ -1,816 +0,0 @@
 
				-"""
			
 
				-合并 MinerU 和 PaddleOCR 的结果
			
 
				-使用 MinerU 的表格结构识别 + PaddleOCR 的文字框坐标
			
 
				-"""
			
 
				-import json
			
 
				-import re
			
 
				-import argparse
			
 
				-from pathlib import Path
			
 
				-from typing import List, Dict, Tuple, Optional
			
 
				-from bs4 import BeautifulSoup
			
 
				-from fuzzywuzzy import fuzz
			
 
				-import shutil
			
 
				-
			
 
				-class MinerUPaddleOCRMerger:
			
 
				-    """合并 MinerU 和 PaddleOCR 的结果"""
			
 
				-    
			
 
				-    def __init__(self, look_ahead_window: int = 10, similarity_threshold: int = 90):
			
 
				-        """
			
 
				-        Args:
			
 
				-            look_ahead_window: 向前查找的窗口大小
			
 
				-            similarity_threshold: 文本相似度阈值
			
 
				-        """
			
 
				-        self.look_ahead_window = look_ahead_window
			
 
				-        self.similarity_threshold = similarity_threshold
			
 
				-    
			
 
				-    def merge_table_with_bbox(self, mineru_json_path: str, paddle_json_path: str) -> List[Dict]:
			
 
				-        """
			
 
				-        合并 MinerU 和 PaddleOCR 的结果
			
 
				-        
			
 
				-        Args:
			
 
				-            mineru_json_path: MinerU 输出的 JSON 路径
			
 
				-            paddle_json_path: PaddleOCR 输出的 JSON 路径
			
 
				-            output_path: 输出路径（可选）
			
 
				-        
			
 
				-        Returns:
			
 
				-            合并后的结果字典
			
 
				-        """
			
 
				-        merged_data = None
			
 
				-        # 加载数据
			
 
				-        with open(mineru_json_path, 'r', encoding='utf-8') as f:
			
 
				-            mineru_data = json.load(f)
			
 
				-        
			
 
				-        with open(paddle_json_path, 'r', encoding='utf-8') as f:
			
 
				-            paddle_data = json.load(f)
			
 
				-        
			
 
				-        # 提取 PaddleOCR 的文字框信息
			
 
				-        paddle_text_boxes = self._extract_paddle_text_boxes(paddle_data)
			
 
				-        
			
 
				-        # 处理 MinerU 的数据
			
 
				-        merged_data = self._process_mineru_data(mineru_data, paddle_text_boxes)
			
 
				-        
			
 
				-        return merged_data
			
 
				-    
			
 
				-    def _extract_paddle_text_boxes(self, paddle_data: Dict) -> List[Dict]:
			
 
				-        """提取 PaddleOCR 的文字框信息"""
			
 
				-        text_boxes = []
			
 
				-        
			
 
				-        if 'overall_ocr_res' in paddle_data:
			
 
				-            ocr_res = paddle_data['overall_ocr_res']
			
 
				-            rec_texts = ocr_res.get('rec_texts', [])
			
 
				-            rec_polys = ocr_res.get('rec_polys', [])
			
 
				-            rec_scores = ocr_res.get('rec_scores', [])
			
 
				-
			
 
				-            for i, (text, poly, score) in enumerate(zip(rec_texts, rec_polys, rec_scores)):
			
 
				-                if text and text.strip():
			
 
				-                    # 计算 bbox (x_min, y_min, x_max, y_max)
			
 
				-                    xs = [p[0] for p in poly]
			
 
				-                    ys = [p[1] for p in poly]
			
 
				-                    bbox = [min(xs), min(ys), max(xs), max(ys)]
			
 
				-                    
			
 
				-                    text_boxes.append({
			
 
				-                        'text': text,
			
 
				-                        'bbox': bbox,
			
 
				-                        'poly': poly,
			
 
				-                        'score': score,
			
 
				-                        'paddle_bbox_index': i,
			
 
				-                        'used': False  # 标记是否已被使用
			
 
				-                    })
			
 
				-
			
 
				-        return text_boxes
			
 
				-    
			
 
				-    def _process_mineru_data(self, mineru_data: List[Dict], 
			
 
				-                            paddle_text_boxes: List[Dict]) -> List[Dict]:
			
 
				-        """处理 MinerU 数据，添加 bbox 信息
			
 
				-
			
 
				-        Args:
			
 
				-            mineru_data (List[Dict]): _description_
			
 
				-            paddle_text_boxes (List[Dict]): _description_
			
 
				-
			
 
				-        Returns:
			
 
				-            List[Dict]: _description_
			
 
				-        """ 
			
 
				-
			
 
				-        merged_data = []
			
 
				-        cells = None  # 存储所有表格单元格信息
			
 
				-        paddle_pointer = 0  # PaddleOCR 文字框指针
			
 
				-        last_matched_index = 0  # 上次匹配成功的索引
			
 
				-
			
 
				-        # 对mineru_data按bbox从上到下排序，从左到右确保顺序一致
			
 
				-        mineru_data.sort(key=lambda x: (x['bbox'][1], x['bbox'][0]) if 'bbox' in x else (float('inf'), float('inf')))
			
 
				-
			
 
				-        for item in mineru_data:
			
 
				-            if item['type'] == 'table':
			
 
				-                # 处理表格
			
 
				-                merged_item = item.copy()
			
 
				-                table_html = item.get('table_body', '')
			
 
				-                
			
 
				-                # 解析 HTML 表格并添加 bbox
			
 
				-                enhanced_html, cells, paddle_pointer = self._enhance_table_html_with_bbox(
			
 
				-                    table_html, paddle_text_boxes, paddle_pointer
			
 
				-                )
			
 
				-                
			
 
				-                merged_item['table_body'] = enhanced_html
			
 
				-                merged_item['table_body_with_bbox'] = enhanced_html
			
 
				-                merged_item['bbox_mapping'] = 'merged_from_paddle_ocr'
			
 
				-                merged_item['table_cells'] = cells if cells else []
			
 
				-                
			
 
				-                merged_data.append(merged_item)
			
 
				-            
			
 
				-            elif item['type'] in ['text', 'title']:
			
 
				-                # 处理普通文本
			
 
				-                merged_item = item.copy()
			
 
				-                text = item.get('text', '')
			
 
				-                
			
 
				-                # 查找匹配的 bbox
			
 
				-                matched_bbox, paddle_pointer, last_matched_index = self._find_matching_bbox(
			
 
				-                    text, paddle_text_boxes, paddle_pointer, last_matched_index
			
 
				-                )
			
 
				-                
			
 
				-                if matched_bbox:
			
 
				-                    # merged_item['bbox'] = matched_bbox['bbox']
			
 
				-                    # merged_item['bbox_source'] = 'paddle_ocr'
			
 
				-                    # merged_item['text_score'] = matched_bbox['score']
			
 
				-
			
 
				-                    # 沿用mineru的bbox, 就是要移动位置paddle_pointer, last_matched_index
			
 
				-                    # 标记为已使用
			
 
				-                    matched_bbox['used'] = True
			
 
				-                
			
 
				-                merged_data.append(merged_item)
			
 
				-            elif item['type'] == 'list':
			
 
				-                # 处理列表项
			
 
				-                merged_item = item.copy()
			
 
				-                list_items = item.get('list_items', [])
			
 
				-                sub_type = item.get('sub_type', 'unordered')  # 有序或无序
			
 
				-
			
 
				-                for list_item in list_items:
			
 
				-                    # 查找匹配的 bbox
			
 
				-                    matched_bbox, paddle_pointer, last_matched_index = self._find_matching_bbox(
			
 
				-                        list_item, paddle_text_boxes, paddle_pointer, last_matched_index
			
 
				-                    )
			
 
				-                    
			
 
				-                    if matched_bbox:
			
 
				-                        # 沿用mineru的bbox, 就是要移动位置paddle_pointer, last_matched_index
			
 
				-                        # 标记为已使用
			
 
				-                        matched_bbox['used'] = True
			
 
				-                
			
 
				-                merged_data.append(merged_item)
			
 
				-            else:
			
 
				-                # 其他类型直接复制
			
 
				-                merged_data.append(item.copy())
			
 
				-        
			
 
				-        return merged_data
			
 
				-    
			
 
				-    def _enhance_table_html_with_bbox(self, html: str, paddle_text_boxes: List[Dict], 
			
 
				-                                      start_pointer: int) -> Tuple[str, List[Dict], int]:
			
 
				-        """
			
 
				-        为 HTML 表格添加 bbox 信息
			
 
				-        
			
 
				-        Args:
			
 
				-            html: 原始 HTML 表格
			
 
				-            paddle_text_boxes: PaddleOCR 文字框列表
			
 
				-            start_pointer: 起始指针位置
			
 
				-        
			
 
				-        Returns:
			
 
				-            (增强后的 HTML, 单元格数组, 新的指针位置)
			
 
				-        """
			
 
				-        # 需要处理minerU识别为2个连着的cell，如： -741.00|357，259.63, paddle识别为一个cell，如： -741.00357，259.63
			
 
				-        soup = BeautifulSoup(html, 'html.parser')
			
 
				-        current_pointer = start_pointer
			
 
				-        last_matched_index = start_pointer
			
 
				-        cells = []  # 存储单元格的 bbox 信息
			
 
				-
			
 
				-        # 遍历所有行
			
 
				-        for row_idx, row in enumerate(soup.find_all('tr')):
			
 
				-            # 遍历所有单元格
			
 
				-            for col_idx, cell in enumerate(row.find_all(['td', 'th'])):
			
 
				-                cell_text = cell.get_text(strip=True)
			
 
				-            
			
 
				-                if not cell_text:
			
 
				-                    continue
			
 
				-                
			
 
				-                # 查找匹配的 bbox
			
 
				-                matched_bbox, current_pointer, last_matched_index = self._find_matching_bbox(
			
 
				-                    cell_text, paddle_text_boxes, current_pointer, last_matched_index
			
 
				-                )
			
 
				-                
			
 
				-                if matched_bbox:
			
 
				-                    # 添加 data-bbox 属性
			
 
				-                    bbox = matched_bbox['bbox']
			
 
				-                    cell['data-bbox'] = f"[{bbox[0]},{bbox[1]},{bbox[2]},{bbox[3]}]"
			
 
				-                    cell['data-score'] = f"{matched_bbox['score']:.4f}"
			
 
				-                    cell['data-paddle-index'] = str(matched_bbox['paddle_bbox_index'])
			
 
				-
			
 
				-                    cells.append({
			
 
				-                        'type': 'table_cell',
			
 
				-                        'text': cell_text,
			
 
				-                        'bbox': bbox,
			
 
				-                        'row': row_idx+1,
			
 
				-                        'col': col_idx+1,
			
 
				-                        'score': matched_bbox['score'],
			
 
				-                        'paddle_bbox_index': matched_bbox['paddle_bbox_index']
			
 
				-                    })
			
 
				-                    # 标记为已使用
			
 
				-                    matched_bbox['used'] = True
			
 
				-        
			
 
				-        return str(soup), cells, current_pointer
			
 
				-    
			
 
				-    def _find_matching_bbox(self, target_text: str, text_boxes: List[Dict], 
			
 
				-                           start_index: int, last_match_index: int) -> tuple[Optional[Dict], int, int]:
			
 
				-        """
			
 
				-        查找匹配的文字框
			
 
				-        
			
 
				-        Args:
			
 
				-            target_text: 目标文本
			
 
				-            text_boxes: 文字框列表
			
 
				-            start_index: 起始索引, 是最后一个used=True的位置+1 
			
 
				-            last_match_index: 上次匹配成功的索引, 可能比start_index小
			
 
				-        
			
 
				-        Returns:
			
 
				-            (匹配的文字框信息, 新的指针位置, last_match_index)
			
 
				-        """
			
 
				-        target_text = self._normalize_text(target_text)
			
 
				-        
			
 
				-        # 过滤过短的目标文本
			
 
				-        if len(target_text) < 2:
			
 
				-            return None, start_index, last_match_index
			
 
				-
			
 
				-        # 由于minerU和Paddle的顺序基本一致, 也有不一致的地方, 所以需要向前找第一个未使用的位置
			
 
				-        # MinerU和Paddle都可能识别错误，所以需要一个look_ahead_window来避免漏掉匹配
			
 
				-        # 匹配时会遇到一些特殊情况，比如Paddle把两个连着的cell识别为一个字符串，MinerU将单元格上下2行识别为一行
			
 
				-        # 	'1|2024-08-11|扫二维码付'   minerU识别为“扫二维码付款”，Paddle识别为'12024-08-11扫二维码付'  
			
 
				-        #                  款
			
 
				-        # 字符串的顺序极大概率是一致的，所以如果短字符串是长字符串的子串，可以增加相似权重
			
 
				-
			
 
				-        search_start = last_match_index - 1
			
 
				-        unused_count = 0
			
 
				-        while search_start >= 0:
			
 
				-            if text_boxes[search_start]['used'] == False:
			
 
				-                unused_count += 1
			
 
				-            if unused_count >= self.look_ahead_window:
			
 
				-                break
			
 
				-            search_start -= 1
			
 
				-        if search_start < 0:
			
 
				-            search_start = 0
			
 
				-            while search_start < start_index and text_boxes[search_start]['used']:
			
 
				-                search_start += 1
			
 
				-        search_end = min(start_index + self.look_ahead_window, len(text_boxes))
			
 
				-        
			
 
				-        best_match = None
			
 
				-        best_index = start_index
			
 
				-        
			
 
				-        for i in range(search_start, search_end):
			
 
				-            if text_boxes[i]['used']:
			
 
				-                continue
			
 
				-            
			
 
				-            box_text = self._normalize_text(text_boxes[i]['text'])
			
 
				-            # 精确匹配优先
			
 
				-            if target_text == box_text:
			
 
				-                if i >= start_index:
			
 
				-                    return text_boxes[i], i + 1, i
			
 
				-                else:
			
 
				-                    return text_boxes[i], start_index, i
			
 
				-            
			
 
				-            # 过滤过短的候选文本(避免单字符匹配)
			
 
				-            if len(box_text) < 2:
			
 
				-                continue
			
 
				-            
			
 
				-            # 长度比例检查 - 避免长度差异过大的匹配
			
 
				-            length_ratio = min(len(target_text), len(box_text)) / max(len(target_text), len(box_text))
			
 
				-            if length_ratio < 0.3:  # 长度差异超过70%则跳过
			
 
				-                continue
			
 
				-
			
 
				-            # 子串检查
			
 
				-            shorter = target_text if len(target_text) < len(box_text) else box_text
			
 
				-            longer = box_text if len(target_text) < len(box_text) else target_text
			
 
				-            is_substring = shorter in longer        
			
 
				-
			
 
				-            # 计算多种相似度
			
 
				-            # token_sort_ratio = fuzz.token_sort_ratio(target_text, box_text)
			
 
				-            partial_ratio = fuzz.partial_ratio(target_text, box_text)
			
 
				-            if is_substring:
			
 
				-                partial_ratio += 10  # 子串时提升相似度
			
 
				-            
			
 
				-            # 综合相似度 - 两种算法都要达到阈值
			
 
				-            if (partial_ratio >= self.similarity_threshold):
			
 
				-                if i >= start_index:
			
 
				-                    return text_boxes[i], i + 1, last_match_index
			
 
				-                else:
			
 
				-                    return text_boxes[i], start_index, last_match_index
			
 
				-
			
 
				-        return best_match, best_index, last_match_index
			
 
				-
			
 
				-    def _normalize_text(self, text: str) -> str:
			
 
				-        """标准化文本（去除空格、标点等）"""
			
 
				-        # 移除所有空白字符
			
 
				-        text = re.sub(r'\s+', '', text)
			
 
				-        # 转换全角数字和字母为半角
			
 
				-        text = self._full_to_half(text)
			
 
				-        return text.lower()
			
 
				-    
			
 
				-    def _full_to_half(self, text: str) -> str:
			
 
				-        """全角转半角"""
			
 
				-        result = []
			
 
				-        for char in text:
			
 
				-            code = ord(char)
			
 
				-            if code == 0x3000:  # 全角空格
			
 
				-                code = 0x0020
			
 
				-            elif 0xFF01 <= code <= 0xFF5E:  # 全角字符
			
 
				-                code -= 0xFEE0
			
 
				-            result.append(chr(code))
			
 
				-        return ''.join(result)
			
 
				-    
			
 
				-    def generate_enhanced_markdown(self, merged_data: List[Dict], 
			
 
				-                                   output_path: Optional[str] = None, mineru_file: Optional[str] = None) -> str:
			
 
				-        """
			
 
				-        生成增强的 Markdown（包含 bbox 信息的注释）
			
 
				-        参考 MinerU 的实现,支持标题、列表、表格标题等
			
 
				-        
			
 
				-        Args:
			
 
				-            merged_data: 合并后的数据
			
 
				-            output_path: 输出路径（可选）
			
 
				-            mineru_file: MinerU 源文件路径（用于复制图片）
			
 
				-        
			
 
				-        Returns:
			
 
				-            Markdown 内容
			
 
				-        """
			
 
				-        md_lines = []
			
 
				-        
			
 
				-        for item in merged_data:
			
 
				-            item_type = item.get('type', '')
			
 
				-            bbox = item.get('bbox', [])
			
 
				-            
			
 
				-            # 添加 bbox 注释
			
 
				-            if bbox:
			
 
				-                md_lines.append(f"<!-- bbox: {bbox} -->")
			
 
				-            
			
 
				-            # 根据类型处理
			
 
				-            if item_type == 'title':
			
 
				-                # 标题 - 使用 text_level 确定标题级别
			
 
				-                text = item.get('text', '')
			
 
				-                text_level = item.get('text_level', 1)
			
 
				-                heading = '#' * min(text_level, 6)  # 最多6级标题
			
 
				-                md_lines.append(f"{heading} {text}\n")
			
 
				-            
			
 
				-            elif item_type == 'text':
			
 
				-                # 普通文本 - 可能也有 text_level
			
 
				-                text = item.get('text', '')
			
 
				-                text_level = item.get('text_level', 0)
			
 
				-                
			
 
				-                if text_level > 0:
			
 
				-                    # 作为标题处理
			
 
				-                    heading = '#' * min(text_level, 6)
			
 
				-                    md_lines.append(f"{heading} {text}\n")
			
 
				-                else:
			
 
				-                    # 普通段落
			
 
				-                    md_lines.append(f"{text}\n")
			
 
				-            
			
 
				-            elif item_type == 'list':
			
 
				-                # 列表
			
 
				-                sub_type = item.get('sub_type', 'text')
			
 
				-                list_items = item.get('list_items', [])
			
 
				-                
			
 
				-                for list_item in list_items:
			
 
				-                    md_lines.append(f"{list_item}\n")
			
 
				-                
			
 
				-                md_lines.append("")  # 列表后添加空行
			
 
				-            
			
 
				-            elif item_type == 'table':
			
 
				-                # 表格标题
			
 
				-                table_caption = item.get('table_caption', [])
			
 
				-                if table_caption:
			
 
				-                    for caption in table_caption:
			
 
				-                        if caption:  # 跳过空标题
			
 
				-                            md_lines.append(f"**{caption}**\n")
			
 
				-                
			
 
				-                # 表格内容
			
 
				-                table_body = item.get('table_body_with_bbox', item.get('table_body', ''))
			
 
				-                if table_body:
			
 
				-                    md_lines.append(table_body)
			
 
				-                    md_lines.append("")
			
 
				-                
			
 
				-                # 表格脚注
			
 
				-                table_footnote = item.get('table_footnote', [])
			
 
				-                if table_footnote:
			
 
				-                    for footnote in table_footnote:
			
 
				-                        if footnote:
			
 
				-                            md_lines.append(f"*{footnote}*")
			
 
				-                    md_lines.append("")
			
 
				-            
			
 
				-            elif item_type == 'image':
			
 
				-                # 图片
			
 
				-                img_path = item.get('img_path', '')
			
 
				-                
			
 
				-                # 复制图片到输出目录
			
 
				-                if img_path and mineru_file and output_path:
			
 
				-                    mineru_dir = Path(mineru_file).parent
			
 
				-                    img_full_path = mineru_dir / img_path
			
 
				-                    if img_full_path.exists():
			
 
				-                        output_img_path = Path(output_path).parent / img_path
			
 
				-                        output_img_path.parent.mkdir(parents=True, exist_ok=True)
			
 
				-                        shutil.copy(img_full_path, output_img_path)
			
 
				-                
			
 
				-                # 图片标题
			
 
				-                image_caption = item.get('image_caption', [])
			
 
				-                if image_caption:
			
 
				-                    for caption in image_caption:
			
 
				-                        if caption:
			
 
				-                            md_lines.append(f"**{caption}**\n")
			
 
				-                
			
 
				-                # 插入图片
			
 
				-                md_lines.append(f"![Image]({img_path})\n")
			
 
				-                
			
 
				-                # 图片脚注
			
 
				-                image_footnote = item.get('image_footnote', [])
			
 
				-                if image_footnote:
			
 
				-                    for footnote in image_footnote:
			
 
				-                        if footnote:
			
 
				-                            md_lines.append(f"*{footnote}*")
			
 
				-                    md_lines.append("")
			
 
				-            
			
 
				-            elif item_type == 'equation':
			
 
				-                # 公式
			
 
				-                latex = item.get('latex', '')
			
 
				-                if latex:
			
 
				-                    md_lines.append(f"$$\n{latex}\n$$\n")
			
 
				-            
			
 
				-            elif item_type == 'inline_equation':
			
 
				-                # 行内公式
			
 
				-                latex = item.get('latex', '')
			
 
				-                if latex:
			
 
				-                    md_lines.append(f"${latex}$\n")
			
 
				-            
			
 
				-            elif item_type == 'page_number':
			
 
				-                # 页码 - 通常跳过或作为注释
			
 
				-                text = item.get('text', '')
			
 
				-                md_lines.append(f"<!-- 页码: {text} -->\n")
			
 
				-            
			
 
				-            elif item_type == 'header':
			
 
				-                # 页眉
			
 
				-                text = item.get('text', '')
			
 
				-                md_lines.append(f"<!-- 页眉: {text} -->\n")
			
 
				-            
			
 
				-            elif item_type == 'footer':
			
 
				-                # 页脚
			
 
				-                text = item.get('text', '')
			
 
				-                if text:
			
 
				-                    md_lines.append(f"<!-- 页脚: {text} -->\n")
			
 
				-            
			
 
				-            elif item_type == 'reference':
			
 
				-                # 参考文献
			
 
				-                text = item.get('text', '')
			
 
				-                md_lines.append(f"> {text}\n")
			
 
				-            
			
 
				-            else:
			
 
				-                # 未知类型 - 尝试提取文本
			
 
				-                text = item.get('text', '')
			
 
				-                if text:
			
 
				-                    md_lines.append(f"{text}\n")
			
 
				-        
			
 
				-        markdown_content = '\n'.join(md_lines)
			
 
				-        
			
 
				-        # 保存文件
			
 
				-        if output_path:
			
 
				-            with open(output_path, 'w', encoding='utf-8') as f:
			
 
				-                f.write(markdown_content)
			
 
				-        
			
 
				-        return markdown_content
			
 
				-
			
 
				-    def extract_table_cells_with_bbox(self, merged_data: List[Dict]) -> List[Dict]:
			
 
				-        """
			
 
				-        提取所有表格单元格及其 bbox 信息
			
 
				-        
			
 
				-        Returns:
			
 
				-            单元格列表，每个包含 text, bbox, row, col 等信息
			
 
				-        """
			
 
				-        cells = []
			
 
				-        
			
 
				-        for item in merged_data:
			
 
				-            if item['type'] != 'table':
			
 
				-                continue
			
 
				-            
			
 
				-            html = item.get('table_body_with_bbox', item.get('table_body', ''))
			
 
				-            soup = BeautifulSoup(html, 'html.parser')
			
 
				-            
			
 
				-            # 遍历所有行
			
 
				-            for row_idx, row in enumerate(soup.find_all('tr')):
			
 
				-                # 遍历所有单元格
			
 
				-                for col_idx, cell in enumerate(row.find_all(['td', 'th'])):
			
 
				-                    cell_text = cell.get_text(strip=True)
			
 
				-                    bbox_str = cell.get('data-bbox', '')
			
 
				-                    
			
 
				-                    if bbox_str:
			
 
				-                        try:
			
 
				-                            bbox = json.loads(bbox_str)
			
 
				-                            cells.append({
			
 
				-                                'text': cell_text,
			
 
				-                                'bbox': bbox,
			
 
				-                                'row': row_idx,
			
 
				-                                'col': col_idx,
			
 
				-                                'score': float(cell.get('data-score', 0)),
			
 
				-                                'paddle_index': int(cell.get('data-paddle-index', -1))
			
 
				-                            })
			
 
				-                        except (json.JSONDecodeError, ValueError):
			
 
				-                            pass
			
 
				-        
			
 
				-        return cells
			
 
				-
			
 
				-
			
 
				-def merge_single_file(mineru_file: Path, paddle_file: Path, output_dir: Path, 
			
 
				-                     output_format: str, merger: MinerUPaddleOCRMerger) -> bool:
			
 
				-    """
			
 
				-    合并单个文件
			
 
				-    
			
 
				-    Args:
			
 
				-        mineru_file: MinerU JSON 文件路径
			
 
				-        paddle_file: PaddleOCR JSON 文件路径
			
 
				-        output_dir: 输出目录
			
 
				-        merger: 合并器实例
			
 
				-    
			
 
				-    Returns:
			
 
				-        是否成功
			
 
				-    """
			
 
				-    print(f"📄 处理: {mineru_file.name}")
			
 
				-    
			
 
				-    # 输出文件路径
			
 
				-    merged_md_path = output_dir / f"{mineru_file.stem}.md"
			
 
				-    merged_json_path = output_dir / f"{mineru_file.stem}.json"
			
 
				-    
			
 
				-    try:
			
 
				-        # 合并数据
			
 
				-        merged_data = merger.merge_table_with_bbox(
			
 
				-            str(mineru_file),
			
 
				-            str(paddle_file)
			
 
				-        )
			
 
				-        
			
 
				-        # 生成 Markdown
			
 
				-        if output_format in ['markdown', 'both']:
			
 
				-            merger.generate_enhanced_markdown(merged_data, str(merged_md_path), mineru_file)
			
 
				-        
			
 
				-        # 提取单元格信息
			
 
				-        # cells = merger.extract_table_cells_with_bbox(merged_data)
			
 
				-        if output_format in ['json', 'both']:
			
 
				-            with open(merged_json_path, 'w', encoding='utf-8') as f:
			
 
				-                json.dump(merged_data, f, ensure_ascii=False, indent=2)
			
 
				-
			
 
				-        print(f"  ✅ 合并完成")
			
 
				-        print(f"  📊 共处理了 {len(merged_data)} 个对象")
			
 
				-        print(f"  💾 输出文件:")
			
 
				-        if output_format in ['markdown', 'both']:
			
 
				-            print(f"    - {merged_md_path.name}")
			
 
				-        if output_format in ['json', 'both']:
			
 
				-            print(f"    - {merged_json_path.name}")
			
 
				-
			
 
				-        return True
			
 
				-        
			
 
				-    except Exception as e:
			
 
				-        print(f"  ❌ 处理失败: {e}")
			
 
				-        import traceback
			
 
				-        traceback.print_exc()
			
 
				-        return False
			
 
				-
			
 
				-
			
 
				-def merge_mineru_paddle_batch(mineru_dir: str, paddle_dir: str, output_dir: str, output_format: str = 'both',
			
 
				-                              look_ahead_window: int = 10, 
			
 
				-                              similarity_threshold: int = 80):
			
 
				-    """
			
 
				-    批量合并 MinerU 和 PaddleOCR 的结果
			
 
				-    
			
 
				-    Args:
			
 
				-        mineru_dir: MinerU 结果目录
			
 
				-        paddle_dir: PaddleOCR 结果目录
			
 
				-        output_dir: 输出目录
			
 
				-        look_ahead_window: 向前查找窗口大小
			
 
				-        similarity_threshold: 相似度阈值
			
 
				-    """
			
 
				-    mineru_path = Path(mineru_dir)
			
 
				-    paddle_path = Path(paddle_dir)
			
 
				-    output_path = Path(output_dir)
			
 
				-    output_path.mkdir(parents=True, exist_ok=True)
			
 
				-    
			
 
				-    merger = MinerUPaddleOCRMerger(
			
 
				-        look_ahead_window=look_ahead_window, 
			
 
				-        similarity_threshold=similarity_threshold
			
 
				-    )
			
 
				-    
			
 
				-    # 查找所有 MinerU 的 JSON 文件
			
 
				-    mineru_files = list(mineru_path.glob('*_page_*[0-9].json'))
			
 
				-    mineru_files.sort()
			
 
				-    
			
 
				-    print(f"\n🔍 找到 {len(mineru_files)} 个 MinerU 文件")
			
 
				-    print(f"📂 MinerU 目录: {mineru_dir}")
			
 
				-    print(f"📂 PaddleOCR 目录: {paddle_dir}")
			
 
				-    print(f"📂 输出目录: {output_dir}")
			
 
				-    print(f"⚙️  查找窗口: {look_ahead_window}")
			
 
				-    print(f"⚙️  相似度阈值: {similarity_threshold}%\n")
			
 
				-    
			
 
				-    success_count = 0
			
 
				-    failed_count = 0
			
 
				-    
			
 
				-    for mineru_file in mineru_files:
			
 
				-        # 查找对应的 PaddleOCR 文件
			
 
				-        paddle_file = paddle_path / mineru_file.name
			
 
				-        
			
 
				-        if not paddle_file.exists():
			
 
				-            print(f"⚠️  跳过: 未找到对应的 PaddleOCR 文件: {paddle_file.name}\n")
			
 
				-            failed_count += 1
			
 
				-            continue
			
 
				-
			
 
				-        if merge_single_file(mineru_file, paddle_file, output_path, output_format, merger):
			
 
				-            success_count += 1
			
 
				-        else:
			
 
				-            failed_count += 1
			
 
				-        
			
 
				-        print()  # 空行分隔
			
 
				-    
			
 
				-    # 打印统计信息
			
 
				-    print("=" * 60)
			
 
				-    print(f"✅ 处理完成!")
			
 
				-    print(f"📊 统计信息:")
			
 
				-    print(f"  - 总文件数: {len(mineru_files)}")
			
 
				-    print(f"  - 成功: {success_count}")
			
 
				-    print(f"  - 失败: {failed_count}")
			
 
				-    print("=" * 60)
			
 
				-
			
 
				-
			
 
				-def main():
			
 
				-    """主函数"""
			
 
				-    parser = argparse.ArgumentParser(
			
 
				-        description='合并 MinerU 和 PaddleOCR 的识别结果，添加 bbox 坐标信息',
			
 
				-        formatter_class=argparse.RawDescriptionHelpFormatter,
			
 
				-        epilog="""
			
 
				-示例用法:
			
 
				-
			
 
				-  1. 批量处理整个目录:
			
 
				-     python merge_mineru_paddle_ocr.py \\
			
 
				-         --mineru-dir /path/to/mineru/results \\
			
 
				-         --paddle-dir /path/to/paddle/results \\
			
 
				-         --output-dir /path/to/output
			
 
				-
			
 
				-  2. 处理单个文件:
			
 
				-     python merge_mineru_paddle_ocr.py \\
			
 
				-         --mineru-file /path/to/file_page_001.json \\
			
 
				-         --paddle-file /path/to/file_page_001.json \\
			
 
				-         --output-dir /path/to/output
			
 
				-
			
 
				-  3. 自定义参数:
			
 
				-     python merge_mineru_paddle_ocr.py \\
			
 
				-         --mineru-dir /path/to/mineru \\
			
 
				-         --paddle-dir /path/to/paddle \\
			
 
				-         --output-dir /path/to/output \\
			
 
				-         --window 15 \\
			
 
				-         --threshold 85
			
 
				-        """
			
 
				-    )
			
 
				-    
			
 
				-    # 文件/目录参数
			
 
				-    file_group = parser.add_argument_group('文件参数')
			
 
				-    file_group.add_argument(
			
 
				-        '--mineru-file', 
			
 
				-        type=str,
			
 
				-        help='MinerU 输出的 JSON 文件路径（单文件模式）'
			
 
				-    )
			
 
				-    file_group.add_argument(
			
 
				-        '--paddle-file', 
			
 
				-        type=str,
			
 
				-        help='PaddleOCR 输出的 JSON 文件路径（单文件模式）'
			
 
				-    )
			
 
				-    
			
 
				-    dir_group = parser.add_argument_group('目录参数')
			
 
				-    dir_group.add_argument(
			
 
				-        '--mineru-dir', 
			
 
				-        type=str,
			
 
				-        help='MinerU 结果目录（批量模式）'
			
 
				-    )
			
 
				-    dir_group.add_argument(
			
 
				-        '--paddle-dir', 
			
 
				-        type=str,
			
 
				-        help='PaddleOCR 结果目录（批量模式）'
			
 
				-    )
			
 
				-    
			
 
				-    # 输出参数
			
 
				-    output_group = parser.add_argument_group('输出参数')
			
 
				-    output_group.add_argument(
			
 
				-        '-o', '--output-dir',
			
 
				-        type=str,
			
 
				-        required=True,
			
 
				-        help='输出目录（必需）'
			
 
				-    )
			
 
				-    output_group.add_argument(
			
 
				-        '-f', '--format', 
			
 
				-        choices=['json', 'markdown', 'both'], 
			
 
				-        default='both', help='输出格式'
			
 
				-    )
			
 
				-
			
 
				-    # 算法参数
			
 
				-    algo_group = parser.add_argument_group('算法参数')
			
 
				-    algo_group.add_argument(
			
 
				-        '-w', '--window',
			
 
				-        type=int,
			
 
				-        default=15,
			
 
				-        help='向前查找的窗口大小（默认: 10）'
			
 
				-    )
			
 
				-    algo_group.add_argument(
			
 
				-        '-t', '--threshold',
			
 
				-        type=int,
			
 
				-        default=80,
			
 
				-        help='文本相似度阈值（0-100，默认: 80）'
			
 
				-    )
			
 
				-    
			
 
				-    args = parser.parse_args()
			
 
				-    output_format = args.format.lower()
			
 
				-    
			
 
				-    # 验证参数
			
 
				-    if args.mineru_file and args.paddle_file:
			
 
				-        # 单文件模式
			
 
				-        mineru_file = Path(args.mineru_file)
			
 
				-        paddle_file = Path(args.paddle_file)
			
 
				-        output_dir = Path(args.output_dir)
			
 
				-        
			
 
				-        if not mineru_file.exists():
			
 
				-            print(f"❌ 错误: MinerU 文件不存在: {mineru_file}")
			
 
				-            return
			
 
				-        
			
 
				-        if not paddle_file.exists():
			
 
				-            print(f"❌ 错误: PaddleOCR 文件不存在: {paddle_file}")
			
 
				-            return
			
 
				-        
			
 
				-        output_dir.mkdir(parents=True, exist_ok=True)
			
 
				-        
			
 
				-        print("\n🔧 单文件处理模式")
			
 
				-        print(f"📄 MinerU 文件: {mineru_file}")
			
 
				-        print(f"📄 PaddleOCR 文件: {paddle_file}")
			
 
				-        print(f"📂 输出目录: {output_dir}")
			
 
				-        print(f"⚙️  查找窗口: {args.window}")
			
 
				-        print(f"⚙️  相似度阈值: {args.threshold}%\n")
			
 
				-        
			
 
				-        merger = MinerUPaddleOCRMerger(
			
 
				-            look_ahead_window=args.window,
			
 
				-            similarity_threshold=args.threshold
			
 
				-        )
			
 
				-        
			
 
				-        success = merge_single_file(mineru_file, paddle_file, output_dir, output_format, merger)
			
 
				-        
			
 
				-        if success:
			
 
				-            print("\n✅ 处理完成!")
			
 
				-        else:
			
 
				-            print("\n❌ 处理失败!")
			
 
				-    
			
 
				-    elif args.mineru_dir and args.paddle_dir:
			
 
				-        # 批量模式
			
 
				-        if not Path(args.mineru_dir).exists():
			
 
				-            print(f"❌ 错误: MinerU 目录不存在: {args.mineru_dir}")
			
 
				-            return
			
 
				-        
			
 
				-        if not Path(args.paddle_dir).exists():
			
 
				-            print(f"❌ 错误: PaddleOCR 目录不存在: {args.paddle_dir}")
			
 
				-            return
			
 
				-        
			
 
				-        print("\n🔧 批量处理模式")
			
 
				-        
			
 
				-        merge_mineru_paddle_batch(
			
 
				-            args.mineru_dir,
			
 
				-            args.paddle_dir,
			
 
				-            args.output_dir,
			
 
				-            output_format=output_format,
			
 
				-            look_ahead_window=args.window,
			
 
				-            similarity_threshold=args.threshold
			
 
				-        )
			
 
				-    
			
 
				-    else:
			
 
				-        parser.print_help()
			
 
				-        print("\n❌ 错误: 请指定单文件模式或批量模式的参数")
			
 
				-        print("  单文件模式: --mineru-file 和 --paddle-file")
			
 
				-        print("  批量模式: --mineru-dir 和 --paddle-dir")
			
 
				-
			
 
				-if __name__ == "__main__":
			
 
				-    print("🚀 启动 MinerU + PaddleOCR 合并程序...")
			
 
				-    
			
 
				-    import sys
			
 
				-    
			
 
				-    if len(sys.argv) == 1:
			
 
				-        # 如果没有命令行参数，使用默认配置运行
			
 
				-        print("ℹ️  未提供命令行参数，使用默认配置运行...")
			
 
				-        
			
 
				-        # 默认配置
			
 
				-        default_config = {
			
 
				-            "mineru-file": "/Users/zhch158/workspace/data/流水分析/对公_招商银行图/mineru-vlm-2.5.3_Results/对公_招商银行图_page_001.json",
			
 
				-            "paddle-file": "/Users/zhch158/workspace/data/流水分析/对公_招商银行图/data_PPStructureV3_Results/对公_招商银行图_page_001.json",
			
 
				-            "output-dir": "/Users/zhch158/workspace/data/流水分析/对公_招商银行图/merged_results",
			
 
				-            # "mineru-dir": "/Users/zhch158/workspace/data/流水分析/德_内蒙古银行照/mineru-vlm-2.5.3_Results",
			
 
				-            # "paddle-dir": "/Users/zhch158/workspace/data/流水分析/德_内蒙古银行照/data_PPStructureV3_Results",
			
 
				-            # "output-dir": "/Users/zhch158/workspace/data/流水分析/德_内蒙古银行照/merged_results",
			
 
				-            "format": "both",
			
 
				-            "window": "15",
			
 
				-            "threshold": "85"
			
 
				-        }
			
 
				-        
			
 
				-        print("⚙️  默认参数:")
			
 
				-        for key, value in default_config.items():
			
 
				-            print(f"  --{key}: {value}")
			
 
				-        # 构造参数
			
 
				-        sys.argv = [sys.argv[0]]
			
 
				-        for key, value in default_config.items():
			
 
				-            sys.argv.extend([f"--{key}", str(value)])
			
 
				-    
			
 
				-    sys.exit(main())
			
--- a/streamlit_ocr_validator.py
+++ b/streamlit_ocr_validator.py
@@ -17,6 +17,7 @@ from streamlit_validator_cross import (
 
				 )
			
 
				 from streamlit_validator_result import display_single_page_cross_validation
			
 
				 from ocr_validator_utils import get_data_source_display_name
			
 
				+from config_manager import load_config  # 🎯 使用新配置管理器
			
 
				 
			
 
				 
			
 
				 def reset_cross_validation_results():
			
@@ -28,22 +29,37 @@ def reset_cross_validation_results():
 
				 
			
 
				 def main():
			
 
				     """主应用"""
			
 
				+    # 🎯 初始化配置管理器
			
 
				+    if 'config_manager' not in st.session_state:
			
 
				+        try:
			
 
				+            st.session_state.config_manager = load_config(config_dir="config")
			
 
				+            # 🎯 生成 OCRValidator 所需的配置
			
 
				+            st.session_state.validator_config = st.session_state.config_manager.to_validator_config()
			
 
				+            print("✅ 配置管理器初始化成功")
			
 
				+            print(f"📄 发现 {len(st.session_state.config_manager.list_documents())} 个文档配置")
			
 
				+            print(f"🔧 发现 {len(st.session_state.config_manager.list_ocr_tools())} 个 OCR 工具")
			
 
				+        except Exception as e:
			
 
				+            st.error(f"❌ 配置加载失败: {e}")
			
 
				+            st.stop()
			
 
				+    
			
 
				+    config_manager = st.session_state.config_manager
			
 
				+    validator_config = config_manager.to_validator_config()
			
 
				+    
			
 
				     # 初始化应用
			
 
				     if 'validator' not in st.session_state:
			
 
				-        validator = StreamlitOCRValidator()
			
 
				+        # 🎯 直接传递配置字典给 OCRValidator
			
 
				+        validator = StreamlitOCRValidator(config_dict=validator_config)
			
 
				         st.session_state.validator = validator
			
 
				-        setup_page_config(validator.config)
			
 
				+        setup_page_config(validator_config)
			
 
				         
			
 
				         # 页面标题
			
 
				-        config = st.session_state.validator.config
			
 
				-        st.title(config['ui']['page_title'])
			
 
				+        st.title(validator_config['ui']['page_title'])
			
 
				         
			
 
				         # 初始化数据源追踪
			
 
				         st.session_state.current_ocr_source = validator.current_source_key
			
 
				         st.session_state.current_verify_source = validator.verify_source_key
			
 
				     else:
			
 
				         validator = st.session_state.validator
			
 
				-        config = st.session_state.validator.config
			
 
				     
			
 
				     if 'selected_text' not in st.session_state:
			
 
				         st.session_state.selected_text = None
			
@@ -84,6 +100,44 @@ def main():
 
				     
			
 
				     # 如果没有可用的数据源，提前返回
			
 
				     if not validator.all_sources:
			
 
				+        st.warning("⚠️ 未找到任何数据源，请检查配置文件")
			
 
				+        
			
 
				+        # 🎯 显示配置信息帮助调试
			
 
				+        with st.expander("🔍 配置信息", expanded=True):
			
 
				+            st.write("**已加载的文档:**")
			
 
				+            docs = config_manager.list_documents()
			
 
				+            if docs:
			
 
				+                for doc in docs:
			
 
				+                    doc_config = config_manager.get_document(doc)
			
 
				+                    st.write(f"- **{doc}**")
			
 
				+                    st.write(f"  - 基础目录: `{doc_config.base_dir}`")
			
 
				+                    st.write(f"  - OCR 结果: {len([r for r in doc_config.ocr_results if r.enabled])} 个已启用")
			
 
				+            else:
			
 
				+                st.write("无")
			
 
				+            
			
 
				+            st.write("**已加载的 OCR 工具:**")
			
 
				+            tools = config_manager.list_ocr_tools()
			
 
				+            if tools:
			
 
				+                for tool in tools:
			
 
				+                    tool_config = config_manager.get_ocr_tool(tool)
			
 
				+                    st.write(f"- **{tool_config.name}** (`{tool}`)")
			
 
				+            else:
			
 
				+                st.write("无")
			
 
				+            
			
 
				+            st.write("**配置文件路径:**")
			
 
				+            st.code(str(config_manager.config_dir / "global.yaml"))
			
 
				+            
			
 
				+            st.write("**生成的数据源:**")
			
 
				+            data_sources = config_manager.get_data_sources()
			
 
				+            if data_sources:
			
 
				+                for ds in data_sources:
			
 
				+                    st.write(f"- `{ds.name}`")
			
 
				+                    st.write(f"  - 工具: {ds.ocr_tool}")
			
 
				+                    st.write(f"  - 结果目录: {ds.ocr_out_dir}")
			
 
				+                    st.write(f"  - 图片目录: {ds.src_img_dir}")
			
 
				+            else:
			
 
				+                st.write("无")
			
 
				+        
			
 
				         st.stop()
			
 
				     
			
 
				     # 文件选择区域
			
@@ -170,7 +224,7 @@ def main():
 
				             show_batch_cross_validation_results_dialog()
			
 
				 
			
 
				     # 显示当前数据源统计信息
			
 
				-    with st.expander("🔧 OCR工具统计信息", expanded=False):
			
 
				+    with st.expander("统� OCR工具计信息", expanded=False):
			
 
				         stats = validator.get_statistics()
			
 
				         col1, col2, col3, col4, col5 = st.columns(5)
			
 
				         
			
@@ -184,20 +238,70 @@ def main():
 
				             st.metric("✅ 准确率", f"{stats['accuracy_rate']:.1f}%")
			
 
				         with col5:
			
 
				             if validator.current_source_config:
			
 
				-                tool_display = validator.current_source_config['ocr_tool'].upper()
			
 
				+                tool_id = validator.current_source_config['ocr_tool']
			
 
				+                # 🎯 从配置管理器获取工具名称
			
 
				+                tool_config = config_manager.get_ocr_tool(tool_id)
			
 
				+                tool_display = tool_config.name if tool_config else tool_id.upper()
			
 
				                 st.metric("🔧 OCR工具", tool_display)
			
 
				         
			
 
				         if stats['tool_info']:
			
 
				             st.write("**详细信息:**", stats['tool_info'])
			
 
				+        
			
 
				+        # 🎯 显示当前文档和 OCR 结果信息
			
 
				+        if validator.current_source_config:
			
 
				+            source_name = validator.current_source_config['name']
			
 
				+            # 解析数据源名称，提取文档名（更精确的解析）
			
 
				+            parts = source_name.split('_', 1)
			
 
				+            doc_name = parts[0] if parts else source_name
			
 
				+            
			
 
				+            doc_config = config_manager.get_document(doc_name)
			
 
				+            if doc_config:
			
 
				+                st.write("**文档信息:**")
			
 
				+                st.write(f"- 文档名称: {doc_config.name}")
			
 
				+                st.write(f"- 基础目录: {doc_config.base_dir}")
			
 
				+                st.write(f"- 可用 OCR 工具: {len([r for r in doc_config.ocr_results if r.enabled])} 个")
			
 
				+    
			
 
				+    # 🎯 添加配置管理面板
			
 
				+    with st.expander("⚙️ 配置管理", expanded=False):
			
 
				+        col1, col2 = st.columns(2)
			
 
				+        
			
 
				+        with col1:
			
 
				+            st.subheader("📄 已加载文档")
			
 
				+            docs = config_manager.list_documents()
			
 
				+            for doc_name in docs:
			
 
				+                doc_config = config_manager.get_document(doc_name)
			
 
				+                enabled_count = len([r for r in doc_config.ocr_results if r.enabled])
			
 
				+                total_count = len(doc_config.ocr_results)
			
 
				+                
			
 
				+                with st.container():
			
 
				+                    st.write(f"✅ **{doc_name}**")
			
 
				+                    st.caption(f"📊 {enabled_count}/{total_count} 工具已启用")
			
 
				+                    
			
 
				+                    # 显示每个 OCR 工具的状态
			
 
				+                    for ocr_result in doc_config.ocr_results:
			
 
				+                        status_icon = "🟢" if ocr_result.enabled else "⚪"
			
 
				+                        tool_config = config_manager.get_ocr_tool(ocr_result.tool)
			
 
				+                        tool_name = tool_config.name if tool_config else ocr_result.tool
			
 
				+                        st.caption(f"  {status_icon} {tool_name} - {ocr_result.description or ocr_result.result_dir}")
			
 
				+        
			
 
				+        with col2:
			
 
				+            st.subheader("🔧 已加载 OCR 工具")
			
 
				+            tools = config_manager.list_ocr_tools()
			
 
				+            for tool_id in tools:
			
 
				+                tool_config = config_manager.get_ocr_tool(tool_id)
			
 
				+                with st.container():
			
 
				+                    st.write(f"🔧 **{tool_config.name}**")
			
 
				+                    st.caption(f"ID: `{tool_id}`")
			
 
				+                    st.caption(f"描述: {tool_config.description}")
			
 
				     
			
 
				     tab1, tab2, tab3 = st.tabs(["📄 内容人工检查", "🔍 交叉验证结果", "📊 表格分析"])
			
 
				     
			
 
				     with tab1:
			
 
				-        validator.create_compact_layout(config)
			
 
				+        validator.create_compact_layout(validator_config)
			
 
				 
			
 
				     with tab2:
			
 
				         # ✅ 使用封装的函数显示单页交叉验证结果
			
 
				-        display_single_page_cross_validation(validator, config)
			
 
				+        display_single_page_cross_validation(validator, validator_config)
			
 
				 
			
 
				     with tab3:
			
 
				         st.header("📊 表格数据分析")
			
@@ -207,7 +311,7 @@ def main():
 
				             display_html_table_as_dataframe(validator.md_content)
			
 
				         else:
			
 
				             st.info("当前OCR结果中没有检测到表格数据")
			
 
				-    
			
 
				+
			
 
				 
			
 
				 if __name__ == "__main__":
			
 
				     main()
			
--- a/streamlit_ocr_validator_v1.py
+++ b/streamlit_ocr_validator_v1.py
@@ -1,1496 +0,0 @@
 
				-#!/usr/bin/env python3
			
 
				-"""
			
 
				-基于Streamlit的OCR可视化校验工具（重构版）
			
 
				-提供丰富的交互组件和更好的用户体验
			
 
				-"""
			
 
				-
			
 
				-import streamlit as st
			
 
				-from pathlib import Path
			
 
				-from PIL import Image
			
 
				-from typing import Dict, List, Optional
			
 
				-import plotly.graph_objects as go
			
 
				-from io import BytesIO
			
 
				-import pandas as pd
			
 
				-import numpy as np
			
 
				-import plotly.express as px
			
 
				-import json
			
 
				-
			
 
				-# 导入工具模块
			
 
				-from ocr_validator_utils import (
			
 
				-    load_config,  load_ocr_data_file, process_ocr_data,
			
 
				-     get_ocr_statistics, 
			
 
				-    find_available_ocr_files, 
			
 
				-     group_texts_by_category,
			
 
				-    find_available_ocr_files_multi_source, get_data_source_display_name
			
 
				-)
			
 
				-from ocr_validator_file_utils import (
			
 
				-    load_css_styles,
			
 
				-    draw_bbox_on_image,
			
 
				-    convert_html_table_to_markdown,
			
 
				-    parse_html_tables, 
			
 
				-    create_dynamic_css,
			
 
				-    export_tables_to_excel, 
			
 
				-    get_table_statistics,
			
 
				-)
			
 
				-from ocr_validator_layout import OCRLayoutManager
			
 
				-from ocr_by_vlm import ocr_with_vlm
			
 
				-from compare_ocr_results import compare_ocr_results
			
 
				-
			
 
				-
			
 
				-class StreamlitOCRValidator:
			
 
				-    def __init__(self):
			
 
				-        self.config = load_config()
			
 
				-        self.ocr_data = []
			
 
				-        self.md_content = ""
			
 
				-        self.image_path = ""
			
 
				-        self.text_bbox_mapping = {}
			
 
				-        self.selected_text = None
			
 
				-        self.marked_errors = set()
			
 
				-        
			
 
				-        # 多数据源相关
			
 
				-        self.all_sources = {}
			
 
				-        self.current_source_key = None
			
 
				-        self.current_source_config = None
			
 
				-        self.file_info = []
			
 
				-        self.selected_file_index = -1
			
 
				-        self.display_options = []
			
 
				-        self.file_paths = []
			
 
				-        
			
 
				-        # ✅ 新增：交叉验证数据源
			
 
				-        self.verify_source_key = None
			
 
				-        self.verify_source_config = None
			
 
				-        self.verify_file_info = []
			
 
				-        self.verify_display_options = []
			
 
				-        self.verify_file_paths = []
			
 
				-
			
 
				-        # 初始化布局管理器
			
 
				-        self.layout_manager = OCRLayoutManager(self)
			
 
				-
			
 
				-        # 加载多数据源文件信息
			
 
				-        self.load_multi_source_info()
			
 
				-        
			
 
				-    def load_multi_source_info(self):
			
 
				-        """加载多数据源文件信息"""
			
 
				-        self.all_sources = find_available_ocr_files_multi_source(self.config)
			
 
				-        
			
 
				-        # 如果有数据源，默认选择第一个作为OCR源
			
 
				-        if self.all_sources:
			
 
				-            source_keys = list(self.all_sources.keys())
			
 
				-            first_source_key = source_keys[0]
			
 
				-            self.switch_to_source(first_source_key)
			
 
				-            
			
 
				-            # 如果有第二个数据源，默认作为验证源
			
 
				-            if len(source_keys) > 1:
			
 
				-                self.switch_to_verify_source(source_keys[1])
			
 
				-    
			
 
				-    def switch_to_source(self, source_key: str):
			
 
				-        """切换到指定OCR数据源"""
			
 
				-        if source_key in self.all_sources:
			
 
				-            self.current_source_key = source_key
			
 
				-            source_data = self.all_sources[source_key]
			
 
				-            self.current_source_config = source_data['config']
			
 
				-            self.file_info = source_data['files']
			
 
				-            
			
 
				-            if self.file_info:
			
 
				-                # 创建显示选项列表
			
 
				-                self.display_options = [f"{info['display_name']}" for info in self.file_info]
			
 
				-                self.file_paths = [info['path'] for info in self.file_info]
			
 
				-                
			
 
				-                # 重置文件选择
			
 
				-                self.selected_file_index = -1
			
 
				-                print(f"✅ 切换到OCR数据源: {source_key}")
			
 
				-            else:
			
 
				-                print(f"⚠️ 数据源 {source_key} 没有可用文件")
			
 
				-    
			
 
				-    def switch_to_verify_source(self, source_key: str):
			
 
				-        """切换到指定验证数据源"""
			
 
				-        if source_key in self.all_sources:
			
 
				-            self.verify_source_key = source_key
			
 
				-            source_data = self.all_sources[source_key]
			
 
				-            self.verify_source_config = source_data['config']
			
 
				-            self.verify_file_info = source_data['files']
			
 
				-            
			
 
				-            if self.verify_file_info:
			
 
				-                self.verify_display_options = [f"{info['display_name']}" for info in self.verify_file_info]
			
 
				-                self.verify_file_paths = [info['path'] for info in self.verify_file_info]
			
 
				-                print(f"✅ 切换到验证数据源: {source_key}")
			
 
				-            else:
			
 
				-                print(f"⚠️ 验证数据源 {source_key} 没有可用文件")
			
 
				-
			
 
				-    def setup_page_config(self):
			
 
				-        """设置页面配置"""
			
 
				-        ui_config = self.config['ui']
			
 
				-        st.set_page_config(
			
 
				-            page_title=ui_config['page_title'],
			
 
				-            page_icon=ui_config['page_icon'],
			
 
				-            layout=ui_config['layout'],
			
 
				-            initial_sidebar_state=ui_config['sidebar_state']
			
 
				-        )
			
 
				-        
			
 
				-        # 加载CSS样式
			
 
				-        css_content = load_css_styles()
			
 
				-        st.markdown(f"<style>{css_content}</style>", unsafe_allow_html=True)
			
 
				-
			
 
				-    def create_data_source_selector(self):
			
 
				-        """创建双数据源选择器 - 支持交叉验证"""
			
 
				-        if not self.all_sources:
			
 
				-            st.warning("❌ 未找到任何数据源，请检查配置文件")
			
 
				-            return
			
 
				-        
			
 
				-        # 准备数据源选项
			
 
				-        source_options = {}
			
 
				-        for source_key, source_data in self.all_sources.items():
			
 
				-            display_name = get_data_source_display_name(source_data['config'])
			
 
				-            source_options[display_name] = source_key
			
 
				-        
			
 
				-        # 创建两列布局
			
 
				-        col1, col2 = st.columns(2)
			
 
				-        
			
 
				-        with col1:
			
 
				-            st.markdown("#### 📊 OCR数据源")
			
 
				-            # OCR数据源选择
			
 
				-            current_display_name = None
			
 
				-            if self.current_source_key:
			
 
				-                for display_name, key in source_options.items():
			
 
				-                    if key == self.current_source_key:
			
 
				-                        current_display_name = display_name
			
 
				-                        break
			
 
				-            
			
 
				-            selected_ocr_display = st.selectbox(
			
 
				-                "选择OCR数据源",
			
 
				-                options=list(source_options.keys()),
			
 
				-                index=list(source_options.keys()).index(current_display_name) if current_display_name else 0,
			
 
				-                key="ocr_source_selector",
			
 
				-                label_visibility="collapsed",
			
 
				-                help="选择要分析的OCR数据源"
			
 
				-            )
			
 
				-            
			
 
				-            selected_ocr_key = source_options[selected_ocr_display]
			
 
				-            
			
 
				-            # 如果OCR数据源发生变化，切换数据源
			
 
				-            if selected_ocr_key != self.current_source_key:
			
 
				-                self.switch_to_source(selected_ocr_key)
			
 
				-                if 'selected_file_index' in st.session_state:
			
 
				-                    st.session_state.selected_file_index = 0
			
 
				-                st.rerun()
			
 
				-            
			
 
				-            # 显示OCR数据源信息
			
 
				-            if self.current_source_config:
			
 
				-                with st.expander("📋 OCR数据源详情", expanded=False):
			
 
				-                    st.write(f"**工具:** {self.current_source_config['ocr_tool']}")
			
 
				-                    st.write(f"**文件数:** {len(self.file_info)}")
			
 
				-        
			
 
				-        with col2:
			
 
				-            st.markdown("#### 🔍 验证数据源")
			
 
				-            # 验证数据源选择
			
 
				-            verify_display_name = None
			
 
				-            if self.verify_source_key:
			
 
				-                for display_name, key in source_options.items():
			
 
				-                    if key == self.verify_source_key:
			
 
				-                        verify_display_name = display_name
			
 
				-                        break
			
 
				-            
			
 
				-            selected_verify_display = st.selectbox(
			
 
				-                "选择验证数据源",
			
 
				-                options=list(source_options.keys()),
			
 
				-                index=list(source_options.keys()).index(verify_display_name) if verify_display_name else (1 if len(source_options) > 1 else 0),
			
 
				-                key="verify_source_selector",
			
 
				-                label_visibility="collapsed",
			
 
				-                help="选择用于交叉验证的数据源"
			
 
				-            )
			
 
				-            
			
 
				-            selected_verify_key = source_options[selected_verify_display]
			
 
				-            
			
 
				-            # 如果验证数据源发生变化，切换数据源
			
 
				-            if selected_verify_key != self.verify_source_key:
			
 
				-                self.switch_to_verify_source(selected_verify_key)
			
 
				-                st.rerun()
			
 
				-            
			
 
				-            # 显示验证数据源信息
			
 
				-            if self.verify_source_config:
			
 
				-                with st.expander("📋 验证数据源详情", expanded=False):
			
 
				-                    st.write(f"**工具:** {self.verify_source_config['ocr_tool']}")
			
 
				-                    st.write(f"**文件数:** {len(self.verify_file_info)}")
			
 
				-        
			
 
				-        # 数据源对比提示
			
 
				-        if self.current_source_key == self.verify_source_key:
			
 
				-            st.warning("⚠️ OCR数据源和验证数据源相同，建议选择不同的数据源进行交叉验证")
			
 
				-        else:
			
 
				-            st.success(f"✅ 已选择 {selected_ocr_display} 与 {selected_verify_display} 进行交叉验证")    
			
 
				-    
			
 
				-    def load_ocr_data(self, json_path: str, md_path: Optional[str] = None, image_path: Optional[str] = None):
			
 
				-        """加载OCR相关数据 - 支持多数据源配置"""
			
 
				-        try:
			
 
				-            # 使用当前数据源的配置加载数据
			
 
				-            if self.current_source_config:
			
 
				-                # 临时修改config以使用当前数据源的配置
			
 
				-                temp_config = self.config.copy()
			
 
				-                temp_config['paths'] = {
			
 
				-                    'ocr_out_dir': self.current_source_config['ocr_out_dir'],
			
 
				-                    'src_img_dir': self.current_source_config.get('src_img_dir', ''),
			
 
				-                    'pre_validation_dir': self.config['pre_validation']['out_dir']
			
 
				-                }
			
 
				-                
			
 
				-                # 设置OCR工具类型
			
 
				-                temp_config['current_ocr_tool'] = self.current_source_config['ocr_tool']
			
 
				-                
			
 
				-                self.ocr_data, self.md_content, self.image_path = load_ocr_data_file(json_path, temp_config)
			
 
				-            else:
			
 
				-                self.ocr_data, self.md_content, self.image_path = load_ocr_data_file(json_path, self.config)
			
 
				-                
			
 
				-            self.process_data()
			
 
				-        except Exception as e:
			
 
				-            st.error(f"❌ 加载失败: {e}")
			
 
				-            st.exception(e)
			
 
				-    
			
 
				-    def process_data(self):
			
 
				-        """处理OCR数据"""
			
 
				-        self.text_bbox_mapping = process_ocr_data(self.ocr_data, self.config)
			
 
				-    
			
 
				-    def get_statistics(self) -> Dict:
			
 
				-        """获取统计信息"""
			
 
				-        return get_ocr_statistics(self.ocr_data, self.text_bbox_mapping, self.marked_errors)
			
 
				-    
			
 
				-    def display_html_table_as_dataframe(self, html_content: str, enable_editing: bool = False):
			
 
				-        """将HTML表格解析为DataFrame显示 - 增强版本支持横向滚动"""
			
 
				-        tables = parse_html_tables(html_content)
			
 
				-        wide_table_threshold = 15  # 超宽表格列数阈值
			
 
				-        
			
 
				-        if not tables:
			
 
				-            st.warning("未找到可解析的表格")
			
 
				-            # 对于无法解析的HTML表格，使用自定义CSS显示
			
 
				-            st.markdown("""
			
 
				-            <style>
			
 
				-            .scrollable-table {
			
 
				-                overflow-x: auto;
			
 
				-                white-space: nowrap;
			
 
				-                border: 1px solid #ddd;
			
 
				-                border-radius: 5px;
			
 
				-                margin: 10px 0;
			
 
				-            }
			
 
				-            .scrollable-table table {
			
 
				-                width: 100%;
			
 
				-                border-collapse: collapse;
			
 
				-            }
			
 
				-            .scrollable-table th, .scrollable-table td {
			
 
				-                border: 1px solid #ddd;
			
 
				-                padding: 8px;
			
 
				-                text-align: left;
			
 
				-                min-width: 100px;
			
 
				-            }
			
 
				-            .scrollable-table th {
			
 
				-                background-color: #f5f5f5;
			
 
				-                font-weight: bold;
			
 
				-            }
			
 
				-            </style>
			
 
				-            """, unsafe_allow_html=True)
			
 
				-            
			
 
				-            st.markdown(f'<div class="scrollable-table">{html_content}</div>', unsafe_allow_html=True)
			
 
				-            return
			
 
				-            
			
 
				-        for i, table in enumerate(tables):
			
 
				-            st.subheader(f"📊 表格 {i+1}")
			
 
				-            
			
 
				-            # 表格信息显示
			
 
				-            col_info1, col_info2, col_info3, col_info4 = st.columns(4)
			
 
				-            with col_info1:
			
 
				-                st.metric("行数", len(table))
			
 
				-            with col_info2:
			
 
				-                st.metric("列数", len(table.columns))
			
 
				-            with col_info3:
			
 
				-                # 检查是否有超宽表格
			
 
				-                is_wide_table = len(table.columns) > wide_table_threshold
			
 
				-                st.metric("表格类型", "超宽表格" if is_wide_table else "普通表格")
			
 
				-            with col_info4:
			
 
				-                # 表格操作模式选择
			
 
				-                display_mode = st.selectbox(
			
 
				-                    f"显示模式 (表格{i+1})",
			
 
				-                    ["完整显示", "分页显示", "筛选列显示"],
			
 
				-                    key=f"display_mode_{i}"
			
 
				-                )
			
 
				-            
			
 
				-            # 创建表格操作按钮
			
 
				-            col1, col2, col3, col4 = st.columns(4)
			
 
				-            with col1:
			
 
				-                show_info = st.checkbox(f"显示详细信息", key=f"info_{i}")
			
 
				-            with col2:
			
 
				-                show_stats = st.checkbox(f"显示统计信息", key=f"stats_{i}")
			
 
				-            with col3:
			
 
				-                enable_filter = st.checkbox(f"启用过滤", key=f"filter_{i}")
			
 
				-            with col4:
			
 
				-                enable_sort = st.checkbox(f"启用排序", key=f"sort_{i}")
			
 
				-            
			
 
				-            # 根据显示模式处理表格
			
 
				-            display_table = self._process_table_display_mode(table, i, display_mode)
			
 
				-            
			
 
				-            # 数据过滤和排序逻辑
			
 
				-            filtered_table = self._apply_table_filters_and_sorts(display_table, i, enable_filter, enable_sort)
			
 
				-            
			
 
				-            # 显示表格 - 使用自定义CSS支持横向滚动
			
 
				-            st.markdown("""
			
 
				-            <style>
			
 
				-            .dataframe-container {
			
 
				-                overflow-x: auto;
			
 
				-                border: 1px solid #ddd;
			
 
				-                border-radius: 5px;
			
 
				-                margin: 10px 0;
			
 
				-            }
			
 
				-            
			
 
				-            /* 为超宽表格特殊样式 */
			
 
				-            .wide-table-container {
			
 
				-                overflow-x: auto;
			
 
				-                max-height: 500px;
			
 
				-                overflow-y: auto;
			
 
				-                border: 2px solid #0288d1;
			
 
				-                border-radius: 8px;
			
 
				-                background: linear-gradient(90deg, #f8f9fa 0%, #ffffff 100%);
			
 
				-            }
			
 
				-            
			
 
				-            .dataframe thead th {
			
 
				-                position: sticky;
			
 
				-                top: 0;
			
 
				-                background-color: #f5f5f5 !important;
			
 
				-                z-index: 10;
			
 
				-                border-bottom: 2px solid #0288d1;
			
 
				-            }
			
 
				-            
			
 
				-            .dataframe tbody td {
			
 
				-                white-space: nowrap;
			
 
				-                min-width: 100px;
			
 
				-                max-width: 300px;
			
 
				-                overflow: hidden;
			
 
				-                text-overflow: ellipsis;
			
 
				-            }
			
 
				-            </style>
			
 
				-            """, unsafe_allow_html=True)
			
 
				-            
			
 
				-            # 根据表格宽度选择显示容器
			
 
				-            container_class = "wide-table-container" if len(table.columns) > wide_table_threshold else "dataframe-container"
			
 
				-            
			
 
				-            if enable_editing:
			
 
				-                st.markdown(f'<div class="{container_class}">', unsafe_allow_html=True)
			
 
				-                edited_table = st.data_editor(
			
 
				-                    filtered_table, 
			
 
				-                    width='stretch', 
			
 
				-                    key=f"editor_{i}",
			
 
				-                    height=400 if len(table.columns) > 8 else None
			
 
				-                )
			
 
				-                st.markdown('</div>', unsafe_allow_html=True)
			
 
				-                
			
 
				-                if not edited_table.equals(filtered_table):
			
 
				-                    st.success("✏️ 表格已编辑，可以导出修改后的数据")
			
 
				-            else:
			
 
				-                st.markdown(f'<div class="{container_class}">', unsafe_allow_html=True)
			
 
				-                st.dataframe(
			
 
				-                    filtered_table, 
			
 
				-                    # width='stretch',
			
 
				-                    width =400 if len(table.columns) > wide_table_threshold else "stretch"
			
 
				-                )
			
 
				-                st.markdown('</div>', unsafe_allow_html=True)
			
 
				-            
			
 
				-            # 显示表格信息和统计
			
 
				-            self._display_table_info_and_stats(table, filtered_table, show_info, show_stats, i)
			
 
				-            
			
 
				-            st.markdown("---")
			
 
				-    
			
 
				-    def _apply_table_filters_and_sorts(self, table: pd.DataFrame, table_index: int, enable_filter: bool, enable_sort: bool) -> pd.DataFrame:
			
 
				-        """应用表格过滤和排序"""
			
 
				-        filtered_table = table.copy()
			
 
				-        
			
 
				-        # 数据过滤
			
 
				-        if enable_filter and not table.empty:
			
 
				-            filter_col = st.selectbox(
			
 
				-                f"选择过滤列 (表格 {table_index+1})", 
			
 
				-                options=['无'] + list(table.columns),
			
 
				-                key=f"filter_col_{table_index}"
			
 
				-            )
			
 
				-            
			
 
				-            if filter_col != '无':
			
 
				-                filter_value = st.text_input(f"过滤值 (表格 {table_index+1})", key=f"filter_value_{table_index}")
			
 
				-                if filter_value:
			
 
				-                    filtered_table = table[table[filter_col].astype(str).str.contains(filter_value, na=False)]
			
 
				-        
			
 
				-        # 数据排序
			
 
				-        if enable_sort and not filtered_table.empty:
			
 
				-            sort_col = st.selectbox(
			
 
				-                f"选择排序列 (表格 {table_index+1})", 
			
 
				-                options=['无'] + list(filtered_table.columns),
			
 
				-                key=f"sort_col_{table_index}"
			
 
				-            )
			
 
				-            
			
 
				-            if sort_col != '无':
			
 
				-                sort_order = st.radio(
			
 
				-                    f"排序方式 (表格 {table_index+1})",
			
 
				-                    options=['升序', '降序'],
			
 
				-                    horizontal=True,
			
 
				-                    key=f"sort_order_{table_index}"
			
 
				-                )
			
 
				-                ascending = (sort_order == '升序')
			
 
				-                filtered_table = filtered_table.sort_values(sort_col, ascending=ascending)
			
 
				-        
			
 
				-        return filtered_table
			
 
				-    
			
 
				-    def _display_table_info_and_stats(self, original_table: pd.DataFrame, filtered_table: pd.DataFrame, 
			
 
				-                                     show_info: bool, show_stats: bool, table_index: int):
			
 
				-        """显示表格信息和统计数据"""
			
 
				-        if show_info:
			
 
				-            st.write("**表格信息:**")
			
 
				-            st.write(f"- 原始行数: {len(original_table)}")
			
 
				-            st.write(f"- 过滤后行数: {len(filtered_table)}")
			
 
				-            st.write(f"- 列数: {len(original_table.columns)}")
			
 
				-            st.write(f"- 列名: {', '.join(original_table.columns)}")
			
 
				-        
			
 
				-        if show_stats:
			
 
				-            st.write("**统计信息:**")
			
 
				-            numeric_cols = filtered_table.select_dtypes(include=[np.number]).columns
			
 
				-            if len(numeric_cols) > 0:
			
 
				-                st.dataframe(filtered_table[numeric_cols].describe())
			
 
				-            else:
			
 
				-                st.info("表格中没有数值列")
			
 
				-        
			
 
				-        # 导出功能
			
 
				-        if st.button(f"📥 导出表格 {table_index+1}", key=f"export_{table_index}"):
			
 
				-            self._create_export_buttons(filtered_table, table_index)
			
 
				-    
			
 
				-    def _create_export_buttons(self, table: pd.DataFrame, table_index: int):
			
 
				-        """创建导出按钮"""
			
 
				-        # CSV导出
			
 
				-        csv_data = table.to_csv(index=False)
			
 
				-        st.download_button(
			
 
				-            label=f"下载CSV (表格 {table_index+1})",
			
 
				-            data=csv_data,
			
 
				-            file_name=f"table_{table_index+1}.csv",
			
 
				-            mime="text/csv",
			
 
				-            key=f"download_csv_{table_index}"
			
 
				-        )
			
 
				-        
			
 
				-        # Excel导出
			
 
				-        excel_buffer = BytesIO()
			
 
				-        table.to_excel(excel_buffer, index=False)
			
 
				-        st.download_button(
			
 
				-            label=f"下载Excel (表格 {table_index+1})",
			
 
				-            data=excel_buffer.getvalue(),
			
 
				-            file_name=f"table_{table_index+1}.xlsx",
			
 
				-            mime="application/vnd.openxmlformats-officedocument.spreadsheetml.sheet",
			
 
				-            key=f"download_excel_{table_index}"
			
 
				-        )
			
 
				-    
			
 
				-    def _process_table_display_mode(self, table: pd.DataFrame, table_index: int, display_mode: str) -> pd.DataFrame:
			
 
				-        """根据显示模式处理表格"""
			
 
				-        if display_mode == "分页显示":
			
 
				-            # 分页显示
			
 
				-            page_size = st.selectbox(
			
 
				-                f"每页显示行数 (表格 {table_index+1})",
			
 
				-                [10, 20, 50, 100],
			
 
				-                key=f"page_size_{table_index}"
			
 
				-            )
			
 
				-            
			
 
				-            total_pages = (len(table) - 1) // page_size + 1
			
 
				-            
			
 
				-            if total_pages > 1:
			
 
				-                page_number = st.selectbox(
			
 
				-                    f"页码 (表格 {table_index+1})",
			
 
				-                    range(1, total_pages + 1),
			
 
				-                    key=f"page_number_{table_index}"
			
 
				-                )
			
 
				-                
			
 
				-                start_idx = (page_number - 1) * page_size
			
 
				-                end_idx = start_idx + page_size
			
 
				-                return table.iloc[start_idx:end_idx]
			
 
				-            
			
 
				-            return table
			
 
				-            
			
 
				-        elif display_mode == "筛选列显示":
			
 
				-            # 列筛选显示
			
 
				-            if len(table.columns) > 5:
			
 
				-                selected_columns = st.multiselect(
			
 
				-                    f"选择要显示的列 (表格 {table_index+1})",
			
 
				-                    table.columns.tolist(),
			
 
				-                    default=table.columns.tolist()[:5],  # 默认显示前5列
			
 
				-                    key=f"selected_columns_{table_index}"
			
 
				-                )
			
 
				-                
			
 
				-                if selected_columns:
			
 
				-                    return table[selected_columns]
			
 
				-            
			
 
				-            return table
			
 
				-            
			
 
				-        else:  # 完整显示
			
 
				-            return table
			
 
				-
			
 
				-    def find_verify_md_path(self, selected_file_index: int) -> Optional[Path]:
			
 
				-        """查找当前OCR文件对应的验证文件路径"""
			
 
				-        current_page = self.file_info[selected_file_index]['page']
			
 
				-        verify_md_path = None
			
 
				-
			
 
				-        for i, info in enumerate(self.verify_file_info):
			
 
				-            if info['page'] == current_page:
			
 
				-                verify_md_path = Path(self.verify_file_paths[i]).with_suffix('.md')
			
 
				-                break
			
 
				-
			
 
				-        return verify_md_path
			
 
				-
			
 
				-    @st.dialog("交叉验证", width="large", dismissible=True, on_dismiss="rerun")
			
 
				-    def cross_validation(self):
			
 
				-        """交叉验证功能 - 批量比对两个数据源的所有OCR结果"""
			
 
				-        
			
 
				-        if self.current_source_key == self.verify_source_key:
			
 
				-            st.error("❌ OCR数据源和验证数据源不能相同")
			
 
				-            return
			
 
				-        
			
 
				-        # 初始化对比结果存储
			
 
				-        if 'cross_validation_batch_result' not in st.session_state:
			
 
				-            st.session_state.cross_validation_batch_result = None
			
 
				-        
			
 
				-        st.header("🔄 批量交叉验证")
			
 
				-        
			
 
				-        # 显示数据源信息
			
 
				-        col1, col2 = st.columns(2)
			
 
				-        with col1:
			
 
				-            st.info(f"**OCR数据源:** {get_data_source_display_name(self.current_source_config)}")
			
 
				-            st.write(f"📁 文件数量: {len(self.file_info)}")
			
 
				-        with col2:
			
 
				-            st.info(f"**验证数据源:** {get_data_source_display_name(self.verify_source_config)}")
			
 
				-            st.write(f"📁 文件数量: {len(self.verify_file_info)}")
			
 
				-        
			
 
				-        # 批量验证选项
			
 
				-        with st.expander("⚙️ 验证选项", expanded=True):
			
 
				-            col1, col2 = st.columns(2)
			
 
				-            with col1:
			
 
				-                table_mode = st.selectbox(
			
 
				-                    "表格比对模式",
			
 
				-                    options=['standard', 'flow_list'],
			
 
				-                    index=1,  # 默认使用flow_list
			
 
				-                    format_func=lambda x: '流水表格模式' if x == 'flow_list' else '标准模式',
			
 
				-                    help="选择表格比对算法"
			
 
				-                )
			
 
				-            with col2:
			
 
				-                similarity_algorithm = st.selectbox(
			
 
				-                    "相似度算法",
			
 
				-                    options=['ratio', 'partial_ratio', 'token_sort_ratio', 'token_set_ratio'],
			
 
				-                    index=0,
			
 
				-                    help="选择文本相似度计算算法"
			
 
				-                )
			
 
				-        
			
 
				-        # 开始批量验证按钮
			
 
				-        if st.button("🚀 开始批量验证", type="primary", width='stretch'):
			
 
				-            self._run_batch_cross_validation(table_mode, similarity_algorithm)
			
 
				-        
			
 
				-        # 显示历史批量验证结果
			
 
				-        if 'cross_validation_batch_result' in st.session_state and st.session_state.cross_validation_batch_result:
			
 
				-            st.markdown("---")
			
 
				-            self._display_batch_validation_results(st.session_state.cross_validation_batch_result)
			
 
				-    
			
 
				-    def _generate_batch_validation_markdown(self, batch_results: dict, output_path: str):
			
 
				-        """生成批量验证的Markdown报告"""
			
 
				-        
			
 
				-        with open(output_path, "w", encoding="utf-8") as f:
			
 
				-            f.write("# 批量交叉验证报告\n\n")
			
 
				-            
			
 
				-            # 基本信息
			
 
				-            f.write("## 📋 基本信息\n\n")
			
 
				-            f.write(f"- **OCR数据源:** {batch_results['ocr_source']}\n")
			
 
				-            f.write(f"- **验证数据源:** {batch_results['verify_source']}\n")
			
 
				-            f.write(f"- **表格模式:** {batch_results['table_mode']}\n")
			
 
				-            f.write(f"- **相似度算法:** {batch_results['similarity_algorithm']}\n")
			
 
				-            f.write(f"- **验证时间:** {batch_results['timestamp']}\n\n")
			
 
				-            
			
 
				-            # 汇总统计
			
 
				-            summary = batch_results['summary']
			
 
				-            f.write("## 📊 汇总统计\n\n")
			
 
				-            f.write(f"- **总页数:** {summary['total_pages']}\n")
			
 
				-            f.write(f"- **成功页数:** {summary['successful_pages']}\n")
			
 
				-            f.write(f"- **失败页数:** {summary['failed_pages']}\n")
			
 
				-            f.write(f"- **总差异数:** {summary['total_differences']}\n")
			
 
				-            f.write(f"- **表格差异:** {summary['total_table_differences']}\n")
			
 
				-            f.write(f"  - 金额差异: {summary.get('total_amount_differences', 0)}\n")
			
 
				-            f.write(f"  - 日期差异: {summary.get('total_datetime_differences', 0)}\n")
			
 
				-            f.write(f"  - 文本差异: {summary.get('total_text_differences', 0)}\n")
			
 
				-            f.write(f"  - 表头前差异: {summary.get('total_table_pre_header', 0)}\n")
			
 
				-            f.write(f"  - 表头位置差异: {summary.get('total_table_header_position', 0)}\n")
			
 
				-            f.write(f"  - 表头严重错误: {summary.get('total_table_header_critical', 0)}\n")
			
 
				-            f.write(f"  - 行缺失: {summary.get('total_table_row_missing', 0)}\n")
			
 
				-            f.write(f"- **段落差异:** {summary['total_paragraph_differences']}\n")
			
 
				-            f.write(f"- **严重程度统计:**\n")
			
 
				-            f.write(f"  - 高严重度: {summary.get('total_high_severity', 0)}\n")
			
 
				-            f.write(f"  - 中严重度: {summary.get('total_medium_severity', 0)}\n")
			
 
				-            f.write(f"  - 低严重度: {summary.get('total_low_severity', 0)}\n\n")
			
 
				-            
			
 
				-            # 详细结果表格
			
 
				-            f.write("## 📄 各页差异统计\n\n")
			
 
				-            f.write("| 页码 | 状态 | 总差异 | 表格差异 | 金额 | 日期 | 文本 | 段落 | 表头前 | 表头位置 | 表头错误 | 行缺失 | 高 | 中 | 低 |\n")
			
 
				-            f.write("|------|------|--------|----------|------|------|------|------|--------|----------|----------|--------|----|----|----|\n")
			
 
				-            
			
 
				-            for page in batch_results['pages']:
			
 
				-                if page['status'] == 'success':
			
 
				-                    status_icon = "✅" if page['total_differences'] == 0 else "⚠️"
			
 
				-                    f.write(f"| {page['page_num']} | {status_icon} | ")
			
 
				-                    f.write(f"{page['total_differences']} | ")
			
 
				-                    f.write(f"{page['table_differences']} | ")
			
 
				-                    f.write(f"{page.get('amount_differences', 0)} | ")
			
 
				-                    f.write(f"{page.get('datetime_differences', 0)} | ")
			
 
				-                    f.write(f"{page.get('text_differences', 0)} | ")
			
 
				-                    f.write(f"{page['paragraph_differences']} | ")
			
 
				-                    f.write(f"{page.get('table_pre_header', 0)} | ")
			
 
				-                    f.write(f"{page.get('table_header_position', 0)} | ")
			
 
				-                    f.write(f"{page.get('table_header_critical', 0)} | ")
			
 
				-                    f.write(f"{page.get('table_row_missing', 0)} | ")
			
 
				-                    f.write(f"{page.get('high_severity', 0)} | ")
			
 
				-                    f.write(f"{page.get('medium_severity', 0)} | ")
			
 
				-                    f.write(f"{page.get('low_severity', 0)} |\n")
			
 
				-                else:
			
 
				-                    f.write(f"| {page['page_num']} | ❌ | - | - | - | - | - | - | - | - | - | - | - | - | - |\n")
			
 
				-            
			
 
				-            f.write("\n")
			
 
				-            
			
 
				-            # 问题汇总
			
 
				-            f.write("## 🔍 问题汇总\n\n")
			
 
				-            
			
 
				-            high_diff_pages = [p for p in batch_results['pages'] 
			
 
				-                             if p['status'] == 'success' and p['total_differences'] > 10]
			
 
				-            if high_diff_pages:
			
 
				-                f.write("### ⚠️ 高差异页面（差异>10）\n\n")
			
 
				-                for page in high_diff_pages:
			
 
				-                    f.write(f"- 第 {page['page_num']} 页：{page['total_differences']} 个差异\n")
			
 
				-                f.write("\n")
			
 
				-            
			
 
				-            amount_error_pages = [p for p in batch_results['pages'] 
			
 
				-                                if p['status'] == 'success' and p.get('amount_differences', 0) > 0]
			
 
				-            if amount_error_pages:
			
 
				-                f.write("### 💰 金额差异页面\n\n")
			
 
				-                for page in amount_error_pages:
			
 
				-                    f.write(f"- 第 {page['page_num']} 页：{page.get('amount_differences', 0)} 个金额差异\n")
			
 
				-                f.write("\n")
			
 
				-            
			
 
				-            header_error_pages = [p for p in batch_results['pages'] 
			
 
				-                                if p['status'] == 'success' and p.get('table_header_critical', 0) > 0]
			
 
				-            if header_error_pages:
			
 
				-                f.write("### ❌ 表头严重错误页面\n\n")
			
 
				-                for page in header_error_pages:
			
 
				-                    f.write(f"- 第 {page['page_num']} 页：{page['table_header_critical']} 个表头错误\n")
			
 
				-                f.write("\n")
			
 
				-            
			
 
				-            failed_pages = [p for p in batch_results['pages'] if p['status'] == 'failed']
			
 
				-            if failed_pages:
			
 
				-                f.write("### 💥 验证失败页面\n\n")
			
 
				-                for page in failed_pages:
			
 
				-                    f.write(f"- 第 {page['page_num']} 页：{page.get('error', '未知错误')}\n")
			
 
				-                f.write("\n")
			
 
				-
			
 
				-    def _run_batch_cross_validation(self, table_mode: str, similarity_algorithm: str):
			
 
				-        """执行批量交叉验证"""
			
 
				-        
			
 
				-        # 准备输出目录
			
 
				-        pre_validation_dir = Path(self.config['pre_validation'].get('out_dir', './output/pre_validation/')).resolve()
			
 
				-        pre_validation_dir.mkdir(parents=True, exist_ok=True)
			
 
				-        
			
 
				-        # ✅ 批量结果存储 - 更新统计字段
			
 
				-        batch_results = {
			
 
				-            'ocr_source': get_data_source_display_name(self.current_source_config),
			
 
				-            'verify_source': get_data_source_display_name(self.verify_source_config),
			
 
				-            'table_mode': table_mode,
			
 
				-            'similarity_algorithm': similarity_algorithm,
			
 
				-            'timestamp': pd.Timestamp.now().strftime('%Y-%m-%d %H:%M:%S'),
			
 
				-            'pages': [],
			
 
				-            'summary': {
			
 
				-                'total_pages': 0,
			
 
				-                'successful_pages': 0,
			
 
				-                'failed_pages': 0,
			
 
				-                'total_differences': 0,
			
 
				-                'total_table_differences': 0,
			
 
				-                'total_amount_differences': 0,
			
 
				-                'total_datetime_differences': 0,
			
 
				-                'total_text_differences': 0,
			
 
				-                'total_paragraph_differences': 0,
			
 
				-                'total_table_pre_header': 0,
			
 
				-                'total_table_header_position': 0,
			
 
				-                'total_table_header_critical': 0,
			
 
				-                'total_table_row_missing': 0,
			
 
				-                'total_high_severity': 0,
			
 
				-                'total_medium_severity': 0,
			
 
				-                'total_low_severity': 0
			
 
				-            }
			
 
				-        }
			
 
				-        
			
 
				-        # 创建进度条
			
 
				-        progress_bar = st.progress(0)
			
 
				-        status_text = st.empty()
			
 
				-        
			
 
				-        # 建立页码映射
			
 
				-        ocr_page_map = {info['page']: i for i, info in enumerate(self.file_info)}
			
 
				-        verify_page_map = {info['page']: i for i, info in enumerate(self.verify_file_info)}
			
 
				-        
			
 
				-        # 找出两个数据源共同的页码
			
 
				-        common_pages = sorted(set(ocr_page_map.keys()) & set(verify_page_map.keys()))
			
 
				-        
			
 
				-        if not common_pages:
			
 
				-            st.error("❌ 两个数据源没有共同的页码，无法进行对比")
			
 
				-            return
			
 
				-        
			
 
				-        batch_results['summary']['total_pages'] = len(common_pages)
			
 
				-        
			
 
				-        # 创建详细日志区域
			
 
				-        with st.expander("📋 详细对比日志", expanded=True):
			
 
				-            log_container = st.container()
			
 
				-        
			
 
				-        # 逐页对比
			
 
				-        for idx, page_num in enumerate(common_pages):
			
 
				-            try:
			
 
				-                # 更新进度
			
 
				-                progress = (idx + 1) / len(common_pages)
			
 
				-                progress_bar.progress(progress)
			
 
				-                status_text.text(f"正在对比第 {page_num} 页... ({idx + 1}/{len(common_pages)})")
			
 
				-                
			
 
				-                # 获取文件路径
			
 
				-                ocr_file_index = ocr_page_map[page_num]
			
 
				-                verify_file_index = verify_page_map[page_num]
			
 
				-                
			
 
				-                ocr_md_path = Path(self.file_paths[ocr_file_index]).with_suffix('.md')
			
 
				-                verify_md_path = Path(self.verify_file_paths[verify_file_index]).with_suffix('.md')
			
 
				-                
			
 
				-                if not ocr_md_path.exists() or not verify_md_path.exists():
			
 
				-                    with log_container:
			
 
				-                        st.warning(f"⚠️ 第 {page_num} 页：文件不存在，跳过")
			
 
				-                    batch_results['summary']['failed_pages'] += 1
			
 
				-                    continue
			
 
				-                
			
 
				-                # 执行对比
			
 
				-                comparison_result_path = pre_validation_dir / f"{ocr_md_path.stem}_cross_validation"
			
 
				-                
			
 
				-                # 捕获对比输出
			
 
				-                import io
			
 
				-                import contextlib
			
 
				-                
			
 
				-                output_buffer = io.StringIO()
			
 
				-                
			
 
				-                with contextlib.redirect_stdout(output_buffer):
			
 
				-                    comparison_result = compare_ocr_results(
			
 
				-                        file1_path=str(ocr_md_path),
			
 
				-                        file2_path=str(verify_md_path),
			
 
				-                        output_file=str(comparison_result_path),
			
 
				-                        output_format='both',
			
 
				-                        ignore_images=True,
			
 
				-                        table_mode=table_mode,
			
 
				-                        similarity_algorithm=similarity_algorithm
			
 
				-                    )
			
 
				-                
			
 
				-                # ✅ 提取统计信息 - 更新字段
			
 
				-                stats = comparison_result['statistics']
			
 
				-                
			
 
				-                page_result = {
			
 
				-                    'page_num': page_num,
			
 
				-                    'ocr_file': str(ocr_md_path.name),
			
 
				-                    'verify_file': str(verify_md_path.name),
			
 
				-                    'total_differences': stats['total_differences'],
			
 
				-                    'table_differences': stats['table_differences'],
			
 
				-                    'amount_differences': stats.get('amount_differences', 0),
			
 
				-                    'datetime_differences': stats.get('datetime_differences', 0),
			
 
				-                    'text_differences': stats.get('text_differences', 0),
			
 
				-                    'paragraph_differences': stats['paragraph_differences'],
			
 
				-                    'table_pre_header': stats.get('table_pre_header', 0),
			
 
				-                    'table_header_position': stats.get('table_header_position', 0),
			
 
				-                    'table_header_critical': stats.get('table_header_critical', 0),
			
 
				-                    'table_row_missing': stats.get('table_row_missing', 0),
			
 
				-                    'high_severity': stats.get('high_severity', 0),
			
 
				-                    'medium_severity': stats.get('medium_severity', 0),
			
 
				-                    'low_severity': stats.get('low_severity', 0),
			
 
				-                    'status': 'success',
			
 
				-                    'comparison_json': f"{comparison_result_path}.json",
			
 
				-                    'comparison_md': f"{comparison_result_path}.md"
			
 
				-                }
			
 
				-                
			
 
				-                batch_results['pages'].append(page_result)
			
 
				-                batch_results['summary']['successful_pages'] += 1
			
 
				-                batch_results['summary']['total_differences'] += stats['total_differences']
			
 
				-                batch_results['summary']['total_table_differences'] += stats['table_differences']
			
 
				-                batch_results['summary']['total_amount_differences'] += stats.get('amount_differences', 0)
			
 
				-                batch_results['summary']['total_datetime_differences'] += stats.get('datetime_differences', 0)
			
 
				-                batch_results['summary']['total_text_differences'] += stats.get('text_differences', 0)
			
 
				-                batch_results['summary']['total_paragraph_differences'] += stats['paragraph_differences']
			
 
				-                batch_results['summary']['total_table_pre_header'] += stats.get('table_pre_header', 0)
			
 
				-                batch_results['summary']['total_table_header_position'] += stats.get('table_header_position', 0)
			
 
				-                batch_results['summary']['total_table_header_critical'] += stats.get('table_header_critical', 0)
			
 
				-                batch_results['summary']['total_table_row_missing'] += stats.get('table_row_missing', 0)
			
 
				-                batch_results['summary']['total_high_severity'] += stats.get('high_severity', 0)
			
 
				-                batch_results['summary']['total_medium_severity'] += stats.get('medium_severity', 0)
			
 
				-                batch_results['summary']['total_low_severity'] += stats.get('low_severity', 0)
			
 
				-                
			
 
				-                # 显示当前页对比结果
			
 
				-                with log_container:
			
 
				-                    if stats['total_differences'] == 0:
			
 
				-                        st.success(f"✅ 第 {page_num} 页：完全匹配")
			
 
				-                    else:
			
 
				-                        st.warning(f"⚠️ 第 {page_num} 页：发现 {stats['total_differences']} 个差异")
			
 
				-                
			
 
				-            except Exception as e:
			
 
				-                with log_container:
			
 
				-                    st.error(f"❌ 第 {page_num} 页：对比失败 - {str(e)}")
			
 
				-                
			
 
				-                page_result = {
			
 
				-                    'page_num': page_num,
			
 
				-                    'status': 'failed',
			
 
				-                    'error': str(e)
			
 
				-                }
			
 
				-                batch_results['pages'].append(page_result)
			
 
				-                batch_results['summary']['failed_pages'] += 1
			
 
				-        
			
 
				-        # 保存批量结果
			
 
				-        batch_result_path = pre_validation_dir / f"{self.current_source_config['name']}_{self.current_source_config['ocr_tool']}_vs_{self.verify_source_config['ocr_tool']}_batch_cross_validation"
			
 
				-        
			
 
				-        # 保存JSON
			
 
				-        with open(f"{batch_result_path}.json", "w", encoding="utf-8") as f:
			
 
				-            json.dump(batch_results, f, ensure_ascii=False, indent=2)
			
 
				-        
			
 
				-        # 生成Markdown报告
			
 
				-        self._generate_batch_validation_markdown(batch_results, f"{batch_result_path}.md")
			
 
				-        
			
 
				-        # 保存到session state
			
 
				-        st.session_state.cross_validation_batch_result = batch_results
			
 
				-        
			
 
				-        # 完成提示
			
 
				-        progress_bar.progress(1.0)
			
 
				-        status_text.text("✅ 批量验证完成！")
			
 
				-        
			
 
				-        st.success(f"🎉 批量验证完成！成功: {batch_results['summary']['successful_pages']}, 失败: {batch_results['summary']['failed_pages']}")
			
 
				-
			
 
				-    def _display_batch_validation_results(self, batch_results: dict):
			
 
				-        """显示批量验证结果"""
			
 
				-        
			
 
				-        st.header("📊 批量验证结果")
			
 
				-        
			
 
				-        # 汇总统计
			
 
				-        summary = batch_results['summary']
			
 
				-        
			
 
				-        col1, col2, col3, col4 = st.columns(4)
			
 
				-        with col1:
			
 
				-            st.metric("总页数", summary['total_pages'])
			
 
				-        with col2:
			
 
				-            st.metric("成功页数", summary['successful_pages'], 
			
 
				-                     delta=f"{summary['successful_pages']/summary['total_pages']*100:.1f}%")
			
 
				-        with col3:
			
 
				-            st.metric("失败页数", summary['failed_pages'],
			
 
				-                     delta=f"-{summary['failed_pages']}" if summary['failed_pages'] > 0 else "0")
			
 
				-        with col4:
			
 
				-            st.metric("总差异数", summary['total_differences'])
			
 
				-        
			
 
				-        # ✅ 详细差异类型统计 - 更新展示
			
 
				-        st.subheader("📈 差异类型统计")
			
 
				-        
			
 
				-        col1, col2, col3 = st.columns(3)
			
 
				-        with col1:
			
 
				-            st.metric("表格差异", summary['total_table_differences'])
			
 
				-            st.caption(f"金额: {summary.get('total_amount_differences', 0)} | 日期: {summary.get('total_datetime_differences', 0)} | 文本: {summary.get('total_text_differences', 0)}")
			
 
				-        with col2:
			
 
				-            st.metric("段落差异", summary['total_paragraph_differences'])
			
 
				-        with col3:
			
 
				-            st.metric("严重度", f"高:{summary.get('total_high_severity', 0)} 中:{summary.get('total_medium_severity', 0)} 低:{summary.get('total_low_severity', 0)}")
			
 
				-        
			
 
				-        # 表格结构差异统计
			
 
				-        with st.expander("📋 表格结构差异详情", expanded=False):
			
 
				-            col1, col2, col3, col4 = st.columns(4)
			
 
				-            with col1:
			
 
				-                st.metric("表头前", summary.get('total_table_pre_header', 0))
			
 
				-            with col2:
			
 
				-                st.metric("表头位置", summary.get('total_table_header_position', 0))
			
 
				-            with col3:
			
 
				-                st.metric("表头错误", summary.get('total_table_header_critical', 0))
			
 
				-            with col4:
			
 
				-                st.metric("行缺失", summary.get('total_table_row_missing', 0))
			
 
				-        
			
 
				-        # ✅ 各页详细结果表格 - 更新列
			
 
				-        st.subheader("📄 各页详细结果")
			
 
				-        
			
 
				-        # 准备DataFrame
			
 
				-        page_data = []
			
 
				-        for page in batch_results['pages']:
			
 
				-            if page['status'] == 'success':
			
 
				-                page_data.append({
			
 
				-                    '页码': page['page_num'],
			
 
				-                    '状态': '✅ 成功' if page['total_differences'] == 0 else '⚠️ 有差异',
			
 
				-                    '总差异': page['total_differences'],
			
 
				-                    '表格差异': page['table_differences'],
			
 
				-                    '金额': page.get('amount_differences', 0),
			
 
				-                    '日期': page.get('datetime_differences', 0),
			
 
				-                    '文本': page.get('text_differences', 0),
			
 
				-                    '段落': page['paragraph_differences'],
			
 
				-                    '表头前': page.get('table_pre_header', 0),
			
 
				-                    '表头位置': page.get('table_header_position', 0),
			
 
				-                    '表头错误': page.get('table_header_critical', 0),
			
 
				-                    '行缺失': page.get('table_row_missing', 0),
			
 
				-                    '高': page.get('high_severity', 0),
			
 
				-                    '中': page.get('medium_severity', 0),
			
 
				-                    '低': page.get('low_severity', 0)
			
 
				-                })
			
 
				-            else:
			
 
				-                page_data.append({
			
 
				-                    '页码': page['page_num'],
			
 
				-                    '状态': '❌ 失败',
			
 
				-                    '总差异': '-', '表格差异': '-', '金额': '-', '日期': '-', 
			
 
				-                    '文本': '-', '段落': '-', '表头前': '-', '表头位置': '-',
			
 
				-                    '表头错误': '-', '行缺失': '-', '高': '-', '中': '-', '低': '-'
			
 
				-                })
			
 
				-        
			
 
				-        df_pages = pd.DataFrame(page_data)
			
 
				-        
			
 
				-        # 显示表格
			
 
				-        st.dataframe(
			
 
				-            df_pages,
			
 
				-            width='stretch',
			
 
				-            hide_index=True,
			
 
				-            column_config={
			
 
				-                "页码": st.column_config.NumberColumn("页码", width="small"),
			
 
				-                "状态": st.column_config.TextColumn("状态", width="small"),
			
 
				-                "总差异": st.column_config.NumberColumn("总差异", width="small"),
			
 
				-                "表格差异": st.column_config.NumberColumn("表格", width="small"),
			
 
				-                "金额": st.column_config.NumberColumn("金额", width="small"),
			
 
				-                "日期": st.column_config.NumberColumn("日期", width="small"),
			
 
				-                "文本": st.column_config.NumberColumn("文本", width="small"),
			
 
				-                "段落": st.column_config.NumberColumn("段落", width="small"),
			
 
				-            }
			
 
				-        )
			
 
				-        
			
 
				-        # 下载选项
			
 
				-        st.subheader("📥 导出报告")
			
 
				-        
			
 
				-        col1, col2 = st.columns(2)
			
 
				-        
			
 
				-        with col1:
			
 
				-            # 导出Excel
			
 
				-            excel_buffer = BytesIO()
			
 
				-            df_pages.to_excel(excel_buffer, index=False, sheet_name='验证结果')
			
 
				-            
			
 
				-            st.download_button(
			
 
				-                label="📊 下载Excel报告",
			
 
				-                data=excel_buffer.getvalue(),
			
 
				-                file_name=f"batch_validation_{pd.Timestamp.now().strftime('%Y%m%d_%H%M%S')}.xlsx",
			
 
				-                mime="application/vnd.openxmlformats-officedocument.spreadsheetml.sheet"
			
 
				-            )
			
 
				-        
			
 
				-        with col2:
			
 
				-            # 导出JSON
			
 
				-            json_data = json.dumps(batch_results, ensure_ascii=False, indent=2)
			
 
				-            
			
 
				-            st.download_button(
			
 
				-                label="📄 下载JSON报告",
			
 
				-                data=json_data,
			
 
				-                file_name=f"batch_validation_{pd.Timestamp.now().strftime('%Y%m%d_%H%M%S')}.json",
			
 
				-                mime="application/json"
			
 
				-            )
			
 
				-
			
 
				-    @st.dialog("查看交叉验证结果", width="large", dismissible=True, on_dismiss="rerun")
			
 
				-    def show_batch_cross_validation_results_dialog(self):
			
 
				-        if 'cross_validation_batch_result' in st.session_state and st.session_state.cross_validation_batch_result:
			
 
				-            self._display_batch_validation_results(st.session_state.cross_validation_batch_result)
			
 
				-            
			
 
				-        else:
			
 
				-            st.info("暂无交叉验证结果，请先运行交叉验证")
			
 
				-
			
 
				-    def display_comparison_results(self, comparison_result: dict, detailed: bool = True):
			
 
				-        """显示对比结果摘要 - 使用DataFrame展示"""
			
 
				-        
			
 
				-        st.header("📊 VLM预校验结果")
			
 
				-        
			
 
				-        # 统计信息
			
 
				-        stats = comparison_result['statistics']
			
 
				-        
			
 
				-        # 统计信息概览
			
 
				-        col1, col2, col3, col4 = st.columns(4)
			
 
				-        with col1:
			
 
				-            st.metric("总差异数", stats['total_differences'])
			
 
				-        with col2:
			
 
				-            st.metric("表格差异", stats['table_differences'])
			
 
				-        with col3:
			
 
				-            st.metric("其中表格金额差异", stats['amount_differences'])
			
 
				-        with col4:
			
 
				-            st.metric("段落差异", stats['paragraph_differences'])
			
 
				-        
			
 
				-        # 结果判断
			
 
				-        if stats['total_differences'] == 0:
			
 
				-            st.success("🎉 完美匹配！VLM识别结果与原OCR结果完全一致")
			
 
				-        else:
			
 
				-            st.warning(f"⚠️ 发现 {stats['total_differences']} 个差异，建议人工检查")
			
 
				-            
			
 
				-            # 使用DataFrame显示差异详情
			
 
				-            if comparison_result['differences']:
			
 
				-                st.subheader("🔍 差异详情对比")
			
 
				-                
			
 
				-                # 准备DataFrame数据
			
 
				-                diff_data = []
			
 
				-                for i, diff in enumerate(comparison_result['differences'], 1):
			
 
				-                    diff_data.append({
			
 
				-                        '序号': i,
			
 
				-                        '位置': diff['position'],
			
 
				-                        '类型': diff['type'],
			
 
				-                        '原OCR结果': diff['file1_value'][:100] + ('...' if len(diff['file1_value']) > 100 else ''),
			
 
				-                        'VLM识别结果': diff['file2_value'][:100] + ('...' if len(diff['file2_value']) > 100 else ''),
			
 
				-                        '描述': diff['description'][:80] + ('...' if len(diff['description']) > 80 else ''),
			
 
				-                        '严重程度': self._get_severity_level(diff)
			
 
				-                    })
			
 
				-                
			
 
				-                # 创建DataFrame
			
 
				-                df_differences = pd.DataFrame(diff_data)
			
 
				-                
			
 
				-                # 添加样式
			
 
				-                def highlight_severity(val):
			
 
				-                    """根据严重程度添加颜色"""
			
 
				-                    if val == '高':
			
 
				-                        return 'background-color: #ffebee; color: #c62828'
			
 
				-                    elif val == '中':
			
 
				-                        return 'background-color: #fff3e0; color: #ef6c00'
			
 
				-                    elif val == '低':
			
 
				-                        return 'background-color: #e8f5e8; color: #2e7d32'
			
 
				-                    return ''
			
 
				-                
			
 
				-                # 显示DataFrame
			
 
				-                styled_df = df_differences.style.applymap(
			
 
				-                    highlight_severity, 
			
 
				-                    subset=['严重程度']
			
 
				-                ).format({
			
 
				-                    '序号': '{:d}',
			
 
				-                })
			
 
				-                
			
 
				-                st.dataframe(
			
 
				-                    styled_df, 
			
 
				-                    width='stretch',
			
 
				-                    height=400,
			
 
				-                    hide_index=True,
			
 
				-                    column_config={
			
 
				-                        "序号": st.column_config.NumberColumn(
			
 
				-                            "序号", 
			
 
				-                            width=None,  # 自动调整宽度
			
 
				-                            pinned=True,
			
 
				-                            help="差异项序号"
			
 
				-                        ),
			
 
				-                        "位置": st.column_config.TextColumn(
			
 
				-                            "位置", 
			
 
				-                            width=None,  # 自动调整宽度
			
 
				-                            pinned=True,
			
 
				-                            help="差异在文档中的位置"
			
 
				-                        ),
			
 
				-                        "类型": st.column_config.TextColumn(
			
 
				-                            "类型", 
			
 
				-                            width=None,  # 自动调整宽度
			
 
				-                            pinned=True,
			
 
				-                            help="差异类型"
			
 
				-                        ),
			
 
				-                        "原OCR结果": st.column_config.TextColumn(
			
 
				-                            "原OCR结果", 
			
 
				-                            width="large",  # 自动调整宽度
			
 
				-                            pinned=True,
			
 
				-                            help="原始OCR识别结果"
			
 
				-                        ),
			
 
				-                        "VLM识别结果": st.column_config.TextColumn(
			
 
				-                            "VLM识别结果", 
			
 
				-                            width="large",  # 自动调整宽度
			
 
				-                            help="VLM重新识别的结果"
			
 
				-                        ),
			
 
				-                        "描述": st.column_config.TextColumn(
			
 
				-                            "描述", 
			
 
				-                            width="medium",  # 自动调整宽度
			
 
				-                            help="差异详细描述"
			
 
				-                        ),
			
 
				-                        "严重程度": st.column_config.TextColumn(
			
 
				-                            "严重程度", 
			
 
				-                            width=None,  # 自动调整宽度
			
 
				-                            help="差异严重程度评级"
			
 
				-                        )
			
 
				-                    }
			
 
				-                )
			
 
				-                
			
 
				-                # 详细差异查看
			
 
				-                st.subheader("🔍 详细差异查看")
			
 
				-                
			
 
				-                if detailed:
			
 
				-                    # 选择要查看的差异
			
 
				-                    selected_diff_index = st.selectbox(
			
 
				-                        "选择要查看的差异:",
			
 
				-                        options=range(len(comparison_result['differences'])),
			
 
				-                        format_func=lambda x: f"差异 {x+1}: {comparison_result['differences'][x]['position']} - {comparison_result['differences'][x]['type']}",
			
 
				-                        key="selected_diff"
			
 
				-                    )
			
 
				-                    
			
 
				-                    if selected_diff_index is not None:
			
 
				-                        diff = comparison_result['differences'][selected_diff_index]
			
 
				-                        
			
 
				-                        # 并排显示完整内容
			
 
				-                        col1, col2 = st.columns(2)
			
 
				-                        
			
 
				-                        with col1:
			
 
				-                            st.write("**原OCR结果:**")
			
 
				-                            st.text_area(
			
 
				-                                "原OCR结果详情",
			
 
				-                                value=diff['file1_value'],
			
 
				-                                height=200,
			
 
				-                                key=f"original_{selected_diff_index}",
			
 
				-                                label_visibility="collapsed"
			
 
				-                            )
			
 
				-                        
			
 
				-                        with col2:
			
 
				-                            st.write("**验证数据源识别结果:**")
			
 
				-                            st.text_area(
			
 
				-                                "验证数据源识别结果详情",
			
 
				-                                value=diff['file2_value'],
			
 
				-                                height=200,
			
 
				-                                key=f"vlm_{selected_diff_index}",
			
 
				-                                label_visibility="collapsed"
			
 
				-                            )
			
 
				-                        
			
 
				-                        # 差异详细信息
			
 
				-                        st.info(f"**位置:** {diff['position']}")
			
 
				-                        st.info(f"**类型:** {diff['type']}")
			
 
				-                        st.info(f"**描述:** {diff['description']}")
			
 
				-                        st.info(f"**严重程度:** {self._get_severity_level(diff)}")
			
 
				-                
			
 
				-                # 差异统计图表
			
 
				-                st.subheader("📈 差异类型分布")
			
 
				-                
			
 
				-                # 按类型统计差异
			
 
				-                type_counts = {}
			
 
				-                severity_counts = {'高': 0, '中': 0, '低': 0}
			
 
				-                
			
 
				-                for diff in comparison_result['differences']:
			
 
				-                    diff_type = diff['type']
			
 
				-                    type_counts[diff_type] = type_counts.get(diff_type, 0) + 1
			
 
				-                    
			
 
				-                    severity = self._get_severity_level(diff)
			
 
				-                    severity_counts[severity] += 1
			
 
				-                
			
 
				-                col1, col2 = st.columns(2)
			
 
				-                
			
 
				-                with col1:
			
 
				-                    # 类型分布饼图
			
 
				-                    if type_counts:
			
 
				-                        fig_type = px.pie(
			
 
				-                            values=list(type_counts.values()),
			
 
				-                            names=list(type_counts.keys()),
			
 
				-                            title="差异类型分布"
			
 
				-                        )
			
 
				-                        st.plotly_chart(fig_type, width='stretch')
			
 
				-                
			
 
				-                with col2:
			
 
				-                    # 严重程度分布条形图
			
 
				-                    fig_severity = px.bar(
			
 
				-                        x=list(severity_counts.keys()),
			
 
				-                        y=list(severity_counts.values()),
			
 
				-                        title="差异严重程度分布",
			
 
				-                        color=list(severity_counts.keys()),
			
 
				-                        color_discrete_map={'高': '#f44336', '中': '#ff9800', '低': '#4caf50'}
			
 
				-                    )
			
 
				-                    st.plotly_chart(fig_severity, width='stretch')
			
 
				-        
			
 
				-        # 下载选项
			
 
				-        if detailed:
			
 
				-            self._provide_download_options_in_results(comparison_result)
			
 
				-
			
 
				-    def _get_severity_level(self, diff: dict) -> str:
			
 
				-        """根据差异类型和内容判断严重程度"""
			
 
				-        # 如果差异中已经包含严重程度，直接使用
			
 
				-        if 'severity' in diff:
			
 
				-            severity_map = {'high': '高', 'medium': '中', 'low': '低'}
			
 
				-            return severity_map.get(diff['severity'], '中')
			
 
				-        
			
 
				-        # 原有的逻辑作为后备
			
 
				-        diff_type = diff['type'].lower()
			
 
				-        
			
 
				-        # 金额相关差异为高严重程度
			
 
				-        if 'amount' in diff_type or 'number' in diff_type:
			
 
				-            return '高'
			
 
				-        
			
 
				-        # 表格结构差异为中等严重程度
			
 
				-        if 'table' in diff_type or 'structure' in diff_type:
			
 
				-            return '中'
			
 
				-        
			
 
				-        # 检查相似度
			
 
				-        if 'similarity' in diff:
			
 
				-            similarity = diff['similarity']
			
 
				-            if similarity < 50:
			
 
				-                return '高'
			
 
				-            elif similarity < 85:
			
 
				-                return '中'
			
 
				-            else:
			
 
				-                return '低'
			
 
				-        
			
 
				-        # 检查内容长度差异
			
 
				-        len_diff = abs(len(diff['file1_value']) - len(diff['file2_value']))
			
 
				-        if len_diff > 50:
			
 
				-            return '高'
			
 
				-        elif len_diff > 10:
			
 
				-            return '中'
			
 
				-        else:
			
 
				-            return '低'
			
 
				-
			
 
				-    def _provide_download_options_in_results(self, comparison_result: dict):
			
 
				-        """在结果页面提供下载选项"""
			
 
				-        
			
 
				-        st.subheader("📥 导出预校验结果")
			
 
				-        
			
 
				-        col1, col2, col3 = st.columns(3)
			
 
				-        
			
 
				-        with col1:
			
 
				-            # 导出差异详情为Excel
			
 
				-            if comparison_result['differences']:
			
 
				-                diff_data = []
			
 
				-                for i, diff in enumerate(comparison_result['differences'], 1):
			
 
				-                    diff_data.append({
			
 
				-                        '序号': i,
			
 
				-                        '位置': diff['position'],
			
 
				-                        '类型': diff['type'],
			
 
				-                        '原OCR结果': diff['file1_value'],
			
 
				-                        'VLM识别结果': diff['file2_value'],
			
 
				-                        '描述': diff['description'],
			
 
				-                        '严重程度': self._get_severity_level(diff)
			
 
				-                    })
			
 
				-                
			
 
				-                df_export = pd.DataFrame(diff_data)
			
 
				-                excel_buffer = BytesIO()
			
 
				-                df_export.to_excel(excel_buffer, index=False, sheet_name='差异详情')
			
 
				-                
			
 
				-                st.download_button(
			
 
				-                    label="📊 下载差异详情(Excel)",
			
 
				-                    data=excel_buffer.getvalue(),
			
 
				-                    file_name=f"vlm_comparison_differences_{pd.Timestamp.now().strftime('%Y%m%d_%H%M%S')}.xlsx",
			
 
				-                    mime="application/vnd.openxmlformats-officedocument.spreadsheetml.sheet",
			
 
				-                    key="download_differences_excel"
			
 
				-                )
			
 
				-        
			
 
				-        with col2:
			
 
				-            # 导出统计报告
			
 
				-            stats_data = {
			
 
				-                '统计项目': ['总差异数', '表格差异', '其中表格金额差异', '段落差异'],
			
 
				-                '数量': [
			
 
				-                    comparison_result['statistics']['total_differences'],
			
 
				-                    comparison_result['statistics']['table_differences'],
			
 
				-                    comparison_result['statistics']['amount_differences'],
			
 
				-                    comparison_result['statistics']['paragraph_differences']
			
 
				-                ]
			
 
				-            }
			
 
				-            
			
 
				-            df_stats = pd.DataFrame(stats_data)
			
 
				-            csv_stats = df_stats.to_csv(index=False)
			
 
				-            
			
 
				-            st.download_button(
			
 
				-                label="📈 下载统计报告(CSV)",
			
 
				-                data=csv_stats,
			
 
				-                file_name=f"vlm_comparison_stats_{pd.Timestamp.now().strftime('%Y%m%d_%H%M%S')}.csv",
			
 
				-                mime="text/csv",
			
 
				-                key="download_stats_csv"
			
 
				-            )
			
 
				-        
			
 
				-        with col3:
			
 
				-            # 导出完整报告为JSON
			
 
				-            import json
			
 
				-            
			
 
				-            report_json = json.dumps(comparison_result, ensure_ascii=False, indent=2)
			
 
				-            
			
 
				-            st.download_button(
			
 
				-                label="📄 下载完整报告(JSON)",
			
 
				-                data=report_json,
			
 
				-                file_name=f"vlm_comparison_full_{pd.Timestamp.now().strftime('%Y%m%d_%H%M%S')}.json",
			
 
				-                mime="application/json",
			
 
				-                key="download_full_json"
			
 
				-            )
			
 
				-        
			
 
				-        # 操作建议
			
 
				-        st.subheader("🚀 后续操作建议")
			
 
				-        
			
 
				-        total_diffs = comparison_result['statistics']['total_differences']
			
 
				-        if total_diffs == 0:
			
 
				-            st.success("✅ VLM识别结果与原OCR完全一致，可信度很高，无需人工校验")
			
 
				-        elif total_diffs <= 5:
			
 
				-            st.warning("⚠️ 发现少量差异，建议重点检查高严重程度的差异项")
			
 
				-        elif total_diffs <= 20:
			
 
				-            st.warning("🔍 发现中等数量差异，建议详细检查差异表格中标红的项目")
			
 
				-        else:
			
 
				-            st.error("❌ 发现大量差异，建议重新进行OCR识别或检查原始图片质量")
			
 
				-    
			
 
				-    def create_compact_layout(self, config):
			
 
				-        """创建滚动凑布局"""
			
 
				-        return self.layout_manager.create_compact_layout(config)
			
 
				-
			
 
				-@st.dialog("message", width="small", dismissible=True, on_dismiss="rerun")
			
 
				-def message_box(msg: str, msg_type: str = "info"):
			
 
				-    if msg_type == "info":
			
 
				-        st.info(msg)
			
 
				-    elif msg_type == "warning":
			
 
				-        st.warning(msg)
			
 
				-    elif msg_type == "error":
			
 
				-        st.error(msg)
			
 
				-
			
 
				-def main():
			
 
				-    """主应用"""
			
 
				-    # 初始化应用
			
 
				-    if 'validator' not in st.session_state:
			
 
				-        validator = StreamlitOCRValidator()
			
 
				-        st.session_state.validator = validator
			
 
				-        st.session_state.validator.setup_page_config()
			
 
				-        
			
 
				-        # 页面标题
			
 
				-        config = st.session_state.validator.config
			
 
				-        st.title(config['ui']['page_title'])
			
 
				-    else:
			
 
				-        validator = st.session_state.validator
			
 
				-        config = st.session_state.validator.config
			
 
				-    
			
 
				-    if 'selected_text' not in st.session_state:
			
 
				-        st.session_state.selected_text = None
			
 
				-    
			
 
				-    if 'marked_errors' not in st.session_state:
			
 
				-        st.session_state.marked_errors = set()
			
 
				-    
			
 
				-    # 数据源选择器
			
 
				-    validator.create_data_source_selector()
			
 
				-    
			
 
				-    # 如果没有可用的数据源，提前返回
			
 
				-    if not validator.all_sources:
			
 
				-        st.stop()
			
 
				-    
			
 
				-    # 文件选择区域
			
 
				-    with st.container(height=75, horizontal=True, horizontal_alignment='left', gap="medium"):
			
 
				-        # 初始化session_state中的选择索引
			
 
				-        if 'selected_file_index' not in st.session_state:
			
 
				-            st.session_state.selected_file_index = 0
			
 
				-            
			
 
				-        if validator.display_options:
			
 
				-            # 文件选择下拉框
			
 
				-            selected_index = st.selectbox(
			
 
				-                "选择OCR结果文件", 
			
 
				-                range(len(validator.display_options)),
			
 
				-                format_func=lambda i: validator.display_options[i],
			
 
				-                index=st.session_state.selected_file_index,
			
 
				-                key="selected_selectbox",
			
 
				-                label_visibility="collapsed"
			
 
				-            )
			
 
				-            
			
 
				-            # 更新session_state
			
 
				-            if selected_index != st.session_state.selected_file_index:
			
 
				-                st.session_state.selected_file_index = selected_index
			
 
				-
			
 
				-            selected_file = validator.file_paths[selected_index]
			
 
				-
			
 
				-            # 页码输入器
			
 
				-            current_page = validator.file_info[selected_index]['page']
			
 
				-            page_input = st.number_input(
			
 
				-                "输入页码", 
			
 
				-                placeholder="输入页码", 
			
 
				-                label_visibility="collapsed",
			
 
				-                min_value=1, 
			
 
				-                max_value=len(validator.display_options), 
			
 
				-                value=current_page, 
			
 
				-                step=1,
			
 
				-                key="page_input"
			
 
				-            )
			
 
				-            
			
 
				-            # 当页码输入改变时，更新文件选择
			
 
				-            if page_input != current_page:
			
 
				-                for i, info in enumerate(validator.file_info):
			
 
				-                    if info['page'] == page_input:
			
 
				-                        st.session_state.selected_file_index = i
			
 
				-                        selected_file = validator.file_paths[i]
			
 
				-                        st.rerun()
			
 
				-                        break
			
 
				-
			
 
				-            # 自动加载文件
			
 
				-            if (st.session_state.selected_file_index >= 0
			
 
				-                and validator.selected_file_index != st.session_state.selected_file_index
			
 
				-                and selected_file):
			
 
				-                validator.selected_file_index = st.session_state.selected_file_index
			
 
				-                st.session_state.validator.load_ocr_data(selected_file)
			
 
				-                
			
 
				-                # 显示加载成功信息
			
 
				-                current_source_name = get_data_source_display_name(validator.current_source_config)
			
 
				-                st.success(f"✅ 已加载 {current_source_name} - 第{validator.file_info[st.session_state.selected_file_index]['page']}页")
			
 
				-                st.rerun()
			
 
				-        else:
			
 
				-            st.warning("当前数据源中未找到OCR结果文件")
			
 
				-
			
 
				-        # 交叉验证按钮
			
 
				-        if st.button("交叉验证", type="primary", icon=":material/compare_arrows:"):
			
 
				-            if validator.image_path and validator.md_content:
			
 
				-                validator.cross_validation()
			
 
				-            else:
			
 
				-                message_box("❌ 请先选择OCR数据文件", "error")
			
 
				-
			
 
				-        # 查看预校验结果按钮
			
 
				-        if st.button("查看验证结果", type="secondary", icon=":material/quick_reference_all:"):
			
 
				-            validator.show_batch_cross_validation_results_dialog()
			
 
				-
			
 
				-    # 显示当前数据源统计信息
			
 
				-    with st.expander("🔧 OCR工具统计信息", expanded=False):
			
 
				-        stats = validator.get_statistics()
			
 
				-        col1, col2, col3, col4, col5 = st.columns(5)
			
 
				-        
			
 
				-        with col1:
			
 
				-            st.metric("📊 总文本块", stats['total_texts'])
			
 
				-        with col2:
			
 
				-            st.metric("🔗 可点击文本", stats['clickable_texts'])
			
 
				-        with col3:
			
 
				-            st.metric("❌ 标记错误", stats['marked_errors'])
			
 
				-        with col4:
			
 
				-            st.metric("✅ 准确率", f"{stats['accuracy_rate']:.1f}%")
			
 
				-        with col5:
			
 
				-            # 显示当前数据源信息
			
 
				-            if validator.current_source_config:
			
 
				-                tool_display = validator.current_source_config['ocr_tool'].upper()
			
 
				-                st.metric("🔧 OCR工具", tool_display)
			
 
				-        
			
 
				-        # 详细工具信息
			
 
				-        if stats['tool_info']:
			
 
				-            st.write("**详细信息:**", stats['tool_info'])
			
 
				-    
			
 
				-    # 其余标签页保持不变...
			
 
				-    tab1, tab2, tab3 = st.tabs(["📄 内容人工检查", "🔍 交叉验证结果", "📊 表格分析"])
			
 
				-    
			
 
				-    with tab1:
			
 
				-        validator.create_compact_layout(config)
			
 
				-
			
 
				-    with tab2:
			
 
				-        # st.header("📄 VLM预校验识别结果")
			
 
				-        current_md_path = Path(validator.file_paths[validator.selected_file_index]).with_suffix('.md')
			
 
				-        pre_validation_dir = Path(validator.config['pre_validation'].get('out_dir', './output/pre_validation/')).resolve()
			
 
				-        comparison_result_path = pre_validation_dir / f"{current_md_path.stem}_cross_validation.json"
			
 
				-        # pre_validation_path = pre_validation_dir / f"{current_md_path.stem}.md"
			
 
				-        verify_md_path = validator.find_verify_md_path(validator.selected_file_index)
			
 
				-        
			
 
				-        if comparison_result_path.exists():
			
 
				-            # 加载并显示验证结果
			
 
				-            with open(comparison_result_path, "r", encoding="utf-8") as f:
			
 
				-                comparison_result = json.load(f)
			
 
				-
			
 
				-            # 左边显示OCR结果，右边显示VLM结果
			
 
				-            col1, col2 = st.columns([1,1])
			
 
				-            with col1:
			
 
				-                st.subheader("🤖 原OCR识别结果")
			
 
				-                with open(current_md_path, "r", encoding="utf-8") as f:
			
 
				-                    original_md_content = f.read()
			
 
				-                font_size = config['styles'].get('font_size', 10)
			
 
				-                height = config['styles']['layout'].get('default_height', 800)
			
 
				-                layout_type = "compact"
			
 
				-                validator.layout_manager.render_content_by_mode(original_md_content, "HTML渲染", font_size, height, layout_type)
			
 
				-            with col2:
			
 
				-                st.subheader("🤖 验证识别结果")
			
 
				-                with open(str(verify_md_path), "r", encoding="utf-8") as f:
			
 
				-                    verify_md_content = f.read()
			
 
				-                font_size = config['styles'].get('font_size', 10)
			
 
				-                height = config['styles']['layout'].get('default_height', 800)
			
 
				-                layout_type = "compact"
			
 
				-                validator.layout_manager.render_content_by_mode(verify_md_content, "HTML渲染", font_size, height, layout_type)
			
 
				-
			
 
				-            # 显示差异统计
			
 
				-            st.markdown("---")
			
 
				-            validator.display_comparison_results(comparison_result, detailed=True)
			
 
				-        else:
			
 
				-            st.info("暂无预校验结果，请先运行VLM预校验")
			
 
				-
			
 
				-    with tab3:
			
 
				-        # 表格分析页面 - 保持原有逻辑
			
 
				-        st.header("📊 表格数据分析")
			
 
				-        
			
 
				-        if validator.md_content and '<table' in validator.md_content.lower():
			
 
				-            st.subheader("🔍 表格数据预览")
			
 
				-            validator.display_html_table_as_dataframe(validator.md_content)
			
 
				-            
			
 
				-        else:
			
 
				-            st.info("当前OCR结果中没有检测到表格数据")
			
 
				-    
			
 
				-if __name__ == "__main__":
			
 
				-    main()
			
--- a/streamlit_validator_core.py
+++ b/streamlit_validator_core.py
@@ -7,7 +7,7 @@ from typing import Dict, List, Optional
 
				 import json
			
 
				 
			
 
				 from ocr_validator_utils import (
			
 
				-    load_config, load_ocr_data_file, process_ocr_data,
			
 
				+    load_ocr_data_file, process_ocr_data,
			
 
				     get_ocr_statistics, find_available_ocr_files_multi_source, 
			
 
				     get_data_source_display_name
			
 
				 )
			
@@ -17,8 +17,14 @@ from ocr_validator_layout import OCRLayoutManager
 
				 class StreamlitOCRValidator:
			
 
				     """核心验证器类"""
			
 
				     
			
 
				-    def __init__(self):
			
 
				-        self.config = load_config()
			
 
				+    def __init__(self, config_dict: Dict = None):  # 🎯 参数名改为 config_dict
			
 
				+        """
			
 
				+        初始化验证器
			
 
				+        
			
 
				+        Args:
			
 
				+            config_dict: 配置字典（从 ConfigManager.to_validator_config() 生成）
			
 
				+        """
			
 
				+        self.config = config_dict  # 🎯 直接赋值
			
 
				         self.ocr_data = []
			
 
				         self.md_content = ""
			
 
				         self.image_path = ""
作者	SHA1 备注	提交日期
zhch158_admin	19be083b28 feat: 修改初始化方法，支持通过配置字典传入配置，移除对 load_config 的依赖	1 周之前
zhch158_admin	2643734c43 feat: 初始化配置管理器并优化文档和OCR工具的显示信息	1 周之前
zhch158_admin	afc9e3d481 feat: 新增配置管理器，支持分层配置和自动发现数据源，集成 Jinja2 模板变量	1 周之前
zhch158_admin	206d52f443 Implement code changes to enhance functionality and improve performance	1 周之前
zhch158_admin	776f9654da feat: 删除合并 MinerU 和 PaddleOCR 的结果的脚本，优化代码结构	1 周之前
zhch158_admin	21757ecf65 feat: 添加多个OCR工具配置文件，支持不同文档的OCR结果管理	1 周之前
zhch158_admin	a9a8e8cf3b feat: 添加日志子目录和全局日志配置，优化处理器日志管理	1 周之前
zhch158_admin	586f15b189 feat: 添加日志重定向支持，优化PDF批量处理器的日志管理	1 周之前
zhch158_admin	9c8d546753 feat: 添加批量合并OCR结果的功能，支持日志重定向和处理器自动检测	1 周之前