Browse Source

Implement code changes to enhance functionality and improve performance

zhch158_admin 1 week ago
parent
commit
206d52f443
2 changed files with 0 additions and 2083 deletions
  1. 0 587
      STREAMLIT_GUIDE.md
  2. 0 1496
      streamlit_ocr_validator_v1.py

+ 0 - 587
STREAMLIT_GUIDE.md

@@ -1,587 +0,0 @@
-# 🚀 Streamlit OCR可视化校验工具使用指南
-
-## 🎯 工具介绍
-
-基于Streamlit开发的OCR可视化校验工具,提供现代化的Web界面和丰富的交互体验,让OCR结果校验变得直观高效。特别新增了强大的表格数据分析功能。
-
-### 🔧 核心功能
-
-- ✅ **实时交互**: 点击文本即时高亮图片位置
-- ✅ **动态过滤**: 搜索、类别筛选、条件过滤
-- ✅ **数据表格**: 可排序的详细数据视图
-- ✅ **统计信息**: 实时统计和进度跟踪
-- ✅ **错误标记**: 一键标记和管理识别错误
-- ✅ **报告导出**: 生成详细的校验报告
-- ⭐ **表格分析**: HTML表格转DataFrame,支持过滤、排序、导出
-- ⭐ **多种渲染**: HTML/Markdown/DataFrame/原始文本四种显示模式
-
-## 🚀 快速启动
-
-### 1. 安装依赖
-
-```bash
-# 安装Streamlit和相关依赖
-pip install streamlit plotly pandas pillow numpy opencv-python openpyxl
-```
-
-### 2. 启动应用
-
-```bash
-# 方法1: 完整功能版本
-python -m streamlit run streamlit_ocr_validator.py
-
-# 方法2: 使用启动脚本
-python run_streamlit_validator.py
-
-# 方法3: 开发模式(自动重载)
-streamlit run streamlit_ocr_validator.py --server.runOnSave true
-```
-
-### 3. 访问界面
-
-浏览器会自动打开 http://localhost:8501,如果没有自动打开,请手动访问该地址。
-
-## 🖥️ 界面使用指南
-
-### 主界面布局
-
-```
-┌─────────────────────────────────────────────────────────────────────┐
-│ 🔍 OCR可视化校验工具                                                 │
-├─────────────────────────────────────────────────────────────────────┤
-│ 📊 总文本块: 13  🔗 可点击: 9  ❌ 标记错误: 2  ✅ 准确率: 85.7%      │
-├─────────────────────────┬───────────────────────────────────────────┤
-│ 📄 OCR识别内容            │ 🖼️ 原图标注                               │
-│                        │                                           │
-│ 🔍 搜索框               │ [显示选中位置的红框标注]                   │
-│ 📍 选择文本下拉框        │ [图片缩放和详细信息]                       │
-│                        │                                           │
-│ 📝 MD内容预览           │ 📍 选中文本详情                           │
-│ [4种渲染模式选择]        │ - 文本内容: xxx                           │
-│ ○ HTML渲染             │ - 位置: [x1,y1,x2,y2]                    │
-│ ● Markdown渲染         │ - 宽度: xxx px                            │
-│ ○ DataFrame表格 ⭐      │ - 高度: xxx px                            │
-│ ○ 原始文本             │                                           │
-│                        │                                           │
-│ 🎯 可点击文本列表        │                                           │
-│ [📍 文本1] [❌] [✅]     │                                           │
-│ [📍 文本2] [❌] [✅]     │                                           │
-└─────────────────────────┴───────────────────────────────────────────┘
-```
-
-### 侧边栏功能
-
-```
-┌─────────────────────┐
-│ 📁 文件选择          │
-│ [选择OCR结果文件]    │
-│ [🔄 加载文件]       │
-│                    │
-│ 🎛️ 控制面板         │
-│ [🧹 清除选择]       │
-│ [❌ 清除错误标记]   │
-│                    │
-│ 📊 表格快捷操作 ⭐   │
-│ [🔍 快速预览表格]   │
-│ [📥 一键导出所有表格] │
-│                    │
-│ 🔧 调试信息         │
-│ [调试信息开关]      │
-└─────────────────────┘
-```
-
-### 使用步骤
-
-1. **选择文件**
-   - 从侧边栏下拉框中选择OCR结果JSON文件
-   - 点击"🔄 加载文件"按钮加载数据
-   - 查看顶部统计信息确认加载成功
-
-2. **浏览统计信息**
-   - 查看总文本块数量
-   - 了解可点击文本数量
-   - 确认图片是否正确加载
-   - 查看当前准确率
-
-3. **交互校验**
-   - 使用下拉框选择要校验的文本
-   - 点击左侧的"📍 文本内容"按钮
-   - 观察右侧图片上的红色框标注
-   - 查看右下角显示的详细位置信息
-
-4. **搜索过滤**
-   - 使用搜索框快速定位特定文本
-   - 在MD内容预览中查看完整识别结果
-   - 通过搜索结果快速定位问题文本
-
-5. **表格数据分析** ⭐ 新增
-   - 选择"DataFrame表格"渲染模式
-   - 查看自动解析的HTML表格
-   - 使用过滤、排序功能分析数据
-   - 查看表格统计信息
-   - 导出CSV或Excel文件
-
-6. **错误标记管理**
-   - 点击文本旁边的"❌"按钮标记错误
-   - 点击"✅"按钮取消错误标记
-   - 观察准确率的实时变化
-   - 使用侧边栏批量清除标记
-
-## 🎨 高级功能
-
-### 完整版本独有功能 (`streamlit_ocr_validator.py`)
-
-#### 📊 错误标记系统
-- **标记错误**: 点击文本旁边的"❌"按钮标记识别错误
-- **取消标记**: 点击"✅"按钮取消错误标记
-- **统计准确率**: 自动计算识别准确率
-- **错误过滤**: 只显示标记为错误的文本
-- **批量操作**: 侧边栏提供批量清除功能
-
-#### 📈 表格数据分析 ⭐ 核心新功能
-- **智能表格检测**: 自动识别HTML表格内容
-- **DataFrame转换**: 将HTML表格转换为可操作的pandas DataFrame
-- **多维度操作**: 支持过滤、排序、搜索等操作
-- **统计分析**: 自动生成表格行列数、数值列统计等信息
-- **数据导出**: 支持CSV、Excel格式导出
-- **可视化图表**: 基于表格数据生成统计图表
-
-#### 🎛️ 多种渲染模式
-- **HTML渲染**: 原生HTML表格显示,保持格式
-- **Markdown渲染**: 转换为Markdown表格格式
-- **DataFrame表格**: 转换为可交互的数据表格 ⭐
-- **原始文本**: 纯文本格式显示
-
-#### 🔧 侧边栏控制
-- **文件管理**: 侧边栏选择和管理OCR文件
-- **控制面板**: 清除选择、清除错误标记等操作
-- **表格快捷操作**: 快速预览和导出表格功能 ⭐
-- **调试信息**: 详细的系统状态和数据信息
-
-#### 📊 过滤和筛选
-- **多条件过滤**: 按类别、错误状态、尺寸等多重筛选
-- **实时搜索**: 动态搜索文本内容
-- **数据表格**: 可排序、可筛选的完整数据视图
-
-### 表格分析功能详解 ⭐
-
-#### 功能特性
-
-```python
-# 表格检测和转换
-if '<table>' in display_content.lower():
-    st.session_state.validator.display_html_table_as_dataframe(display_content)
-else:
-    st.info("当前内容中没有检测到HTML表格")
-```
-
-#### 支持的表格操作
-
-1. **基础操作**
-   - 自动检测HTML表格
-   - 转换为pandas DataFrame
-   - 表格信息统计(行数、列数、列名等)
-
-2. **数据过滤**
-   - 按列内容过滤
-   - 支持文本搜索
-   - 条件筛选
-
-3. **数据排序**
-   - 按任意列排序
-   - 升序/降序选择
-   - 多列排序
-
-4. **统计分析**
-   - 数值列描述性统计
-   - 数据类型分析
-   - 缺失值统计
-
-5. **数据导出**
-   - CSV格式导出
-   - Excel格式导出
-   - 支持过滤后数据导出
-
-#### 使用示例
-
-```python
-# 在streamlit界面中
-def display_html_table_as_dataframe(self, html_content: str, enable_editing: bool = False):
-    """将HTML表格解析为DataFrame显示"""
-    import pandas as pd
-    from io import StringIO, BytesIO
-    
-    try:
-        # 使用pandas直接读取HTML表格
-        tables = pd.read_html(StringIO(html_content))
-        if tables:
-            for i, table in enumerate(tables):
-                st.subheader(f"📊 表格 {i+1}")
-                
-                # 创建表格操作按钮
-                col1, col2, col3, col4 = st.columns(4)
-                with col1:
-                    show_info = st.checkbox(f"显示表格信息", key=f"info_{i}")
-                with col2:
-                    show_stats = st.checkbox(f"显示统计信息", key=f"stats_{i}")
-                with col3:
-                    enable_filter = st.checkbox(f"启用过滤", key=f"filter_{i}")
-                with col4:
-                    enable_sort = st.checkbox(f"启用排序", key=f"sort_{i}")
-                
-                # 显示表格
-                st.dataframe(table, width='stretch')
-```
-
-## 🔧 自定义开发
-
-### 扩展功能开发
-
-#### 1. 添加新的表格处理功能 ⭐
-
-```python
-import plotly.express as px
-import streamlit as st
-
-def create_table_visualization(df):
-    """创建表格数据可视化"""
-    if not df.empty:
-        numeric_cols = df.select_dtypes(include=[np.number]).columns
-        
-        if len(numeric_cols) > 0:
-            # 创建统计图表
-            fig = px.bar(
-                x=df.index,
-                y=df[numeric_cols[0]],
-                title=f"{numeric_cols[0]} 分布"
-            )
-            st.plotly_chart(fig, width='stretch')
-            
-            # 创建散点图
-            if len(numeric_cols) > 1:
-                fig_scatter = px.scatter(
-                    df, 
-                    x=numeric_cols[0], 
-                    y=numeric_cols[1],
-                    title=f"{numeric_cols[0]} vs {numeric_cols[1]}"
-                )
-                st.plotly_chart(fig_scatter, width='stretch')
-
-# 在主应用中使用
-if st.checkbox("显示数据可视化"):
-    create_table_visualization(filtered_table)
-```
-
-#### 2. 高级表格编辑功能
-
-```python
-def advanced_table_editor(df):
-    """高级表格编辑器"""
-    st.subheader("🔧 高级编辑")
-    
-    # 数据编辑
-    edited_df = st.data_editor(
-        df,
-        width='stretch',
-        num_rows="dynamic",  # 允许添加删除行
-        key="advanced_editor"
-    )
-    
-    # 数据验证
-    if not edited_df.equals(df):
-        st.success("✏️ 数据已修改")
-        
-        # 显示变更统计
-        changes = len(edited_df) - len(df)
-        st.info(f"行数变化: {changes:+d}")
-        
-        # 导出修改后的数据
-        if st.button("💾 保存修改"):
-            csv_data = edited_df.to_csv(index=False)
-            st.download_button(
-                "下载修改后的数据",
-                csv_data,
-                "modified_table.csv",
-                "text/csv"
-            )
-    
-    return edited_df
-```
-
-#### 3. 批量表格处理
-
-```python
-def batch_table_processing():
-    """批量表格处理功能"""
-    st.subheader("📦 批量表格处理")
-    
-    uploaded_files = st.file_uploader(
-        "上传多个包含表格的文件", 
-        type=['md', 'html'], 
-        accept_multiple_files=True
-    )
-    
-    if uploaded_files and st.button("开始批量处理"):
-        all_tables = []
-        progress_bar = st.progress(0)
-        
-        for i, file in enumerate(uploaded_files):
-            content = file.read().decode('utf-8')
-            
-            if '<table' in content.lower():
-                tables = pd.read_html(StringIO(content))
-                for j, table in enumerate(tables):
-                    table['source_file'] = file.name
-                    table['table_index'] = j
-                    all_tables.append(table)
-            
-            progress_bar.progress((i + 1) / len(uploaded_files))
-        
-        if all_tables:
-            st.success(f"✅ 共处理 {len(all_tables)} 个表格")
-            
-            # 合并所有表格
-            if st.checkbox("合并所有表格"):
-                try:
-                    merged_df = pd.concat(all_tables, ignore_index=True)
-                    st.dataframe(merged_df)
-                    
-                    # 导出合并结果
-                    csv_data = merged_df.to_csv(index=False)
-                    st.download_button(
-                        "下载合并表格",
-                        csv_data,
-                        "merged_tables.csv",
-                        "text/csv"
-                    )
-                except Exception as e:
-                    st.error(f"合并失败: {e}")
-```
-
-#### 4. 表格数据质量检查 ⭐
-
-```python
-def table_quality_check(df):
-    """表格数据质量检查"""
-    st.subheader("🔍 数据质量检查")
-    
-    # 基础统计
-    col1, col2, col3 = st.columns(3)
-    with col1:
-        st.metric("总行数", len(df))
-    with col2:
-        st.metric("总列数", len(df.columns))
-    with col3:
-        null_percent = (df.isnull().sum().sum() / (len(df) * len(df.columns))) * 100
-        st.metric("缺失值比例", f"{null_percent:.1f}%")
-    
-    # 详细质量报告
-    quality_issues = []
-    
-    # 检查空值
-    null_cols = df.columns[df.isnull().any()].tolist()
-    if null_cols:
-        quality_issues.append(f"发现 {len(null_cols)} 列存在空值: {', '.join(null_cols)}")
-    
-    # 检查重复行
-    duplicate_rows = df.duplicated().sum()
-    if duplicate_rows > 0:
-        quality_issues.append(f"发现 {duplicate_rows} 行重复数据")
-    
-    # 检查数据类型一致性
-    for col in df.columns:
-        if df[col].dtype == 'object':
-            # 检查是否应该是数值类型
-            numeric_like = df[col].str.replace(',', '').str.replace('$', '')
-            try:
-                pd.to_numeric(numeric_like, errors='raise')
-                quality_issues.append(f"列 '{col}' 可能应该是数值类型")
-            except:
-                pass
-    
-    if quality_issues:
-        st.warning("⚠️ 发现数据质量问题:")
-        for issue in quality_issues:
-            st.write(f"- {issue}")
-    else:
-        st.success("✅ 数据质量良好")
-```
-
-## 📊 性能优化
-
-### 1. 缓存优化
-
-```python
-@st.cache_data
-def load_and_process_ocr_data(file_path: str):
-    """缓存OCR数据加载和处理"""
-    with open(file_path, 'r') as f:
-        ocr_data = json.load(f)
-    
-    # 处理数据
-    processed_data = process_ocr_data(ocr_data)
-    return processed_data
-
-@st.cache_resource
-def load_image(image_path: str):
-    """缓存图片加载"""
-    return Image.open(image_path)
-
-@st.cache_data
-def parse_html_tables(html_content: str):
-    """缓存表格解析结果"""
-    try:
-        tables = pd.read_html(StringIO(html_content))
-        return tables
-    except:
-        return []
-```
-
-### 2. 大文件处理
-
-```python
-def handle_large_tables():
-    """处理大型表格"""
-    if 'page_size' not in st.session_state:
-        st.session_state.page_size = 100
-    
-    # 分页显示表格
-    if not df.empty:
-        total_rows = len(df)
-        pages = (total_rows - 1) // st.session_state.page_size + 1
-        
-        col1, col2, col3 = st.columns([1, 2, 1])
-        with col2:
-            current_page = st.slider("页数", 1, pages, 1)
-        
-        # 显示当前页数据
-        start_idx = (current_page - 1) * st.session_state.page_size
-        end_idx = min(start_idx + st.session_state.page_size, total_rows)
-        current_df = df.iloc[start_idx:end_idx]
-        
-        st.dataframe(current_df, width='stretch')
-        st.info(f"显示第 {start_idx+1}-{end_idx} 行,共 {total_rows} 行")
-```
-
-### 3. 内存优化
-
-```python
-def optimize_dataframe_memory(df):
-    """优化DataFrame内存使用"""
-    initial_memory = df.memory_usage(deep=True).sum()
-    
-    # 优化数值类型
-    for col in df.select_dtypes(include=['int']).columns:
-        df[col] = pd.to_numeric(df[col], downcast='integer')
-    
-    for col in df.select_dtypes(include=['float']).columns:
-        df[col] = pd.to_numeric(df[col], downcast='float')
-    
-    # 优化字符串类型
-    for col in df.select_dtypes(include=['object']).columns:
-        if df[col].nunique() < len(df) * 0.5:  # 如果唯一值少于50%,转换为category
-            df[col] = df[col].astype('category')
-    
-    final_memory = df.memory_usage(deep=True).sum()
-    reduction = (initial_memory - final_memory) / initial_memory * 100
-    
-    st.info(f"内存优化:减少 {reduction:.1f}% ({initial_memory/1024/1024:.1f}MB → {final_memory/1024/1024:.1f}MB)")
-    
-    return df
-```
-
-## 🚀 部署指南
-
-### 本地开发部署
-
-```bash
-# 开发模式运行(自动重载)
-streamlit run streamlit_ocr_validator.py --server.runOnSave true
-
-# 指定端口运行
-streamlit run streamlit_ocr_validator.py --server.port 8502
-
-# 指定主机运行(局域网访问)
-streamlit run streamlit_ocr_validator.py --server.address 0.0.0.0
-```
-
-### Docker部署
-
-```dockerfile
-FROM python:3.9-slim
-
-WORKDIR /app
-
-# 安装系统依赖
-RUN apt-update && apt-get install -y \
-    libgl1-mesa-glx \
-    libglib2.0-0 \
-    libsm6 \
-    libxext6 \
-    libxrender-dev \
-    libgomp1 \
-    && rm -rf /var/lib/apt/lists/*
-
-# 安装Python依赖
-COPY requirements.txt .
-RUN pip install -r requirements.txt
-
-COPY . .
-
-EXPOSE 8501
-CMD ["streamlit", "run", "streamlit_ocr_validator.py", "--server.address=0.0.0.0"]
-```
-
-### Streamlit Cloud部署
-
-1. 将代码推送到GitHub仓库
-2. 访问 https://share.streamlit.io/
-3. 连接GitHub仓库并部署
-4. 设置环境变量(如API密钥)
-
-## 💡 最佳实践
-
-### 1. 用户体验优化
-
-- **加载状态**: 使用`st.spinner()`显示加载状态
-- **错误处理**: 使用`st.error()`友好地显示错误信息
-- **进度提示**: 使用`st.progress()`显示处理进度
-- **数据缓存**: 合理使用`@st.cache_data`提升性能
-
-### 2. 界面设计
-
-- **布局清晰**: 使用`st.columns()`合理分布内容
-- **视觉层次**: 使用不同级别的标题和分隔符
-- **交互反馈**: 及时响应用户操作
-- **移动友好**: 考虑不同屏幕尺寸的适配
-
-### 3. 表格处理最佳实践 ⭐
-
-- **大表格处理**: 对超过1000行的表格启用分页显示
-- **内存管理**: 使用数据类型优化减少内存使用
-- **导出优化**: 大表格导出时显示进度条
-- **错误处理**: 优雅处理表格解析失败的情况
-
-### 4. 数据安全
-
-- **输入验证**: 验证上传文件的格式和内容
-- **错误处理**: 妥善处理异常情况
-- **资源清理**: 及时清理临时文件和内存
-
-## 🎉 总结
-
-Streamlit版本的OCR校验工具经过升级后提供了更加强大的功能:
-
-✅ **基础功能**:实时交互、动态更新、错误管理  
-✅ **表格分析**:HTML表格转DataFrame、多种操作、导出功能 ⭐  
-✅ **数据处理**:过滤、排序、统计分析、可视化 ⭐  
-✅ **批量操作**:多文件处理、批量导出、合并功能 ⭐  
-✅ **质量检查**:数据质量分析、问题检测、优化建议 ⭐  
-✅ **扩展性**:易于添加新功能和自定义组件  
-✅ **用户体验**:现代化界面、响应式设计、直观操作  
-
-新增的表格分析功能使其不仅能够校验OCR结果,更能深入分析表格数据,成为一个完整的OCR数据处理工作台!
-
----
-
-> 🌟 **特别推荐**:使用DataFrame表格模式分析财务报表等结构化数据,体验完整的数据处理工作流程。

+ 0 - 1496
streamlit_ocr_validator_v1.py

@@ -1,1496 +0,0 @@
-#!/usr/bin/env python3
-"""
-基于Streamlit的OCR可视化校验工具(重构版)
-提供丰富的交互组件和更好的用户体验
-"""
-
-import streamlit as st
-from pathlib import Path
-from PIL import Image
-from typing import Dict, List, Optional
-import plotly.graph_objects as go
-from io import BytesIO
-import pandas as pd
-import numpy as np
-import plotly.express as px
-import json
-
-# 导入工具模块
-from ocr_validator_utils import (
-    load_config,  load_ocr_data_file, process_ocr_data,
-     get_ocr_statistics, 
-    find_available_ocr_files, 
-     group_texts_by_category,
-    find_available_ocr_files_multi_source, get_data_source_display_name
-)
-from ocr_validator_file_utils import (
-    load_css_styles,
-    draw_bbox_on_image,
-    convert_html_table_to_markdown,
-    parse_html_tables, 
-    create_dynamic_css,
-    export_tables_to_excel, 
-    get_table_statistics,
-)
-from ocr_validator_layout import OCRLayoutManager
-from ocr_by_vlm import ocr_with_vlm
-from compare_ocr_results import compare_ocr_results
-
-
-class StreamlitOCRValidator:
-    def __init__(self):
-        self.config = load_config()
-        self.ocr_data = []
-        self.md_content = ""
-        self.image_path = ""
-        self.text_bbox_mapping = {}
-        self.selected_text = None
-        self.marked_errors = set()
-        
-        # 多数据源相关
-        self.all_sources = {}
-        self.current_source_key = None
-        self.current_source_config = None
-        self.file_info = []
-        self.selected_file_index = -1
-        self.display_options = []
-        self.file_paths = []
-        
-        # ✅ 新增:交叉验证数据源
-        self.verify_source_key = None
-        self.verify_source_config = None
-        self.verify_file_info = []
-        self.verify_display_options = []
-        self.verify_file_paths = []
-
-        # 初始化布局管理器
-        self.layout_manager = OCRLayoutManager(self)
-
-        # 加载多数据源文件信息
-        self.load_multi_source_info()
-        
-    def load_multi_source_info(self):
-        """加载多数据源文件信息"""
-        self.all_sources = find_available_ocr_files_multi_source(self.config)
-        
-        # 如果有数据源,默认选择第一个作为OCR源
-        if self.all_sources:
-            source_keys = list(self.all_sources.keys())
-            first_source_key = source_keys[0]
-            self.switch_to_source(first_source_key)
-            
-            # 如果有第二个数据源,默认作为验证源
-            if len(source_keys) > 1:
-                self.switch_to_verify_source(source_keys[1])
-    
-    def switch_to_source(self, source_key: str):
-        """切换到指定OCR数据源"""
-        if source_key in self.all_sources:
-            self.current_source_key = source_key
-            source_data = self.all_sources[source_key]
-            self.current_source_config = source_data['config']
-            self.file_info = source_data['files']
-            
-            if self.file_info:
-                # 创建显示选项列表
-                self.display_options = [f"{info['display_name']}" for info in self.file_info]
-                self.file_paths = [info['path'] for info in self.file_info]
-                
-                # 重置文件选择
-                self.selected_file_index = -1
-                print(f"✅ 切换到OCR数据源: {source_key}")
-            else:
-                print(f"⚠️ 数据源 {source_key} 没有可用文件")
-    
-    def switch_to_verify_source(self, source_key: str):
-        """切换到指定验证数据源"""
-        if source_key in self.all_sources:
-            self.verify_source_key = source_key
-            source_data = self.all_sources[source_key]
-            self.verify_source_config = source_data['config']
-            self.verify_file_info = source_data['files']
-            
-            if self.verify_file_info:
-                self.verify_display_options = [f"{info['display_name']}" for info in self.verify_file_info]
-                self.verify_file_paths = [info['path'] for info in self.verify_file_info]
-                print(f"✅ 切换到验证数据源: {source_key}")
-            else:
-                print(f"⚠️ 验证数据源 {source_key} 没有可用文件")
-
-    def setup_page_config(self):
-        """设置页面配置"""
-        ui_config = self.config['ui']
-        st.set_page_config(
-            page_title=ui_config['page_title'],
-            page_icon=ui_config['page_icon'],
-            layout=ui_config['layout'],
-            initial_sidebar_state=ui_config['sidebar_state']
-        )
-        
-        # 加载CSS样式
-        css_content = load_css_styles()
-        st.markdown(f"<style>{css_content}</style>", unsafe_allow_html=True)
-
-    def create_data_source_selector(self):
-        """创建双数据源选择器 - 支持交叉验证"""
-        if not self.all_sources:
-            st.warning("❌ 未找到任何数据源,请检查配置文件")
-            return
-        
-        # 准备数据源选项
-        source_options = {}
-        for source_key, source_data in self.all_sources.items():
-            display_name = get_data_source_display_name(source_data['config'])
-            source_options[display_name] = source_key
-        
-        # 创建两列布局
-        col1, col2 = st.columns(2)
-        
-        with col1:
-            st.markdown("#### 📊 OCR数据源")
-            # OCR数据源选择
-            current_display_name = None
-            if self.current_source_key:
-                for display_name, key in source_options.items():
-                    if key == self.current_source_key:
-                        current_display_name = display_name
-                        break
-            
-            selected_ocr_display = st.selectbox(
-                "选择OCR数据源",
-                options=list(source_options.keys()),
-                index=list(source_options.keys()).index(current_display_name) if current_display_name else 0,
-                key="ocr_source_selector",
-                label_visibility="collapsed",
-                help="选择要分析的OCR数据源"
-            )
-            
-            selected_ocr_key = source_options[selected_ocr_display]
-            
-            # 如果OCR数据源发生变化,切换数据源
-            if selected_ocr_key != self.current_source_key:
-                self.switch_to_source(selected_ocr_key)
-                if 'selected_file_index' in st.session_state:
-                    st.session_state.selected_file_index = 0
-                st.rerun()
-            
-            # 显示OCR数据源信息
-            if self.current_source_config:
-                with st.expander("📋 OCR数据源详情", expanded=False):
-                    st.write(f"**工具:** {self.current_source_config['ocr_tool']}")
-                    st.write(f"**文件数:** {len(self.file_info)}")
-        
-        with col2:
-            st.markdown("#### 🔍 验证数据源")
-            # 验证数据源选择
-            verify_display_name = None
-            if self.verify_source_key:
-                for display_name, key in source_options.items():
-                    if key == self.verify_source_key:
-                        verify_display_name = display_name
-                        break
-            
-            selected_verify_display = st.selectbox(
-                "选择验证数据源",
-                options=list(source_options.keys()),
-                index=list(source_options.keys()).index(verify_display_name) if verify_display_name else (1 if len(source_options) > 1 else 0),
-                key="verify_source_selector",
-                label_visibility="collapsed",
-                help="选择用于交叉验证的数据源"
-            )
-            
-            selected_verify_key = source_options[selected_verify_display]
-            
-            # 如果验证数据源发生变化,切换数据源
-            if selected_verify_key != self.verify_source_key:
-                self.switch_to_verify_source(selected_verify_key)
-                st.rerun()
-            
-            # 显示验证数据源信息
-            if self.verify_source_config:
-                with st.expander("📋 验证数据源详情", expanded=False):
-                    st.write(f"**工具:** {self.verify_source_config['ocr_tool']}")
-                    st.write(f"**文件数:** {len(self.verify_file_info)}")
-        
-        # 数据源对比提示
-        if self.current_source_key == self.verify_source_key:
-            st.warning("⚠️ OCR数据源和验证数据源相同,建议选择不同的数据源进行交叉验证")
-        else:
-            st.success(f"✅ 已选择 {selected_ocr_display} 与 {selected_verify_display} 进行交叉验证")    
-    
-    def load_ocr_data(self, json_path: str, md_path: Optional[str] = None, image_path: Optional[str] = None):
-        """加载OCR相关数据 - 支持多数据源配置"""
-        try:
-            # 使用当前数据源的配置加载数据
-            if self.current_source_config:
-                # 临时修改config以使用当前数据源的配置
-                temp_config = self.config.copy()
-                temp_config['paths'] = {
-                    'ocr_out_dir': self.current_source_config['ocr_out_dir'],
-                    'src_img_dir': self.current_source_config.get('src_img_dir', ''),
-                    'pre_validation_dir': self.config['pre_validation']['out_dir']
-                }
-                
-                # 设置OCR工具类型
-                temp_config['current_ocr_tool'] = self.current_source_config['ocr_tool']
-                
-                self.ocr_data, self.md_content, self.image_path = load_ocr_data_file(json_path, temp_config)
-            else:
-                self.ocr_data, self.md_content, self.image_path = load_ocr_data_file(json_path, self.config)
-                
-            self.process_data()
-        except Exception as e:
-            st.error(f"❌ 加载失败: {e}")
-            st.exception(e)
-    
-    def process_data(self):
-        """处理OCR数据"""
-        self.text_bbox_mapping = process_ocr_data(self.ocr_data, self.config)
-    
-    def get_statistics(self) -> Dict:
-        """获取统计信息"""
-        return get_ocr_statistics(self.ocr_data, self.text_bbox_mapping, self.marked_errors)
-    
-    def display_html_table_as_dataframe(self, html_content: str, enable_editing: bool = False):
-        """将HTML表格解析为DataFrame显示 - 增强版本支持横向滚动"""
-        tables = parse_html_tables(html_content)
-        wide_table_threshold = 15  # 超宽表格列数阈值
-        
-        if not tables:
-            st.warning("未找到可解析的表格")
-            # 对于无法解析的HTML表格,使用自定义CSS显示
-            st.markdown("""
-            <style>
-            .scrollable-table {
-                overflow-x: auto;
-                white-space: nowrap;
-                border: 1px solid #ddd;
-                border-radius: 5px;
-                margin: 10px 0;
-            }
-            .scrollable-table table {
-                width: 100%;
-                border-collapse: collapse;
-            }
-            .scrollable-table th, .scrollable-table td {
-                border: 1px solid #ddd;
-                padding: 8px;
-                text-align: left;
-                min-width: 100px;
-            }
-            .scrollable-table th {
-                background-color: #f5f5f5;
-                font-weight: bold;
-            }
-            </style>
-            """, unsafe_allow_html=True)
-            
-            st.markdown(f'<div class="scrollable-table">{html_content}</div>', unsafe_allow_html=True)
-            return
-            
-        for i, table in enumerate(tables):
-            st.subheader(f"📊 表格 {i+1}")
-            
-            # 表格信息显示
-            col_info1, col_info2, col_info3, col_info4 = st.columns(4)
-            with col_info1:
-                st.metric("行数", len(table))
-            with col_info2:
-                st.metric("列数", len(table.columns))
-            with col_info3:
-                # 检查是否有超宽表格
-                is_wide_table = len(table.columns) > wide_table_threshold
-                st.metric("表格类型", "超宽表格" if is_wide_table else "普通表格")
-            with col_info4:
-                # 表格操作模式选择
-                display_mode = st.selectbox(
-                    f"显示模式 (表格{i+1})",
-                    ["完整显示", "分页显示", "筛选列显示"],
-                    key=f"display_mode_{i}"
-                )
-            
-            # 创建表格操作按钮
-            col1, col2, col3, col4 = st.columns(4)
-            with col1:
-                show_info = st.checkbox(f"显示详细信息", key=f"info_{i}")
-            with col2:
-                show_stats = st.checkbox(f"显示统计信息", key=f"stats_{i}")
-            with col3:
-                enable_filter = st.checkbox(f"启用过滤", key=f"filter_{i}")
-            with col4:
-                enable_sort = st.checkbox(f"启用排序", key=f"sort_{i}")
-            
-            # 根据显示模式处理表格
-            display_table = self._process_table_display_mode(table, i, display_mode)
-            
-            # 数据过滤和排序逻辑
-            filtered_table = self._apply_table_filters_and_sorts(display_table, i, enable_filter, enable_sort)
-            
-            # 显示表格 - 使用自定义CSS支持横向滚动
-            st.markdown("""
-            <style>
-            .dataframe-container {
-                overflow-x: auto;
-                border: 1px solid #ddd;
-                border-radius: 5px;
-                margin: 10px 0;
-            }
-            
-            /* 为超宽表格特殊样式 */
-            .wide-table-container {
-                overflow-x: auto;
-                max-height: 500px;
-                overflow-y: auto;
-                border: 2px solid #0288d1;
-                border-radius: 8px;
-                background: linear-gradient(90deg, #f8f9fa 0%, #ffffff 100%);
-            }
-            
-            .dataframe thead th {
-                position: sticky;
-                top: 0;
-                background-color: #f5f5f5 !important;
-                z-index: 10;
-                border-bottom: 2px solid #0288d1;
-            }
-            
-            .dataframe tbody td {
-                white-space: nowrap;
-                min-width: 100px;
-                max-width: 300px;
-                overflow: hidden;
-                text-overflow: ellipsis;
-            }
-            </style>
-            """, unsafe_allow_html=True)
-            
-            # 根据表格宽度选择显示容器
-            container_class = "wide-table-container" if len(table.columns) > wide_table_threshold else "dataframe-container"
-            
-            if enable_editing:
-                st.markdown(f'<div class="{container_class}">', unsafe_allow_html=True)
-                edited_table = st.data_editor(
-                    filtered_table, 
-                    width='stretch', 
-                    key=f"editor_{i}",
-                    height=400 if len(table.columns) > 8 else None
-                )
-                st.markdown('</div>', unsafe_allow_html=True)
-                
-                if not edited_table.equals(filtered_table):
-                    st.success("✏️ 表格已编辑,可以导出修改后的数据")
-            else:
-                st.markdown(f'<div class="{container_class}">', unsafe_allow_html=True)
-                st.dataframe(
-                    filtered_table, 
-                    # width='stretch',
-                    width =400 if len(table.columns) > wide_table_threshold else "stretch"
-                )
-                st.markdown('</div>', unsafe_allow_html=True)
-            
-            # 显示表格信息和统计
-            self._display_table_info_and_stats(table, filtered_table, show_info, show_stats, i)
-            
-            st.markdown("---")
-    
-    def _apply_table_filters_and_sorts(self, table: pd.DataFrame, table_index: int, enable_filter: bool, enable_sort: bool) -> pd.DataFrame:
-        """应用表格过滤和排序"""
-        filtered_table = table.copy()
-        
-        # 数据过滤
-        if enable_filter and not table.empty:
-            filter_col = st.selectbox(
-                f"选择过滤列 (表格 {table_index+1})", 
-                options=['无'] + list(table.columns),
-                key=f"filter_col_{table_index}"
-            )
-            
-            if filter_col != '无':
-                filter_value = st.text_input(f"过滤值 (表格 {table_index+1})", key=f"filter_value_{table_index}")
-                if filter_value:
-                    filtered_table = table[table[filter_col].astype(str).str.contains(filter_value, na=False)]
-        
-        # 数据排序
-        if enable_sort and not filtered_table.empty:
-            sort_col = st.selectbox(
-                f"选择排序列 (表格 {table_index+1})", 
-                options=['无'] + list(filtered_table.columns),
-                key=f"sort_col_{table_index}"
-            )
-            
-            if sort_col != '无':
-                sort_order = st.radio(
-                    f"排序方式 (表格 {table_index+1})",
-                    options=['升序', '降序'],
-                    horizontal=True,
-                    key=f"sort_order_{table_index}"
-                )
-                ascending = (sort_order == '升序')
-                filtered_table = filtered_table.sort_values(sort_col, ascending=ascending)
-        
-        return filtered_table
-    
-    def _display_table_info_and_stats(self, original_table: pd.DataFrame, filtered_table: pd.DataFrame, 
-                                     show_info: bool, show_stats: bool, table_index: int):
-        """显示表格信息和统计数据"""
-        if show_info:
-            st.write("**表格信息:**")
-            st.write(f"- 原始行数: {len(original_table)}")
-            st.write(f"- 过滤后行数: {len(filtered_table)}")
-            st.write(f"- 列数: {len(original_table.columns)}")
-            st.write(f"- 列名: {', '.join(original_table.columns)}")
-        
-        if show_stats:
-            st.write("**统计信息:**")
-            numeric_cols = filtered_table.select_dtypes(include=[np.number]).columns
-            if len(numeric_cols) > 0:
-                st.dataframe(filtered_table[numeric_cols].describe())
-            else:
-                st.info("表格中没有数值列")
-        
-        # 导出功能
-        if st.button(f"📥 导出表格 {table_index+1}", key=f"export_{table_index}"):
-            self._create_export_buttons(filtered_table, table_index)
-    
-    def _create_export_buttons(self, table: pd.DataFrame, table_index: int):
-        """创建导出按钮"""
-        # CSV导出
-        csv_data = table.to_csv(index=False)
-        st.download_button(
-            label=f"下载CSV (表格 {table_index+1})",
-            data=csv_data,
-            file_name=f"table_{table_index+1}.csv",
-            mime="text/csv",
-            key=f"download_csv_{table_index}"
-        )
-        
-        # Excel导出
-        excel_buffer = BytesIO()
-        table.to_excel(excel_buffer, index=False)
-        st.download_button(
-            label=f"下载Excel (表格 {table_index+1})",
-            data=excel_buffer.getvalue(),
-            file_name=f"table_{table_index+1}.xlsx",
-            mime="application/vnd.openxmlformats-officedocument.spreadsheetml.sheet",
-            key=f"download_excel_{table_index}"
-        )
-    
-    def _process_table_display_mode(self, table: pd.DataFrame, table_index: int, display_mode: str) -> pd.DataFrame:
-        """根据显示模式处理表格"""
-        if display_mode == "分页显示":
-            # 分页显示
-            page_size = st.selectbox(
-                f"每页显示行数 (表格 {table_index+1})",
-                [10, 20, 50, 100],
-                key=f"page_size_{table_index}"
-            )
-            
-            total_pages = (len(table) - 1) // page_size + 1
-            
-            if total_pages > 1:
-                page_number = st.selectbox(
-                    f"页码 (表格 {table_index+1})",
-                    range(1, total_pages + 1),
-                    key=f"page_number_{table_index}"
-                )
-                
-                start_idx = (page_number - 1) * page_size
-                end_idx = start_idx + page_size
-                return table.iloc[start_idx:end_idx]
-            
-            return table
-            
-        elif display_mode == "筛选列显示":
-            # 列筛选显示
-            if len(table.columns) > 5:
-                selected_columns = st.multiselect(
-                    f"选择要显示的列 (表格 {table_index+1})",
-                    table.columns.tolist(),
-                    default=table.columns.tolist()[:5],  # 默认显示前5列
-                    key=f"selected_columns_{table_index}"
-                )
-                
-                if selected_columns:
-                    return table[selected_columns]
-            
-            return table
-            
-        else:  # 完整显示
-            return table
-
-    def find_verify_md_path(self, selected_file_index: int) -> Optional[Path]:
-        """查找当前OCR文件对应的验证文件路径"""
-        current_page = self.file_info[selected_file_index]['page']
-        verify_md_path = None
-
-        for i, info in enumerate(self.verify_file_info):
-            if info['page'] == current_page:
-                verify_md_path = Path(self.verify_file_paths[i]).with_suffix('.md')
-                break
-
-        return verify_md_path
-
-    @st.dialog("交叉验证", width="large", dismissible=True, on_dismiss="rerun")
-    def cross_validation(self):
-        """交叉验证功能 - 批量比对两个数据源的所有OCR结果"""
-        
-        if self.current_source_key == self.verify_source_key:
-            st.error("❌ OCR数据源和验证数据源不能相同")
-            return
-        
-        # 初始化对比结果存储
-        if 'cross_validation_batch_result' not in st.session_state:
-            st.session_state.cross_validation_batch_result = None
-        
-        st.header("🔄 批量交叉验证")
-        
-        # 显示数据源信息
-        col1, col2 = st.columns(2)
-        with col1:
-            st.info(f"**OCR数据源:** {get_data_source_display_name(self.current_source_config)}")
-            st.write(f"📁 文件数量: {len(self.file_info)}")
-        with col2:
-            st.info(f"**验证数据源:** {get_data_source_display_name(self.verify_source_config)}")
-            st.write(f"📁 文件数量: {len(self.verify_file_info)}")
-        
-        # 批量验证选项
-        with st.expander("⚙️ 验证选项", expanded=True):
-            col1, col2 = st.columns(2)
-            with col1:
-                table_mode = st.selectbox(
-                    "表格比对模式",
-                    options=['standard', 'flow_list'],
-                    index=1,  # 默认使用flow_list
-                    format_func=lambda x: '流水表格模式' if x == 'flow_list' else '标准模式',
-                    help="选择表格比对算法"
-                )
-            with col2:
-                similarity_algorithm = st.selectbox(
-                    "相似度算法",
-                    options=['ratio', 'partial_ratio', 'token_sort_ratio', 'token_set_ratio'],
-                    index=0,
-                    help="选择文本相似度计算算法"
-                )
-        
-        # 开始批量验证按钮
-        if st.button("🚀 开始批量验证", type="primary", width='stretch'):
-            self._run_batch_cross_validation(table_mode, similarity_algorithm)
-        
-        # 显示历史批量验证结果
-        if 'cross_validation_batch_result' in st.session_state and st.session_state.cross_validation_batch_result:
-            st.markdown("---")
-            self._display_batch_validation_results(st.session_state.cross_validation_batch_result)
-    
-    def _generate_batch_validation_markdown(self, batch_results: dict, output_path: str):
-        """生成批量验证的Markdown报告"""
-        
-        with open(output_path, "w", encoding="utf-8") as f:
-            f.write("# 批量交叉验证报告\n\n")
-            
-            # 基本信息
-            f.write("## 📋 基本信息\n\n")
-            f.write(f"- **OCR数据源:** {batch_results['ocr_source']}\n")
-            f.write(f"- **验证数据源:** {batch_results['verify_source']}\n")
-            f.write(f"- **表格模式:** {batch_results['table_mode']}\n")
-            f.write(f"- **相似度算法:** {batch_results['similarity_algorithm']}\n")
-            f.write(f"- **验证时间:** {batch_results['timestamp']}\n\n")
-            
-            # 汇总统计
-            summary = batch_results['summary']
-            f.write("## 📊 汇总统计\n\n")
-            f.write(f"- **总页数:** {summary['total_pages']}\n")
-            f.write(f"- **成功页数:** {summary['successful_pages']}\n")
-            f.write(f"- **失败页数:** {summary['failed_pages']}\n")
-            f.write(f"- **总差异数:** {summary['total_differences']}\n")
-            f.write(f"- **表格差异:** {summary['total_table_differences']}\n")
-            f.write(f"  - 金额差异: {summary.get('total_amount_differences', 0)}\n")
-            f.write(f"  - 日期差异: {summary.get('total_datetime_differences', 0)}\n")
-            f.write(f"  - 文本差异: {summary.get('total_text_differences', 0)}\n")
-            f.write(f"  - 表头前差异: {summary.get('total_table_pre_header', 0)}\n")
-            f.write(f"  - 表头位置差异: {summary.get('total_table_header_position', 0)}\n")
-            f.write(f"  - 表头严重错误: {summary.get('total_table_header_critical', 0)}\n")
-            f.write(f"  - 行缺失: {summary.get('total_table_row_missing', 0)}\n")
-            f.write(f"- **段落差异:** {summary['total_paragraph_differences']}\n")
-            f.write(f"- **严重程度统计:**\n")
-            f.write(f"  - 高严重度: {summary.get('total_high_severity', 0)}\n")
-            f.write(f"  - 中严重度: {summary.get('total_medium_severity', 0)}\n")
-            f.write(f"  - 低严重度: {summary.get('total_low_severity', 0)}\n\n")
-            
-            # 详细结果表格
-            f.write("## 📄 各页差异统计\n\n")
-            f.write("| 页码 | 状态 | 总差异 | 表格差异 | 金额 | 日期 | 文本 | 段落 | 表头前 | 表头位置 | 表头错误 | 行缺失 | 高 | 中 | 低 |\n")
-            f.write("|------|------|--------|----------|------|------|------|------|--------|----------|----------|--------|----|----|----|\n")
-            
-            for page in batch_results['pages']:
-                if page['status'] == 'success':
-                    status_icon = "✅" if page['total_differences'] == 0 else "⚠️"
-                    f.write(f"| {page['page_num']} | {status_icon} | ")
-                    f.write(f"{page['total_differences']} | ")
-                    f.write(f"{page['table_differences']} | ")
-                    f.write(f"{page.get('amount_differences', 0)} | ")
-                    f.write(f"{page.get('datetime_differences', 0)} | ")
-                    f.write(f"{page.get('text_differences', 0)} | ")
-                    f.write(f"{page['paragraph_differences']} | ")
-                    f.write(f"{page.get('table_pre_header', 0)} | ")
-                    f.write(f"{page.get('table_header_position', 0)} | ")
-                    f.write(f"{page.get('table_header_critical', 0)} | ")
-                    f.write(f"{page.get('table_row_missing', 0)} | ")
-                    f.write(f"{page.get('high_severity', 0)} | ")
-                    f.write(f"{page.get('medium_severity', 0)} | ")
-                    f.write(f"{page.get('low_severity', 0)} |\n")
-                else:
-                    f.write(f"| {page['page_num']} | ❌ | - | - | - | - | - | - | - | - | - | - | - | - | - |\n")
-            
-            f.write("\n")
-            
-            # 问题汇总
-            f.write("## 🔍 问题汇总\n\n")
-            
-            high_diff_pages = [p for p in batch_results['pages'] 
-                             if p['status'] == 'success' and p['total_differences'] > 10]
-            if high_diff_pages:
-                f.write("### ⚠️ 高差异页面(差异>10)\n\n")
-                for page in high_diff_pages:
-                    f.write(f"- 第 {page['page_num']} 页:{page['total_differences']} 个差异\n")
-                f.write("\n")
-            
-            amount_error_pages = [p for p in batch_results['pages'] 
-                                if p['status'] == 'success' and p.get('amount_differences', 0) > 0]
-            if amount_error_pages:
-                f.write("### 💰 金额差异页面\n\n")
-                for page in amount_error_pages:
-                    f.write(f"- 第 {page['page_num']} 页:{page.get('amount_differences', 0)} 个金额差异\n")
-                f.write("\n")
-            
-            header_error_pages = [p for p in batch_results['pages'] 
-                                if p['status'] == 'success' and p.get('table_header_critical', 0) > 0]
-            if header_error_pages:
-                f.write("### ❌ 表头严重错误页面\n\n")
-                for page in header_error_pages:
-                    f.write(f"- 第 {page['page_num']} 页:{page['table_header_critical']} 个表头错误\n")
-                f.write("\n")
-            
-            failed_pages = [p for p in batch_results['pages'] if p['status'] == 'failed']
-            if failed_pages:
-                f.write("### 💥 验证失败页面\n\n")
-                for page in failed_pages:
-                    f.write(f"- 第 {page['page_num']} 页:{page.get('error', '未知错误')}\n")
-                f.write("\n")
-
-    def _run_batch_cross_validation(self, table_mode: str, similarity_algorithm: str):
-        """执行批量交叉验证"""
-        
-        # 准备输出目录
-        pre_validation_dir = Path(self.config['pre_validation'].get('out_dir', './output/pre_validation/')).resolve()
-        pre_validation_dir.mkdir(parents=True, exist_ok=True)
-        
-        # ✅ 批量结果存储 - 更新统计字段
-        batch_results = {
-            'ocr_source': get_data_source_display_name(self.current_source_config),
-            'verify_source': get_data_source_display_name(self.verify_source_config),
-            'table_mode': table_mode,
-            'similarity_algorithm': similarity_algorithm,
-            'timestamp': pd.Timestamp.now().strftime('%Y-%m-%d %H:%M:%S'),
-            'pages': [],
-            'summary': {
-                'total_pages': 0,
-                'successful_pages': 0,
-                'failed_pages': 0,
-                'total_differences': 0,
-                'total_table_differences': 0,
-                'total_amount_differences': 0,
-                'total_datetime_differences': 0,
-                'total_text_differences': 0,
-                'total_paragraph_differences': 0,
-                'total_table_pre_header': 0,
-                'total_table_header_position': 0,
-                'total_table_header_critical': 0,
-                'total_table_row_missing': 0,
-                'total_high_severity': 0,
-                'total_medium_severity': 0,
-                'total_low_severity': 0
-            }
-        }
-        
-        # 创建进度条
-        progress_bar = st.progress(0)
-        status_text = st.empty()
-        
-        # 建立页码映射
-        ocr_page_map = {info['page']: i for i, info in enumerate(self.file_info)}
-        verify_page_map = {info['page']: i for i, info in enumerate(self.verify_file_info)}
-        
-        # 找出两个数据源共同的页码
-        common_pages = sorted(set(ocr_page_map.keys()) & set(verify_page_map.keys()))
-        
-        if not common_pages:
-            st.error("❌ 两个数据源没有共同的页码,无法进行对比")
-            return
-        
-        batch_results['summary']['total_pages'] = len(common_pages)
-        
-        # 创建详细日志区域
-        with st.expander("📋 详细对比日志", expanded=True):
-            log_container = st.container()
-        
-        # 逐页对比
-        for idx, page_num in enumerate(common_pages):
-            try:
-                # 更新进度
-                progress = (idx + 1) / len(common_pages)
-                progress_bar.progress(progress)
-                status_text.text(f"正在对比第 {page_num} 页... ({idx + 1}/{len(common_pages)})")
-                
-                # 获取文件路径
-                ocr_file_index = ocr_page_map[page_num]
-                verify_file_index = verify_page_map[page_num]
-                
-                ocr_md_path = Path(self.file_paths[ocr_file_index]).with_suffix('.md')
-                verify_md_path = Path(self.verify_file_paths[verify_file_index]).with_suffix('.md')
-                
-                if not ocr_md_path.exists() or not verify_md_path.exists():
-                    with log_container:
-                        st.warning(f"⚠️ 第 {page_num} 页:文件不存在,跳过")
-                    batch_results['summary']['failed_pages'] += 1
-                    continue
-                
-                # 执行对比
-                comparison_result_path = pre_validation_dir / f"{ocr_md_path.stem}_cross_validation"
-                
-                # 捕获对比输出
-                import io
-                import contextlib
-                
-                output_buffer = io.StringIO()
-                
-                with contextlib.redirect_stdout(output_buffer):
-                    comparison_result = compare_ocr_results(
-                        file1_path=str(ocr_md_path),
-                        file2_path=str(verify_md_path),
-                        output_file=str(comparison_result_path),
-                        output_format='both',
-                        ignore_images=True,
-                        table_mode=table_mode,
-                        similarity_algorithm=similarity_algorithm
-                    )
-                
-                # ✅ 提取统计信息 - 更新字段
-                stats = comparison_result['statistics']
-                
-                page_result = {
-                    'page_num': page_num,
-                    'ocr_file': str(ocr_md_path.name),
-                    'verify_file': str(verify_md_path.name),
-                    'total_differences': stats['total_differences'],
-                    'table_differences': stats['table_differences'],
-                    'amount_differences': stats.get('amount_differences', 0),
-                    'datetime_differences': stats.get('datetime_differences', 0),
-                    'text_differences': stats.get('text_differences', 0),
-                    'paragraph_differences': stats['paragraph_differences'],
-                    'table_pre_header': stats.get('table_pre_header', 0),
-                    'table_header_position': stats.get('table_header_position', 0),
-                    'table_header_critical': stats.get('table_header_critical', 0),
-                    'table_row_missing': stats.get('table_row_missing', 0),
-                    'high_severity': stats.get('high_severity', 0),
-                    'medium_severity': stats.get('medium_severity', 0),
-                    'low_severity': stats.get('low_severity', 0),
-                    'status': 'success',
-                    'comparison_json': f"{comparison_result_path}.json",
-                    'comparison_md': f"{comparison_result_path}.md"
-                }
-                
-                batch_results['pages'].append(page_result)
-                batch_results['summary']['successful_pages'] += 1
-                batch_results['summary']['total_differences'] += stats['total_differences']
-                batch_results['summary']['total_table_differences'] += stats['table_differences']
-                batch_results['summary']['total_amount_differences'] += stats.get('amount_differences', 0)
-                batch_results['summary']['total_datetime_differences'] += stats.get('datetime_differences', 0)
-                batch_results['summary']['total_text_differences'] += stats.get('text_differences', 0)
-                batch_results['summary']['total_paragraph_differences'] += stats['paragraph_differences']
-                batch_results['summary']['total_table_pre_header'] += stats.get('table_pre_header', 0)
-                batch_results['summary']['total_table_header_position'] += stats.get('table_header_position', 0)
-                batch_results['summary']['total_table_header_critical'] += stats.get('table_header_critical', 0)
-                batch_results['summary']['total_table_row_missing'] += stats.get('table_row_missing', 0)
-                batch_results['summary']['total_high_severity'] += stats.get('high_severity', 0)
-                batch_results['summary']['total_medium_severity'] += stats.get('medium_severity', 0)
-                batch_results['summary']['total_low_severity'] += stats.get('low_severity', 0)
-                
-                # 显示当前页对比结果
-                with log_container:
-                    if stats['total_differences'] == 0:
-                        st.success(f"✅ 第 {page_num} 页:完全匹配")
-                    else:
-                        st.warning(f"⚠️ 第 {page_num} 页:发现 {stats['total_differences']} 个差异")
-                
-            except Exception as e:
-                with log_container:
-                    st.error(f"❌ 第 {page_num} 页:对比失败 - {str(e)}")
-                
-                page_result = {
-                    'page_num': page_num,
-                    'status': 'failed',
-                    'error': str(e)
-                }
-                batch_results['pages'].append(page_result)
-                batch_results['summary']['failed_pages'] += 1
-        
-        # 保存批量结果
-        batch_result_path = pre_validation_dir / f"{self.current_source_config['name']}_{self.current_source_config['ocr_tool']}_vs_{self.verify_source_config['ocr_tool']}_batch_cross_validation"
-        
-        # 保存JSON
-        with open(f"{batch_result_path}.json", "w", encoding="utf-8") as f:
-            json.dump(batch_results, f, ensure_ascii=False, indent=2)
-        
-        # 生成Markdown报告
-        self._generate_batch_validation_markdown(batch_results, f"{batch_result_path}.md")
-        
-        # 保存到session state
-        st.session_state.cross_validation_batch_result = batch_results
-        
-        # 完成提示
-        progress_bar.progress(1.0)
-        status_text.text("✅ 批量验证完成!")
-        
-        st.success(f"🎉 批量验证完成!成功: {batch_results['summary']['successful_pages']}, 失败: {batch_results['summary']['failed_pages']}")
-
-    def _display_batch_validation_results(self, batch_results: dict):
-        """显示批量验证结果"""
-        
-        st.header("📊 批量验证结果")
-        
-        # 汇总统计
-        summary = batch_results['summary']
-        
-        col1, col2, col3, col4 = st.columns(4)
-        with col1:
-            st.metric("总页数", summary['total_pages'])
-        with col2:
-            st.metric("成功页数", summary['successful_pages'], 
-                     delta=f"{summary['successful_pages']/summary['total_pages']*100:.1f}%")
-        with col3:
-            st.metric("失败页数", summary['failed_pages'],
-                     delta=f"-{summary['failed_pages']}" if summary['failed_pages'] > 0 else "0")
-        with col4:
-            st.metric("总差异数", summary['total_differences'])
-        
-        # ✅ 详细差异类型统计 - 更新展示
-        st.subheader("📈 差异类型统计")
-        
-        col1, col2, col3 = st.columns(3)
-        with col1:
-            st.metric("表格差异", summary['total_table_differences'])
-            st.caption(f"金额: {summary.get('total_amount_differences', 0)} | 日期: {summary.get('total_datetime_differences', 0)} | 文本: {summary.get('total_text_differences', 0)}")
-        with col2:
-            st.metric("段落差异", summary['total_paragraph_differences'])
-        with col3:
-            st.metric("严重度", f"高:{summary.get('total_high_severity', 0)} 中:{summary.get('total_medium_severity', 0)} 低:{summary.get('total_low_severity', 0)}")
-        
-        # 表格结构差异统计
-        with st.expander("📋 表格结构差异详情", expanded=False):
-            col1, col2, col3, col4 = st.columns(4)
-            with col1:
-                st.metric("表头前", summary.get('total_table_pre_header', 0))
-            with col2:
-                st.metric("表头位置", summary.get('total_table_header_position', 0))
-            with col3:
-                st.metric("表头错误", summary.get('total_table_header_critical', 0))
-            with col4:
-                st.metric("行缺失", summary.get('total_table_row_missing', 0))
-        
-        # ✅ 各页详细结果表格 - 更新列
-        st.subheader("📄 各页详细结果")
-        
-        # 准备DataFrame
-        page_data = []
-        for page in batch_results['pages']:
-            if page['status'] == 'success':
-                page_data.append({
-                    '页码': page['page_num'],
-                    '状态': '✅ 成功' if page['total_differences'] == 0 else '⚠️ 有差异',
-                    '总差异': page['total_differences'],
-                    '表格差异': page['table_differences'],
-                    '金额': page.get('amount_differences', 0),
-                    '日期': page.get('datetime_differences', 0),
-                    '文本': page.get('text_differences', 0),
-                    '段落': page['paragraph_differences'],
-                    '表头前': page.get('table_pre_header', 0),
-                    '表头位置': page.get('table_header_position', 0),
-                    '表头错误': page.get('table_header_critical', 0),
-                    '行缺失': page.get('table_row_missing', 0),
-                    '高': page.get('high_severity', 0),
-                    '中': page.get('medium_severity', 0),
-                    '低': page.get('low_severity', 0)
-                })
-            else:
-                page_data.append({
-                    '页码': page['page_num'],
-                    '状态': '❌ 失败',
-                    '总差异': '-', '表格差异': '-', '金额': '-', '日期': '-', 
-                    '文本': '-', '段落': '-', '表头前': '-', '表头位置': '-',
-                    '表头错误': '-', '行缺失': '-', '高': '-', '中': '-', '低': '-'
-                })
-        
-        df_pages = pd.DataFrame(page_data)
-        
-        # 显示表格
-        st.dataframe(
-            df_pages,
-            width='stretch',
-            hide_index=True,
-            column_config={
-                "页码": st.column_config.NumberColumn("页码", width="small"),
-                "状态": st.column_config.TextColumn("状态", width="small"),
-                "总差异": st.column_config.NumberColumn("总差异", width="small"),
-                "表格差异": st.column_config.NumberColumn("表格", width="small"),
-                "金额": st.column_config.NumberColumn("金额", width="small"),
-                "日期": st.column_config.NumberColumn("日期", width="small"),
-                "文本": st.column_config.NumberColumn("文本", width="small"),
-                "段落": st.column_config.NumberColumn("段落", width="small"),
-            }
-        )
-        
-        # 下载选项
-        st.subheader("📥 导出报告")
-        
-        col1, col2 = st.columns(2)
-        
-        with col1:
-            # 导出Excel
-            excel_buffer = BytesIO()
-            df_pages.to_excel(excel_buffer, index=False, sheet_name='验证结果')
-            
-            st.download_button(
-                label="📊 下载Excel报告",
-                data=excel_buffer.getvalue(),
-                file_name=f"batch_validation_{pd.Timestamp.now().strftime('%Y%m%d_%H%M%S')}.xlsx",
-                mime="application/vnd.openxmlformats-officedocument.spreadsheetml.sheet"
-            )
-        
-        with col2:
-            # 导出JSON
-            json_data = json.dumps(batch_results, ensure_ascii=False, indent=2)
-            
-            st.download_button(
-                label="📄 下载JSON报告",
-                data=json_data,
-                file_name=f"batch_validation_{pd.Timestamp.now().strftime('%Y%m%d_%H%M%S')}.json",
-                mime="application/json"
-            )
-
-    @st.dialog("查看交叉验证结果", width="large", dismissible=True, on_dismiss="rerun")
-    def show_batch_cross_validation_results_dialog(self):
-        if 'cross_validation_batch_result' in st.session_state and st.session_state.cross_validation_batch_result:
-            self._display_batch_validation_results(st.session_state.cross_validation_batch_result)
-            
-        else:
-            st.info("暂无交叉验证结果,请先运行交叉验证")
-
-    def display_comparison_results(self, comparison_result: dict, detailed: bool = True):
-        """显示对比结果摘要 - 使用DataFrame展示"""
-        
-        st.header("📊 VLM预校验结果")
-        
-        # 统计信息
-        stats = comparison_result['statistics']
-        
-        # 统计信息概览
-        col1, col2, col3, col4 = st.columns(4)
-        with col1:
-            st.metric("总差异数", stats['total_differences'])
-        with col2:
-            st.metric("表格差异", stats['table_differences'])
-        with col3:
-            st.metric("其中表格金额差异", stats['amount_differences'])
-        with col4:
-            st.metric("段落差异", stats['paragraph_differences'])
-        
-        # 结果判断
-        if stats['total_differences'] == 0:
-            st.success("🎉 完美匹配!VLM识别结果与原OCR结果完全一致")
-        else:
-            st.warning(f"⚠️ 发现 {stats['total_differences']} 个差异,建议人工检查")
-            
-            # 使用DataFrame显示差异详情
-            if comparison_result['differences']:
-                st.subheader("🔍 差异详情对比")
-                
-                # 准备DataFrame数据
-                diff_data = []
-                for i, diff in enumerate(comparison_result['differences'], 1):
-                    diff_data.append({
-                        '序号': i,
-                        '位置': diff['position'],
-                        '类型': diff['type'],
-                        '原OCR结果': diff['file1_value'][:100] + ('...' if len(diff['file1_value']) > 100 else ''),
-                        'VLM识别结果': diff['file2_value'][:100] + ('...' if len(diff['file2_value']) > 100 else ''),
-                        '描述': diff['description'][:80] + ('...' if len(diff['description']) > 80 else ''),
-                        '严重程度': self._get_severity_level(diff)
-                    })
-                
-                # 创建DataFrame
-                df_differences = pd.DataFrame(diff_data)
-                
-                # 添加样式
-                def highlight_severity(val):
-                    """根据严重程度添加颜色"""
-                    if val == '高':
-                        return 'background-color: #ffebee; color: #c62828'
-                    elif val == '中':
-                        return 'background-color: #fff3e0; color: #ef6c00'
-                    elif val == '低':
-                        return 'background-color: #e8f5e8; color: #2e7d32'
-                    return ''
-                
-                # 显示DataFrame
-                styled_df = df_differences.style.applymap(
-                    highlight_severity, 
-                    subset=['严重程度']
-                ).format({
-                    '序号': '{:d}',
-                })
-                
-                st.dataframe(
-                    styled_df, 
-                    width='stretch',
-                    height=400,
-                    hide_index=True,
-                    column_config={
-                        "序号": st.column_config.NumberColumn(
-                            "序号", 
-                            width=None,  # 自动调整宽度
-                            pinned=True,
-                            help="差异项序号"
-                        ),
-                        "位置": st.column_config.TextColumn(
-                            "位置", 
-                            width=None,  # 自动调整宽度
-                            pinned=True,
-                            help="差异在文档中的位置"
-                        ),
-                        "类型": st.column_config.TextColumn(
-                            "类型", 
-                            width=None,  # 自动调整宽度
-                            pinned=True,
-                            help="差异类型"
-                        ),
-                        "原OCR结果": st.column_config.TextColumn(
-                            "原OCR结果", 
-                            width="large",  # 自动调整宽度
-                            pinned=True,
-                            help="原始OCR识别结果"
-                        ),
-                        "VLM识别结果": st.column_config.TextColumn(
-                            "VLM识别结果", 
-                            width="large",  # 自动调整宽度
-                            help="VLM重新识别的结果"
-                        ),
-                        "描述": st.column_config.TextColumn(
-                            "描述", 
-                            width="medium",  # 自动调整宽度
-                            help="差异详细描述"
-                        ),
-                        "严重程度": st.column_config.TextColumn(
-                            "严重程度", 
-                            width=None,  # 自动调整宽度
-                            help="差异严重程度评级"
-                        )
-                    }
-                )
-                
-                # 详细差异查看
-                st.subheader("🔍 详细差异查看")
-                
-                if detailed:
-                    # 选择要查看的差异
-                    selected_diff_index = st.selectbox(
-                        "选择要查看的差异:",
-                        options=range(len(comparison_result['differences'])),
-                        format_func=lambda x: f"差异 {x+1}: {comparison_result['differences'][x]['position']} - {comparison_result['differences'][x]['type']}",
-                        key="selected_diff"
-                    )
-                    
-                    if selected_diff_index is not None:
-                        diff = comparison_result['differences'][selected_diff_index]
-                        
-                        # 并排显示完整内容
-                        col1, col2 = st.columns(2)
-                        
-                        with col1:
-                            st.write("**原OCR结果:**")
-                            st.text_area(
-                                "原OCR结果详情",
-                                value=diff['file1_value'],
-                                height=200,
-                                key=f"original_{selected_diff_index}",
-                                label_visibility="collapsed"
-                            )
-                        
-                        with col2:
-                            st.write("**验证数据源识别结果:**")
-                            st.text_area(
-                                "验证数据源识别结果详情",
-                                value=diff['file2_value'],
-                                height=200,
-                                key=f"vlm_{selected_diff_index}",
-                                label_visibility="collapsed"
-                            )
-                        
-                        # 差异详细信息
-                        st.info(f"**位置:** {diff['position']}")
-                        st.info(f"**类型:** {diff['type']}")
-                        st.info(f"**描述:** {diff['description']}")
-                        st.info(f"**严重程度:** {self._get_severity_level(diff)}")
-                
-                # 差异统计图表
-                st.subheader("📈 差异类型分布")
-                
-                # 按类型统计差异
-                type_counts = {}
-                severity_counts = {'高': 0, '中': 0, '低': 0}
-                
-                for diff in comparison_result['differences']:
-                    diff_type = diff['type']
-                    type_counts[diff_type] = type_counts.get(diff_type, 0) + 1
-                    
-                    severity = self._get_severity_level(diff)
-                    severity_counts[severity] += 1
-                
-                col1, col2 = st.columns(2)
-                
-                with col1:
-                    # 类型分布饼图
-                    if type_counts:
-                        fig_type = px.pie(
-                            values=list(type_counts.values()),
-                            names=list(type_counts.keys()),
-                            title="差异类型分布"
-                        )
-                        st.plotly_chart(fig_type, width='stretch')
-                
-                with col2:
-                    # 严重程度分布条形图
-                    fig_severity = px.bar(
-                        x=list(severity_counts.keys()),
-                        y=list(severity_counts.values()),
-                        title="差异严重程度分布",
-                        color=list(severity_counts.keys()),
-                        color_discrete_map={'高': '#f44336', '中': '#ff9800', '低': '#4caf50'}
-                    )
-                    st.plotly_chart(fig_severity, width='stretch')
-        
-        # 下载选项
-        if detailed:
-            self._provide_download_options_in_results(comparison_result)
-
-    def _get_severity_level(self, diff: dict) -> str:
-        """根据差异类型和内容判断严重程度"""
-        # 如果差异中已经包含严重程度,直接使用
-        if 'severity' in diff:
-            severity_map = {'high': '高', 'medium': '中', 'low': '低'}
-            return severity_map.get(diff['severity'], '中')
-        
-        # 原有的逻辑作为后备
-        diff_type = diff['type'].lower()
-        
-        # 金额相关差异为高严重程度
-        if 'amount' in diff_type or 'number' in diff_type:
-            return '高'
-        
-        # 表格结构差异为中等严重程度
-        if 'table' in diff_type or 'structure' in diff_type:
-            return '中'
-        
-        # 检查相似度
-        if 'similarity' in diff:
-            similarity = diff['similarity']
-            if similarity < 50:
-                return '高'
-            elif similarity < 85:
-                return '中'
-            else:
-                return '低'
-        
-        # 检查内容长度差异
-        len_diff = abs(len(diff['file1_value']) - len(diff['file2_value']))
-        if len_diff > 50:
-            return '高'
-        elif len_diff > 10:
-            return '中'
-        else:
-            return '低'
-
-    def _provide_download_options_in_results(self, comparison_result: dict):
-        """在结果页面提供下载选项"""
-        
-        st.subheader("📥 导出预校验结果")
-        
-        col1, col2, col3 = st.columns(3)
-        
-        with col1:
-            # 导出差异详情为Excel
-            if comparison_result['differences']:
-                diff_data = []
-                for i, diff in enumerate(comparison_result['differences'], 1):
-                    diff_data.append({
-                        '序号': i,
-                        '位置': diff['position'],
-                        '类型': diff['type'],
-                        '原OCR结果': diff['file1_value'],
-                        'VLM识别结果': diff['file2_value'],
-                        '描述': diff['description'],
-                        '严重程度': self._get_severity_level(diff)
-                    })
-                
-                df_export = pd.DataFrame(diff_data)
-                excel_buffer = BytesIO()
-                df_export.to_excel(excel_buffer, index=False, sheet_name='差异详情')
-                
-                st.download_button(
-                    label="📊 下载差异详情(Excel)",
-                    data=excel_buffer.getvalue(),
-                    file_name=f"vlm_comparison_differences_{pd.Timestamp.now().strftime('%Y%m%d_%H%M%S')}.xlsx",
-                    mime="application/vnd.openxmlformats-officedocument.spreadsheetml.sheet",
-                    key="download_differences_excel"
-                )
-        
-        with col2:
-            # 导出统计报告
-            stats_data = {
-                '统计项目': ['总差异数', '表格差异', '其中表格金额差异', '段落差异'],
-                '数量': [
-                    comparison_result['statistics']['total_differences'],
-                    comparison_result['statistics']['table_differences'],
-                    comparison_result['statistics']['amount_differences'],
-                    comparison_result['statistics']['paragraph_differences']
-                ]
-            }
-            
-            df_stats = pd.DataFrame(stats_data)
-            csv_stats = df_stats.to_csv(index=False)
-            
-            st.download_button(
-                label="📈 下载统计报告(CSV)",
-                data=csv_stats,
-                file_name=f"vlm_comparison_stats_{pd.Timestamp.now().strftime('%Y%m%d_%H%M%S')}.csv",
-                mime="text/csv",
-                key="download_stats_csv"
-            )
-        
-        with col3:
-            # 导出完整报告为JSON
-            import json
-            
-            report_json = json.dumps(comparison_result, ensure_ascii=False, indent=2)
-            
-            st.download_button(
-                label="📄 下载完整报告(JSON)",
-                data=report_json,
-                file_name=f"vlm_comparison_full_{pd.Timestamp.now().strftime('%Y%m%d_%H%M%S')}.json",
-                mime="application/json",
-                key="download_full_json"
-            )
-        
-        # 操作建议
-        st.subheader("🚀 后续操作建议")
-        
-        total_diffs = comparison_result['statistics']['total_differences']
-        if total_diffs == 0:
-            st.success("✅ VLM识别结果与原OCR完全一致,可信度很高,无需人工校验")
-        elif total_diffs <= 5:
-            st.warning("⚠️ 发现少量差异,建议重点检查高严重程度的差异项")
-        elif total_diffs <= 20:
-            st.warning("🔍 发现中等数量差异,建议详细检查差异表格中标红的项目")
-        else:
-            st.error("❌ 发现大量差异,建议重新进行OCR识别或检查原始图片质量")
-    
-    def create_compact_layout(self, config):
-        """创建滚动凑布局"""
-        return self.layout_manager.create_compact_layout(config)
-
-@st.dialog("message", width="small", dismissible=True, on_dismiss="rerun")
-def message_box(msg: str, msg_type: str = "info"):
-    if msg_type == "info":
-        st.info(msg)
-    elif msg_type == "warning":
-        st.warning(msg)
-    elif msg_type == "error":
-        st.error(msg)
-
-def main():
-    """主应用"""
-    # 初始化应用
-    if 'validator' not in st.session_state:
-        validator = StreamlitOCRValidator()
-        st.session_state.validator = validator
-        st.session_state.validator.setup_page_config()
-        
-        # 页面标题
-        config = st.session_state.validator.config
-        st.title(config['ui']['page_title'])
-    else:
-        validator = st.session_state.validator
-        config = st.session_state.validator.config
-    
-    if 'selected_text' not in st.session_state:
-        st.session_state.selected_text = None
-    
-    if 'marked_errors' not in st.session_state:
-        st.session_state.marked_errors = set()
-    
-    # 数据源选择器
-    validator.create_data_source_selector()
-    
-    # 如果没有可用的数据源,提前返回
-    if not validator.all_sources:
-        st.stop()
-    
-    # 文件选择区域
-    with st.container(height=75, horizontal=True, horizontal_alignment='left', gap="medium"):
-        # 初始化session_state中的选择索引
-        if 'selected_file_index' not in st.session_state:
-            st.session_state.selected_file_index = 0
-            
-        if validator.display_options:
-            # 文件选择下拉框
-            selected_index = st.selectbox(
-                "选择OCR结果文件", 
-                range(len(validator.display_options)),
-                format_func=lambda i: validator.display_options[i],
-                index=st.session_state.selected_file_index,
-                key="selected_selectbox",
-                label_visibility="collapsed"
-            )
-            
-            # 更新session_state
-            if selected_index != st.session_state.selected_file_index:
-                st.session_state.selected_file_index = selected_index
-
-            selected_file = validator.file_paths[selected_index]
-
-            # 页码输入器
-            current_page = validator.file_info[selected_index]['page']
-            page_input = st.number_input(
-                "输入页码", 
-                placeholder="输入页码", 
-                label_visibility="collapsed",
-                min_value=1, 
-                max_value=len(validator.display_options), 
-                value=current_page, 
-                step=1,
-                key="page_input"
-            )
-            
-            # 当页码输入改变时,更新文件选择
-            if page_input != current_page:
-                for i, info in enumerate(validator.file_info):
-                    if info['page'] == page_input:
-                        st.session_state.selected_file_index = i
-                        selected_file = validator.file_paths[i]
-                        st.rerun()
-                        break
-
-            # 自动加载文件
-            if (st.session_state.selected_file_index >= 0
-                and validator.selected_file_index != st.session_state.selected_file_index
-                and selected_file):
-                validator.selected_file_index = st.session_state.selected_file_index
-                st.session_state.validator.load_ocr_data(selected_file)
-                
-                # 显示加载成功信息
-                current_source_name = get_data_source_display_name(validator.current_source_config)
-                st.success(f"✅ 已加载 {current_source_name} - 第{validator.file_info[st.session_state.selected_file_index]['page']}页")
-                st.rerun()
-        else:
-            st.warning("当前数据源中未找到OCR结果文件")
-
-        # 交叉验证按钮
-        if st.button("交叉验证", type="primary", icon=":material/compare_arrows:"):
-            if validator.image_path and validator.md_content:
-                validator.cross_validation()
-            else:
-                message_box("❌ 请先选择OCR数据文件", "error")
-
-        # 查看预校验结果按钮
-        if st.button("查看验证结果", type="secondary", icon=":material/quick_reference_all:"):
-            validator.show_batch_cross_validation_results_dialog()
-
-    # 显示当前数据源统计信息
-    with st.expander("🔧 OCR工具统计信息", expanded=False):
-        stats = validator.get_statistics()
-        col1, col2, col3, col4, col5 = st.columns(5)
-        
-        with col1:
-            st.metric("📊 总文本块", stats['total_texts'])
-        with col2:
-            st.metric("🔗 可点击文本", stats['clickable_texts'])
-        with col3:
-            st.metric("❌ 标记错误", stats['marked_errors'])
-        with col4:
-            st.metric("✅ 准确率", f"{stats['accuracy_rate']:.1f}%")
-        with col5:
-            # 显示当前数据源信息
-            if validator.current_source_config:
-                tool_display = validator.current_source_config['ocr_tool'].upper()
-                st.metric("🔧 OCR工具", tool_display)
-        
-        # 详细工具信息
-        if stats['tool_info']:
-            st.write("**详细信息:**", stats['tool_info'])
-    
-    # 其余标签页保持不变...
-    tab1, tab2, tab3 = st.tabs(["📄 内容人工检查", "🔍 交叉验证结果", "📊 表格分析"])
-    
-    with tab1:
-        validator.create_compact_layout(config)
-
-    with tab2:
-        # st.header("📄 VLM预校验识别结果")
-        current_md_path = Path(validator.file_paths[validator.selected_file_index]).with_suffix('.md')
-        pre_validation_dir = Path(validator.config['pre_validation'].get('out_dir', './output/pre_validation/')).resolve()
-        comparison_result_path = pre_validation_dir / f"{current_md_path.stem}_cross_validation.json"
-        # pre_validation_path = pre_validation_dir / f"{current_md_path.stem}.md"
-        verify_md_path = validator.find_verify_md_path(validator.selected_file_index)
-        
-        if comparison_result_path.exists():
-            # 加载并显示验证结果
-            with open(comparison_result_path, "r", encoding="utf-8") as f:
-                comparison_result = json.load(f)
-
-            # 左边显示OCR结果,右边显示VLM结果
-            col1, col2 = st.columns([1,1])
-            with col1:
-                st.subheader("🤖 原OCR识别结果")
-                with open(current_md_path, "r", encoding="utf-8") as f:
-                    original_md_content = f.read()
-                font_size = config['styles'].get('font_size', 10)
-                height = config['styles']['layout'].get('default_height', 800)
-                layout_type = "compact"
-                validator.layout_manager.render_content_by_mode(original_md_content, "HTML渲染", font_size, height, layout_type)
-            with col2:
-                st.subheader("🤖 验证识别结果")
-                with open(str(verify_md_path), "r", encoding="utf-8") as f:
-                    verify_md_content = f.read()
-                font_size = config['styles'].get('font_size', 10)
-                height = config['styles']['layout'].get('default_height', 800)
-                layout_type = "compact"
-                validator.layout_manager.render_content_by_mode(verify_md_content, "HTML渲染", font_size, height, layout_type)
-
-            # 显示差异统计
-            st.markdown("---")
-            validator.display_comparison_results(comparison_result, detailed=True)
-        else:
-            st.info("暂无预校验结果,请先运行VLM预校验")
-
-    with tab3:
-        # 表格分析页面 - 保持原有逻辑
-        st.header("📊 表格数据分析")
-        
-        if validator.md_content and '<table' in validator.md_content.lower():
-            st.subheader("🔍 表格数据预览")
-            validator.display_html_table_as_dataframe(validator.md_content)
-            
-        else:
-            st.info("当前OCR结果中没有检测到表格数据")
-    
-if __name__ == "__main__":
-    main()