zhch158_admin cc3e15d2d7 feat: 更新OCR结果比较功能，添加日期时间格式检测和解析逻辑		1 settimana fa
..
README.md	62190e9d59 feat: 添加OCR结果对比模块的详细说明文档	1 settimana fa
__init__.py	729b73c15e feat: Add paragraph comparison and reporting functionality	4 settimane fa
compare_ocr_results.py	cc3e15d2d7 feat: 更新OCR结果比较功能，添加日期时间格式检测和解析逻辑	1 settimana fa
content_extractor.py	be7c23a036 feat: 重构内容提取逻辑，增强文本标准化和段落提取功能	3 settimane fa
data_type_detector.py	cc3e15d2d7 feat: 更新OCR结果比较功能，添加日期时间格式检测和解析逻辑	1 settimana fa
ocr_comparator.py	9605070c6b feat: 增强OCR结果比较器，优化段落对齐和结构化内容提取逻辑	3 settimane fa
paragraph_comparator.py	729b73c15e feat: Add paragraph comparison and reporting functionality	4 settimane fa
report_generator.py	729b73c15e feat: Add paragraph comparison and reporting functionality	4 settimane fa
similarity_calculator.py	729b73c15e feat: Add paragraph comparison and reporting functionality	4 settimane fa
table_comparator.py	672d58aaf3 feat: 改进表头检测逻辑，新增分类行判断，优化得分计算	1 settimana fa
text_processor.py	729b73c15e feat: Add paragraph comparison and reporting functionality	4 settimane fa

📊 OCR 结果对比模块 (Comparator)

OCR 结果对比模块提供了强大的文档对比功能，支持表格、段落的细粒度差异检测，特别优化了财务报表、流水表格等复杂文档的对比。

📁 模块结构

comparator/
├── __init__.py                      # 模块初始化
├── compare_ocr_results.py           # 命令行对比工具（入口）
├── ocr_comparator.py                # 核心对比器
├── table_comparator.py              # 表格对比器 ✨
├── paragraph_comparator.py          # 段落对比器
├── similarity_calculator.py         # 相似度计算器
├── data_type_detector.py            # 数据类型检测器
├── content_extractor.py             # 内容提取器
├── text_processor.py                # 文本处理器
├── report_generator.py              # 报告生成器
└── README.md                        # 本文档

✨ 核心功能

🎯 智能表格对比

1. 两种对比模式

标准模式 (standard)

适用于结构固定的表格
逐行逐列精确对比
不进行表头检测
适合：静态报表、统计表

流水模式 (flow_list) ✨

适用于结构可变的表格
智能表头检测（关键词匹配）
支持多层表头识别（如资产负债表）
列类型自动检测
差异严重度智能分级
适合：流水表、财务报表、交易记录

2. 列类型自动检测

支持的数据类型：

类型	标识	特征	示例
数字金额	`numeric`	包含数字、小数点、逗号	`28,239,305.48`
日期时间	`datetime`	符合日期格式	`2023-12-31` / `2023年12月31日`
文本型数字	`text_number`	纯数字但作为文本（如票据号）	`20231231001`
普通文本	`text`	其他文本内容	`货币资金`

检测算法：

def detect_column_type(column_values):
    """检测列的数据类型"""
    numeric_count = 0
    datetime_count = 0
    text_number_count = 0
    
    for value in column_values:
        if is_numeric(value):
            numeric_count += 1
        elif is_datetime(value):
            datetime_count += 1
        elif is_text_number(value):
            text_number_count += 1
    
    # 超过 60% 认定为该类型
    total = len(column_values)
    if numeric_count / total > 0.6:
        return 'numeric'
    elif datetime_count / total > 0.6:
        return 'datetime'
    elif text_number_count / total > 0.6:
        return 'text_number'
    else:
        return 'text'

4. 差异严重度分级 ✨

基础严重度（由单元格内容决定）：

差异类型	基础严重度	说明
`table_amount`	high	金额数字不一致
`table_datetime`	medium	日期时间不一致
`table_text`	low/medium	文本不一致（根据相似度）
`table_header_position`	high	表头位置不一致
`table_header_content`	high	表头内容不一致
`table_row_missing`	high	行数不一致
`table_column_missing`	high	列数不一致

列类型冲突自动提升： ✨

# 如果列类型不一致，自动将严重度提升到 high
if col_idx in mismatched_columns:
    if base_severity != 'high':
        final_severity = 'high'
        description += " [列类型冲突]"

示例：

{
  "type": "table_text",
  "severity": "high",  // 从 low 提升到 high
  "column_type_mismatch": true,
  "description": "文本不一致: 流动资产 vs 流动 资产 [列类型冲突]"
}

5. 表格匹配算法

智能匹配两个文件中的表格：

def find_matching_tables(tables1, tables2):
    """查找匹配的表格对"""
    matches = []
    
    for idx1, table1 in enumerate(tables1):
        best_match_idx = -1
        best_similarity = 0
        
        for idx2, table2 in enumerate(tables2):
            # 计算综合相似度
            similarity = calculate_table_similarity(table1, table2)
            
            if similarity > best_similarity and similarity > 0.5:
                best_similarity = similarity
                best_match_idx = idx2
        
        if best_match_idx >= 0:
            matches.append((idx1, best_match_idx, best_similarity))
    
    return matches

相似度计算（总分 100%）：

行列数相似度 (30%)
- 行数相似度 (15%)
- 列数相似度 (15%)
- ✨ 改进：容忍 1-2 列差异（如合并列导致）
表头相似度 (50%) - 最重要
- 精确匹配 (40%)：完全一致的表头数量
- 模糊匹配 (40%)：相似度 > 80% 的表头
- 语义匹配 (20%)：识别常见表头关键词
内容特征相似度 (20%)
- 数据类型分布
- 数值范围
- 文本特征

示例输出：

# 匹配结果
matches = [
    (0, 0, 95.2),  # 文件1第0个表格 ↔ 文件2第0个表格，相似度 95.2%
    (1, 1, 87.3),  # 文件1第1个表格 ↔ 文件2第1个表格，相似度 87.3%
]

📝 段落对比

对比策略：

段落匹配
- 基于相似度的智能匹配
- 支持段落顺序调整
- 识别新增/删除的段落
差异检测
- 文本内容差异
- 格式差异（如列表、引用）
- 图片内容差异（可选）
相似度算法
- ratio：标准 Levenshtein 距离
- partial_ratio：部分匹配
- token_sort_ratio：排序后匹配
- token_set_ratio：集合匹配

🔍 文本相似度计算

支持的相似度算法：

# 1. 标准相似度（ratio）
similarity = fuzz.ratio("文本1", "文本2")
# 输出: 85 (0-100)

# 2. 部分匹配（partial_ratio）
similarity = fuzz.partial_ratio("这是一段很长的文本", "很长的文本")
# 输出: 100

# 3. 排序后匹配（token_sort_ratio）
similarity = fuzz.token_sort_ratio("apple banana", "banana apple")
# 输出: 100

# 4. 集合匹配（token_set_ratio）
similarity = fuzz.token_set_ratio("the quick brown fox", "brown quick fox")
# 输出: 100

🚀 快速开始

1. 基本对比

# 对比两个 Markdown 文件
python comparator/compare_ocr_results.py file1.md file2.md

# 输出 JSON 格式
python comparator/compare_ocr_results.py file1.md file2.md -f json

# 输出 Markdown 格式
python comparator/compare_ocr_results.py file1.md file2.md -f markdown

# 同时输出两种格式
python comparator/compare_ocr_results.py file1.md file2.md -f both

2. 流水表格对比 ✨

# 使用流水模式（智能表头检测 + 多层表头识别）
python comparator/compare_ocr_results.py file1.md file2.md \
    --table-mode flow_list \
    -o output/comparison_result \
    -f both

# 资产负债表对比（自动识别多层表头）
python comparator/compare_ocr_results.py balance_sheet1.md balance_sheet2.md \
    --table-mode flow_list \
    --similarity-algorithm ratio \
    -o balance_sheet_comparison

3. 高级对比

# 使用 token_set_ratio 算法（集合匹配）
python comparator/compare_ocr_results.py file1.md file2.md \
    --similarity-algorithm token_set_ratio

# 忽略图片内容对比
python comparator/compare_ocr_results.py file1.md file2.md \
    --ignore-images

# 指定输出文件名
python comparator/compare_ocr_results.py file1.md file2.md \
    -o my_comparison_report

# 详细调试信息
python comparator/compare_ocr_results.py file1.md file2.md \
    --table-mode flow_list \
    -v

📖 命令行参数

必需参数

参数	类型	说明
`file1`	string	第一个文件路径（原OCR结果）
`file2`	string	第二个文件路径（验证结果）

可选参数

参数	默认值	说明
`-o, --output`	`comparison_result`	输出文件名（不含扩展名）
`-f, --format`	`both`	输出格式：`json` / `markdown` / `both`
`--table-mode`	`standard`	表格对比模式：`standard` / `flow_list`
`--similarity-algorithm`	`ratio`	相似度算法：`ratio` / `partial_ratio` / `token_sort_ratio` / `token_set_ratio`
`--ignore-images`	`False`	是否忽略图片内容对比
`-v, --verbose`	`False`	显示详细调试信息

📊 输出格式

JSON 格式

{
  "file1": "/path/to/file1.md",
  "file2": "/path/to/file2.md",
  "comparison_time": "2025-11-07 14:30:25",
  "table_mode": "flow_list",
  "similarity_algorithm": "ratio",
  
  "differences": [
    {
      "type": "table_amount",
      "position": "第15行第5列",
      "file1_value": "15.00",
      "file2_value": "15,00",
      "description": "金额不一致: 15.00 vs 15,00",
      "severity": "high",
      "column_name": "金额",
      "column_type": "numeric",
      "row_index": 15,
      "col_index": 4
    },
    {
      "type": "table_header_position",
      "position": "表头位置",
      "file1_value": "第1行",
      "file2_value": "第2行",
      "description": "表头位置不一致: 文件1在第1行，文件2在第2行",
      "severity": "high"
    },
    {
      "type": "table_text",
      "position": "第20行第3列",
      "file1_value": "流动资产",
      "file2_value": "流动 资产",
      "description": "文本不一致: 流动资产 vs 流动 资产 [列类型冲突]",
      "severity": "high",
      "column_type_mismatch": true
    }
  ],
  
  "statistics": {
    "total_differences": 42,
    "table_differences": 35,
    "amount_differences": 8,
    "datetime_differences": 3,
    "text_differences": 24,
    "paragraph_differences": 7,
    "critical_severity": 2,
    "high_severity": 11,
    "medium_severity": 17,
    "low_severity": 12
  },
  
  "table_matches": [
    {
      "file1_table_index": 0,
      "file2_table_index": 0,
      "similarity": 95.2,
      "header_position_file1": 1,
      "header_position_file2": 1,
      "row_count_file1": 10,
      "row_count_file2": 10,
      "column_count_file1": 6,
      "column_count_file2": 6
    }
  ]
}

Markdown 格式

# OCR结果对比报告

## 📋 基本信息
- **原OCR结果**: `/path/to/file1.md`
- **验证结果**: `/path/to/file2.md`
- **对比时间**: `2025-11-07 14:30:25`
- **表格对比模式**: `flow_list`
- **相似度算法**: `ratio`

---

## 📊 统计信息
- **总差异数量**: 42
- **表格差异**: 35
  - 金额差异: 8 (严重度: high)
  - 日期差异: 3 (严重度: medium)
  - 文本差异: 24 (严重度: low/medium)
  - 列类型冲突: 3 (严重度提升至: high)
- **段落差异**: 7

---

## 📈 严重度分布
- ❌ **严重差异 (Critical)**: 2
- ⚠️ **高优先级 (High)**: 11
- ℹ️ **中优先级 (Medium)**: 17
- 💡 **低优先级 (Low)**: 12

---

## 🔍 表格匹配情况

### 表格 #1 ↔ 表格 #1 (相似度: 95.2%)
- **行数对比**: 10 vs 10, 相似度: 100.0%
- **列数对比**: 6 vs 6, 相似度: 100.0%
- **表头位置**: 文件1第1行, 文件2第1行
- **表头相似度**: 92.5%
  - 精确匹配: 83%
  - 模糊匹配: 100%
  - 语义匹配: 100%

---

## 📝 差异详情（按严重度分类）

### ❌ 严重差异 (Critical)

| 序号 | 位置 | 类型 | 原OCR结果 | 验证结果 | 描述 |
|------|------|------|-----------|----------|------|
| 1 | 表格列类型 | table_header_critical | 5列类型不一致 | 共10列 | 列类型差异过大 (50%) |

---

### ⚠️ 高优先级差异 (High)

| 序号 | 位置 | 类型 | 原OCR结果 | 验证结果 | 描述 |
|------|------|------|-----------|----------|------|
| 1 | 第15行第5列 | table_amount | 15.00 | 15,00 | 金额不一致 |
| 2 | 第20行第3列 | table_text | 流动资产 | 流动 资产 | 文本不一致 [列类型冲突] |
| 3 | 表头位置 | table_header_position | 第1行 | 第2行 | 表头位置不一致 |

---

### ℹ️ 中优先级差异 (Medium)

| 序号 | 位置 | 类型 | 原OCR结果 | 验证结果 | 描述 |
|------|------|------|-----------|----------|------|
| 1 | 第8行第2列 | table_datetime | 2023-12-31 | 2023年12月31日 | 日期格式不一致 |

---

### 💡 低优先级差异 (Low)

| 序号 | 位置 | 类型 | 原OCR结果 | 验证结果 | 描述 |
|------|------|------|-----------|----------|------|
| 1 | 第3行第1列 | table_text | 现金及现金等价物 | 现金及 现金等价物 | 文本相似度: 92.3% |

🎯 使用场景示例

场景 1：银行流水对比

# 命令
python comparator/compare_ocr_results.py \
    /data/银行流水/dotsocr/page_001.md \
    /data/银行流水/paddleocr_vl/page_001.md \
    --table-mode flow_list \
    --similarity-algorithm ratio \
    -o output/bank_flow_comparison \
    -f both

# 特点
# ✅ 自动检测表头（日期、金额、余额等关键词）
# ✅ 列类型检测（数字金额、日期、文本型数字）
# ✅ 金额差异高优先级
# ✅ 列类型冲突自动提升严重度

场景 2：资产负债表对比 ✨

# 命令
python comparator/compare_ocr_results.py \
    /data/年报/mineru/balance_sheet.md \
    /data/年报/ppstructv3/balance_sheet.md \
    --table-mode flow_list \
    --similarity-algorithm ratio \
    -o output/balance_sheet_comparison \
    -v

# 特点
# ✅ 自动识别多层表头（总表头 + 分类标题）
# ✅ 检测"流动资产:"等分类行
# ✅ 智能评分（分类行 +0.1，数据行 +0.2）
# ✅ 详细调试信息（-v 参数）

# 调试输出示例
📍 检测到表头在第 1 行 (得分: 0.87)
   - 关键词: "资产"(0.25) + "余额"(0.50) + "负债"(0.25)
   - 下一行: 分类行 (+0.1)
   - 总得分: 1.0 + 0.1 = 1.1

场景 3：利润表对比

# 命令
python comparator/compare_ocr_results.py \
    /data/财报/paddleocr_vl/income_statement.md \
    /data/财报/dots_ocr/income_statement.md \
    --table-mode flow_list \
    --similarity-algorithm token_set_ratio \
    -o output/income_statement_comparison

# 特点
# ✅ 集合匹配算法（容忍词序差异）
# ✅ 自动检测"收入"、"成本"等关键词
# ✅ 数值列精确对比

场景 4：批量对比

# 批量对比脚本
for file1 in /data/dotsocr/*.md; do
    file2="/data/paddleocr_vl/$(basename $file1)"
    if [ -f "$file2" ]; then
        python comparator/compare_ocr_results.py \
            "$file1" "$file2" \
            --table-mode flow_list \
            -o "output/$(basename $file1 .md)_comparison" \
            -f json
    fi
done

🔧 编程接口

直接调用 OCRResultComparator

from comparator.ocr_comparator import OCRResultComparator

# 初始化对比器
comparator = OCRResultComparator(
    table_mode='flow_list',
    similarity_algorithm='ratio',
    ignore_images=False
)

# 从文件加载内容
with open('file1.md', 'r', encoding='utf-8') as f:
    content1 = f.read()
with open('file2.md', 'r', encoding='utf-8') as f:
    content2 = f.read()

# 执行对比
result = comparator.compare(content1, content2)

# 查看结果
print(f"总差异数: {result['statistics']['total_differences']}")
print(f"表格差异: {result['statistics']['table_differences']}")
print(f"段落差异: {result['statistics']['paragraph_differences']}")

# 获取高优先级差异
high_diffs = [d for d in result['differences'] if d['severity'] == 'high']
print(f"高优先级差异: {len(high_diffs)}")

表格对比器独立使用

from comparator.table_comparator import TableComparator

# 初始化表格对比器
table_comparator = TableComparator(
    mode='flow_list',
    similarity_algorithm='ratio'
)

# 准备表格数据
table1 = [
    ['日期', '金额', '余额'],
    ['2023-01-01', '1000.00', '5000.00'],
    ['2023-01-02', '500.00', '5500.00']
]

table2 = [
    ['日期', '金额', '余额'],
    ['2023-01-01', '1,000.00', '5,000.00'],
    ['2023-01-02', '500.00', '5500.00']
]

# 执行对比
differences = table_comparator.compare_tables(table1, table2)

# 分析差异
for diff in differences:
    print(f"{diff['position']}: {diff['description']} (严重度: {diff['severity']})")

相似度计算器独立使用

from comparator.similarity_calculator import SimilarityCalculator

# 初始化计算器
calculator = SimilarityCalculator(algorithm='ratio')

# 计算文本相似度
similarity = calculator.calculate("流动资产", "流动 资产")
print(f"相似度: {similarity}%")  # 输出: 92.31%

# 切换算法
calculator.set_algorithm('token_set_ratio')
similarity = calculator.calculate("apple banana", "banana apple")
print(f"相似度: {similarity}%")  # 输出: 100%

数据类型检测器独立使用

from comparator.data_type_detector import DataTypeDetector

# 初始化检测器
detector = DataTypeDetector()

# 检测单个值
print(detector.detect_type("28,239,305.48"))  # 输出: numeric
print(detector.detect_type("2023-12-31"))     # 输出: datetime
print(detector.detect_type("20231231001"))    # 输出: text_number
print(detector.detect_type("货币资金"))        # 输出: text

# 检测列类型
column_values = ["1000.00", "2000.50", "3000.75", "文本"]
column_type = detector.detect_column_type(column_values)
print(f"列类型: {column_type}")  # 输出: numeric (75% 是数字)

🐛 调试技巧

1. 启用详细日志

# 使用 -v 参数
python comparator/compare_ocr_results.py file1.md file2.md \
    --table-mode flow_list \
    -v

输出示例：

🔍 开始对比...
📄 文件1: /path/to/file1.md
📄 文件2: /path/to/file2.md
⚙️ 表格模式: flow_list
⚙️ 相似度算法: ratio

📊 提取表格...
   文件1: 发现 2 个表格
   文件2: 发现 2 个表格

🔗 匹配表格...
   表格 #1 ↔ 表格 #1: 相似度 95.2%

📍 检测表头位置...
   文件1表格1: 检测到表头在第 1 行 (得分: 0.87)
      关键词: "资产"(0.25) + "余额"(0.50) + "负债"(0.25)
      下一行: 分类行 (+0.1)
   文件2表格1: 检测到表头在第 1 行 (得分: 0.85)

🔍 对比单元格...
   第15行第5列: 金额差异 (15.00 vs 15,00) [严重度: high]
   第20行第3列: 文本差异 (流动资产 vs 流动 资产) [列类型冲突] [严重度: high]

✅ 对比完成
   总差异: 42
   表格差异: 35
   段落差异: 7

2. 检查表格匹配

# 查看表格匹配结果
result = comparator.compare(content1, content2)
for match in result['table_matches']:
    print(f"表格 #{match['file1_table_index']} ↔ #{match['file2_table_index']}")
    print(f"  相似度: {match['similarity']}%")
    print(f"  行数: {match['row_count_file1']} vs {match['row_count_file2']}")
    print(f"  列数: {match['column_count_file1']} vs {match['column_count_file2']}")

3. 分析列类型冲突

# 过滤列类型冲突的差异
type_conflicts = [
    d for d in result['differences'] 
    if d.get('column_type_mismatch', False)
]

for diff in type_conflicts:
    print(f"位置: {diff['position']}")
    print(f"文件1: {diff['file1_value']}")
    print(f"文件2: {diff['file2_value']}")
    print(f"基础严重度: {diff.get('base_severity', 'N/A')}")
    print(f"最终严重度: {diff['severity']}")
    print()

4. 检查表头检测结果

# 手动检测表头
from comparator.table_comparator import TableComparator

comparator = TableComparator(mode='flow_list')
header_row_idx = comparator._detect_table_header_row(table)

print(f"检测到表头在第 {header_row_idx + 1} 行")

# 查看评分详情
for idx, row in enumerate(table):
    score = comparator._score_header_row(row, table, idx)
    print(f"第 {idx + 1} 行: 得分 {score:.2f}")

📚 常见问题

Q1: 表头检测不准确？

检查表格是否包含表头关键词（日期、金额、余额等）
使用 -v 参数查看详细评分信息
手动指定表头位置（在代码中设置 header_row_idx）

Q2: 列类型检测错误？

检查列的数据一致性（是否混合了不同类型）
调整检测阈值（默认 60%，可在代码中修改）
查看检测日志了解判断依据

Q3: 差异过多且都是 high 严重度？

检查是否存在列类型冲突（会自动提升严重度）
使用不同的相似度算法（如 token_set_ratio）
确认表格结构是否一致（行列数）

Q4: 多层表头识别失败？✨

确认表格结构符合预期：
- 第1行：总表头
- 第2行：分类标题（如"流动资产:"）
- 第3行起：数据行
检查分类行格式：
- 第一个单元格包含关键词 + 冒号
- 其他单元格为空
使用 -v 参数查看检测详情

Q5: 表格匹配失败？

检查表格相似度阈值（默认 50%）
查看表头相似度（最重要的匹配因素）
确认行列数差异是否过大

Q6: 金额格式差异导致误报？

使用数字标准化工具预处理：

from normalize_financial_numbers import normalize_financial_numbers
normalized = normalize_financial_numbers(text)

或在对比前手动统一格式

Q7: 相似度计算结果异常？

尝试不同的相似度算法
检查文本是否包含特殊字符
确认编码格式正确（UTF-8）

🎓 最佳实践

1. 选择合适的对比模式

文档类型	推荐模式	理由
固定格式报表	`standard`	结构稳定，逐行对比更精确
银行流水	`flow_list`	表头位置可能变化，需要智能检测
资产负债表	`flow_list`	支持多层表头识别 ✨
利润表	`flow_list`	自动检测"收入"、"成本"等关键词
交易记录	`flow_list`	列类型多样，需要类型检测

2. 选择合适的相似度算法

算法	适用场景	特点
`ratio`	精确对比	严格匹配，适合格式统一的文本
`partial_ratio`	部分匹配	适合长短文本对比
`token_sort_ratio`	词序差异	容忍词序不同
`token_set_ratio`	集合匹配	容忍重复词、词序

3. 处理常见差异类型

金额差异：

# 预处理：标准化金额格式
python normalize_financial_numbers.py input.json output.json

# 对比
python comparator/compare_ocr_results.py file1.md file2.md \
    --table-mode flow_list

日期格式差异：

# 在对比前统一日期格式
def normalize_date(date_str):
    # 2023-12-31 → 2023年12月31日
    # 2023/12/31 → 2023年12月31日
    pass

列类型冲突：

# 检查冲突原因
type_conflicts = [d for d in differences if d.get('column_type_mismatch')]
for diff in type_conflicts:
    print(f"{diff['position']}: {diff['file1_value']} vs {diff['file2_value']}")
    # 分析是 OCR 错误还是格式差异

4. 批量对比策略

#!/bin/bash
# batch_compare.sh

# 配置
SOURCE_DIR="/data/dotsocr"
TARGET_DIR="/data/paddleocr_vl"
OUTPUT_DIR="/output/comparisons"

# 批量对比
for file1 in "$SOURCE_DIR"/*.md; do
    filename=$(basename "$file1")
    file2="$TARGET_DIR/$filename"
    
    if [ -f "$file2" ]; then
        echo "对比: $filename"
        python comparator/compare_ocr_results.py \
            "$file1" "$file2" \
            --table-mode flow_list \
            --similarity-algorithm ratio \
            -o "$OUTPUT_DIR/${filename%.md}_comparison" \
            -f both
    else
        echo "跳过: $filename (目标文件不存在)"
    fi
done

echo "✅ 批量对比完成"

5. 结果分析流程

import json

# 加载对比结果
with open('comparison_result.json', 'r') as f:
    result = json.load(f)

# 1. 统计分析
stats = result['statistics']
print(f"总差异: {stats['total_differences']}")
print(f"表格差异: {stats['table_differences']}")

# 2. 严重度分布
print(f"Critical: {stats['critical_severity']}")
print(f"High: {stats['high_severity']}")
print(f"Medium: {stats['medium_severity']}")
print(f"Low: {stats['low_severity']}")

# 3. 差异类型分布
type_counts = {}
for diff in result['differences']:
    type_counts[diff['type']] = type_counts.get(diff['type'], 0) + 1

for diff_type, count in sorted(type_counts.items(), key=lambda x: x[1], reverse=True):
    print(f"{diff_type}: {count}")

# 4. 关注高优先级差异
high_diffs = [d for d in result['differences'] if d['severity'] == 'high']
for diff in high_diffs:
    print(f"⚠️ {diff['position']}: {diff['description']}")

📝 开发指南

扩展新的数据类型检测

# 在 data_type_detector.py 中添加
def is_currency(value):
    """检测是否为货币格式"""
    patterns = [
        r'¥\s*[\d,]+\.?\d*',
        r'\$\s*[\d,]+\.?\d*',
        r'[\d,]+\.?\d*\s*元',
    ]
    return any(re.match(pattern, str(value)) for pattern in patterns)

自定义相似度算法

# 在 similarity_calculator.py 中添加
def custom_similarity(text1, text2):
    """自定义相似度计算"""
    # 实现自定义逻辑
    pass

# 注册算法
SimilarityCalculator.register_algorithm('custom', custom_similarity)

扩展表格对比逻辑

# 继承 TableComparator
class CustomTableComparator(TableComparator):
    def _detect_table_header_row(self, table):
        """自定义表头检测逻辑"""
        # 实现自定义检测
        pass
    
    def _compare_cells(self, cell1, cell2, column_type):
        """自定义单元格对比"""
        # 实现自定义对比
        pass

📄 许可证

本模块采用 MIT 许可证。

最后更新: 2025年11月7日 维护者: zhch158_admin