zhch158_admin 9733ac0b36 feat: 更新多个工具的 README 文档，添加详细文档链接		6 mēneši atpakaļ
..
vendor	d8ecf2d8c6 Add new language dictionaries and model configurations for OCR	6 mēneši atpakaļ
README.md	9733ac0b36 feat: 更新多个工具的 README 文档，添加详细文档链接	6 mēneši atpakaļ
__init__.py	d8ecf2d8c6 Add new language dictionaries and model configurations for OCR	6 mēneši atpakaļ
doc_preprocess_result.py	d8ecf2d8c6 Add new language dictionaries and model configurations for OCR	6 mēneši atpakaļ
doc_preprocessor_v2.py	d8ecf2d8c6 Add new language dictionaries and model configurations for OCR	6 mēneši atpakaļ
layout_detect_onnx.py	d8ecf2d8c6 Add new language dictionaries and model configurations for OCR	6 mēneši atpakaļ
orientation_classifier_v2.py	d8ecf2d8c6 Add new language dictionaries and model configurations for OCR	6 mēneši atpakaļ
paddle_to_pytorch_universal.py	d8ecf2d8c6 Add new language dictionaries and model configurations for OCR	6 mēneši atpakaļ
pytorch_paddle.py	d8ecf2d8c6 Add new language dictionaries and model configurations for OCR	6 mēneši atpakaļ
unified_model_loader.py	d8ecf2d8c6 Add new language dictionaries and model configurations for OCR	6 mēneši atpakaļ

Unified PyTorch Models

📚 详细文档：ONNX 转换、推理算法等详细技术文档，请查看 docs/ocr_tools/pytorch_models/

统一的 PyTorch 模型推理接口，支持布局检测、OCR、文档方向分类等功能。

位置: ocr_platform/ocr_tools/pytorch_models/

📂 目录结构

pytorch_models/
├── Layout/                          # 布局检测与方向分类模型
│   ├── PP-LCNet_x1_0_doc_ori.onnx  # 文档方向分类模型
│   └── RT-DETR-H_layout_17cls.onnx # 布局检测模型
├── OCR/                             # OCR 模型目录（从 ModelScope 下载）
│   ├── Cls/                         # 方向分类器模型
│   └── Rec/                         # 文本识别模型
├── vendor/                          # ✨ 核心依赖模块
│   ├── __init__.py                  # 重构后：从 ocr_utils 导入通用工具
│   ├── ocr_utils.py                 # OCR 专用工具函数
│   ├── infer/                       # 推理模块
│   │   ├── predict_det.py           # 文本检测
│   │   ├── predict_rec.py           # 文本识别
│   │   ├── predict_cls.py           # 方向分类
│   │   ├── predict_system.py        # OCR 系统
│   │   └── pytorchocr_utility.py    # 工具函数
│   └── pytorchocr/                  # PytorchOCR 核心
│       ├── modeling/                # 模型架构
│       ├── postprocess/             # 后处理
│       └── utils/                   # 工具与资源
│           └── resources/           # 配置与字典
│               ├── models_config.yml
│               └── dict/            # 多语言字典
├── pytorch_paddle.py                # ✨ PytorchPaddleOCR 主模块
├── layout_detect_onnx.py            # 布局检测器
├── orientation_classifier_v2.py     # 增强版方向分类器
├── doc_preprocessor_v2.py           # 文档预处理 Pipeline
├── doc_preprocess_result.py         # 预处理结果数据类
├── paddle_to_pytorch_universal.py   # Paddle 模型转换工具
├── unified_model_loader.py          # 统一模型加载器
├── onnx转换、推理算法.md            # 技术文档
└── README.md

🚀 快速开始

1. 安装依赖

pip install torch torchvision opencv-python onnxruntime numpy pyyaml shapely pyclipper loguru python-dotenv

2. 准备模型文件

# 模型会自动从 ModelScope 下载到:
# ~/.cache/modelscope/models/OpenDataLab/PDF-Extract-Kit-1.0/

# 或者设置自定义缓存目录:
export MODELSCOPE_CACHE_DIR="/path/to/your/cache"

3. 配置环境变量

创建 .env 文件（可选）:

# ModelScope 缓存目录
MODELSCOPE_CACHE_DIR=/Users/zhch158/models/modelscope_cache

# 设备配置
DEVICE=cpu  # 或 cuda, mps

4. 运行测试

cd /Users/zhch158/workspace/repository.git/ocr_platform/ocr_tools/pytorch_models

# 测试 OCR 功能（含可视化）
python pytorch_paddle.py

# 测试文档预处理
python doc_preprocessor_v2.py

# 测试布局检测
python layout_detect_onnx.py

# 测试方向分类
python orientation_classifier_v2.py

📖 使用示例

1. OCR 识别（推荐）

# 添加 ocr_platform 根目录到 Python 路径
import sys
from pathlib import Path
ocr_platform_root = Path(__file__).parents[3]  # 根据实际路径调整
if str(ocr_platform_root) not in sys.path:
    sys.path.insert(0, str(ocr_platform_root))

from ocr_tools.pytorch_models import PytorchPaddleOCR
import cv2

# 初始化 OCR 引擎
ocr = PytorchPaddleOCR(
    lang='ch',                       # 语言: ch, en, ch_lite, korean, japan 等
    device='cpu',                    # 设备: cpu, cuda, mps
    use_orientation_cls=True,         # ✨ 启用方向分类
    orientation_model_path='./Layout/PP-LCNet_x1_0_doc_ori.onnx',
    rec_batch_num=6,                 # 识别批大小
    enable_merge_det_boxes=True      # 合并检测框
)

# 读取图像
img = cv2.imread("test.jpg")

# 执行 OCR
results = ocr.ocr(img, det=True, rec=True)

# 打印结果
if results and results[0]:
    for box, (text, conf) in results[0]:
        print(f"{text} (confidence={conf:.3f})")

# 可视化结果
img_vis = ocr.visualize(
    img, 
    results, 
    output_path="output_ocr.jpg",
    show_text=True,
    show_confidence=True
)

2. 布局检测

from ocr_tools.pytorch_models import LayoutDetectorONNX
import cv2

# 初始化检测器
detector = LayoutDetectorONNX(
    onnx_path="./Layout/RT-DETR-H_layout_17cls.onnx",
    use_gpu=False
)

# 检测
img = cv2.imread("test.jpg")
results = detector.predict(
    img, 
    conf_threshold=0.5, 
    return_debug=True
)

# 打印结果
for box in results:
    print(f"{box['category_name']}: {box['bbox']}, score={box['score']:.3f}")

# 可视化
img_vis = detector.visualize(
    img, 
    results, 
    output_path="output_layout.jpg"
)

3. 文档方向分类

from ocr_tools.pytorch_models import PytorchPaddleOCR, OrientationClassifierV2
import cv2

# 初始化 OCR（用于辅助判断）
ocr = PytorchPaddleOCR(lang='ch', device='cpu')

# 初始化方向分类器
classifier = OrientationClassifierV2(
    model_path="./Layout/PP-LCNet_x1_0_doc_ori.onnx",
    text_detector=ocr,               # ✨ 传入文本检测器
    aspect_ratio_threshold=1.2,      # 长宽比阈值
    vertical_text_ratio=0.28,        # 垂直文本占比阈值
    vertical_text_min_count=3,       # 最小垂直文本数量
    use_gpu=False
)

# 预测方向
img = cv2.imread("test.jpg")
result = classifier.predict(img, return_debug=True)

print(f"Rotation angle: {result.rotation_angle}°")
print(f"Confidence: {result.confidence:.3f}")
print(f"Needs rotation: {result.needs_rotation}")

# 如果需要旋转
if result.needs_rotation:
    img_rotated = classifier.rotate_image(img, result.rotation_angle)
    cv2.imwrite("rotated.jpg", img_rotated)

4. 完整文档预处理流程

from ocr_tools.pytorch_models import PytorchPaddleOCR, DocPreprocessorV2
import cv2

# 初始化 OCR
ocr = PytorchPaddleOCR(lang='ch', device='cpu')

# 初始化预处理 Pipeline
pipeline = DocPreprocessorV2(
    orientation_model="./Layout/PP-LCNet_x1_0_doc_ori.onnx",
    text_detector=ocr,
    use_orientation_classify=True
)

# 预测
img = cv2.imread("test.jpg")
results = pipeline.predict(img, return_debug=True)

print(results[0])  # DocPreprocessResult 对象

🔧 配置说明

OCR 引擎参数

PytorchPaddleOCR(
    lang='ch',                       # 语言
    device='cpu',                    # 设备
    use_orientation_cls=True,         # 启用方向分类
    orientation_model_path='...',     # 方向分类模型路径
    rec_batch_num=6,                 # 识别批大小
    det_db_thresh=0.3,              # 检测二值化阈值
    det_db_box_thresh=0.6,          # 检测框过滤阈值
    enable_merge_det_boxes=True,    # 合并检测框
    drop_score=0.5                  # 最低置信度
)

支持的语言

语言代码	说明	推荐用途
`ch`	中文（标准）	通用中文识别
`ch_lite`	中文（轻量）	CPU 环境
`ch_server`	中文（服务器）	高精度场景
`en`	英文	英文识别
`korean`	韩文	韩文识别
`japan`	日文	日文识别
`chinese_cht`	繁体中文	繁体中文识别
`latin`	拉丁字母	多语言拉丁字母
`arabic`	阿拉伯语	阿拉伯语识别
`cyrillic`	西里尔字母	俄语等
`devanagari`	梵文字母	印地语等

方向分类器参数

OrientationClassifierV2(
    model_path="...",
    text_detector=ocr,              # ✨ 文本检测器（辅助判断）
    aspect_ratio_threshold=1.2,     # 长宽比阈值（h/w > 1.2 触发检测）
    vertical_text_ratio=0.28,       # 垂直文本占比阈值（>28% 判定为横向扫描）
    vertical_text_min_count=3,      # 最小垂直文本数量
    use_gpu=False
)

🎨 可视化功能

OCR 可视化

img_vis = ocr.visualize(
    img, 
    results, 
    output_path="output.jpg",
    show_text=True,              # 显示识别文字
    show_confidence=True,        # 显示置信度
    font_scale=0.5,              # 字体大小
    thickness=2                  # 边框粗细
)

颜色编码:

🟢 绿色: 高置信度 (≥0.9)
🟡 黄色: 中置信度 (0.7-0.9)
🟠 橙色: 低置信度 (<0.7)

布局检测可视化

img_vis = detector.visualize(
    img, 
    results, 
    output_path="layout.jpg",
    show_labels=True,
    show_scores=True
)

📝 注意事项

1. 依赖关系

通用工具: 设备检测和图像处理工具已迁移到 ocr_utils
- ocr_utils.device_utils - 设备检测
- ocr_utils.image_utils - 图像处理
OCR 专用工具: 保留在 vendor/ 目录
- vendor/infer/ - OCR 推理模块
- vendor/pytorchocr/ - PyTorchOCR 核心

2. 模型路径

自动下载 (推荐):

# 设置环境变量
export MODELSCOPE_CACHE_DIR="/path/to/cache"

# 模型会自动下载到:
# $MODELSCOPE_CACHE_DIR/models/OpenDataLab/PDF-Extract-Kit-1.0/

手动指定:

ocr = PytorchPaddleOCR(
    det_model_path="/path/to/det_model.pth",
    rec_model_path="/path/to/rec_model.pth",
    rec_char_dict_path="/path/to/dict.txt"
)

3. GPU 支持

# CUDA (NVIDIA GPU)
ocr = PytorchPaddleOCR(device='cuda')

# MPS (Apple Silicon M1/M2/M3)
ocr = PytorchPaddleOCR(device='mps')

# CPU
ocr = PytorchPaddleOCR(device='cpu')

4. 内存优化

# CPU 环境使用轻量模型
ocr = PytorchPaddleOCR(lang='ch_lite', device='cpu')

# 调整批大小
ocr = PytorchPaddleOCR(rec_batch_num=4)  # 默认 6

🐛 故障排除

1. 识别结果全是空字符串

原因: 字符集未正确加载

解决方案:

# 初始化后验证字符集
if hasattr(ocr.text_recognizer, 'postprocess_op'):
    char_count = len(ocr.text_recognizer.postprocess_op.character)
    print(f"Character set size: {char_count}")  # 应该 > 0

2. 横向扫描图片无法识别

原因: 图像方向未矫正

解决方案:

# 启用方向分类
ocr = PytorchPaddleOCR(
    use_orientation_cls=True,
    orientation_model_path='./Layout/PP-LCNet_x1_0_doc_ori.onnx'
)

3. ImportError: No module named 'ocr_utils'

解决方案:

import sys
from pathlib import Path

# 添加 ocr_platform 根目录到 Python 路径
ocr_platform_root = Path(__file__).parents[3]  # 根据实际路径调整
sys.path.insert(0, str(ocr_platform_root))

4. 模型加载失败

MODELSCOPE_CACHE_DIR="/Users/zhch158/models/modelscope_cache"
# 检查模型文件
ls $MODELSCOPE_CACHE_DIR/models/OpenDataLab/PDF-Extract-Kit-1.0/models/OCR/

# 清除缓存重新下载
rm -rf $MODELSCOPE_CACHE_DIR/models/OpenDataLab/PDF-Extract-Kit-1.0/

🔄 模型转换

将 Paddle 模型转换为 PyTorch:

from ocr_tools.pytorch_models import UniversalPaddleToPyTorchConverter

converter = UniversalPaddleToPyTorchConverter(
    paddle_model_dir="path/to/paddle_model",
    output_dir="./output"
)

onnx_path = converter.convert("model_name")

📚 参考资料

MinerU - PDF 文档解析工具
PaddleOCR - 百度 OCR 工具包
PaddleX - 飞桨低代码开发工具
PDF-Extract-Kit - ModelScope 模型

📄 许可证

本项目仅供学习研究使用，模型版权归原作者所有。

最后更新: 2024-12-22
迁移位置: ocr_platform/ocr_tools/pytorch_models/