Explorar el Código

fix(pdf-extract): adjust box threshold for OCR detection (#447)

Tuned the detection box threshold parameter in the OCR model initialization to improve the
accuracy of text extraction from images. The threshold was modified from 0.6 to
0.3 to filter out smaller detection boxes, which is expected to enhance the quality of the extracted
text by reducing noise and false positives in the OCR process.
Xiaomeng Zhao hace 1 año
padre
commit
041b9465b9
Se han modificado 1 ficheros con 1 adiciones y 1 borrados
  1. 1 1
      magic_pdf/model/pdf_extract_kit.py

+ 1 - 1
magic_pdf/model/pdf_extract_kit.py

@@ -139,7 +139,7 @@ class CustomPEKModel:
         )
         # 初始化ocr
         if self.apply_ocr:
-            self.ocr_model = ModifiedPaddleOCR(show_log=show_log)
+            self.ocr_model = ModifiedPaddleOCR(show_log=show_log, det_db_box_thresh=0.3)
 
         # init structeqtable
         if self.apply_table: