Bläddra i källkod

fix(pdf-extract): adjust box threshold for OCR detection (#447)

Tuned the detection box threshold parameter in the OCR model initialization to improve the
accuracy of text extraction from images. The threshold was modified from 0.6 to
0.3 to filter out smaller detection boxes, which is expected to enhance the quality of the extracted
text by reducing noise and false positives in the OCR process.
Xiaomeng Zhao 1 år sedan
förälder
incheckning
041b9465b9
1 ändrade filer med 1 tillägg och 1 borttagningar
  1. 1 1
      magic_pdf/model/pdf_extract_kit.py

+ 1 - 1
magic_pdf/model/pdf_extract_kit.py

@@ -139,7 +139,7 @@ class CustomPEKModel:
         )
         )
         # 初始化ocr
         # 初始化ocr
         if self.apply_ocr:
         if self.apply_ocr:
-            self.ocr_model = ModifiedPaddleOCR(show_log=show_log)
+            self.ocr_model = ModifiedPaddleOCR(show_log=show_log, det_db_box_thresh=0.3)
 
 
         # init structeqtable
         # init structeqtable
         if self.apply_table:
         if self.apply_table: