Răsfoiți Sursa

fix(pre_proc): add Discarded block type to span block type compatibility

- Include BlockType.Discarded in the list of compatible block types for ContentType.Text and ContentType.InlineEquation
- This change improves the OCR dictionary merging process by handling discarded blocks more effectively
myhloli 8 luni în urmă
părinte
comite
7a8568045d
1 a modificat fișierele cu 9 adăugiri și 1 ștergeri
  1. 9 1
      magic_pdf/pre_proc/ocr_dict_merge.py

+ 9 - 1
magic_pdf/pre_proc/ocr_dict_merge.py

@@ -62,7 +62,15 @@ def merge_spans_to_line(spans, threshold=0.6):
 
 def span_block_type_compatible(span_type, block_type):
     if span_type in [ContentType.Text, ContentType.InlineEquation]:
-        return block_type in [BlockType.Text, BlockType.Title, BlockType.ImageCaption, BlockType.ImageFootnote, BlockType.TableCaption, BlockType.TableFootnote]
+        return block_type in [
+            BlockType.Text,
+            BlockType.Title,
+            BlockType.ImageCaption,
+            BlockType.ImageFootnote,
+            BlockType.TableCaption,
+            BlockType.TableFootnote,
+            BlockType.Discarded
+        ]
     elif span_type == ContentType.InterlineEquation:
         return block_type in [BlockType.InterlineEquation, BlockType.Text]
     elif span_type == ContentType.Image: