瀏覽代碼

fix(pdf_parse): improve span removal logic for all content types

- Update remove_outside_spans function to handle all content types
- Add processing for text and equation spans
- Improve overlap calculation for better accuracy
myhloli 1 年之前
父節點
當前提交
ad0d06b6a0
共有 1 個文件被更改,包括 1 次插入3 次删除
  1. 1 3
      magic_pdf/pdf_parse_union_core_v2.py

+ 1 - 3
magic_pdf/pdf_parse_union_core_v2.py

@@ -410,13 +410,11 @@ def remove_outside_spans(spans, all_bboxes):
                 if calculate_overlap_area_in_bbox1_area_ratio(span['bbox'], block_bbox) > 0.5:
                     new_spans.append(span)
                     break
-        elif span['type'] in [ContentType.Text, ContentType.InlineEquation, ContentType.InterlineEquation]:
+        else:
             for block_bbox in other_block_bboxes:
                 if calculate_overlap_area_in_bbox1_area_ratio(span['bbox'], block_bbox) > 0.5:
                     new_spans.append(span)
                     break
-        else:
-            new_spans.append(span)
 
     return new_spans