Ver Fonte

refactor(para): adjust right margin threshold based on block width

- Introduce a variable threshold for right margin based on block width
- Use 0.26 * block_weight for wider blocks (block_weight_radio >= 0.5)
- Use 0.36 * block_weight for narrower blocks- This change aims to improve paragraph splitting accuracy for different block widths
myhloli há 1 ano atrás
pai
commit
69805f4ba9
1 ficheiros alterados com 6 adições e 2 exclusões
  1. 6 2
      magic_pdf/para/para_split_v3.py

+ 6 - 2
magic_pdf/para/para_split_v3.py

@@ -121,8 +121,12 @@ def __is_list_or_index_block(block):
                 right_close_num += 1
             else:
                 # 右侧不顶格情况下是否有一段距离,拍脑袋用0.3block宽度做阈值
-                # 0.26
-                closed_area = 0.35 * block_weight
+                # block宽的阈值可以小些,block窄的阈值要大
+
+                if block_weight_radio >= 0.5:
+                    closed_area = 0.26 * block_weight
+                else:
+                    closed_area = 0.36 * block_weight
                 if block['bbox_fs'][2] - line['bbox'][2] > closed_area:
                     right_not_close_num += 1