Pārlūkot izejas kodu

refactor(pdf_parse): adjust block splitting logic for wide blocks

- Modify the logic for splitting wide blocks exceeding 0.4 page width
- Remove the specific case for blocks exceeding 0.25 page width
- Add comments to explain the reasoning behind different splitting strategies
myhloli 1 gadu atpakaļ
vecāks
revīzija
4cf7e9a224
1 mainītis faili ar 3 papildinājumiem un 4 dzēšanām
  1. 3 4
      magic_pdf/pdf_parse_union_core_v2.py

+ 3 - 4
magic_pdf/pdf_parse_union_core_v2.py

@@ -208,13 +208,12 @@ def insert_lines_into_block(block_bbox, line_height, page_w, page_h):
         ):  # 可能是双列结构,可以切细点
             lines = int(block_height / line_height) + 1
         else:
-            # 如果block的宽度超过0.4页面宽度,则将block分成3行
+            # 如果block的宽度超过0.4页面宽度,则将block分成3行(是一种复杂布局,图不能切的太细)
             if block_weight > page_w * 0.4:
                 line_height = (y1 - y0) / 3
                 lines = 3
-            elif block_weight > page_w * 0.25:  # 否则将block分成两行
-                line_height = (y1 - y0) / 2
-                lines = 2
+            elif block_weight > page_w * 0.25:  # (可能是三列结构,也切细点)
+                lines = int(block_height / line_height) + 1
             else:  # 判断长宽比
                 if block_height / block_weight > 1.2:  # 细长的不分
                     return [[x0, y0, x1, y1]]