Quellcode durchsuchen

feat(model_utils): adjust table detection threshold and add features

- Adjust the threshold for considering tables inside other tables from2 to 3
- Add support for custom formula delimiters through user configuration
- Pin pdfminer.six to version 20250324 to prevent parsing failures
myhloli vor 6 Monaten
Ursprung
Commit
49a8f8be0a
3 geänderte Dateien mit 8 neuen und 2 gelöschten Zeilen
  1. 3 0
      README.md
  2. 3 0
      README_zh-CN.md
  3. 2 2
      magic_pdf/model/sub_modules/model_utils.py

+ 3 - 0
README.md

@@ -48,6 +48,9 @@ Easier to use: Just grab MinerU Desktop. No coding, no login, just a simple inte
 </div>
 
 # Changelog
+- 2025/04/29 1.3.10 Released
+  - Support for custom formula delimiters can be achieved by modifying the `latex-delimiter-config` item in the `magic-pdf.json` file under the user directory.
+  - Pinned `pdfminer.six` to version `20250324` to prevent parsing failures caused by new versions.
 - 2025/04/27 1.3.9 Released  
   - Optimized the formula parsing function to improve the success rate of formula rendering  
   - Updated `pdfminer.six` to the latest version, fixing some abnormal PDF parsing issues

+ 3 - 0
README_zh-CN.md

@@ -47,6 +47,9 @@
 </div>
 
 # 更新记录
+- 2025/04/29 1.3.10 发布
+  - 支持使用自定义公式标识符,可通过修改用户目录下的magic-pdf.json文件中的`latex-delimiter-config`项实现。
+  - 锁定`pdfminer.six`至`20250324`版本,以避免新版本导致的解析失败问题。
 - 2025/04/27 1.3.9 发布
   - 优化公式解析功能,提升公式渲染的成功率
   - 更新`pdfminer.six`到最新版本,修复了部分pdf解析异常问题

+ 2 - 2
magic_pdf/model/sub_modules/model_utils.py

@@ -172,8 +172,8 @@ def filter_nested_tables(table_res_list, overlap_threshold=0.8, area_threshold=0
         tables_inside = [j for j in range(len(table_res_list))
                          if i != j and is_inside(table_info[j], table_info[i], overlap_threshold)]
 
-        # Continue if there are at least 2 tables inside
-        if len(tables_inside) >= 2:
+        # Continue if there are at least 3 tables inside
+        if len(tables_inside) >= 3:
             # Check if inside tables overlap with each other
             tables_overlap = any(do_overlap(table_info[tables_inside[idx1]], table_info[tables_inside[idx2]])
                                  for idx1 in range(len(tables_inside))