1 年之前 · 17be54972b
--- a/README.md
+++ b/README.md
@@ -17,6 +17,7 @@
 
				 
			
 
				 # MinerU 
			
 
				 
			
 
				+
			
 
				 ## Introduction
			
 
				 
			
 
				 MinerU is a one-stop, open-source data extraction tool, primarily includes the following features:
			
@@ -24,8 +25,10 @@ MinerU is a one-stop, open-source data extraction tool, primarily includes the f
 
				 - [Magic-PDF](#Magic-PDF)  PDF Document Extraction  
			
 
				 - [Magic-Doc](#Magic-Doc)  Webpage & E-book Extraction
			
 
				 
			
 
				+
			
 
				 # Magic-PDF
			
 
				 
			
 
				+
			
 
				 ## Introduction
			
 
				 
			
 
				 Magic-PDF is a tool designed to convert PDF documents into Markdown format, capable of processing files stored locally or on object storage supporting S3 protocol.
			
@@ -51,6 +54,7 @@ https://github.com/magicpdf/Magic-PDF/assets/11393164/618937cb-dc6a-4646-b433-e3
 
				 
			
 
				 ![Project Panorama](docs/images/project_panorama_en.png)
			
 
				 
			
 
				+
			
 
				 ## Flowchart
			
 
				 
			
 
				 ![Flowchart](docs/images/flowchart_en.png)
			
@@ -62,6 +66,7 @@ https://github.com/magicpdf/Magic-PDF/assets/11393164/618937cb-dc6a-4646-b433-e3
 
				 - [Miner-PDF-Benchmark](https://github.com/opendatalab/Miner-PDF-Benchmark)
			
 
				   - An end-to-end PDF document comprehension evaluation suite designed for large-scale model data scenarios
			
 
				 
			
 
				+
			
 
				 ## Getting Started
			
 
				 
			
 
				 ### Requirements
			
@@ -119,18 +124,21 @@ md_content = pipe.pipe_mk_markdown(image_dir, drop_mode="none")
 
				 
			
 
				 Demo can be referred to [demo.py](demo/demo.py)
			
 
				 
			
 
				+
			
 
				 ## All Thanks To Our Contributors
			
 
				 
			
 
				 <a href="https://github.com/magicpdf/Magic-PDF/graphs/contributors">
			
 
				-  <img src="https://contrib.rocks/image?repo=magicpdf/Magic-PDF" />
			
 
				+  <img src="https://contrib.rocks/image?repo=opendatalab/MinerU" />
			
 
				 </a>
			
 
				 
			
 
				+
			
 
				 ## License Information
			
 
				 
			
 
				 [LICENSE.md](LICENSE.md)
			
 
				 
			
 
				 The project currently leverages PyMuPDF to deliver advanced functionalities; however, its adherence to the AGPL license may impose limitations on certain use cases. In upcoming iterations, we intend to explore and transition to a more permissively licensed PDF processing library to enhance user-friendliness and flexibility.
			
 
				 
			
 
				+
			
 
				 ## Acknowledgments
			
 
				 
			
 
				 - [PaddleOCR](https://github.com/PaddlePaddle/PaddleOCR)
			
@@ -139,6 +147,7 @@ The project currently leverages PyMuPDF to deliver advanced functionalities; how
 
				 
			
 
				 # Magic-Doc
			
 
				 
			
 
				+
			
 
				 ## Introduction
			
 
				 
			
 
				 Magic-Doc is a tool designed to convert web pages or multi-format e-books into markdown format.
			
@@ -166,6 +175,7 @@ https://github.com/opendatalab/MinerU/assets/11393164/20438a02-ce6c-4af8-9dde-d7
 
				 
			
 
				 
			
 
				 
			
 
				+
			
 
				 ## Project Repository
			
 
				 
			
 
				 - [Magic-Doc](https://github.com/magicpdf/Magic-Doc)
			
--- a/README_zh-CN.md
+++ b/README_zh-CN.md
@@ -17,6 +17,7 @@
 
				 
			
 
				 # MinerU 
			
 
				 
			
 
				+
			
 
				 ## 简介
			
 
				 
			
 
				 MinerU 是一款一站式开源数据提取工具，主要包含以下功能:
			
@@ -26,6 +27,7 @@ MinerU 是一款一站式开源数据提取工具，主要包含以下功能:
 
				 
			
 
				 # Magic-PDF
			
 
				 
			
 
				+
			
 
				 ## 简介
			
 
				 
			
 
				 Magic-PDF 是一款将 PDF 转化为 markdown 格式的工具。支持转换本地文档或者位于支持S3协议对象存储上的文件。
			
@@ -121,12 +123,20 @@ md_content = pipe.pipe_mk_markdown(image_dir, drop_mode="none")
 
				 详细实现可参考 [demo.py](demo/demo.py)
			
 
				 
			
 
				 
			
 
				+## 感谢我们的贡献者
			
 
				+
			
 
				+<a href="https://github.com/magicpdf/Magic-PDF/graphs/contributors">
			
 
				+  <img src="https://contrib.rocks/image?repo=opendatalab/MinerU" />
			
 
				+</a>
			
 
				+
			
 
				+
			
 
				 ## 版权说明
			
 
				 
			
 
				 [LICENSE.md](LICENSE.md)
			
 
				 
			
 
				 本项目目前采用PyMuPDF以实现高级功能，但因其遵循AGPL协议，可能对某些使用场景构成限制。未来版本迭代中，我们计划探索并替换为许可条款更为宽松的PDF处理库，以提升用户友好度及灵活性。
			
 
				 
			
 
				+
			
 
				 ## 鸣谢
			
 
				 - [PaddleOCR](https://github.com/PaddlePaddle/PaddleOCR)
			
 
				 - [PyMuPDF](https://github.com/pymupdf/PyMuPDF)
			
@@ -134,6 +144,7 @@ md_content = pipe.pipe_mk_markdown(image_dir, drop_mode="none")
 
				 
			
 
				 # Magic-Doc
			
 
				 
			
 
				+
			
 
				 ## 简介
			
 
				 
			
 
				 Magic-Doc 是一款支持将网页或多格式电子书转换为 markdown 格式的工具。
			
@@ -161,6 +172,7 @@ https://github.com/opendatalab/MinerU/assets/11393164/20438a02-ce6c-4af8-9dde-d7
 
				 
			
 
				 
			
 
				 
			
 
				+
			
 
				 ## 项目仓库
			
 
				 
			
 
				 - [Magic-Doc](https://github.com/magicpdf/Magic-Doc)
			
--- a/magic_pdf/libs/language.py
+++ b/magic_pdf/libs/language.py
@@ -1,13 +1,6 @@
 
				-import regex
			
 
				 import unicodedata
			
 
				 from fast_langdetect import detect_langs
			
 
				 
			
 
				-RE_BAD_CHARS = regex.compile(r"\p{Cc}|\p{Cs}")
			
 
				-
			
 
				-
			
 
				-def remove_bad_chars(text):
			
 
				-    return RE_BAD_CHARS.sub("", text)
			
 
				-
			
 
				 
			
 
				 def detect_lang(text: str) -> str:
			
 
				     if len(text) == 0: