|
|
@@ -17,6 +17,7 @@
|
|
|
|
|
|
# MinerU
|
|
|
|
|
|
+
|
|
|
## Introduction
|
|
|
|
|
|
MinerU is a one-stop, open-source data extraction tool, primarily includes the following features:
|
|
|
@@ -24,8 +25,10 @@ MinerU is a one-stop, open-source data extraction tool, primarily includes the f
|
|
|
- [Magic-PDF](#Magic-PDF) PDF Document Extraction
|
|
|
- [Magic-Doc](#Magic-Doc) Webpage & E-book Extraction
|
|
|
|
|
|
+
|
|
|
# Magic-PDF
|
|
|
|
|
|
+
|
|
|
## Introduction
|
|
|
|
|
|
Magic-PDF is a tool designed to convert PDF documents into Markdown format, capable of processing files stored locally or on object storage supporting S3 protocol.
|
|
|
@@ -51,6 +54,7 @@ https://github.com/magicpdf/Magic-PDF/assets/11393164/618937cb-dc6a-4646-b433-e3
|
|
|
|
|
|

|
|
|
|
|
|
+
|
|
|
## Flowchart
|
|
|
|
|
|

|
|
|
@@ -62,6 +66,7 @@ https://github.com/magicpdf/Magic-PDF/assets/11393164/618937cb-dc6a-4646-b433-e3
|
|
|
- [Miner-PDF-Benchmark](https://github.com/opendatalab/Miner-PDF-Benchmark)
|
|
|
- An end-to-end PDF document comprehension evaluation suite designed for large-scale model data scenarios
|
|
|
|
|
|
+
|
|
|
## Getting Started
|
|
|
|
|
|
### Requirements
|
|
|
@@ -119,18 +124,21 @@ md_content = pipe.pipe_mk_markdown(image_dir, drop_mode="none")
|
|
|
|
|
|
Demo can be referred to [demo.py](demo/demo.py)
|
|
|
|
|
|
+
|
|
|
## All Thanks To Our Contributors
|
|
|
|
|
|
<a href="https://github.com/magicpdf/Magic-PDF/graphs/contributors">
|
|
|
- <img src="https://contrib.rocks/image?repo=magicpdf/Magic-PDF" />
|
|
|
+ <img src="https://contrib.rocks/image?repo=opendatalab/MinerU" />
|
|
|
</a>
|
|
|
|
|
|
+
|
|
|
## License Information
|
|
|
|
|
|
[LICENSE.md](LICENSE.md)
|
|
|
|
|
|
The project currently leverages PyMuPDF to deliver advanced functionalities; however, its adherence to the AGPL license may impose limitations on certain use cases. In upcoming iterations, we intend to explore and transition to a more permissively licensed PDF processing library to enhance user-friendliness and flexibility.
|
|
|
|
|
|
+
|
|
|
## Acknowledgments
|
|
|
|
|
|
- [PaddleOCR](https://github.com/PaddlePaddle/PaddleOCR)
|
|
|
@@ -139,6 +147,7 @@ The project currently leverages PyMuPDF to deliver advanced functionalities; how
|
|
|
|
|
|
# Magic-Doc
|
|
|
|
|
|
+
|
|
|
## Introduction
|
|
|
|
|
|
Magic-Doc is a tool designed to convert web pages or multi-format e-books into markdown format.
|
|
|
@@ -166,6 +175,7 @@ https://github.com/opendatalab/MinerU/assets/11393164/20438a02-ce6c-4af8-9dde-d7
|
|
|
|
|
|
|
|
|
|
|
|
+
|
|
|
## Project Repository
|
|
|
|
|
|
- [Magic-Doc](https://github.com/magicpdf/Magic-Doc)
|