https://github.com/opendatalab/MinerU.git

liukaiwen 377b49eb24 add table recognition success detect vor 1 Jahr
.github e174154c8f Update cla.yml vor 1 Jahr
demo 7bca348d57 upload ocr_demo pdf vor 1 Jahr
docs c7067c8535 docs(models_zh_cn): add print statement to download models example vor 1 Jahr
magic_pdf 377b49eb24 add table recognition success detect vor 1 Jahr
signatures b495880e21 @nutshellfool has signed the CLA in opendatalab/MinerU#258 vor 1 Jahr
tests 40e0827e60 Feat/impl cli (#264) vor 1 Jahr
.gitignore 40e0827e60 Feat/impl cli (#264) vor 1 Jahr
LICENSE.md 9fe81795bc Create LICENSE.md vor 1 Jahr
MinerU_CLA.md 572c35f9c0 Update MinerU_CLA.md vor 1 Jahr
README.md 3abf22cc77 docs(readme): update wheel install URL vor 1 Jahr
README_ja-JP.md 3abf22cc77 docs(readme): update wheel install URL vor 1 Jahr
README_zh-CN.md 9778a46142 docs: specify absolute path for model weights configuration vor 1 Jahr
README_zh-CN_v2.md 6350f3492a docs(readme): update acknowledgment section and project description- Streamline the Acknowledgments section in the README by removing redundant entries.- Clarify the project's current use of PyMuPDF and future plans for exploring a more permissively licensed PDF processing library in the project description. vor 1 Jahr
magic-pdf.template.json 37925f36d9 feat(model inference): add table recognition and conversion to LaTeX (#284) vor 1 Jahr
requirements-qa.txt d04f3f22f5 # feat(model inference): add table recognition and convertion to LaTeX vor 1 Jahr
requirements.txt cae215bb5e fix table recognition bug#321 vor 1 Jahr
setup.py 252139099b fix(setup): allow latest matplotlib versions on non-Windows platforms vor 1 Jahr
update_version.py 7fd8d97edb fix error: version is 0.0.0 vor 1 Jahr

README.md

[![stars](https://img.shields.io/github/stars/opendatalab/MinerU.svg)](https://github.com/opendatalab/MinerU) [![forks](https://img.shields.io/github/forks/opendatalab/MinerU.svg)](https://github.com/opendatalab/MinerU) [![open issues](https://img.shields.io/github/issues-raw/opendatalab/MinerU)](https://github.com/opendatalab/MinerU/issues) [![issue resolution](https://img.shields.io/github/issues-closed-raw/opendatalab/MinerU)](https://github.com/opendatalab/MinerU/issues) [![PyPI version](https://badge.fury.io/py/magic-pdf.svg)](https://badge.fury.io/py/magic-pdf) [![Downloads](https://static.pepy.tech/badge/magic-pdf)](https://pepy.tech/project/magic-pdf) [![Downloads](https://static.pepy.tech/badge/magic-pdf/month)](https://pepy.tech/project/magic-pdf) [English](README.md) | [简体中文](README_zh-CN.md) | [日本語](README_ja-JP.md)

MinerU

Introduction

MinerU is a one-stop, open-source, high-quality data extraction tool, includes the following primary features:

Magic-PDF

Introduction

Magic-PDF is a tool designed to convert PDF documents into Markdown format, capable of processing files stored locally or on object storage supporting S3 protocol.

Key features include:

  • Support for multiple front-end model inputs
  • Removal of headers, footers, footnotes, and page numbers
  • Human-readable layout formatting
  • Retains the original document's structure and formatting, including headings, paragraphs, lists, and more
  • Extraction and display of images and tables within markdown
  • Conversion of equations into LaTeX format
  • Automatic detection and conversion of garbled PDFs
  • Compatibility with CPU and GPU environments
  • Available for Windows, Linux, and macOS platforms

https://github.com/user-attachments/assets/4bea02c9-6d54-4cd6-97ed-dff14340982c

Project Panorama

Project Panorama

Flowchart

Flowchart

Dependency repositorys

Getting Started

Requirements

  • Python >= 3.9

Using a virtual environment is recommended to avoid potential dependency conflicts; both venv and conda are suitable. For example:

conda create -n MinerU python=3.10
conda activate MinerU

Installation and Configuration

1. Install Magic-PDF

1.Install dependencies

The full-feature package depends on detectron2, which requires a compilation installation.
If you need to compile it yourself, please refer to https://github.com/facebookresearch/detectron2/issues/5114
Alternatively, you can directly use our precompiled whl package (limited to Python 3.10):

pip install detectron2 --extra-index-url https://wheels.myhloli.com

2.Install the full-feature package with pip

Note: The pip-installed package supports CPU-only and is ideal for quick tests.

For CUDA/MPS acceleration in production, see Acceleration Using CUDA or MPS.

pip install magic-pdf[full]==0.6.2b1

❗️❗️❗️ We have pre-released the 0.6.2 beta version, addressing numerous issues mentioned in our logs. However, this build has not undergone full QA testing and does not represent the final release quality. Should you encounter any problems, please promptly report them to us via issues or revert to using version 0.6.1.

> pip install magic-pdf[full-cpu]==0.6.1
> ```



#### 2. Downloading model weights files

For detailed references, please see below [how_to_download_models](docs/how_to_download_models_en.md)

After downloading the model weights, move the 'models' directory to a directory on a larger disk space, preferably an SSD.


#### 3. Copy the Configuration File and Make Configurations
You can get the [magic-pdf.template.json](magic-pdf.template.json) file in the repository root directory.

bash cp magic-pdf.template.json ~/magic-pdf.json

In magic-pdf.json, configure "models-dir" to point to the directory where the model weights files are located.

json { "models-dir": "/tmp/models" }



#### 4. Acceleration Using CUDA or MPS
If you have an available Nvidia GPU or are using a Mac with Apple Silicon, you can leverage acceleration with CUDA or MPS respectively.
##### CUDA

You need to install the corresponding PyTorch version according to your CUDA version.  
This example installs the CUDA 11.8 version.More information https://pytorch.org/get-started/locally/

bash pip install --force-reinstall torch==2.3.1 torchvision==0.18.1 --index-url https://download.pytorch.org/whl/cu118

> ❗ ️Make sure to specify version
> ```bash
> torch==2.3.1 torchvision==0.18.1
> ```
>  in the command, as these are the highest versions we support. Failing to specify the versions may result in automatically installing higher versions which can cause the program to fail.

Also, you need to modify the value of "device-mode" in the configuration file magic-pdf.json.  

json { "device-mode":"cuda" }


##### MPS

For macOS users with M-series chip devices, you can use MPS for inference acceleration.  
You also need to modify the value of "device-mode" in the configuration file magic-pdf.json.  

json { "device-mode":"mps" }



### Usage

#### 1.Usage via Command Line

###### simple

bash magic-pdf pdf-command --pdf "pdf_path" --inside_model true

After the program has finished, you can find the generated markdown files under the directory "/tmp/magic-pdf".  
You can find the corresponding xxx_model.json file in the markdown directory.   
If you intend to do secondary development on the post-processing pipeline, you can use the command:  

bash magic-pdf pdf-command --pdf "pdf_path" --model "model_json_path"

In this way, you won't need to re-run the model data, making debugging more convenient.


###### more 

bash magic-pdf --help



#### 2. Usage via Api

###### Local

python image_writer = DiskReaderWriter(local_image_dir) image_dir = str(os.path.basename(local_image_dir)) jso_useful_key = {"_pdf_type": "", "model_list": []} pipe = UNIPipe(pdf_bytes, jso_useful_key, image_writer) pipe.pipe_classify() pipe.pipe_parse() md_content = pipe.pipe_mk_markdown(image_dir, drop_mode="none")


###### Object Storage

python s3pdf_cli = S3ReaderWriter(pdf_ak, pdf_sk, pdf_endpoint) image_dir = "s3://img_bucket/" s3image_cli = S3ReaderWriter(img_ak, img_sk, img_endpoint, parent_path=image_dir) pdf_bytes = s3pdf_cli.read(s3_pdf_path, mode=s3pdf_cli.MODE_BIN) jso_useful_key = {"_pdf_type": "", "model_list": []} pipe = UNIPipe(pdf_bytes, jso_useful_key, s3image_cli) pipe.pipe_classify() pipe.pipe_parse() md_content = pipe.pipe_mk_markdown(image_dir, drop_mode="none")


Demo can be referred to [demo.py](demo/demo.py)


# Magic-Doc


## Introduction

Magic-Doc is a tool designed to convert web pages or multi-format e-books into markdown format.

Key Features Include:

- Web Page Extraction
  - Cross-modal precise parsing of text, images, tables, and formula information.

- E-Book Document Extraction
  - Supports various document formats including epub, mobi, with full adaptation for text and images.

- Language Type Identification
  - Accurate recognition of 176 languages.

https://github.com/opendatalab/MinerU/assets/11393164/a5a650e9-f4c0-463e-acc3-960967f1a1ca



https://github.com/opendatalab/MinerU/assets/11393164/0f4a6fe9-6cca-4113-9fdc-a537749d764d



https://github.com/opendatalab/MinerU/assets/11393164/20438a02-ce6c-4af8-9dde-d722a4e825b2




## Project Repository

- [Magic-Doc](https://github.com/InternLM/magic-doc)
  Outstanding Webpage and E-book Extraction Tool


# All Thanks To Our Contributors

<a href="https://github.com/opendatalab/MinerU/graphs/contributors">
  <img src="https://contrib.rocks/image?repo=opendatalab/MinerU" />
</a>


# License Information

[LICENSE.md](LICENSE.md)

The project currently leverages PyMuPDF to deliver advanced functionalities; however, its adherence to the AGPL license may impose limitations on certain use cases. In upcoming iterations, we intend to explore and transition to a more permissively licensed PDF processing library to enhance user-friendliness and flexibility.


# Acknowledgments

- [PaddleOCR](https://github.com/PaddlePaddle/PaddleOCR)
- [PyMuPDF](https://github.com/pymupdf/PyMuPDF)
- [fast-langdetect](https://github.com/LlmKira/fast-langdetect)
- [pdfminer.six](https://github.com/pdfminer/pdfminer.six)


# Citation

bibtex @article{he2024opendatalab, title={Opendatalab: Empowering general artificial intelligence with open datasets}, author={He, Conghui and Li, Wei and Jin, Zhenjiang and Xu, Chao and Wang, Bin and Lin, Dahua}, journal={arXiv preprint arXiv:2407.13773}, year={2024} }

@misc{2024mineru,

title={MinerU: A One-stop, Open-source, High-quality Data Extraction Tool},
author={MinerU Contributors},
howpublished = {\url{https://github.com/opendatalab/MinerU}},
year={2024}

} ```

Star History

Star History Chart

Links