https://github.com/opendatalab/MinerU.git

0 Rilis

Xiaomeng Zhao 572c35f9c0 Update MinerU_CLA.md		1 tahun lalu
.github	ab1ec00232 Merge pull request #172 from dt-yy/master	1 tahun lalu
demo	720db843c5 fix(demo): add fallback to internal model when external model data is missingIf no valid model data is provided, the system now checks if an internal model	1 tahun lalu
docs	4f967dcc6c move cla to root	1 tahun lalu
magic_pdf	f6f1d00d58 feat(magic_pdf): update installation commands for simplified dependency options	1 tahun lalu
tests	305c77cd52 add gpu ci	1 tahun lalu
.gitignore	016cde3ece 修复init错误	1 tahun lalu
LICENSE.md	9fe81795bc Create LICENSE.md	1 tahun lalu
MinerU_CLA.md	572c35f9c0 Update MinerU_CLA.md	1 tahun lalu
README.md	bcaf715991 Update README.md	1 tahun lalu
README_zh-CN.md	e51d3e7917 Update README_zh-CN.md	1 tahun lalu
magic-pdf.template.json	695b357994 feat(config-reader): add models-dir and device-mode configurations	1 tahun lalu
requirements-qa.txt	36aee5d60c add pyopenssl	1 tahun lalu
requirements.txt	27e98a8130 fix(dependencies): pin pdfminer.six to 20231228 for compatibility	1 tahun lalu
setup.py	5c963168fb feat(setup.py): restructure extras_require options for clarity	1 tahun lalu
update_version.py	7fd8d97edb fix error: version is 0.0.0	1 tahun lalu

MinerU

[![stars](https://img.shields.io/github/stars/opendatalab/MinerU.svg)](https://github.com/opendatalab/MinerU) [![forks](https://img.shields.io/github/forks/opendatalab/MinerU.svg)](https://github.com/opendatalab/MinerU) [![open issues](https://img.shields.io/github/issues-raw/opendatalab/MinerU)](https://github.com/opendatalab/MinerU/issues) [![issue resolution](https://img.shields.io/github/issues-closed-raw/opendatalab/MinerU)](https://github.com/opendatalab/MinerU/issues) [![PyPI version](https://badge.fury.io/py/magic-pdf.svg)](https://badge.fury.io/py/magic-pdf) [![Downloads](https://static.pepy.tech/badge/magic-pdf)](https://pepy.tech/project/magic-pdf) [![Downloads](https://static.pepy.tech/badge/magic-pdf/month)](https://pepy.tech/project/magic-pdf)

[English](README.md) | [简体中文](README_zh-CN.md)

MinerU: An end-to-end PDF parsing tool based on PDF-Extract-Kit, supporting conversion from PDF to Markdown.🚀🚀🚀
PDF-Extract-Kit: A Comprehensive Toolkit for High-Quality PDF Content Extraction🔥🔥🔥

👋 join us on Discord and WeChat

MinerU

Introduction

MinerU is a one-stop, open-source, high-quality data extraction tool, includes the following primary features:

Magic-PDF PDF Document Extraction
Magic-Doc Webpage & E-book Extraction

Magic-PDF

Introduction

Magic-PDF is a tool designed to convert PDF documents into Markdown format, capable of processing files stored locally or on object storage supporting S3 protocol.

Key features include:

Support for multiple front-end model inputs
Removal of headers, footers, footnotes, and page numbers
Human-readable layout formatting
Retains the original document's structure and formatting, including headings, paragraphs, lists, and more
Extraction and display of images and tables within markdown
Conversion of equations into LaTeX format
Automatic detection and conversion of garbled PDFs
Compatibility with CPU and GPU environments
Available for Windows, Linux, and macOS platforms

https://github.com/user-attachments/assets/4bea02c9-6d54-4cd6-97ed-dff14340982c

Project Panorama

Flowchart

Dependency repositorys

PDF-Extract-Kit : A Comprehensive Toolkit for High-Quality PDF Content Extraction 🚀🚀🚀

Getting Started

Requirements

Python >= 3.9

Using a virtual environment is recommended to avoid potential dependency conflicts; both venv and conda are suitable. For example:

conda create -n MinerU python=3.10
conda activate MinerU

Installation and Configuration

1. Install Magic-PDF

Install the full-feature package with pip:

Note: The pip-installed package supports CPU-only and is ideal for quick tests.

For CUDA/MPS acceleration in production, see Acceleration Using CUDA or MPS.

pip install magic-pdf[full-cpu]

The full-feature package depends on detectron2, which requires a compilation installation.
If you need to compile it yourself, please refer to https://github.com/facebookresearch/detectron2/issues/5114
Alternatively, you can directly use our precompiled whl package (limited to Python 3.10):

pip install detectron2 --extra-index-url https://myhloli.github.io/wheels/

2. Downloading model weights files

For detailed references, please see below how_to_download_models

After downloading the model weights, move the 'models' directory to a directory on a larger disk space, preferably an SSD.

3. Copy the Configuration File and Make Configurations

You can get the magic-pdf.template.json file in the repository root directory.

cp magic-pdf.template.json ~/magic-pdf.json

In magic-pdf.json, configure "models-dir" to point to the directory where the model weights files are located.

{
  "models-dir": "/tmp/models"
}

4. Acceleration Using CUDA or MPS

If you have an available Nvidia GPU or are using a Mac with Apple Silicon, you can leverage acceleration with CUDA or MPS respectively.

CUDA

You need to install the corresponding PyTorch version according to your CUDA version.
This example installs the CUDA 11.8 version.More information https://pytorch.org/get-started/locally/

pip install --force-reinstall torch==2.3.1 torchvision==0.18.1 --index-url https://download.pytorch.org/whl/cu118

Also, you need to modify the value of "device-mode" in the configuration file magic-pdf.json.

{
  "device-mode":"cuda"
}

MPS

For macOS users with M-series chip devices, you can use MPS for inference acceleration.
You also need to modify the value of "device-mode" in the configuration file magic-pdf.json.

{
  "device-mode":"mps"
}

Usage

1.Usage via Command Line

simple

magic-pdf pdf-command --pdf "pdf_path" --inside_model true

After the program has finished, you can find the generated markdown files under the directory "/tmp/magic-pdf".
You can find the corresponding xxx_model.json file in the markdown directory.
If you intend to do secondary development on the post-processing pipeline, you can use the command:

magic-pdf pdf-command --pdf "pdf_path" --model "model_json_path"

In this way, you won't need to re-run the model data, making debugging more convenient.

magic-pdf --help

2. Usage via Api

Local

image_writer = DiskReaderWriter(local_image_dir)
image_dir = str(os.path.basename(local_image_dir))
jso_useful_key = {"_pdf_type": "", "model_list": []}
pipe = UNIPipe(pdf_bytes, jso_useful_key, image_writer)
pipe.pipe_classify()
pipe.pipe_parse()
md_content = pipe.pipe_mk_markdown(image_dir, drop_mode="none")

Object Storage

s3pdf_cli = S3ReaderWriter(pdf_ak, pdf_sk, pdf_endpoint)
image_dir = "s3://img_bucket/"
s3image_cli = S3ReaderWriter(img_ak, img_sk, img_endpoint, parent_path=image_dir)
pdf_bytes = s3pdf_cli.read(s3_pdf_path, mode=s3pdf_cli.MODE_BIN)
jso_useful_key = {"_pdf_type": "", "model_list": []}
pipe = UNIPipe(pdf_bytes, jso_useful_key, s3image_cli)
pipe.pipe_classify()
pipe.pipe_parse()
md_content = pipe.pipe_mk_markdown(image_dir, drop_mode="none")

Demo can be referred to demo.py

Magic-Doc

Introduction

Magic-Doc is a tool designed to convert web pages or multi-format e-books into markdown format.

Key Features Include:

Web Page Extraction
- Cross-modal precise parsing of text, images, tables, and formula information.
E-Book Document Extraction
- Supports various document formats including epub, mobi, with full adaptation for text and images.
Language Type Identification
- Accurate recognition of 176 languages.

https://github.com/opendatalab/MinerU/assets/11393164/a5a650e9-f4c0-463e-acc3-960967f1a1ca

https://github.com/opendatalab/MinerU/assets/11393164/0f4a6fe9-6cca-4113-9fdc-a537749d764d

https://github.com/opendatalab/MinerU/assets/11393164/20438a02-ce6c-4af8-9dde-d722a4e825b2

Project Repository

Magic-Doc Outstanding Webpage and E-book Extraction Tool

All Thanks To Our Contributors

License Information

LICENSE.md

The project currently leverages PyMuPDF to deliver advanced functionalities; however, its adherence to the AGPL license may impose limitations on certain use cases. In upcoming iterations, we intend to explore and transition to a more permissively licensed PDF processing library to enhance user-friendliness and flexibility.

Acknowledgments

Citation

@misc{2024mineru,
    title={MinerU: A One-stop, Open-source, High-quality Data Extraction Tool},
    author={MinerU Contributors},
    howpublished = {\url{https://github.com/opendatalab/MinerU}},
    year={2024}
}

README.md

MinerU

Introduction

Magic-PDF

Introduction

Project Panorama

Flowchart

Dependency repositorys

Getting Started

Requirements

Installation and Configuration

1. Install Magic-PDF

2. Downloading model weights files

3. Copy the Configuration File and Make Configurations

4. Acceleration Using CUDA or MPS

CUDA

MPS

Usage

1.Usage via Command Line

simple

more

2. Usage via Api

Local

Object Storage

Magic-Doc

Introduction

Project Repository

All Thanks To Our Contributors

License Information

Acknowledgments

Citation

Star History

Links