[](https://github.com/opendatalab/MinerU)
[](https://github.com/opendatalab/MinerU)
[](https://github.com/opendatalab/MinerU/issues)
[](https://github.com/opendatalab/MinerU/issues)
[](https://badge.fury.io/py/magic-pdf)
[](https://pepy.tech/project/magic-pdf)
[](https://pepy.tech/project/magic-pdf)
[](https://opendatalab.com/OpenSourceTools/Extractor/PDF)
[](https://huggingface.co/spaces/opendatalab/MinerU)
[](https://www.modelscope.cn/studios/OpenDataLab/MinerU)
[](https://colab.research.google.com/gist/papayalove/b5f4913389e7ff9883c6b687de156e78/mineru_demo.ipynb)
[](https://arxiv.org/pdf/2409.18839?)

[English](README.md) | [็ฎไฝไธญๆ](README_zh-CN.md)
PDF-Extract-Kit: High-Quality PDF Extraction Toolkit๐ฅ๐ฅ๐ฅ
๐ join us on Discord and WeChat
# Changelog
- 2024/09/27 Version 0.8.1 released, Fixed some bugs, and providing a [localized deployment version](projects/web_demo/README.md) of the [online demo](https://opendatalab.com/OpenSourceTools/Extractor/PDF/) and the [front-end interface](projects/web/README.md).
- 2024/09/09: Version 0.8.0 released, supporting fast deployment with Dockerfile, and launching demos on Huggingface and Modelscope.
- 2024/08/30: Version 0.7.1 released, add paddle tablemaster table recognition option
- 2024/08/09: Version 0.7.0b1 released, simplified installation process, added table recognition functionality
- 2024/08/01: Version 0.6.2b1 released, optimized dependency conflict issues and installation documentation
- 2024/07/05: Initial open-source release
| Operating System |
| Ubuntu 22.04 LTS |
Windows 10 / 11 |
macOS 11+ |
| CPU |
x86_64 |
x86_64 |
x86_64 / arm64 |
| Memory |
16GB or more, recommended 32GB+ |
| Python Version |
3.10 |
| Nvidia Driver Version |
latest (Proprietary Driver) |
latest |
None |
| CUDA Environment |
Automatic installation [12.1 (pytorch) + 11.8 (paddle)] |
11.8 (manual installation) + cuDNN v8.7.0 (manual installation) |
None |
| GPU Hardware Support List |
Minimum Requirement 8G+ VRAM |
3060ti/3070/3080/3080ti/4060/4070/4070ti
8G VRAM enables layout, formula recognition acceleration and OCR acceleration |
None |
| Recommended Configuration 16G+ VRAM |
3090/3090ti/4070ti super/4080/4090
16G VRAM or more can enable layout, formula recognition, OCR acceleration and table recognition acceleration simultaneously
|
### Online Demo
[](https://opendatalab.com/OpenSourceTools/Extractor/PDF)
[](https://huggingface.co/spaces/opendatalab/MinerU)
[](https://www.modelscope.cn/studios/OpenDataLab/MinerU)
### Quick CPU Demo
#### 1. Install magic-pdf
```bash
conda create -n MinerU python=3.10
conda activate MinerU
pip install -U magic-pdf[full] --extra-index-url https://wheels.myhloli.com
```
#### 2. Download model weight files
Refer to [How to Download Model Files](docs/how_to_download_models_en.md) for detailed instructions.
#### 3. Modify the Configuration File for Additional Configuration
After completing the [2. Download model weight files](#2-download-model-weight-files) step, the script will automatically generate a `magic-pdf.json` file in the user directory and configure the default model path.
You can find the `magic-pdf.json` file in your ใuser directoryใ.
> The user directory for Windows is "C:\\Users\\username", for Linux it is "/home/username", and for macOS it is "/Users/username".
You can modify certain configurations in this file to enable or disable features, such as table recognition:
> If the following items are not present in the JSON, please manually add the required items and remove the comment content (standard JSON does not support comments).
```json
{
// other config
"table-config": {
"model": "TableMaster", // Another option of this value is 'struct_eqtable'
"is_table_recog_enable": false, // Table recognition is disabled by default, modify this value to enable it
"max_time": 400
}
}
```
### Using GPU
If your device supports CUDA and meets the GPU requirements of the mainline environment, you can use GPU acceleration. Please select the appropriate guide based on your system:
- [Ubuntu 22.04 LTS + GPU](docs/README_Ubuntu_CUDA_Acceleration_en_US.md)
- [Windows 10/11 + GPU](docs/README_Windows_CUDA_Acceleration_en_US.md)
- Quick Deployment with Docker
> Docker requires a GPU with at least 16GB of VRAM, and all acceleration features are enabled by default.
>
> Before running this Docker, you can use the following command to check if your device supports CUDA acceleration on Docker.
>
> ```bash
> docker run --rm --gpus=all nvidia/cuda:12.1.0-base-ubuntu22.04 nvidia-smi
> ```
```bash
wget https://github.com/opendatalab/MinerU/raw/master/Dockerfile
docker build -t mineru:latest .
docker run --rm -it --gpus=all mineru:latest /bin/bash
magic-pdf --help
```
## Usage
### Command Line
```bash
magic-pdf --help
Usage: magic-pdf [OPTIONS]
Options:
-v, --version display the version and exit
-p, --path PATH local pdf filepath or directory [required]
-o, --output-dir PATH output local directory [required]
-m, --method [ocr|txt|auto] the method for parsing pdf. ocr: using ocr
technique to extract information from pdf. txt:
suitable for the text-based pdf only and
outperform ocr. auto: automatically choose the
best method for parsing pdf from ocr and txt.
without method specified, auto will be used by
default.
-l, --lang TEXT Input the languages in the pdf (if known) to
improve OCR accuracy. Optional. You should
input "Abbreviation" with language form url: ht
tps://paddlepaddle.github.io/PaddleOCR/en/ppocr
/blog/multi_languages.html#5-support-languages-
and-abbreviations
-d, --debug BOOLEAN Enables detailed debugging information during
the execution of the CLI commands.
-s, --start INTEGER The starting page for PDF parsing, beginning
from 0.
-e, --end INTEGER The ending page for PDF parsing, beginning from
0.
--help Show this message and exit.
## show version
magic-pdf -v
## command line example
magic-pdf -p {some_pdf} -o {some_output_dir} -m auto
```
`{some_pdf}` can be a single PDF file or a directory containing multiple PDFs.
The results will be saved in the `{some_output_dir}` directory. The output file list is as follows:
```text
โโโ some_pdf.md # markdown file
โโโ images # directory for storing images
โโโ some_pdf_layout.pdf # layout diagram
โโโ some_pdf_middle.json # MinerU intermediate processing result
โโโ some_pdf_model.json # model inference result
โโโ some_pdf_origin.pdf # original PDF file
โโโ some_pdf_spans.pdf # smallest granularity bbox position information diagram
โโโ some_pdf_content_list.json # Rich text JSON arranged in reading order
```
For more information about the output files, please refer to the [Output File Description](docs/output_file_en_us.md).
### API
Processing files from local disk
```python
image_writer = DiskReaderWriter(local_image_dir)
image_dir = str(os.path.basename(local_image_dir))
jso_useful_key = {"_pdf_type": "", "model_list": []}
pipe = UNIPipe(pdf_bytes, jso_useful_key, image_writer)
pipe.pipe_classify()
pipe.pipe_analyze()
pipe.pipe_parse()
md_content = pipe.pipe_mk_markdown(image_dir, drop_mode="none")
```
Processing files from object storage
```python
s3pdf_cli = S3ReaderWriter(pdf_ak, pdf_sk, pdf_endpoint)
image_dir = "s3://img_bucket/"
s3image_cli = S3ReaderWriter(img_ak, img_sk, img_endpoint, parent_path=image_dir)
pdf_bytes = s3pdf_cli.read(s3_pdf_path, mode=s3pdf_cli.MODE_BIN)
jso_useful_key = {"_pdf_type": "", "model_list": []}
pipe = UNIPipe(pdf_bytes, jso_useful_key, s3image_cli)
pipe.pipe_classify()
pipe.pipe_analyze()
pipe.pipe_parse()
md_content = pipe.pipe_mk_markdown(image_dir, drop_mode="none")
```
For detailed implementation, refer to:
- [demo.py Simplest Processing Method](demo/demo.py)
- [magic_pdf_parse_main.py More Detailed Processing Workflow](demo/magic_pdf_parse_main.py)
### Development Guide
TODO
# TODO
- [x] Semantic-based reading order
- [ ] List recognition within the text
- [ ] Code block recognition within the text
- [ ] Table of contents recognition
- [x] Table recognition
- [ ] [Chemical formula recognition](docs/chemical_knowledge_introduction/introduction.pdf)
- [ ] Geometric shape recognition
# Known Issues
- Reading order is segmented based on rules, which can cause disordered sequences in some cases
- Vertical text is not supported
- Lists, code blocks, and table of contents are not yet supported in the layout model
- Comic books, art books, elementary school textbooks, and exercise books are not well-parsed yet
- Enabling OCR may produce better results in PDFs with a high density of formulas
- If you are processing PDFs with a large number of formulas, it is strongly recommended to enable the OCR function. When using PyMuPDF to extract text, overlapping text lines can occur, leading to inaccurate formula insertion positions.
# FAQ
[FAQ in Chinese](docs/FAQ_zh_cn.md)
[FAQ in English](docs/FAQ_en_us.md)
# All Thanks To Our Contributors