[](https://github.com/opendatalab/MinerU)
[](https://github.com/opendatalab/MinerU)
[](https://github.com/opendatalab/MinerU/issues)
[](https://github.com/opendatalab/MinerU/issues)
[](https://badge.fury.io/py/magic-pdf)
[](https://pepy.tech/project/magic-pdf)
[](https://pepy.tech/project/magic-pdf)
[](https://opendatalab.com/OpenSourceTools/Extractor/PDF)
[](https://huggingface.co/spaces/opendatalab/MinerU)
[](https://www.modelscope.cn/studios/OpenDataLab/MinerU)
[](https://colab.research.google.com/gist/papayalove/b5f4913389e7ff9883c6b687de156e78/mineru_demo.ipynb)
[](#)

[English](README.md) | [简体中文](README_zh-CN.md)
PDF-Extract-Kit: High-Quality PDF Extraction Toolkit🔥🔥🔥
👋 join us on Discord and WeChat
# Changelog
- 2024/09/09: Version 0.8.0 released, supporting fast deployment with Dockerfile, and launching demos on Huggingface and Modelscope.
- 2024/08/30: Version 0.7.1 released, add paddle tablemaster table recognition option
- 2024/08/09: Version 0.7.0b1 released, simplified installation process, added table recognition functionality
- 2024/08/01: Version 0.6.2b1 released, optimized dependency conflict issues and installation documentation
- 2024/07/05: Initial open-source release
| Operating System |
| Ubuntu 22.04 LTS |
Windows 10 / 11 |
macOS 11+ |
| CPU |
x86_64 |
x86_64 |
x86_64 / arm64 |
| Memory |
16GB or more, recommended 32GB+ |
| Python Version |
3.10 |
| Nvidia Driver Version |
latest (Proprietary Driver) |
latest |
None |
| CUDA Environment |
Automatic installation [12.1 (pytorch) + 11.8 (paddle)] |
11.8 (manual installation) + cuDNN v8.7.0 (manual installation) |
None |
| GPU Hardware Support List |
Minimum Requirement 8G+ VRAM |
3060ti/3070/3080/3080ti/4060/4070/4070ti
8G VRAM only enables layout and formula recognition acceleration |
None |
| Recommended Configuration 16G+ VRAM |
3090/3090ti/4070ti super/4080/4090
16G or more can enable layout, formula recognition, and OCR acceleration simultaneously
24G or more can enable layout, formula recognition, OCR acceleration and table recognition simultaneously
|
### Online Demo
[](https://opendatalab.com/OpenSourceTools/Extractor/PDF)
[](https://huggingface.co/spaces/opendatalab/MinerU)
[](https://www.modelscope.cn/studios/OpenDataLab/MinerU)
### Quick CPU Demo
#### 1. Install magic-pdf
```bash
conda create -n MinerU python=3.10
conda activate MinerU
pip install -U magic-pdf[full] --extra-index-url https://wheels.myhloli.com
```
#### 2. Download model weight files
Refer to [How to Download Model Files](docs/how_to_download_models_en.md) for detailed instructions.
> ❗️After downloading the models, please make sure to verify the completeness of the model files.
>
> Check if the model file sizes match the description on the webpage. If possible, use sha256 to verify the integrity of the files.
#### 3. Copy and configure the template file
You can find the `magic-pdf.template.json` template configuration file in the root directory of the repository.
> ❗️Make sure to execute the following command to copy the configuration file to your **user directory**; otherwise, the program will not run.
>
> The user directory for Windows is `C:\Users\YourUsername`, for Linux it is `/home/YourUsername`, and for macOS it is `/Users/YourUsername`.
```bash
cp magic-pdf.template.json ~/magic-pdf.json
```
Find the `magic-pdf.json` file in your user directory and configure the "models-dir" path to point to the directory where the model weight files were downloaded in [Step 2](#2-download-model-weight-files).
> ❗️Make sure to correctly configure the **absolute path** to the model weight files directory, otherwise the program will not run because it can't find the model files.
>
> On Windows, this path should include the drive letter and all backslashes (`\`) in the path should be replaced with forward slashes (`/`) to avoid syntax errors in the JSON file due to escape sequences.
>
> For example: If the models are stored in the "models" directory at the root of the D drive, the "model-dir" value should be `D:/models`.
```json
{
// other config
"models-dir": "D:/models",
"table-config": {
"model": "TableMaster", // Another option of this value is 'struct_eqtable'
"is_table_recog_enable": false, // Table recognition is disabled by default, modify this value to enable it
"max_time": 400
}
}
```
### Using GPU
If your device supports CUDA and meets the GPU requirements of the mainline environment, you can use GPU acceleration. Please select the appropriate guide based on your system:
- [Ubuntu 22.04 LTS + GPU](docs/README_Ubuntu_CUDA_Acceleration_en_US.md)
- [Windows 10/11 + GPU](docs/README_Windows_CUDA_Acceleration_en_US.md)
- Quick Deployment with Docker
> Docker requires a GPU with at least 16GB of VRAM, and all acceleration features are enabled by default.
>
> Before running this Docker, you can use the following command to check if your device supports CUDA acceleration on Docker.
>
> ```bash
> docker run --rm --gpus=all nvidia/cuda:12.1.0-base-ubuntu22.04 nvidia-smi
> ```
```bash
wget https://github.com/opendatalab/MinerU/raw/master/Dockerfile
docker build -t mineru:latest .
docker run --rm -it --gpus=all mineru:latest /bin/bash
magic-pdf --help
```
## Usage
### Command Line
```bash
magic-pdf --help
Usage: magic-pdf [OPTIONS]
Options:
-v, --version display the version and exit
-p, --path PATH local pdf filepath or directory [required]
-o, --output-dir TEXT output local directory
-m, --method [ocr|txt|auto] the method for parsing pdf.
ocr: using ocr technique to extract information from pdf,
txt: suitable for the text-based pdf only and outperform ocr,
auto: automatically choose the best method for parsing pdf
from ocr and txt.
without method specified, auto will be used by default.
--help Show this message and exit.
## show version
magic-pdf -v
## command line example
magic-pdf -p {some_pdf} -o {some_output_dir} -m auto
```
`{some_pdf}` can be a single PDF file or a directory containing multiple PDFs.
The results will be saved in the `{some_output_dir}` directory. The output file list is as follows:
```text
├── some_pdf.md # markdown file
├── images # directory for storing images
├── some_pdf_layout.pdf # layout diagram
├── some_pdf_middle.json # MinerU intermediate processing result
├── some_pdf_model.json # model inference result
├── some_pdf_origin.pdf # original PDF file
└── some_pdf_spans.pdf # smallest granularity bbox position information diagram
```
For more information about the output files, please refer to the [Output File Description](docs/output_file_en_us.md).
### API
Processing files from local disk
```python
image_writer = DiskReaderWriter(local_image_dir)
image_dir = str(os.path.basename(local_image_dir))
jso_useful_key = {"_pdf_type": "", "model_list": []}
pipe = UNIPipe(pdf_bytes, jso_useful_key, image_writer)
pipe.pipe_classify()
pipe.pipe_analyze()
pipe.pipe_parse()
md_content = pipe.pipe_mk_markdown(image_dir, drop_mode="none")
```
Processing files from object storage
```python
s3pdf_cli = S3ReaderWriter(pdf_ak, pdf_sk, pdf_endpoint)
image_dir = "s3://img_bucket/"
s3image_cli = S3ReaderWriter(img_ak, img_sk, img_endpoint, parent_path=image_dir)
pdf_bytes = s3pdf_cli.read(s3_pdf_path, mode=s3pdf_cli.MODE_BIN)
jso_useful_key = {"_pdf_type": "", "model_list": []}
pipe = UNIPipe(pdf_bytes, jso_useful_key, s3image_cli)
pipe.pipe_classify()
pipe.pipe_analyze()
pipe.pipe_parse()
md_content = pipe.pipe_mk_markdown(image_dir, drop_mode="none")
```
For detailed implementation, refer to:
- [demo.py Simplest Processing Method](demo/demo.py)
- [magic_pdf_parse_main.py More Detailed Processing Workflow](demo/magic_pdf_parse_main.py)
### Development Guide
TODO
# TODO
- [ ] Semantic-based reading order
- [ ] List recognition within the text
- [ ] Code block recognition within the text
- [ ] Table of contents recognition
- [x] Table recognition
- [ ] [Chemical formula recognition](docs/chemical_knowledge_introduction/introduction.pdf)
- [ ] Geometric shape recognition
# Known Issues
- Reading order is segmented based on rules, which can cause disordered sequences in some cases
- Vertical text is not supported
- Lists, code blocks, and table of contents are not yet supported in the layout model
- Comic books, art books, elementary school textbooks, and exercise books are not well-parsed yet
- Enabling OCR may produce better results in PDFs with a high density of formulas
- If you are processing PDFs with a large number of formulas, it is strongly recommended to enable the OCR function. When using PyMuPDF to extract text, overlapping text lines can occur, leading to inaccurate formula insertion positions.
# FAQ
[FAQ in Chinese](docs/FAQ_zh_cn.md)
[FAQ in English](docs/FAQ_en_us.md)
# All Thanks To Our Contributors