https://github.com/opendatalab/MinerU.git

quyuan 36aee5d60c add pyopenssl 1 жил өмнө
.github cb36408dfa update requeirements 1 жил өмнө
assets 4d8d7ee444 add detectron2 wheel 1 жил өмнө
demo 1e73b9fca0 fix: fasttext not support numpy>=2.0.0 1 жил өмнө
docs 695b357994 feat(config-reader): add models-dir and device-mode configurations 1 жил өмнө
magic_pdf b6df9b1824 feat(cli): set "full" as default model_mode for better accuracy 1 жил өмнө
others c9c14beab3 更新readme 1 жил өмнө
tests 46bcddf4f2 disable s3 test 1 жил өмнө
tools 4a823359f6 Merge branch 'master' of https://github.com/opendatalab/MinerU 1 жил өмнө
.gitignore 016cde3ece 修复init错误 1 жил өмнө
LICENSE.md 9fe81795bc Create LICENSE.md 1 жил өмнө
README.md e30f6d0e56 Update README.md 1 жил өмнө
README_zh-CN.md a2b56d4710 Update README_zh-CN.md 1 жил өмнө
magic-pdf.template.json 695b357994 feat(config-reader): add models-dir and device-mode configurations 1 жил өмнө
requirements-qa.txt 36aee5d60c add pyopenssl 1 жил өмнө
requirements.txt b668386ade update: fix requirements.txt gbk error 1 жил өмнө
setup.py d458b705aa feat(setup.py): include package data for magic_pdf.resources 1 жил өмнө
update_version.py 7fd8d97edb fix error: version is 0.0.0 1 жил өмнө

README.md

[![stars](https://img.shields.io/github/stars/opendatalab/MinerU.svg)](https://github.com/opendatalab/MinerU) [![forks](https://img.shields.io/github/forks/opendatalab/MinerU.svg)](https://github.com/opendatalab/MinerU) [![open issues](https://img.shields.io/github/issues-raw/opendatalab/MinerU)](https://github.com/opendatalab/MinerU/issues) [![issue resolution](https://img.shields.io/github/issues-closed-raw/opendatalab/MinerU)](https://github.com/opendatalab/MinerU/issues) [![PyPI version](https://badge.fury.io/py/magic-pdf.svg)](https://badge.fury.io/py/magic-pdf) [![Downloads](https://static.pepy.tech/badge/magic-pdf)](https://pepy.tech/project/magic-pdf) [![Downloads](https://static.pepy.tech/badge/magic-pdf/month)](https://pepy.tech/project/magic-pdf) [English](README.md) | [简体中文](README_zh-CN.md)

MinerU

Introduction

MinerU is a one-stop, open-source, high-quality data extraction tool, includes the following primary features:

Magic-PDF

Introduction

Magic-PDF is a tool designed to convert PDF documents into Markdown format, capable of processing files stored locally or on object storage supporting S3 protocol.

Key features include:

  • Support for multiple front-end model inputs
  • Removal of headers, footers, footnotes, and page numbers
  • Human-readable layout formatting
  • Retains the original document's structure and formatting, including headings, paragraphs, lists, and more
  • Extraction and display of images and tables within markdown
  • Conversion of equations into LaTeX format
  • Automatic detection and conversion of garbled PDFs
  • Compatibility with CPU and GPU environments
  • Available for Windows, Linux, and macOS platforms

https://github.com/opendatalab/MinerU/assets/11393164/618937cb-dc6a-4646-b433-e3131a5f4070

Project Panorama

Project Panorama

Flowchart

Flowchart

Submodule Repositories

  • PDF-Extract-Kit
    • A Comprehensive Toolkit for High-Quality PDF Content Extraction

Getting Started

Requirements

  • Python >= 3.9

Usage Instructions

1. Install Magic-PDF

pip install magic-pdf

2. Usage via Command Line

simple
cp magic-pdf.template.json ~/magic-pdf.json
magic-pdf pdf-command --pdf "pdf_path" --model "model_json_path"

After the program has finished, you can find the generated markdown files under the directory "/tmp/magic-pdf".

more
magic-pdf --help

3. Usage via Api

Local
image_writer = DiskReaderWriter(local_image_dir)
image_dir = str(os.path.basename(local_image_dir))
jso_useful_key = {"_pdf_type": "", "model_list": model_json}
pipe = UNIPipe(pdf_bytes, jso_useful_key, image_writer)
pipe.pipe_classify()
pipe.pipe_parse()
md_content = pipe.pipe_mk_markdown(image_dir, drop_mode="none")
Object Storage
s3pdf_cli = S3ReaderWriter(pdf_ak, pdf_sk, pdf_endpoint)
image_dir = "s3://img_bucket/"
s3image_cli = S3ReaderWriter(img_ak, img_sk, img_endpoint, parent_path=image_dir)
pdf_bytes = s3pdf_cli.read(s3_pdf_path, mode=s3pdf_cli.MODE_BIN)
jso_useful_key = {"_pdf_type": "", "model_list": model_json}
pipe = UNIPipe(pdf_bytes, jso_useful_key, s3image_cli)
pipe.pipe_classify()
pipe.pipe_parse()
md_content = pipe.pipe_mk_markdown(image_dir, drop_mode="none")

Demo can be referred to demo.py

Magic-Doc

Introduction

Magic-Doc is a tool designed to convert web pages or multi-format e-books into markdown format.

Key Features Include:

  • Web Page Extraction

    • Cross-modal precise parsing of text, images, tables, and formula information.
  • E-Book Document Extraction

    • Supports various document formats including epub, mobi, with full adaptation for text and images.
  • Language Type Identification

    • Accurate recognition of 176 languages.

https://github.com/opendatalab/MinerU/assets/11393164/a5a650e9-f4c0-463e-acc3-960967f1a1ca

https://github.com/opendatalab/MinerU/assets/11393164/0f4a6fe9-6cca-4113-9fdc-a537749d764d

https://github.com/opendatalab/MinerU/assets/11393164/20438a02-ce6c-4af8-9dde-d722a4e825b2

Project Repository

  • Magic-Doc Outstanding Webpage and E-book Extraction Tool

All Thanks To Our Contributors

License Information

LICENSE.md

The project currently leverages PyMuPDF to deliver advanced functionalities; however, its adherence to the AGPL license may impose limitations on certain use cases. In upcoming iterations, we intend to explore and transition to a more permissively licensed PDF processing library to enhance user-friendliness and flexibility.

Acknowledgments

Citation

@misc{2024mineru,
    title={MinerU: A One-stop, Open-source, High-quality Data Extraction Tool},
    author={MinerU Contributors},
    howpublished = {\url{https://github.com/opendatalab/MinerU}},
    year={2024}
}

Star History

Star History Chart