https://github.com/opendatalab/MinerU.git

赵小蒙 5334d3a920 update project panorama 1 år sedan
.github 9fcd2bd543 update ci timeout 1 år sedan
demo 63a4a06255 update demo model json and code 1 år sedan
docs 5334d3a920 update project panorama 1 år sedan
magic_pdf bf45c8fb2b update log level 1 år sedan
others c9c14beab3 更新readme 1 år sedan
tests fdb6a2e158 update ci timeout 1 år sedan
tools b3c6d67684 update ci timeout 1 år sedan
.gitignore 016cde3ece 修复init错误 1 år sedan
LICENSE.md 9fe81795bc Create LICENSE.md 1 år sedan
README.md 6979c9269d update readme 1 år sedan
README_zh-CN.md 6979c9269d update readme 1 år sedan
magic-pdf.template.json 02d805ea9b 增加重构函数位置 1 år sedan
requirements.txt b3c6d67684 update ci timeout 1 år sedan
setup.py 3aa8ccdceb update requirements and setup 1 år sedan
update_version.py 7fd8d97edb fix error: version is 0.0.0 1 år sedan

README.md

[![stars](https://img.shields.io/github/stars/magicpdf/Magic-PDF.svg)](https://github.com/magicpdf/Magic-PDF) [![forks](https://img.shields.io/github/forks/magicpdf/Magic-PDF.svg)](https://github.com/magicpdf/Magic-PDF) [![license](https://img.shields.io/github/license/magicpdf/Magic-PDF.svg)](https://github.com/magicpdf/Magic-PDF/tree/main/LICENSE) [![issue resolution](https://img.shields.io/github/issues-closed-raw/magicpdf/Magic-PDF)](https://github.com/magicpdf/Magic-PDF/issues) [![open issues](https://img.shields.io/github/issues-raw/magicpdf/Magic-PDF)](https://github.com/magicpdf/Magic-PDF/issues) [English](README.md) | [简体中文](README_zh-CN.md)

Magic-PDF

Introduction

Magic-PDF is a tool designed to convert PDF documents into Markdown format, capable of processing files stored locally or on object storage supporting S3 protocol.

Key features include:

  • Support for multiple front-end model inputs
  • Removal of headers, footers, footnotes, and page numbers
  • Human-readable layout formatting
  • Retains the original document's structure and formatting, including headings, paragraphs, lists, and more
  • Extraction and display of images and tables within markdown
  • Conversion of equations into LaTeX format
  • Automatic detection and conversion of garbled PDFs
  • Compatibility with CPU and GPU environments
  • Available for Windows, Linux, and macOS platforms

Project Panorama

Project Panorama

Getting Started

Requirements

  • Python 3.9 or newer

Usage Instructions

1. Install Magic-PDF

pip install magic-pdf

2. Usage via Command Line

simple
cp magic-pdf.template.json to ~/magic-pdf.json
magic-pdf pdf-command --pdf "pdf_path" --model "model_json_path"
more
magic-pdf --help

3. Usage via Api

Local
image_writer = DiskReaderWriter(local_image_dir)
image_dir = str(os.path.basename(local_image_dir))
jso_useful_key = {"_pdf_type": "", "model_list": model_json}
pipe = UNIPipe(pdf_bytes, jso_useful_key, image_writer)
pipe.pipe_classify()
pipe.pipe_parse()
md_content = pipe.pipe_mk_markdown(image_dir, drop_mode="none")
Object Storage
s3pdf_cli = S3ReaderWriter(pdf_ak, pdf_sk, pdf_endpoint)
image_dir = "s3://img_bucket/"
s3image_cli = S3ReaderWriter(img_ak, img_sk, img_endpoint, parent_path=image_dir)
pdf_bytes = s3pdf_cli.read(s3_pdf_path, mode=s3pdf_cli.MODE_BIN)
jso_useful_key = {"_pdf_type": "", "model_list": model_json}
pipe = UNIPipe(pdf_bytes, jso_useful_key, s3image_cli)
pipe.pipe_classify()
pipe.pipe_parse()
md_content = pipe.pipe_mk_markdown(image_dir, drop_mode="none")

Demo can be referred to demo.py

All Thanks To Our Contributors

License Information

See LICENSE.md for details.

Acknowledgments