https://github.com/opendatalab/MinerU.git
|
|
1 år sedan | |
|---|---|---|
| .github | 1 år sedan | |
| demo | 1 år sedan | |
| docs | 1 år sedan | |
| magic_pdf | 1 år sedan | |
| others | 1 år sedan | |
| tests | 1 år sedan | |
| tools | 1 år sedan | |
| .gitignore | 1 år sedan | |
| LICENSE.md | 1 år sedan | |
| README.md | 1 år sedan | |
| README_zh-CN.md | 1 år sedan | |
| magic-pdf.template.json | 1 år sedan | |
| requirements.txt | 1 år sedan | |
| setup.py | 1 år sedan | |
| update_version.py | 1 år sedan |
Magic-PDF is a tool designed to convert PDF documents into Markdown format, capable of processing files stored locally or on object storage supporting S3 protocol.
Key features include:
pip install magic-pdf
cp magic-pdf.template.json to ~/magic-pdf.json
magic-pdf pdf-command --pdf "pdf_path" --model "model_json_path"
magic-pdf --help
image_writer = DiskReaderWriter(local_image_dir)
image_dir = str(os.path.basename(local_image_dir))
jso_useful_key = {"_pdf_type": "", "model_list": model_json}
pipe = UNIPipe(pdf_bytes, jso_useful_key, image_writer)
pipe.pipe_classify()
pipe.pipe_parse()
md_content = pipe.pipe_mk_markdown(image_dir, drop_mode="none")
s3pdf_cli = S3ReaderWriter(pdf_ak, pdf_sk, pdf_endpoint)
image_dir = "s3://img_bucket/"
s3image_cli = S3ReaderWriter(img_ak, img_sk, img_endpoint, parent_path=image_dir)
pdf_bytes = s3pdf_cli.read(s3_pdf_path, mode=s3pdf_cli.MODE_BIN)
jso_useful_key = {"_pdf_type": "", "model_list": model_json}
pipe = UNIPipe(pdf_bytes, jso_useful_key, s3image_cli)
pipe.pipe_classify()
pipe.pipe_parse()
md_content = pipe.pipe_mk_markdown(image_dir, drop_mode="none")
Demo can be referred to demo.py
See LICENSE.md for details.