command_line.rst 3.1 KB

1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677
  1. Command Line
  2. ===================
  3. .. code:: bash
  4. magic-pdf --help
  5. Usage: magic-pdf [OPTIONS]
  6. Options:
  7. -v, --version display the version and exit
  8. -p, --path PATH local filepath or directory. support PDF, PPT,
  9. PPTX, DOC, DOCX, PNG, JPG files [required]
  10. -o, --output-dir PATH output local directory [required]
  11. -m, --method [ocr|txt|auto] the method for parsing pdf. ocr: using ocr
  12. technique to extract information from pdf. txt:
  13. suitable for the text-based pdf only and
  14. outperform ocr. auto: automatically choose the
  15. best method for parsing pdf from ocr and txt.
  16. without method specified, auto will be used by
  17. default.
  18. -l, --lang TEXT Input the languages in the pdf (if known) to
  19. improve OCR accuracy. Optional. You should
  20. input "Abbreviation" with language form url: ht
  21. tps://paddlepaddle.github.io/PaddleOCR/en/ppocr
  22. /blog/multi_languages.html#5-support-languages-
  23. and-abbreviations
  24. -d, --debug BOOLEAN Enables detailed debugging information during
  25. the execution of the CLI commands.
  26. -s, --start INTEGER The starting page for PDF parsing, beginning
  27. from 0.
  28. -e, --end INTEGER The ending page for PDF parsing, beginning from
  29. 0.
  30. --help Show this message and exit.
  31. ## show version
  32. magic-pdf -v
  33. ## command line example
  34. magic-pdf -p {some_pdf} -o {some_output_dir} -m auto
  35. .. admonition:: Important
  36. :class: tip
  37. The file must endswith with the following suffix.
  38. .pdf
  39. .png
  40. .jpg
  41. .ppt
  42. .pptx
  43. .doc
  44. .docx
  45. ``{some_pdf}`` can be a single PDF file or a directory containing
  46. multiple PDFs. The results will be saved in the ``{some_output_dir}``
  47. directory. The output file list is as follows:
  48. .. code:: text
  49. ├── some_pdf.md # markdown file
  50. ├── images # directory for storing images
  51. ├── some_pdf_layout.pdf # layout diagram
  52. ├── some_pdf_middle.json # MinerU intermediate processing result
  53. ├── some_pdf_model.json # model inference result
  54. ├── some_pdf_origin.pdf # original PDF file
  55. ├── some_pdf_spans.pdf # smallest granularity bbox position information diagram
  56. └── some_pdf_content_list.json # Rich text JSON arranged in reading order
  57. .. admonition:: Tip
  58. :class: tip
  59. For more information about the output files, please refer to the :doc:`../inference_result` or :doc:`../pipe_result`