command_line.rst 2.8 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960
  1. Command Line
  2. ===================
  3. .. code:: bash
  4. magic-pdf --help
  5. Usage: magic-pdf [OPTIONS]
  6. Options:
  7. -v, --version display the version and exit
  8. -p, --path PATH local pdf filepath or directory [required]
  9. -o, --output-dir PATH output local directory [required]
  10. -m, --method [ocr|txt|auto] the method for parsing pdf. ocr: using ocr
  11. technique to extract information from pdf. txt:
  12. suitable for the text-based pdf only and
  13. outperform ocr. auto: automatically choose the
  14. best method for parsing pdf from ocr and txt.
  15. without method specified, auto will be used by
  16. default.
  17. -l, --lang TEXT Input the languages in the pdf (if known) to
  18. improve OCR accuracy. Optional. You should
  19. input "Abbreviation" with language form url: ht
  20. tps://paddlepaddle.github.io/PaddleOCR/en/ppocr
  21. /blog/multi_languages.html#5-support-languages-
  22. and-abbreviations
  23. -d, --debug BOOLEAN Enables detailed debugging information during
  24. the execution of the CLI commands.
  25. -s, --start INTEGER The starting page for PDF parsing, beginning
  26. from 0.
  27. -e, --end INTEGER The ending page for PDF parsing, beginning from
  28. 0.
  29. --help Show this message and exit.
  30. ## show version
  31. magic-pdf -v
  32. ## command line example
  33. magic-pdf -p {some_pdf} -o {some_output_dir} -m auto
  34. ``{some_pdf}`` can be a single PDF file or a directory containing
  35. multiple PDFs. The results will be saved in the ``{some_output_dir}``
  36. directory. The output file list is as follows:
  37. .. code:: text
  38. ├── some_pdf.md # markdown file
  39. ├── images # directory for storing images
  40. ├── some_pdf_layout.pdf # layout diagram
  41. ├── some_pdf_middle.json # MinerU intermediate processing result
  42. ├── some_pdf_model.json # model inference result
  43. ├── some_pdf_origin.pdf # original PDF file
  44. ├── some_pdf_spans.pdf # smallest granularity bbox position information diagram
  45. └── some_pdf_content_list.json # Rich text JSON arranged in reading order
  46. For more information about the output files, please refer to the `Output
  47. File Description <docs/output_file_en_us.md>`__.