index.rst 3.6 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111
  1. .. xtuner documentation master file, created by
  2. sphinx-quickstart on Tue Jan 9 16:33:06 2024.
  3. You can adapt this file completely to your liking, but it should at least
  4. contain the root `toctree` directive.
  5. Welcome to the MinerU Documentation
  6. ==============================================
  7. .. figure:: ./_static/image/logo.png
  8. :align: center
  9. :alt: mineru
  10. :class: no-scaled-link
  11. .. raw:: html
  12. <p style="text-align:center">
  13. <strong>A one-stop, open-source, high-quality data extraction tool
  14. </strong>
  15. </p>
  16. <p style="text-align:center">
  17. <script async defer src="https://buttons.github.io/buttons.js"></script>
  18. <a class="github-button" href="https://github.com/opendatalab/MinerU" data-show-count="true" data-size="large" aria-label="Star">Star</a>
  19. <a class="github-button" href="https://github.com/opendatalab/MinerU/subscription" data-icon="octicon-eye" data-size="large" aria-label="Watch">Watch</a>
  20. <a class="github-button" href="https://github.com/opendatalab/MinerU/fork" data-icon="octicon-repo-forked" data-size="large" aria-label="Fork">Fork</a>
  21. </p>
  22. Project Introduction
  23. --------------------
  24. MinerU is a tool that converts PDFs into machine-readable formats (e.g.,
  25. markdown, JSON), allowing for easy extraction into any format. MinerU
  26. was born during the pre-training process of
  27. `InternLM <https://github.com/InternLM/InternLM>`__. We focus on solving
  28. symbol conversion issues in scientific literature and hope to contribute
  29. to technological development in the era of large models. Compared to
  30. well-known commercial products, MinerU is still young. If you encounter
  31. any issues or if the results are not as expected, please submit an issue
  32. on `issue <https://github.com/opendatalab/MinerU/issues>`__ and **attach
  33. the relevant PDF**.
  34. .. video:: https://github.com/user-attachments/assets/4bea02c9-6d54-4cd6-97ed-dff14340982c
  35. Key Features
  36. ------------
  37. - Remove headers, footers, footnotes, page numbers, etc., to ensure
  38. semantic coherence.
  39. - Output text in human-readable order, suitable for single-column,
  40. multi-column, and complex layouts.
  41. - Preserve the structure of the original document, including headings,
  42. paragraphs, lists, etc.
  43. - Extract images, image descriptions, tables, table titles, and
  44. footnotes.
  45. - Automatically recognize and convert formulas in the document to LaTeX
  46. format.
  47. - Automatically recognize and convert tables in the document to LaTeX
  48. or HTML format.
  49. - Automatically detect scanned PDFs and garbled PDFs and enable OCR
  50. functionality.
  51. - OCR supports detection and recognition of 84 languages.
  52. - Supports multiple output formats, such as multimodal and NLP
  53. Markdown, JSON sorted by reading order, and rich intermediate
  54. formats.
  55. - Supports various visualization results, including layout
  56. visualization and span visualization, for efficient confirmation of
  57. output quality.
  58. - Supports both CPU and GPU environments.
  59. - Compatible with Windows, Linux, and Mac platforms.
  60. .. tip::
  61. Get started with MinerU by trying the `online demo <https://www.modelscope.cn/studios/OpenDataLab/MinerU>`_ or :doc:`installing it locally <user_guide/install/install>`.
  62. User Guide
  63. -------------
  64. .. toctree::
  65. :maxdepth: 2
  66. :caption: User Guide
  67. user_guide
  68. API Reference
  69. -------------
  70. If you are looking for information on a specific function, class or
  71. method, this part of the documentation is for you.
  72. .. toctree::
  73. :maxdepth: 2
  74. :caption: API
  75. api
  76. Additional Notes
  77. ------------------
  78. .. toctree::
  79. :maxdepth: 1
  80. :caption: Additional Notes
  81. additional_notes/known_issues
  82. additional_notes/faq
  83. additional_notes/glossary