dataset.rst 995 B

12345678910111213141516171819202122232425262728293031323334353637383940
  1. Dataset
  2. ===========
  3. Import Classes
  4. -----------------
  5. Dataset
  6. ^^^^^^^^
  7. Each pdfs or image will form one ``Dataset``. As we all know, Pdf has two categories, :ref:`digital_method_section` or :ref:`ocr_method_section`.
  8. Will get ``ImageDataset`` which is subclass of ``Dataset`` with images and get ``PymuDocDataset`` from pdf files.
  9. The difference between ``ImageDataset`` and ``PymuDocDataset`` is that ``ImageDataset`` only support ``OCR`` parse method,
  10. while ``PymuDocDataset`` support both ``OCR`` and ``TXT``
  11. .. note::
  12. In fact some pdf may generated by images, that means it can not support ``TXT`` methods. Currently it is something the user needs to ensure does not happen
  13. Pdf Parse Methods
  14. ------------------
  15. .. _ocr_method_section:
  16. OCR
  17. ^^^^
  18. Extract chars via ``Optical Character Recognition`` technical.
  19. .. _digital_method_section:
  20. TXT
  21. ^^^^^^^^
  22. Extract chars via third-party library, currently we use ``pymupdf``.
  23. Check :doc:`../../api/dataset` for more details