| 12345678910111213141516171819202122232425262728293031323334353637383940 |
- Dataset
- ===========
- Import Classes
- -----------------
- Dataset
- ^^^^^^^^
- Each pdfs or image will form one ``Dataset``. As we all know, Pdf has two categories, :ref:`digital_method_section` or :ref:`ocr_method_section`.
- Will get ``ImageDataset`` which is subclass of ``Dataset`` with images and get ``PymuDocDataset`` from pdf files.
- The difference between ``ImageDataset`` and ``PymuDocDataset`` is that ``ImageDataset`` only support ``OCR`` parse method,
- while ``PymuDocDataset`` support both ``OCR`` and ``TXT``
- .. note::
- In fact some pdf may generated by images, that means it can not support ``TXT`` methods. Currently it is something the user needs to ensure does not happen
- Pdf Parse Methods
- ------------------
- .. _ocr_method_section:
- OCR
- ^^^^
- Extract chars via ``Optical Character Recognition`` technical.
- .. _digital_method_section:
- TXT
- ^^^^^^^^
- Extract chars via third-party library, currently we use ``pymupdf``.
- Check :doc:`../../api/dataset` for more details
|