zhengchun
/
MinerU


			
							12345678910111213141516171819202122232425262728293031323334353637383940
							

Dataset 
===========


Import Classes 
-----------------

Dataset 
^^^^^^^^

Each pdfs or image will form one ``Dataset``. As we all know, Pdf has two categories, :ref:`digital_method_section` or :ref:`ocr_method_section`.
Will get ``ImageDataset`` which is subclass of ``Dataset`` with images and get ``PymuDocDataset`` from pdf files.
The difference between ``ImageDataset`` and ``PymuDocDataset`` is that ``ImageDataset`` only support ``OCR`` parse method, 
while ``PymuDocDataset`` support both ``OCR`` and ``TXT``

.. note::

    In fact some pdf may generated by images, that means it can not support ``TXT`` methods. Currently it is something the user needs to ensure does not happen


Pdf Parse Methods
------------------

.. _ocr_method_section:
OCR 
^^^^
Extract chars via ``Optical Character Recognition`` technical.

.. _digital_method_section:
TXT
^^^^^^^^
Extract chars via third-party library, currently we use ``pymupdf``. 


Check :doc:`../../api/dataset` for more details