data_reader_writer.rst 7.6 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236
  1. Data Reader Writer
  2. ====================
  3. Aims for read or write bytes from different media, You can implement new classes to meet the needs of your personal scenarios
  4. if MinerU have not provide the suitable classes. It is easy to implement new classes, the only one requirement is to inherit from
  5. ``DataReader`` or ``DataWriter``
  6. .. code:: python
  7. class SomeReader(DataReader):
  8. def read(self, path: str) -> bytes:
  9. pass
  10. def read_at(self, path: str, offset: int = 0, limit: int = -1) -> bytes:
  11. pass
  12. class SomeWriter(DataWriter):
  13. def write(self, path: str, data: bytes) -> None:
  14. pass
  15. def write_string(self, path: str, data: str) -> None:
  16. pass
  17. Reader may curious about the difference between :doc:`io` and this section. Those two sections look very similarity at first glance.
  18. :doc:`io` provides fundamental functions, while This section thinks more at application level. Customer can build they own classes to meet
  19. their own applications need which may share same IO function. That is why we have :doc:`io`.
  20. Important Classes
  21. -----------------
  22. .. code:: python
  23. class FileBasedDataReader(DataReader):
  24. def __init__(self, parent_dir: str = ''):
  25. pass
  26. class FileBasedDataWriter(DataWriter):
  27. def __init__(self, parent_dir: str = '') -> None:
  28. pass
  29. Class ``FileBasedDataReader`` initialized with unary param ``parent_dir``, That means that every method ``FileBasedDataReader`` provided will have features as follow.
  30. Features:
  31. #. read content from the absolute path file, ``parent_dir`` will be ignored.
  32. #. read the relative path, file will first join with ``parent_dir``, then read content from the merged path
  33. .. note::
  34. ``FileBasedDataWriter`` shares the same behavior with ``FileBaseDataReader``
  35. .. code:: python
  36. class MultiS3Mixin:
  37. def __init__(self, default_prefix: str, s3_configs: list[S3Config]):
  38. pass
  39. class MultiBucketS3DataReader(DataReader, MultiS3Mixin):
  40. pass
  41. All read-related method that class ``MultiBucketS3DataReader`` provided will have features as follow.
  42. Features:
  43. #. read object with full s3-format path, for example ``s3://test_bucket/test_object``, ``default_prefix`` will be ignored.
  44. #. read object with relative path, file will join ``default_prefix`` and trim the ``bucket_name`` firstly, then read the content. ``bucket_name`` is the first element of the result after split ``default_prefix`` with delimiter ``\``
  45. .. note::
  46. ``MultiBucketS3DataWriter`` shares the same behavior with ``MultiBucketS3DataReader``
  47. .. code:: python
  48. class S3DataReader(MultiBucketS3DataReader):
  49. pass
  50. ``S3DataReader`` is build on top of MultiBucketS3DataReader which only support for bucket. So is ``S3DataWriter``.
  51. Read Examples
  52. ------------
  53. .. code:: python
  54. import os
  55. from magic_pdf.data.data_reader_writer import *
  56. from magic_pdf.data.data_reader_writer import MultiBucketS3DataReader
  57. from magic_pdf.data.schemas import S3Config
  58. # file based related
  59. file_based_reader1 = FileBasedDataReader('')
  60. ## will read file abc
  61. file_based_reader1.read('abc')
  62. file_based_reader2 = FileBasedDataReader('/tmp')
  63. ## will read /tmp/abc
  64. file_based_reader2.read('abc')
  65. ## will read /tmp/logs/message.txt
  66. file_based_reader2.read('/tmp/logs/message.txt')
  67. # multi bucket s3 releated
  68. bucket = "bucket" # replace with real bucket
  69. ak = "ak" # replace with real access key
  70. sk = "sk" # replace with real secret key
  71. endpoint_url = "endpoint_url" # replace with real endpoint_url
  72. bucket_2 = "bucket_2" # replace with real bucket
  73. ak_2 = "ak_2" # replace with real access key
  74. sk_2 = "sk_2" # replace with real secret key
  75. endpoint_url_2 = "endpoint_url_2" # replace with real endpoint_url
  76. test_prefix = 'test/unittest'
  77. multi_bucket_s3_reader1 = MultiBucketS3DataReader(f"{bucket}/{test_prefix}", [S3Config(
  78. bucket_name=bucket, access_key=ak, secret_key=sk, endpoint_url=endpoint_url
  79. ),
  80. S3Config(
  81. bucket_name=bucket_2,
  82. access_key=ak_2,
  83. secret_key=sk_2,
  84. endpoint_url=endpoint_url_2,
  85. )])
  86. ## will read s3://{bucket}/{test_prefix}/abc
  87. multi_bucket_s3_reader1.read('abc')
  88. ## will read s3://{bucket}/{test_prefix}/efg
  89. multi_bucket_s3_reader1.read(f's3://{bucket}/{test_prefix}/efg')
  90. ## will read s3://{bucket2}/{test_prefix}/abc
  91. multi_bucket_s3_reader1.read(f's3://{bucket_2}/{test_prefix}/abc')
  92. # s3 related
  93. s3_reader1 = S3DataReader(
  94. test_prefix,
  95. bucket,
  96. ak,
  97. sk,
  98. endpoint_url
  99. )
  100. ## will read s3://{bucket}/{test_prefix}/abc
  101. s3_reader1.read('abc')
  102. ## will read s3://{bucket}/efg
  103. s3_reader1.read(f's3://{bucket}/efg')
  104. Write Examples
  105. ---------------
  106. .. code:: python
  107. import os
  108. from magic_pdf.data.data_reader_writer import *
  109. from magic_pdf.data.data_reader_writer import MultiBucketS3DataWriter
  110. from magic_pdf.data.schemas import S3Config
  111. # file based related
  112. file_based_writer1 = FileBasedDataWriter("")
  113. ## will write 123 to abc
  114. file_based_writer1.write("abc", "123".encode())
  115. ## will write 123 to abc
  116. file_based_writer1.write_string("abc", "123")
  117. file_based_writer2 = FileBasedDataWriter("/tmp")
  118. ## will write 123 to /tmp/abc
  119. file_based_writer2.write_string("abc", "123")
  120. ## will write 123 to /tmp/logs/message.txt
  121. file_based_writer2.write_string("/tmp/logs/message.txt", "123")
  122. # multi bucket s3 releated
  123. bucket = "bucket" # replace with real bucket
  124. ak = "ak" # replace with real access key
  125. sk = "sk" # replace with real secret key
  126. endpoint_url = "endpoint_url" # replace with real endpoint_url
  127. bucket_2 = "bucket_2" # replace with real bucket
  128. ak_2 = "ak_2" # replace with real access key
  129. sk_2 = "sk_2" # replace with real secret key
  130. endpoint_url_2 = "endpoint_url_2" # replace with real endpoint_url
  131. test_prefix = "test/unittest"
  132. multi_bucket_s3_writer1 = MultiBucketS3DataWriter(
  133. f"{bucket}/{test_prefix}",
  134. [
  135. S3Config(
  136. bucket_name=bucket, access_key=ak, secret_key=sk, endpoint_url=endpoint_url
  137. ),
  138. S3Config(
  139. bucket_name=bucket_2,
  140. access_key=ak_2,
  141. secret_key=sk_2,
  142. endpoint_url=endpoint_url_2,
  143. ),
  144. ],
  145. )
  146. ## will write 123 to s3://{bucket}/{test_prefix}/abc
  147. multi_bucket_s3_writer1.write_string("abc", "123")
  148. ## will write 123 to s3://{bucket}/{test_prefix}/abc
  149. multi_bucket_s3_writer1.write("abc", "123".encode())
  150. ## will write 123 to s3://{bucket}/{test_prefix}/efg
  151. multi_bucket_s3_writer1.write(f"s3://{bucket}/{test_prefix}/efg", "123".encode())
  152. ## will write 123 to s3://{bucket_2}/{test_prefix}/abc
  153. multi_bucket_s3_writer1.write(f's3://{bucket_2}/{test_prefix}/abc', '123'.encode())
  154. # s3 related
  155. s3_writer1 = S3DataWriter(test_prefix, bucket, ak, sk, endpoint_url)
  156. ## will write 123 to s3://{bucket}/{test_prefix}/abc
  157. s3_writer1.write("abc", "123".encode())
  158. ## will write 123 to s3://{bucket}/{test_prefix}/abc
  159. s3_writer1.write_string("abc", "123")
  160. ## will write 123 to s3://{bucket}/efg
  161. s3_writer1.write(f"s3://{bucket}/efg", "123".encode())
  162. Check :doc:`../../api/data_reader_writer` for more details