data_reader_writer.rst 7.3 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218
  1. 数据读取和写入类
  2. =================
  3. 旨在从不同的媒介读取或写入字节。如果 MinerU 没有提供合适的类,你可以实现新的类以满足个人场景的需求。实现新的类非常容易,唯一的要求是继承自 DataReader 或 DataWriter。
  4. .. code:: python
  5. class SomeReader(DataReader):
  6. def read(self, path: str) -> bytes:
  7. pass
  8. def read_at(self, path: str, offset: int = 0, limit: int = -1) -> bytes:
  9. pass
  10. class SomeWriter(DataWriter):
  11. def write(self, path: str, data: bytes) -> None:
  12. pass
  13. def write_string(self, path: str, data: str) -> None:
  14. pass
  15. 读者可能会对 io 和本节的区别感到好奇。乍一看,这两部分非常相似。io 提供基本功能,而本节则更注重应用层面。用户可以构建自己的类以满足特定应用需求,这些类可能共享相同的基本 IO 功能。这就是为什么我们有 io。
  16. 重要类
  17. ------------
  18. .. code:: python
  19. class FileBasedDataReader(DataReader):
  20. def __init__(self, parent_dir: str = ''):
  21. pass
  22. class FileBasedDataWriter(DataWriter):
  23. def __init__(self, parent_dir: str = '') -> None:
  24. pass
  25. 类 FileBasedDataReader 使用单个参数 parent_dir 初始化。这意味着 FileBasedDataReader 提供的每个方法将具有以下特性:
  26. #. 从绝对路径文件读取内容,parent_dir 将被忽略。
  27. #. 从相对路径读取文件,首先将路径与 parent_dir 连接,然后从合并后的路径读取内容。
  28. .. note::
  29. `FileBasedDataWriter` 与 `FileBasedDataReader` 具有相同的行为。
  30. .. code:: python
  31. class MultiS3Mixin:
  32. def __init__(self, default_prefix: str, s3_configs: list[S3Config]):
  33. pass
  34. class MultiBucketS3DataReader(DataReader, MultiS3Mixin):
  35. pass
  36. MultiBucketS3DataReader 提供的所有读取相关方法将具有以下特性:
  37. #. 从完整的 S3 格式路径读取对象,例如 s3://test_bucket/test_object,default_prefix 将被忽略。
  38. #. 从相对路径读取对象,首先将路径与 default_prefix 连接并去掉 bucket_name,然后读取内容。bucket_name 是将 default_prefix 用分隔符 \ 分割后的第一个元素。
  39. .. note::
  40. MultiBucketS3DataWriter 与 MultiBucketS3DataReader 具有类似的行为。
  41. .. code:: python
  42. class S3DataReader(MultiBucketS3DataReader):
  43. pass
  44. S3DataReader 基于 MultiBucketS3DataReader 构建,但仅支持单个桶。S3DataWriter 也是类似的情况。
  45. 读取示例
  46. ---------
  47. .. code:: python
  48. import os
  49. from magic_pdf.data.data_reader_writer import *
  50. from magic_pdf.data.data_reader_writer import MultiBucketS3DataReader
  51. from magic_pdf.data.schemas import S3Config
  52. # 初始化 reader
  53. file_based_reader1 = FileBasedDataReader('')
  54. ## 读本地文件 abc
  55. file_based_reader1.read('abc')
  56. file_based_reader2 = FileBasedDataReader('/tmp')
  57. ## 读本地文件 /tmp/abc
  58. file_based_reader2.read('abc')
  59. ## 读本地文件 /tmp/logs/message.txt
  60. file_based_reader2.read('/tmp/logs/message.txt')
  61. # 初始化多桶 s3 reader
  62. bucket = "bucket" # 替换为有效的 bucket
  63. ak = "ak" # 替换为有效的 access key
  64. sk = "sk" # 替换为有效的 secret key
  65. endpoint_url = "endpoint_url" # 替换为有效的 endpoint_url
  66. bucket_2 = "bucket_2" # 替换为有效的 bucket
  67. ak_2 = "ak_2" # 替换为有效的 access key
  68. sk_2 = "sk_2" # 替换为有效的 secret key
  69. endpoint_url_2 = "endpoint_url_2" # 替换为有效的 endpoint_url
  70. test_prefix = 'test/unittest'
  71. multi_bucket_s3_reader1 = MultiBucketS3DataReader(f"{bucket}/{test_prefix}", [S3Config(
  72. bucket_name=bucket, access_key=ak, secret_key=sk, endpoint_url=endpoint_url
  73. ),
  74. S3Config(
  75. bucket_name=bucket_2,
  76. access_key=ak_2,
  77. secret_key=sk_2,
  78. endpoint_url=endpoint_url_2,
  79. )])
  80. ## 读文件 s3://{bucket}/{test_prefix}/abc
  81. multi_bucket_s3_reader1.read('abc')
  82. ## 读文件 s3://{bucket}/{test_prefix}/efg
  83. multi_bucket_s3_reader1.read(f's3://{bucket}/{test_prefix}/efg')
  84. ## 读文件 s3://{bucket2}/{test_prefix}/abc
  85. multi_bucket_s3_reader1.read(f's3://{bucket_2}/{test_prefix}/abc')
  86. # 初始化 s3 reader
  87. s3_reader1 = S3DataReader(
  88. test_prefix,
  89. bucket,
  90. ak,
  91. sk,
  92. endpoint_url
  93. )
  94. ## 读文件 s3://{bucket}/{test_prefix}/abc
  95. s3_reader1.read('abc')
  96. ## 读文件 s3://{bucket}/efg
  97. s3_reader1.read(f's3://{bucket}/efg')
  98. 写入示例
  99. ----------
  100. .. code:: python
  101. import os
  102. from magic_pdf.data.data_reader_writer import *
  103. from magic_pdf.data.data_reader_writer import MultiBucketS3DataWriter
  104. from magic_pdf.data.schemas import S3Config
  105. # 初始化 reader
  106. file_based_writer1 = FileBasedDataWriter("")
  107. ## 写数据 123 to abc
  108. file_based_writer1.write("abc", "123".encode())
  109. ## 写数据 123 to abc
  110. file_based_writer1.write_string("abc", "123")
  111. file_based_writer2 = FileBasedDataWriter("/tmp")
  112. ## 写数据 123 to /tmp/abc
  113. file_based_writer2.write_string("abc", "123")
  114. ## 写数据 123 to /tmp/logs/message.txt
  115. file_based_writer2.write_string("/tmp/logs/message.txt", "123")
  116. # 初始化多桶 s3 writer
  117. bucket = "bucket" # 替换为有效的 bucket
  118. ak = "ak" # 替换为有效的 access key
  119. sk = "sk" # 替换为有效的 secret key
  120. endpoint_url = "endpoint_url" # 替换为有效的 endpoint_url
  121. bucket_2 = "bucket_2" # 替换为有效的 bucket
  122. ak_2 = "ak_2" # 替换为有效的 access key
  123. sk_2 = "sk_2" # 替换为有效的 secret key
  124. endpoint_url_2 = "endpoint_url_2" # 替换为有效的 endpoint_url
  125. test_prefix = "test/unittest"
  126. multi_bucket_s3_writer1 = MultiBucketS3DataWriter(
  127. f"{bucket}/{test_prefix}",
  128. [
  129. S3Config(
  130. bucket_name=bucket, access_key=ak, secret_key=sk, endpoint_url=endpoint_url
  131. ),
  132. S3Config(
  133. bucket_name=bucket_2,
  134. access_key=ak_2,
  135. secret_key=sk_2,
  136. endpoint_url=endpoint_url_2,
  137. ),
  138. ],
  139. )
  140. ## 写数据 123 to s3://{bucket}/{test_prefix}/abc
  141. multi_bucket_s3_writer1.write_string("abc", "123")
  142. ## 写数据 123 to s3://{bucket}/{test_prefix}/abc
  143. multi_bucket_s3_writer1.write("abc", "123".encode())
  144. ## 写数据 123 to s3://{bucket}/{test_prefix}/efg
  145. multi_bucket_s3_writer1.write(f"s3://{bucket}/{test_prefix}/efg", "123".encode())
  146. ## 写数据 123 to s3://{bucket_2}/{test_prefix}/abc
  147. multi_bucket_s3_writer1.write(f's3://{bucket_2}/{test_prefix}/abc', '123'.encode())
  148. # 初始化 s3 writer
  149. s3_writer1 = S3DataWriter(test_prefix, bucket, ak, sk, endpoint_url)
  150. ## 写数据 123 to s3://{bucket}/{test_prefix}/abc
  151. s3_writer1.write("abc", "123".encode())
  152. ## 写数据 123 to s3://{bucket}/{test_prefix}/abc
  153. s3_writer1.write_string("abc", "123")
  154. ## 写数据 123 to s3://{bucket}/efg
  155. s3_writer1.write(f"s3://{bucket}/efg", "123".encode())