| 1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283848586878889 |
- read_api
- ==========
- Read the content from file or directory to create ``Dataset``, Currently we provided serval functions that cover some scenarios.
- if you have new scenarios that is common to most of the users, you can post it on the offical github issues with detail descriptions.
- Also it is easy to implement your own read-related funtions.
- Important Functions
- -------------------
- read_jsonl
- ^^^^^^^^^^^^^^^^
- Read the contet from jsonl which may located on local machine or remote s3. if you want to know more about jsonl, please goto :doc:`../../additional_notes/glossary`
- .. code:: python
- from magic_pdf.data.read_api import *
- from magic_pdf.data.data_reader_writer import MultiBucketS3DataReader
- from magic_pdf.data.schemas import S3Config
- # read jsonl from local machine
- datasets = read_jsonl("tt.jsonl", None) # replace with real jsonl file
- # read jsonl from remote s3
- bucket = "bucket_1" # replace with real s3 bucket
- ak = "access_key_1" # replace with real s3 access key
- sk = "secret_key_1" # replace with real s3 secret key
- endpoint_url = "endpoint_url_1" # replace with real s3 endpoint url
- bucket_2 = "bucket_2" # replace with real s3 bucket
- ak_2 = "access_key_2" # replace with real s3 access key
- sk_2 = "secret_key_2" # replace with real s3 secret key
- endpoint_url_2 = "endpoint_url_2" # replace with real s3 endpoint url
- s3configs = [
- S3Config(
- bucket_name=bucket, access_key=ak, secret_key=sk, endpoint_url=endpoint_url
- ),
- S3Config(
- bucket_name=bucket_2,
- access_key=ak_2,
- secret_key=sk_2,
- endpoint_url=endpoint_url_2,
- ),
- ]
- s3_reader = MultiBucketS3DataReader(bucket, s3configs)
- datasets = read_jsonl(f"s3://bucket_1/tt.jsonl", s3_reader) # replace with real s3 jsonl file
- read_local_pdfs
- ^^^^^^^^^^^^^^^^^
- Read pdf from path or directory.
- .. code:: python
- from magic_pdf.data.read_api import *
- # read pdf path
- datasets = read_local_pdfs("tt.pdf")
- # read pdfs under directory
- datasets = read_local_pdfs("pdfs/")
- read_local_images
- ^^^^^^^^^^^^^^^^^^^
- Read images from path or directory
- .. code:: python
- from magic_pdf.data.read_api import *
- # read from image path
- datasets = read_local_images("tt.png") # replace with real file path
- # read files from directory that endswith suffix in suffixes array
- datasets = read_local_images("images/", suffixes=["png", "jpg"]) # replace with real directory
- Check :doc:`../../api/read_api` for more details
|