|
|
@@ -54,7 +54,7 @@ After executing the above code, you can obtain the following result:
|
|
|
|
|
|
The result shows that PP-ChatOCRv3 can extract text information from the image and pass the extracted text information to the DeepSeek-V3 large model for question understanding and information extraction, returning the required extraction result.
|
|
|
|
|
|
-## 2. New Model Can Quickly Adapt to Multi-page PDF Files for Efficient Information Extraction
|
|
|
+## 3. New Model Can Quickly Adapt to Multi-page PDF Files for Efficient Information Extraction
|
|
|
|
|
|
In practical application scenarios, besides a large number of image files, more document information extraction tasks involve multi-page PDF files. Since multi-page PDF files often contain a vast amount of text information, passing all this text information to a large language model at once not only increases the invocation cost but also reduces the accuracy of text information extraction. To address this issue, the PP-ChatOCRv3 pipeline integrates vector retrieval technology, which stores the text information from multi-page PDF files in the form of a vector database and retrieves the most relevant fragments through vector retrieval technology to pass them to the large language model, significantly reducing the invocation cost of the large language model and improving the accuracy of text information extraction. The Baidu Cloud Qianfan platform provides four vector models for establishing vector databases of text information. For the specific model support list and their functional characteristics, refer to the vector model section in the [API List](https://cloud.baidu.com/doc/WENXINWORKSHOP/s/Nlks5zkzu_en). Next, we will use the `embedding-v1` model to establish a vector database of text information and pass the most relevant fragments to the `DeepSeek-V3` large language model through vector retrieval technology, thereby efficiently extracting key information from multi-page PDF files.
|
|
|
|
|
|
@@ -156,7 +156,7 @@ Total Time: 6.9693s
|
|
|
|
|
|
By comparing the results of the two executions, it can be observed that during the first execution, the PP-ChatOCRv3 Pipeline extracts all text information from multi-page PDF files and establishes a vector library, which takes a longer time. During subsequent executions, the PP-ChatOCRv3 Pipeline only needs to load and retrieve the vector library, significantly reducing the overall time consumption. The PP-ChatOCRv3 Pipeline, combined with vector retrieval technology, effectively reduces the number of calls to large language models when extracting ultra-long text, achieving faster text information extraction speed and more accurate key information location. This provides a more efficient solution for us in actual multi-page PDF file information extraction scenarios.
|
|
|
|
|
|
-## 3. Exploring the Thinking Mode of Large Models in Text and Image Information Extraction
|
|
|
+## 4. Exploring the Thinking Mode of Large Models in Text and Image Information Extraction
|
|
|
|
|
|
DeepSeek-R1 impresses with its exceptional text dialogue capabilities and in-depth problem-solving thinking abilities. When executing complex tasks or processing user instructions, in addition to normally completing dialogue tasks, the model can also demonstrate its thinking process during problem-solving. The PP-ChatOCRv3 Pipeline already supports the ability to adaptively return the output of thinking model results. For models that support returning the thinking process, PP-ChatOCRv3 can return the thinking process through an additional `reasoning_content` output field. This field is a list field containing the thinking results of the PP-ChatOCRv3 when calling the large language model multiple times. By observing these thinking results, we can gain insight into how the model gradually extracts the answer to the question from the given text information, and these thinking results can help us provide more improvement ideas for prompt optimization of the model. Next, we will take a specific legal document information extraction task as an example, using the `DeepSeek-R1` model as the large language model called in PP-ChatOCRv3 for key information extraction, and briefly explore the thinking process of the DeepSeek-R1 model.
|
|
|
|
|
|
@@ -200,7 +200,7 @@ print(chat_result)
|
|
|
|
|
|
The result shows that the use of the `DeepSeek-R1` model not only helps us complete the relationship information extraction task for the question 'When was this regulation announced?' but also returns its thinking process when solving the information extraction problem in the `reasoning_content` field. For example, when thinking, the model carefully distinguishes between the publication date and the implementation date of the regulation and rechecks the returned results.
|
|
|
|
|
|
-## 4. Supporting Custom Prompt Engineering to Expand the Functional Boundaries of Large Language Models
|
|
|
+## 5. Supporting Custom Prompt Engineering to Expand the Functional Boundaries of Large Language Models
|
|
|
|
|
|
In document information extraction tasks, in addition to directly extracting key information from text information, we can also expand the functional boundaries of large language models through custom prompt engineering. For example, we can design new prompt rules to allow large language models to summarize these text information, thereby helping us quickly locate the key information we need from a large amount of text information, or allowing large language models to think and judge user questions based on the content in the text and give suggestions, etc. The PP-ChatOCRv3 Pipeline already supports custom prompt functionality, and the default prompts used by the Pipeline can be referred to in the Pipeline's [configuration file](../../paddlex/configs/pipelines/PP-ChatOCRv3-doc.yaml). We can refer to the prompt logic in the default configuration to customize and modify the prompts in the chat interface. Below is a brief introduction to the meaning of prompt parameters related to text content:
|
|
|
|