Open-source Mistral7B_OCR_to_Json_V1 Model - Freely Achieve Structured Extraction of Receipt/Invoice Image Data

Mistral7b Ocr To Json V1

Developed by mychen76

A Mistral-7B-v0.1 fine-tuned OCR text-to-JSON model, specialized in structured data extraction from receipt/invoice images

Large Language Model

Transformers

Open Source License:Apache-2.0 #OCR Text to JSON #Receipt Structured Parsing #Mistral-7B Fine-tuning

Downloads 583

Release Time : 10/5/2023

Model Overview

This model converts OCR engine output text detection results into structured JSON objects, particularly suitable for receipt and invoice processing scenarios

Model Features

OCR Post-processing Optimization

Specifically optimized for OCR output, effectively handling noise and irregular formats in OCR recognition

Structured Output

Converts unstructured OCR text into standardized JSON format for easy subsequent processing and analysis

Performance Advantage

Outperforms Llama 2 13B model on test benchmarks

Model Capabilities

Receipt Text Parsing

Invoice Data Extraction

Unstructured Text to Structured JSON

POS Data Processing

Use Cases

Retail Industry

Receipt Digitization

Automatically converts scanned paper receipts into structured data

Generates JSON objects containing complete information such as items, prices, taxes, etc.

Financial Processing

Invoice Automation Processing

Automatically extracts key invoice information for reimbursement and accounting

Identifies key fields such as invoice number, date, amount, etc.

🚀 mychen76/mistral7b_ocr_to_json_v1

The mychen76/mistral7b_ocr_to_json_v1 is a fine - tuned Large Language Model (LLM) designed to convert OCR text into JSON objects. This experimental model is based on Mistral - 7B - v0.1, which outperforms Llama 2 13B on all tested benchmarks. It leverages the outputs from OCR engines to save LLM training time for image - to - text use cases, such as converting invoice or receipt images to JSON objects.

✨ Features

OCR Text to JSON Conversion: Specialized in converting OCR text from invoice or receipt images into well - formed JSON objects.
High - Performance Base Model: Built on Mistral - 7B - v0.1, which shows better performance than Llama 2 13B on benchmarks.
Multiple Language Support: Demonstrated usage on both English and German receipts.

📦 Installation

To load the model directly, you can use the following Python code:

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("mychen76/mistral7b_ocr_to_json_v1")
model = AutoModelForCausalLM.from_pretrained("mychen76/mistral7b_ocr_to_json_v1")

To load the model in 4 - bit quantization for lower memory usage:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, GenerationConfig, BitsAndBytesConfig

bnb_config = BitsAndBytesConfig(
    llm_int8_enable_fp32_cpu_offload=True,
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
)
device_map = {
    "transformer.word_embeddings": 0,
    "transformer.word_embeddings_layernorm": 0,
    "lm_head": 0,
    "transformer.h": 0,
    "transformer.ln_f": 0,
    "model.embed_tokens": 0,
    "model.layers":0,
    "model.norm":0    
}
device = "cuda" if torch.cuda.is_available() else "cpu"

model_id="mychen76/mistral7b_ocr_to_json_v1"
model = AutoModelForCausalLM.from_pretrained(
    model_id, 
    trust_remote_code=True,  
    torch_dtype=torch.float16,
    quantization_config=bnb_config,
    device_map=device_map)
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)

💻 Usage Examples

Basic Usage

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

tokenizer = AutoTokenizer.from_pretrained("mychen76/mistral7b_ocr_to_json_v1")
model = AutoModelForCausalLM.from_pretrained("mychen76/mistral7b_ocr_to_json_v1")

receipt_boxes = """[[[[184.0, 42.0], [278.0, 45.0], [278.0, 62.0], [183.0, 59.0]], ('BAJA FRESH', 0.9551795721054077)], [[[242.0, 113.0], [379.0, 118.0], [378.0, 136.0], [242.0, 131.0]], ('GENERAL MANAGER:', 0.9462024569511414)], [[[240.0, 133.0], [300.0, 135.0], [300.0, 153.0], [240.0, 151.0]], ('NORMAN', 0.9913229942321777)], [[[143.0, 166.0], [234.0, 171.0], [233.0, 192.0], [142.0, 187.0]], ('176 Rosa C', 0.9229503870010376)], [[[130.0, 207.0], [206.0, 210.0], [205.0, 231.0], [129.0, 228.0]], ('Chk 7545', 0.9349349141120911)], [[[283.0, 215.0], [431.0, 221.0], [431.0, 239.0], [282.0, 233.0]], ("Dec26'0707:26PM", 0.9290117025375366)], [[[440.0, 221.0], [489.0, 221.0], [489.0, 239.0], [440.0, 239.0]], ('Gst0', 0.9164432883262634)], [[[164.0, 252.0], [308.0, 256.0], [308.0, 276.0], [164.0, 272.0]], ('TAKE OUT', 0.9367803335189819)], [[[145.0, 274.0], [256.0, 278.0], [255.0, 296.0], [144.0, 292.0]], ('1 BAJA STEAK', 0.9167789816856384)], [[[423.0, 282.0], [465.0, 282.0], [465.0, 304.0], [423.0, 304.0]], ('6.95', 0.9965073466300964)], [[[180.0, 296.0], [292.0, 299.0], [292.0, 319.0], [179.0, 316.0]], ('NO GUACAMOLE', 0.9631438255310059)], [[[179.0, 317.0], [319.0, 322.0], [318.0, 343.0], [178.0, 338.0]], ('ENCHILADO STYLE', 0.9704310894012451)], [[[423.0, 325.0], [467.0, 325.0], [467.0, 347.0], [423.0, 347.0]], ('1.49', 0.988395631313324)], [[[159.0, 339.0], [201.0, 341.0], [200.0, 360.0], [158.0, 358.0]], ('CASH', 0.9982023239135742)], [[[417.0, 348.0], [466.0, 348.0], [466.0, 367.0], [417.0, 367.0]], ('20.00', 0.9921982884407043)], [[[156.0, 380.0], [200.0, 382.0], [198.0, 404.0], [155.0, 402.0]], ('FOOD', 0.9906187057495117)], [[[426.0, 390.0], [468.0, 390.0], [468.0, 409.0], [426.0, 409.0]], ('8.44', 0.9963030219078064)], [[[154.0, 402.0], [190.0, 405.0], [188.0, 427.0], [152.0, 424.0]], ('TAX', 0.9963871836662292)], [[[427.0, 413.0], [468.0, 413.0], [468.0, 432.0], [427.0, 432.0]], ('0.61', 0.9934712648391724)], [[[153.0, 427.0], [224.0, 429.0], [224.0, 450.0], [153.0, 448.0]], ('PAYMENT', 0.9948703646659851)], [[[428.0, 436.0], [470.0, 436.0], [470.0, 455.0], [428.0, 455.0]], ('9.05', 0.9961490631103516)], [[[152.0, 450.0], [251.0, 453.0], [250.0, 475.0], [152.0, 472.0]], ('Change Due', 0.9556287527084351)], [[[420.0, 458.0], [471.0, 458.0], [471.0, 480.0], [420.0, 480.0]], ('10.95', 0.997236430644989)], [[[209.0, 498.0], [382.0, 503.0], [381.0, 524.0], [208.0, 519.0]], ('$2.000FF', 0.9757758378982544)], [[[169.0, 522.0], [422.0, 528.0], [421.0, 548.0], [169.0, 542.0]], ('NEXT PURCHASE', 0.962527871131897)], [[[167.0, 546.0], [365.0, 552.0], [365.0, 570.0], [167.0, 564.0]], ('CALL800 705 5754or', 0.926964521408081)], [[[146.0, 570.0], [416.0, 577.0], [415.0, 597.0], [146.0, 590.0]], ('Go www.mshare.net/bajafresh', 0.9759786128997803)], [[[147.0, 594.0], [356.0, 601.0], [356.0, 621.0], [146.0, 614.0]], ('Take our brief survey', 0.9390400648117065)], [[[143.0, 620.0], [410.0, 626.0], [409.0, 647.0], [143.0, 641.0]], ('When Prompted, Enter Store', 0.9385656118392944)], [[[142.0, 646.0], [408.0, 653.0], [407.0, 673.0], [142.0, 666.0]], ('Write down redemption code', 0.9536812901496887)], [[[141.0, 672.0], [409.0, 679.0], [408.0, 699.0], [141.0, 692.0]], ('Use this receipt as coupon', 0.9658807516098022)], [[[138.0, 697.0], [448.0, 701.0], [448.0, 725.0], [138.0, 721.0]], ('Discount on purchases of $5.00', 0.9624248743057251)], [[[139.0, 726.0], [466.0, 729.0], [466.0, 750.0], [139.0, 747.0]], ('or more,Offer expires in 30 day', 0.9263916611671448)], [[[137.0, 750.0], [459.0, 755.0], [459.0, 778.0], [137.0, 773.0]], ('Good at participating locations', 0.963909924030304)]]"""

prompt=f"""### Instruction:
You are POS receipt data expert, parse, detect, recognize and convert following receipt OCR image result into structure receipt data object. 
Don't make up value not in the Input. Output must be a well - formed JSON object.```json

### Input:
{receipt_boxes}

### Output:
"""
with torch.inference_mode():
    inputs = tokenizer(prompt,return_tensors="pt",truncation=True).to(device)
    outputs = model.generate(**inputs, max_new_tokens=512) 
    result_text = tokenizer.batch_decode(outputs)[0]
    print(result_text)

Advanced Usage

First, get the OCR image boxes:

from paddleocr import PaddleOCR, draw_ocr
from ast import literal_eval
import json

paddleocr = PaddleOCR(lang="en",ocr_version="PP - OCRv4",show_log = False,use_gpu=True)

def paddle_scan(paddleocr,img_path_or_nparray):
    result = paddleocr.ocr(img_path_or_nparray,cls=True)
    result = result[0]
    boxes = [line[0] for line in result]       #boundign box 
    txts = [line[1][0] for line in result]     #raw text
    scores = [line[1][1] for line in result]   # scores
    return  txts, result

# perform ocr scan
receipt_texts, receipt_boxes = paddle_scan(paddleocr,receipt_image_array)
print(50*"--","\ntext only:\n",receipt_texts)
print(50*"--","\nocr boxes:\n",receipt_boxes)

📚 Documentation

Model Architecture

The mychen76/mistral7b_ocr_to_json_v1 is a fine - tuned LLM based on Mistral - 7B - v0.1, which is optimized for the task of converting OCR text to JSON objects.

Motivation

OCR engines are good at image detection and text recognition, while LLM models are well - trained for text processing and generation. By leveraging the outputs from OCR engines, this model can save LLM training time for image - to - text use cases, such as converting invoice or receipt images to JSON objects.

Model Usage

Take an invoice or receipt image.
Perform OCR on the image to get text boxes.
Feed the outputs into the LLM model to generate a well - formed receipt JSON object.

Dataset

The dataset used for finetuning is mychen76/invoices - and - receipts_ocr_v1.

Usage Notebooks

English Receipts:
- model_id="mychen76/mistral7b_ocr_to_json_v1": [Convert_Receipt_Image - to - Json_using_OCR_to_JSON_v1 - English.ipynb](https://github.com/minyang - chen/LLM_convert_receipt_image - to - json_or_xml/blob/main/Convert_Receipt_Image - to - Json_using_OCR_to_JSON_v1 - English.ipynb)
- model_id: mychen76/mistral_ocr2json_v3_chatml: [Convert_Receipt_Image - to - Json_using_OCR_to_JSON_v2_ChatML.ipynb](https://github.com/minyang - chen/LLM_convert_receipt_image - to - json_or_xml/blob/main/Convert_Receipt_Image - to - Json_using_OCR_to_JSON_v2_ChatML.ipynb)
German Receipts:
- model_id="mychen76/mistral7b_ocr_to_json_v1":
  - [Convert_Receipt_Image - to - Json_using_OCR_to_JSON_v1 - German - Test1 - passed.ipynb](https://github.com/minyang - chen/LLM_convert_receipt_image - to - json_or_xml/blob/main/Convert_Receipt_Image - to - Json_using_OCR_to_JSON_v1 - German - Test1 - passed.ipynb)
  - [Convert_Receipt_Image - to - Json_using_OCR_to_JSON_v1 - German - Test2 - failed.ipynb](https://github.com/minyang - chen/LLM_convert_receipt_image - to - json_or_xml/blob/main/Convert_Receipt_Image - to - Json_using_OCR_to_JSON_v1 - German - Test2 - failed.ipynb)
  - [Convert_Receipt_Image - to - Json_using_OCR_to_JSON_v1 - German - Test3 - okay.ipynb](https://github.com/minyang - chen/LLM_convert_receipt_image - to - json_or_xml/blob/main/Convert_Receipt_Image - to - Json_using_OCR_to_JSON_v1 - German - Test3 - okay.ipynb)

📄 License

This model is released under the Apache 2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご