llama-7b-v1-Receipt-Key-Extractionオープンソースモデル - 英語とアラビア語のレシートの重要情報抽出を無料で実現

ホーム

Llama 7b V1 Receipt Key Extraction

abdoelsayedによって開発

LLamA v1ベースの70億パラメータモデル、英語とアラビア語の領収書エントリからキー情報を抽出

大規模言語モデル

Transformers

複数言語対応#領収書のキー情報抽出 #多言語対応 #小売データ分析

ダウンロード数 41

リリース時間 : 9/21/2023

モデル概要

このモデルはLLamA v1アーキテクチャに基づく70億パラメータモデルで、領収書テキストからキー情報を抽出するために特別に設計されており、英語とアラビア語をサポートしています。

モデル特徴

多言語対応

英語とアラビア語の領収書からキー情報を抽出可能

高精度抽出

領収書からカテゴリ、ブランド、重量、単価などの多様なキー情報を正確に抽出可能

AMuRDデータセットベース

注釈付き多言語領収書データセットで訓練されており、言語横断的なキー情報抽出に適しています

モデル能力

テキスト情報抽出

多言語処理

構造化データ生成

使用事例

小売業

領収書情報自動処理

小売領収書から商品情報を自動抽出

データ処理効率向上、手入力エラー削減

財務システム

経費精算自動化

精算書類の経費項目を自動識別・分類

精算プロセス簡素化、財務処理効率向上

🚀 llama-7b-v1-Receipt-Key-Extraction

llama-7b-v1-Receipt-Key-Extractionは、LLamA v1に基づく70億パラメータのモデルです。このモデルは、レシート内の項目から重要な情報を抽出するための研究専用のモデルで、英語とアラビア語に対応しています。

AMuRD: Annotated Multilingual Receipts Dataset for Cross-lingual Key Information Extraction and Classification

🚀 クイックスタート

以下のコードを使用して、モデルを始めることができます。

# pip install -q transformers

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, GenerationConfig

checkpoint = "abdoelsayed/llama-7b-v1-Receipt-Key-Extraction"
device = "cuda" if torch.cuda.is_available() else "cpu"

tokenizer = AutoTokenizer.from_pretrained(checkpoint, model_max_length=512,
        padding_side="right",
        use_fast=False,)
model = AutoModelForCausalLM.from_pretrained(checkpoint).to(device)

def generate_response(instruction, input_text, max_new_tokens=100, temperature=0.1,  num_beams=4 ,top_k=40):
    prompt = f"Below is an instruction that describes a task, paired with an input that provides further context.\n\n### Instruction:\n{instruction}\n\n### Input:\n{input_text}\n\n### Response:"
    inputs = tokenizer(prompt, return_tensors="pt")
    input_ids = inputs["input_ids"].to(device)
    generation_config = GenerationConfig(
            temperature=temperature,
            top_p=top_p,
            top_k=top_k,
            num_beams=num_beams,
        )
    with torch.no_grad():
        outputs = model.generate(input_ids,generation_config=generation_config, max_new_tokens=max_new_tokens)
    outputs = tokenizer.decode(outputs.sequences[0])
    return output.split("### Response:")[-1].strip().replace("</s>","")

instruction = "Extract the class, Brand, Weight, Number of units, Size of units, Price, T.Price, Pack, Unit from the following sentence"
input_text = "Americana Okra zero 400 gm"

response = generate_response(instruction, input_text)
print(response)

📄 ライセンス

このモデルのライセンスはllama2です。

属性	详情
モデルタイプ	llama-7b-v1-Receipt-Key-Extraction
評価指標	精度、F1値
ライブラリ名	transformers
対応言語	英語、アラビア語

🔗 引用方法

このモデルを引用する場合は、以下の形式を使用してください。

@misc{abdallah2023amurd,
    title={AMuRD: Annotated Multilingual Receipts Dataset for Cross-lingual Key Information Extraction and Classification},
    author={Abdelrahman Abdallah and Mahmoud Abdalla and Mohamed Elkasaby and Yasser Elbendary and Adam Jatowt},
    year={2023},
    eprint={2309.09800},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}