tamil-llama-7b-instruct-v0.2オープンソースモデル - 英語とタミル語の両言語処理をサポートする指導ツール

ホーム

Tamil Llama 7b Instruct V0.2

abhinandによって開発

これはLLaMA-2ベースの7Bパラメータタミル語インストラクトモデルで、英語とタミル語のバイリンガル処理をサポートします。

大規模言語モデル

Transformers

複数言語対応#タミル語サポート #バイリンガルインストラクション追従 #農業文化Q&A

ダウンロード数 197

リリース時間 : 1/23/2024

モデル概要

このモデルはタミル語大規模言語モデルの発展を推進する重要な一歩であり、推論や特定の自然言語処理タスクのニーズを満たすためのさらなるファインチューニングの準備ができています。

モデル特徴

バイリンガルサポート

英語とタミル語の同時処理をサポート

タミル語拡張

オリジナルLLaMA-2に約16,000のタミル語単語を追加

インストラクション追従

インストラクション追従タスクに特化して最適化

モデル能力

タミル語テキスト生成

英語テキスト生成

インストラクション理解と実行

マルチターン会話

使用事例

教育

タミル文化解説

タミルの祭りや伝統を解説

例ではPongal祭りの意義を成功裏に説明

カスタマーサポート

バイリンガルカスタマーサポートアシスタント

タミル語ユーザー向け英語-タミル語バイリンガルサポートを提供

🚀 タミルLLaMA 7B Instruct v0.2

タミル語向けの大規模言語モデル（LLM）を進化させる重要な一歩として、タミルLLaMA 7B命令型モデルの初回リリースへようこそ。このモデルは即座に推論を行うことができ、また、特定の自然言語処理（NLP）タスクに合わせてさらなる微調整も可能です。

このモデルの開発と機能について詳しく知りたい場合は、研究論文と紹介ブログ記事（作成中）をご覧ください。これらの資料では、開発の過程とモデルの潜在的な影響が説明されています。

⚠️ 重要提示

このモデルはタミルLLaMAシリーズのモデルに基づいています。GitHubリポジトリは同じです - https://github.com/abhinand5/tamil-llama。タミルLLaMA v0.2のベースモデルと更新されたコード（このモデルの基礎となっています）は近日公開予定です。

💡 使用建议

この取り組みを評価し、継続的な開発を支援したい場合は、私にコーヒーを買ってくださいをご検討ください。あなたの支援は非常に大切で、心から感謝しています。

🚀 クイックスタート

使いやすいノーコードデモにアクセスするには、提供されたGoogle Colabノートブックを開いてください。使用方法の完全な説明はノートブック内に含まれています。

✨ 主な機能

タミルLLaMAモデルは、元のLLaMA - 2をベースに、約16,000トークンの広範なタミル語語彙を持つように強化され、特化されています。

モデルタイプ：英語とタミル語のサンプルが等量含まれる約500,000サンプルで微調整された70億パラメータのGPTライクなモデル。（データセットは近日公開予定）
言語：英語とタミル語のバイリンガル
ライセンス：GNU General Public License v3.0
微調整元のモデル：近日公開予定
学習精度：bfloat16
コード：GitHub（近日更新予定）

📦 インストール

本READMEには具体的なインストール手順が記載されていないため、このセクションを省略します。

💻 使用例

基本的な使用法

from transformers import LlamaForCausalLM, AutoTokenizer, pipeline

model = LlamaForCausalLM.from_pretrained(
    "abhinand/tamil-llama-instruct-v0.2",
    #load_in_8bit=True, # Set this depending on the GPU you have
    torch_dtype=torch.bfloat16,
    device_map={"": 0}, # Set this depending on the number of GPUs you have
    local_files_only=False # Optional
)
model.eval()

tokenizer = AutoTokenizer.from_pretrained("abhinand/tamil-llama-instruct-v0.2")

inf_pipeline = pipeline("conversational", model=model, tokenizer=tokenizer)


def format_instruction(system_prompt, question, return_dict=False):
	if system_prompt is None:
		messages = [
			{'content': question, 'role': 'user'},
		]
	else:
		messages = [
			{'content': system_prompt, 'role': 'system'},
			{'content': question, 'role': 'user'},
		]

	if return_dict:
		return messages

	prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

	return prompt


# Set the generation configuration according to your needs
temperature = 0.6
repetition_penalty = 1.1
max_new_tokens = 256

SYSTEM_PROMPT = "You are an AI assistant who follows instructions extremely well. Do your best your best to help."
INPUT = "Can you explain the significance of Tamil festival Pongal?"

instruction = format_instruction(
    system_prompt=SYSTEM_PROMPT,
    question=INPUT,
    return_dict=True,
)

output = inf_pipeline(
    instruction,
    temperature=temperature,
    max_new_tokens=max_new_tokens,
    repetition_penalty=repetition_penalty
)
print(output)

出力例：

Conversation id: d57cdf33-01ff-4328-8efe-5c4fefdd6e77
system: You are an AI assistant who follows instructions extremely well. Do your best your best to help.
user: Can you explain the significance of Tamil festival Pongal?
assistant: Pongal is a significant harvest festival celebrated in Tamil Nadu and other parts of southern India. It marks the end of the rainy season and beginning of the agricultural year. The festival primarily revolves around giving gratitude to nature, particularly the Sun God Surya for his bountiful gifts like agriculture and health. People offer prayers to cattle, which play a significant role in agriculture, as well as their families for their continued support during the harvest season. The festival is marked by various colorful events, including preparing traditional Pongal dishes like rice cooked with milk, sugarcane, and banana, followed by exchanging gifts and celebrating among family members and friends. It also serves as a time for unity and strengthens the bond between people in their communities.

📚 ドキュメント

プロンプトテンプレート：ChatML

<|im_start|>system
{system_message}<|im_end|>
<|im_start|>user
{prompt}<|im_end|>
<|im_start|>assistant

ベンチマーク結果

ベンチマークは、LLM - Autoevalを使用して、runpod上のRTX 3090で行われました。

⚠️ 重要提示

Open LLM Leaderboardのスコアと、同一の設定でLM Eval Harnessを使用したローカル実行のスコアには不一致が見られることに注意してください。ここに記載されている結果は、私たち自身のベンチマークに基づいています。これらの結果を再現するには、LLM - Autoevalを使用するか、Open LLM LeaderboardのAboutページに記載されている設定でlm - evaluation - harnessをローカルで使用してください。

ベンチマーク	Llama 2 Chat	タミルLlama v0.2 Instruct	テルグ語Llama Instruct	マラヤーラム語Llama Instruct
ARC Challenge (25-shot)	52.9	53.75	52.47	52.82
TruthfulQA (0-shot)	45.57	47.23	48.47	47.46
Hellaswag (10-shot)	78.55	76.11	76.13	76.91
Winogrande (5-shot)	71.74	73.95	71.74	73.16
AGI Eval (0-shot)	29.3	30.95	28.44	29.6
BigBench (0-shot)	32.6	33.08	32.99	33.26
平均	51.78	52.51	51.71	52.2

モデル	タイプ	データ	ベースモデル	パラメータ数	ダウンロードリンク
タミルLLaMA 7B v0.1 Base	ベースモデル	12GB	LLaMA 7B	7B	HF Hub
タミルLLaMA 13B v0.1 Base	ベースモデル	4GB	LLaMA 13B	13B	HF Hub
タミルLLaMA 7B v0.1 Instruct	命令追従モデル	145k命令	タミルLLaMA 7B Base	7B	HF Hub
タミルLLaMA 13B v0.1 Instruct	命令追従モデル	145k命令	タミルLLaMA 13B Base	13B	HF Hub
テルグ語LLaMA 7B v0.1 Instruct	命令/チャットモデル	420k命令	テルグ語LLaMA 7B Base v0.1	7B	HF Hub
マラヤーラム語LLaMA 7B v0.2 Instruct	命令/チャットモデル	420k命令	マラヤーラム語LLaMA 7B Base v0.1	7B	HF Hub

使用上の注意

これらのモデルは解毒/検閲処理が行われていないことに注意してください。したがって、印象的な言語能力を持っていますが、有害または不快な内容を生成する可能性があります。ユーザーには、特に公開または敏感なアプリケーションでは、慎重に判断し、モデルの出力を注意深く監視することを強くお勧めします。

開発者紹介

この革新的なモデルの作成者を知り、彼らの分野への貢献をフォローしましょう。

Abhinand Balachandran

引用

このモデルまたはタミル - Llama関連の研究をあなたの研究で使用する場合は、以下を引用してください。

@misc{balachandran2023tamilllama,
      title={Tamil-Llama: A New Tamil Language Model Based on Llama 2}, 
      author={Abhinand Balachandran},
      year={2023},
      eprint={2311.05845},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

私たちは、このモデルがあなたのNLPツールキットの中で貴重なツールとなり、タミル語の理解と生成における進歩が見られることを楽しみにしています。