Llama-3.1-8B-Instuct-Uz開源大語言模型 - 優化烏茲別克語的多任務處理助手

首頁

Llama 3.1 8B Instuct Uz

由behbudiy開發

Llama-3.1-8B-Instruct-Uz 是一個針對烏茲別克語優化的指令微調大語言模型，支持多種自然語言處理任務。

大型語言模型

Transformers

支持多種語言#烏茲別克語優化 #雙語指令微調 #低資源語言NLP

下載量 130

發布時間 : 7/31/2024

模型概述

該模型通過混合公開可用的烏茲別克語和英語數據以及合成構建的數據進行了指令微調，旨在支持烏茲別克語的各種自然語言處理任務，如機器翻譯、摘要和對話系統。

模型特點

烏茲別克語優化

針對烏茲別克語進行了專門的指令微調，顯著提升了在烏茲別克語任務上的性能。

多任務支持

支持多種自然語言處理任務，包括機器翻譯、摘要、問答、情感分析和新聞分類。

高性能

在烏茲別克語翻譯、情感分析和新聞分類任務上表現優於基礎模型和其他對比模型。

模型能力

文本生成

機器翻譯

摘要

問答

情感分析

新聞分類

使用案例

機器翻譯

烏茲別克語-英語翻譯

將烏茲別克語文本準確翻譯為英語。

BLEU 烏譯英（單樣本）27.42，COMET（烏譯英）85.63

英語-烏茲別克語翻譯

將英語文本準確翻譯為烏茲別克語。

BLEU 英譯烏（單樣本）11.58，COMET（英譯烏）86.53

情感分析

烏茲別克語情感分析

判斷烏茲別克語文本的情感傾向（積極或消極）。

準確率82.42

新聞分類

烏茲別克語新聞分類

將烏茲別克語新聞文章分類到特定類別。

準確率60.84

🚀 LLaMA-3.1-8B-Instruct-Uz模型

LLaMA-3.1-8B-Instruct-Uz模型是一個專為烏茲別克語自然語言處理任務設計的模型。它通過使用公開可用和合成構建的烏茲別克語及英語數據進行指令微調，在保留原始知識的同時增強了能力，可廣泛應用於機器翻譯、摘要生成和對話系統等任務。

🚀 快速開始

LLaMA-3.1-8B-Instruct-Uz模型可以結合transformers庫使用，也可以使用原始的llama代碼庫。

使用transformers庫

從transformers >= 4.43.0版本開始，你可以使用Transformers的pipeline抽象方法或利用Auto類結合generate()函數進行對話推理。

確保通過以下命令更新你的transformers庫：

pip install --upgrade transformers

import transformers
import torch

model_id = "behbudiy/Llama-3.1-8B-Instruct-Uz"

pipeline = transformers.pipeline(
    "text-generation",
    model=model_id,
    model_kwargs={"torch_dtype": torch.bfloat16},
    device_map="auto",
)

messages = [
    {"role": "system", "content": "Berilgan gap bo'yicha hissiyot tahlilini bajaring."},
    {"role": "user", "content": "Men bu filmni yaxshi ko'raman!"},
]

outputs = pipeline(
    messages,
    max_new_tokens=256,
)
print(outputs[0]["generated_text"][-1])

💡 使用建議

你可以在huggingface-llama-recipes中找到關於如何在本地使用該模型、使用torch.compile()、輔助生成、量化等詳細的使用方法。

使用原始`llama`代碼庫

請遵循倉庫中的說明進行操作。

✨ 主要特性

多語言支持：支持烏茲別克語和英語，可處理多種自然語言處理任務。
指令微調：通過混合烏茲別克語和英語數據進行指令微調，增強了模型的能力。
廣泛應用：適用於機器翻譯、摘要生成、對話系統等多種自然語言處理任務。

📦 模型信息

屬性	詳情
許可證	llama3.1
支持語言	uz, en
基礎模型	models/Meta-Llama-3.1-8B-Instruct
庫名稱	transformers
標籤	llama, text-generation-inference, summarization, translation, question-answering
數據集	yahma/alpaca-cleaned, behbudiy/alpaca-cleaned-uz, behbudiy/translation-instruction
評估指標	bleu, comet, accuracy
任務類型	text-generation

📊 性能對比

模型名稱	烏茲別克語到英語的BLEU分數（單樣本）	英語到烏茲別克語的BLEU分數（單樣本）	COMET分數（烏茲別克語到英語）	COMET分數（英語到烏茲別克語）	烏茲別克語情感分析	烏茲別克語新聞分類	MMLU（英語，5樣本）
Llama-3.1 8B Instruct	23.74	6.72	84.30	82.70	68.96	55.41	65.77
Llama-3.1 8B Instruct Uz	27.42	11.58	85.63	86.53	82.42	60.84	62.78
Mistral 7B Instruct	7.47	0.67	68.14	45.58	62.02	47.52	61.07
Mistral 7B Instruct Uz	29.39	16.77	86.91	88.75	79.13	59.38	55.72
Mistral Nemo Instruct	25.68	9.79	85.56	85.04	72.47	49.24	67.62
Mistral Nemo Instruct Uz	30.49	15.52	87.04	88.01	82.05	58.2	67.36
Google Translate	41.18	22.98	89.16	90.67	—	—	—

結果表明，針對烏茲別克語優化的模型在翻譯基準測試（BLEU和COMET）、烏茲別克語情感分析和新聞分類任務中始終優於其基礎版本。此外，在衡量英語多任務通用語言理解能力的MMLU基準測試中，微調後的模型性能沒有顯著下降。

🔧 技術細節

開發團隊

評估方法

翻譯任務

使用FLORES+烏茲別克語 - 英語 / 英語 - 烏茲別克語數據集進行評估，將開發集和測試集合並以創建更大的評估數據。使用以下提示進行單樣本烏茲別克語到英語的評估（英語到烏茲別克語評估時，交換“English”和“Uzbek”的位置）：

prompt = f'''You are a professional Uzbek-English translator. Your task is to accurately translate the given Uzbek text into English.

Instructions:
1. Translate the text from Uzbek to English.
2. Maintain the original meaning and tone.
3. Use appropriate English grammar and vocabulary.
4. If you encounter an ambiguous or unfamiliar word, provide the most likely translation based on context.
5. Output only the English translation, without any additional comments.

Example:
Uzbek: "Bugun ob-havo juda yaxshi, quyosh charaqlab turibdi."
English: "The weather is very nice today, the sun is shining brightly."

Now, please translate the following Uzbek text into English:
"{sentence}"
    '''

烏茲別克語情感分析

使用risqaliyevds/uzbek-sentiment-analysis數據集進行評估，使用GPT-4o API創建二元標籤（0: 負面，1: 正面）。使用以下提示進行評估：

prompt = f'''Given the following text, determine the sentiment as either 'Positive' or 'Negative.' Respond with only the word 'Positive' or 'Negative' without any additional text or explanation.

Text: {text}"
'''

烏茲別克語新聞分類

使用risqaliyevds/uzbek-zero-shot-classification數據集，要求模型使用以下提示預測新聞類別：

prompt = f'''Classify the given Uzbek news article into one of the following categories. Provide only the category number as the answer.

Categories:
0 - Politics (Siyosat)
1 - Economy (Iqtisodiyot)
2 - Technology (Texnologiya)
3 - Sports (Sport)
4 - Culture (Madaniyat)
5 - Health (Salomatlik)
6 - Family and Society (Oila va Jamiyat)
7 - Education (Ta'lim)
8 - Ecology (Ekologiya)
9 - Foreign News (Xorijiy Yangiliklar)

Now classify this article:
"{text}"

Answer (number only):"
'''

MMLU評估

在MMLU上進行5樣本評估，使用以下模板並提取模型生成的第一個標記來衡量準確率：

template = "The following are multiple choice questions (with answers) about [subject area].

[Example question 1]
A. text
B. text
C. text
D. text
Answer: [Correct answer letter]

.
.
.

[Example question 5]
A. text
B. text
C. text
D. text
Answer: [Correct answer letter]

Now, let's think step by step and then provide only the letter corresponding to the correct answer for the below question, without any additional explanation or comments.

[Actual MMLU test question]
A. text
B. text
C. text
D. text
Answer:"