Mistral-7B-Instruct-Uz开源模型 - 支持乌兹别克语与英语多种NLP任务

首页

Mistral 7B Instruct Uz

由 behbudiy 开发

针对乌兹别克语优化的Mistral-7B指令微调模型，支持乌兹别克语和英语的多种NLP任务

大型语言模型

Transformers

支持多种语言开源协议:Apache-2.0 #乌兹别克语优化 #多任务指令微调 #低资源语言支持

下载量 49

发布时间 : 8/30/2024

模型简介

该模型通过持续预训练和指令微调，结合乌兹别克语和英语数据，增强了在乌兹别克语任务上的表现，同时保留了原有知识。适用于机器翻译、摘要生成和对话系统等应用。

模型特点

乌兹别克语优化

专门针对乌兹别克语进行持续预训练和指令微调，显著提升乌兹别克语任务表现

多任务支持

支持文本生成、摘要、翻译和问答等多种自然语言处理任务

双语能力

同时支持乌兹别克语和英语，适合双语应用场景

性能提升

在乌兹别克语翻译、情感分析和新闻分类任务上优于基础模型

模型能力

乌兹别克语文本生成

英语文本生成

乌兹别克语-英语机器翻译

英语-乌兹别克语机器翻译

文本摘要

问答系统

情感分析

文本分类

使用案例

机器翻译

乌兹别克语-英语翻译

将乌兹别克语文本准确翻译为英语

BLEU Uz-En: 29.39, COMET: 86.91

英语-乌兹别克语翻译

将英语文本准确翻译为乌兹别克语

BLEU En-Uz: 16.77, COMET: 88.75

文本分析

乌兹别克语情感分析

分析乌兹别克语文本的情感倾向

准确率: 79.13%

乌兹别克语新闻分类

将乌兹别克语新闻分类到特定类别

准确率: 59.38%

对话系统

乌兹别克语聊天机器人

构建乌兹别克语对话系统

🚀 Mistral-7B-Instruct-Uz模型

Mistral-7B-Instruct-Uz模型通过公开可用和合成构造的乌兹别克语及英语数据进行持续预训练和指令微调，在保留原有知识的同时增强了能力。该模型旨在支持乌兹别克语的各种自然语言处理任务，如机器翻译、文本摘要和对话系统等，确保在这些应用场景中表现出色。

🚀 快速开始

安装

推荐使用 behbudiy/Mistral-7B-Instruct-Uz 搭配 mistral-inference。执行以下命令进行安装：

pip install mistral_inference

下载模型

from huggingface_hub import snapshot_download
from pathlib import Path

mistral_models_path = Path.home().joinpath('mistral_models', '7B-Instruct-Uz')
mistral_models_path.mkdir(parents=True, exist_ok=True)

snapshot_download(repo_id="behbudiy/Mistral-7B-Instruct-Uz", allow_patterns=["params.json", "consolidated.safetensors", "tokenizer.model.v3"], local_dir=mistral_models_path)

聊天功能

安装 mistral_inference 后，环境中会有 mistral-chat 命令。可以使用以下命令与模型进行对话：

mistral-chat $HOME/mistral_models/7B-Instruct-Uz --instruct --max_tokens 256

指令跟随

from mistral_inference.transformer import Transformer
from mistral_inference.generate import generate

from mistral_common.tokens.tokenizers.mistral import MistralTokenizer
from mistral_common.protocol.instruct.messages import UserMessage
from mistral_common.protocol.instruct.request import ChatCompletionRequest


tokenizer = MistralTokenizer.from_file(f"{mistral_models_path}/tokenizer.model.v3")
model = Transformer.from_folder(mistral_models_path)

completion_request = ChatCompletionRequest(messages=[UserMessage(content="O'zbekiston haqida ma'lumot ber.")])

tokens = tokenizer.encode_chat_completion(completion_request).tokens

out_tokens, _ = generate([tokens], model, max_tokens=64, temperature=0.0, eos_id=tokenizer.instruct_tokenizer.tokenizer.eos_id)
result = tokenizer.instruct_tokenizer.tokenizer.decode(out_tokens[0])

print(result)

使用 `transformers` 生成文本

如果想使用 Hugging Face transformers 生成文本，可以这样做：

from transformers import pipeline

messages = [
    {"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"},
    {"role": "user", "content": "Who are you?"},
]
chatbot = pipeline("text-generation", model="behbudiy/Mistral-7B-Instruct-Uz", device='cuda')
chatbot(messages)

✨ 主要特性

该模型经过持续预训练和指令微调，结合了公开可用和合成构造的乌兹别克语及英语数据，在保留原始知识的同时提升了能力。
支持多种乌兹别克语自然语言处理任务，如机器翻译、文本摘要和对话系统等。

📦 安装指南

推荐使用 behbudiy/Mistral-7B-Instruct-Uz 搭配 mistral-inference。执行以下命令进行安装：

pip install mistral_inference

💻 使用示例

基础用法

from huggingface_hub import snapshot_download
from pathlib import Path

mistral_models_path = Path.home().joinpath('mistral_models', '7B-Instruct-Uz')
mistral_models_path.mkdir(parents=True, exist_ok=True)

snapshot_download(repo_id="behbudiy/Mistral-7B-Instruct-Uz", allow_patterns=["params.json", "consolidated.safetensors", "tokenizer.model.v3"], local_dir=mistral_models_path)

高级用法

from transformers import pipeline

messages = [
    {"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"},
    {"role": "user", "content": "Who are you?"},
]
chatbot = pipeline("text-generation", model="behbudiy/Mistral-7B-Instruct-Uz", device='cuda')
chatbot(messages)

📚 详细文档

模型描述

Mistral-7B-Instruct-Uz 模型经过持续预训练和指令微调，使用了公开可用和合成构造的乌兹别克语及英语数据，以保留其原始知识并增强能力。该模型旨在支持各种乌兹别克语自然语言处理任务，如机器翻译、文本摘要和对话系统等，确保在这些应用中表现稳健。有关与基础模型相比的性能指标详情，请参阅此文章。

开发者信息

性能对比

模型名称	乌兹别克语 - 英语 BLEU (单样本)	英语 - 乌兹别克语 BLEU (单样本)	COMET (乌兹别克语 - 英语)	COMET (英语 - 乌兹别克语)	乌兹别克语情感分析	乌兹别克语新闻分类	MMLU (英语) (5样本)
Llama-3.1 8B Instruct	23.74	6.72	84.30	82.70	68.96	55.41	65.77
Llama-3.1 8B Instruct Uz	27.42	11.58	85.63	86.53	82.42	60.84	62.78
Mistral 7B Instruct	7.47	0.67	68.14	45.58	62.02	47.52	61.07
Mistral 7B Instruct Uz	29.39	16.77	86.91	88.75	79.13	59.38	55.72
Mistral Nemo Instruct	25.68	9.79	85.56	85.04	72.47	49.24	67.62
Mistral Nemo Instruct Uz	30.49	15.52	87.04	88.01	82.05	58.2	67.36
Google Translate	41.18	22.98	89.16	90.67	—	—	—

结果表明，针对乌兹别克语优化的模型在 FLORES+ 乌兹别克语 - 英语 / 英语 - 乌兹别克语评估数据集的翻译基准测试（BLEU 和 COMET）、乌兹别克语情感分析和新闻分类任务中始终优于其基础模型。此外，在衡量英语多任务通用语言理解能力的 MMLU 基准测试中，微调后的模型性能没有显著下降。（基础 Llama 模型的 MMLU 得分与官方得分不同，原因在于评估方法。请参考以下链接查看评估详情。）

评估方法信息

翻译任务评估

在翻译任务评估中，我们使用了 FLORES+ 乌兹别克语 - 英语 / 英语 - 乌兹别克语数据集，将开发集和测试集合并，为每个乌兹别克语 - 英语和英语 - 乌兹别克语子集创建了更大的评估数据。我们使用以下提示对基础模型和针对乌兹别克语优化的模型进行单样本乌兹别克语 - 英语评估（对于英语 - 乌兹别克语评估，我们交换了 “English” 和 “Uzbek” 这两个词的位置）。

  prompt = f'''You are a professional Uzbek-English translator. Your task is to accurately translate the given Uzbek text into English.

  Instructions:
  1. Translate the text from Uzbek to English.
  2. Maintain the original meaning and tone.
  3. Use appropriate English grammar and vocabulary.
  4. If you encounter an ambiguous or unfamiliar word, provide the most likely translation based on context.
  5. Output only the English translation, without any additional comments.

  Example:
  Uzbek: "Bugun ob-havo juda yaxshi, quyosh charaqlab turibdi."
  English: "The weather is very nice today, the sun is shining brightly."

  Now, please translate the following Uzbek text into English:
  "{sentence}"
    '''

乌兹别克语情感分析评估

为了评估模型在乌兹别克语情感分析方面的能力，我们使用了 risqaliyevds/uzbek-sentiment-analysis 数据集，并使用 GPT-4o API 为其创建了二进制标签（0: 负面，1: 正面）（参考 behbudiy/uzbek-sentiment-analysis 数据集）。我们使用以下提示进行评估：

prompt = f'''Given the following text, determine the sentiment as either 'Positive' or 'Negative.' Respond with only the word 'Positive' or 'Negative' without any additional text or explanation.

Text: {text}"
'''

乌兹别克语新闻分类评估

在乌兹别克语新闻分类任务中，我们使用了 risqaliyevds/uzbek-zero-shot-classification 数据集，并要求模型使用以下提示预测新闻类别：

prompt = f'''Classify the given Uzbek news article into one of the following categories. Provide only the category number as the answer.

Categories:
0 - Politics (Siyosat)
1 - Economy (Iqtisodiyot)
2 - Technology (Texnologiya)
3 - Sports (Sport)
4 - Culture (Madaniyat)
5 - Health (Salomatlik)
6 - Family and Society (Oila va Jamiyat)
7 - Education (Ta'lim)
8 - Ecology (Ekologiya)
9 - Foreign News (Xorijiy Yangiliklar)

Now classify this article:
"{text}"

Answer (number only):"
'''