Swallow-70b-hf开源大语言模型 - 增强日语能力，多规模及调优版本可选

首页

Swallow 70b Hf

由 tokyotech-llm 开发

基于Llama 2系列增强日语能力的开源大语言模型，提供7B/13B/70B三种规模及指令调优版本

大型语言模型

Transformers

支持多种语言#日语优化 #Llama2增强 #多语言生成

下载量 2,088

发布时间 : 11/25/2023

模型简介

东京工业大学开发的日语优化大语言模型，通过持续预训练和指令微调提升日语任务表现，支持日英双语文本生成

模型特点

日语优化词表

扩展日语专用分词器，显著提升日语文本处理效率

多规模选择

提供7B/13B/70B三种参数规模，满足不同计算需求

指令调优版本

通过监督式微调优化指令跟随能力

持续更新

团队保持高频迭代，2024年已发布多个增强版本

模型能力

日语文本生成

英语文本生成

指令理解与执行

开放式问答

机器阅读理解

自动摘要

数学推理

使用案例

教育

日语学习辅助

生成日语学习材料和练习题

提升日语学习效率

内容创作

日语内容生成

自动生成符合日语表达习惯的文本内容

加速日语内容生产流程

研究

日语NLP研究

作为日语自然语言处理研究的基线模型

推动日语AI技术发展

🚀 Swallow

Swallow模型基于Llama 2系列进行持续预训练，主要增加了日语数据。微调版本采用了监督微调（SFT）技术。其他模型的链接可在索引中找到。

🚀 快速开始

本仓库提供了由TokyoTech-LLM开发的大语言模型。你可以阅读我们的博客文章或论文。

首先，安装requirements.txt中的额外依赖项：

pip install -r requirements.txt

🔧 使用指令模型

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

model_name = "tokyotech-llm/Swallow-7b-instruct-hf"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16, low_cpu_mem_usage=True, device_map="auto")


PROMPT_DICT = {
    "prompt_input": (
        "以下に、あるタスクを説明する指示があり、それに付随する入力が更なる文脈を提供しています。"
        "リクエストを適切に完了するための回答を記述してください。\n\n"
        "### 指示:\n{instruction}\n\n### 入力:\n{input}\n\n### 応答:"

    ),
    "prompt_no_input": (
        "以下に、あるタスクを説明する指示があります。"
        "リクエストを適切に完了するための回答を記述してください。\n\n"
        "### 指示:\n{instruction}\n\n### 応答:"
    ),
}

def create_prompt(instruction, input=None):
    """
    Generates a prompt based on the given instruction and an optional input.
    If input is provided, it uses the 'prompt_input' template from PROMPT_DICT.
    If no input is provided, it uses the 'prompt_no_input' template.

    Args:
        instruction (str): The instruction describing the task.
        input (str, optional): Additional input providing context for the task. Default is None.

    Returns:
        str: The generated prompt.
    """
    if input:
        # Use the 'prompt_input' template when additional input is provided
        return PROMPT_DICT["prompt_input"].format(instruction=instruction, input=input)
    else:
        # Use the 'prompt_no_input' template when no additional input is provided
        return PROMPT_DICT["prompt_no_input"].format(instruction=instruction)

# Example usage
instruction_example = "以下のトピックに関する詳細な情報を提供してください。"
input_example = "東京工業大学の主なキャンパスについて教えてください"
prompt = create_prompt(instruction_example, input_example)

input_ids = tokenizer.encode(
    prompt,
    add_special_tokens=False,
    return_tensors="pt"
)

tokens = model.generate(
    input_ids.to(device=model.device),
    max_new_tokens=128,
    temperature=0.99,
    top_p=0.95,
    do_sample=True,
)

out = tokenizer.decode(tokens[0], skip_special_tokens=True)
print(out)

🔧 使用基础模型

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

model_name = "tokyotech-llm/Swallow-7b-hf"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16, device_map="auto")

prompt = "東京工業大学の主なキャンパスは、"
input_ids = tokenizer.encode(
    prompt,
    add_special_tokens=False,
    return_tensors="pt"
)
tokens = model.generate(
    input_ids.to(device=model.device),
    max_new_tokens=128,
    temperature=0.99,
    top_p=0.95,
    do_sample=True,
)

out = tokenizer.decode(tokens[0], skip_special_tokens=True)
print(out)

✨ 主要特性

多语言支持：支持日语和英语两种语言。
持续预训练：基于Llama 2系列进行持续预训练，增加了日语数据。
微调版本：提供监督微调（SFT）版本，以更好地适应特定任务。

📦 安装指南

首先，安装requirements.txt中的额外依赖项：

pip install -r requirements.txt

💻 使用示例

基础用法

使用指令模型的示例代码如下：

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

model_name = "tokyotech-llm/Swallow-7b-instruct-hf"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16, low_cpu_mem_usage=True, device_map="auto")


PROMPT_DICT = {
    "prompt_input": (
        "以下に、あるタスクを説明する指示があり、それに付随する入力が更なる文脈を提供しています。"
        "リクエストを適切に完了するための回答を記述してください。\n\n"
        "### 指示:\n{instruction}\n\n### 入力:\n{input}\n\n### 応答:"

    ),
    "prompt_no_input": (
        "以下に、あるタスクを説明する指示があります。"
        "リクエストを適切に完了するための回答を記述してください。\n\n"
        "### 指示:\n{instruction}\n\n### 応答:"
    ),
}

def create_prompt(instruction, input=None):
    """
    Generates a prompt based on the given instruction and an optional input.
    If input is provided, it uses the 'prompt_input' template from PROMPT_DICT.
    If no input is provided, it uses the 'prompt_no_input' template.

    Args:
        instruction (str): The instruction describing the task.
        input (str, optional): Additional input providing context for the task. Default is None.

    Returns:
        str: The generated prompt.
    """
    if input:
        # Use the 'prompt_input' template when additional input is provided
        return PROMPT_DICT["prompt_input"].format(instruction=instruction, input=input)
    else:
        # Use the 'prompt_no_input' template when no additional input is provided
        return PROMPT_DICT["prompt_no_input"].format(instruction=instruction)

# Example usage
instruction_example = "以下のトピックに関する詳細な情報を提供してください。"
input_example = "東京工業大学の主なキャンパスについて教えてください"
prompt = create_prompt(instruction_example, input_example)

input_ids = tokenizer.encode(
    prompt,
    add_special_tokens=False,
    return_tensors="pt"
)

tokens = model.generate(
    input_ids.to(device=model.device),
    max_new_tokens=128,
    temperature=0.99,
    top_p=0.95,
    do_sample=True,
)

out = tokenizer.decode(tokens[0], skip_special_tokens=True)
print(out)

高级用法

使用基础模型的示例代码如下：

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

model_name = "tokyotech-llm/Swallow-7b-hf"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16, device_map="auto")

prompt = "東京工業大学の主なキャンパスは、"
input_ids = tokenizer.encode(
    prompt,
    add_special_tokens=False,
    return_tensors="pt"
)
tokens = model.generate(
    input_ids.to(device=model.device),
    max_new_tokens=128,
    temperature=0.99,
    top_p=0.95,
    do_sample=True,
)

out = tokenizer.decode(tokens[0], skip_special_tokens=True)
print(out)

📚 详细文档

模型发布更新

2024年4月26日：发布增强指令微调模型的0.1版本：Swallow-7b-instruct-v0.1、Swallow-13b-instruct-v0.1和Swallow-70b-instruct-v0.1作为预览版本。
2024年3月2日：发布Swallow-7b-plus-hf，该模型使用的日语标记数量约为Swallow-7b-hf的两倍。
2024年2月4日：发布Swallow-13b-NVE-hf。
2024年1月26日：发布Swallow-7b-NVE-hf、Swallow-7b-NVE-instruct-hf、Swallow-70b-NVE-hf和Swallow-70b-NVE-instruct-hf。
2023年12月19日：发布Swallow-7b-hf、Swallow-7b-instruct-hf、Swallow-13b-hf、Swallow-13b-instruct-hf、Swallow-70b-hf和Swallow-70b-instruct-hf。

Swallow模型索引

模型	Swallow-hf	Swallow-instruct-hf	Swallow-instruct-v0.1
7B	链接	链接	链接
7B-Plus	链接	N/A	N/A
13B	链接	链接	链接
70B	链接	链接	链接

Swallow模型索引NVE（无词汇扩展）

模型	Swallow-NVE-hf	Swallow-NVE-instruct-hf
7B	链接	链接
13B	链接	N/A
70B	链接	链接

模型详情

属性	详情
模型类型	请参考LLaMA - 2技术报告了解模型架构详情。
语言	日语、英语
库	Megatron-LM
分词器	该模型采用了基于日语数据扩展词汇表的分词器，能够使用更少的标记更高效地表示文本，从而显著加快推理过程。
联系方式	swallow[at]nlp.c.titech.ac.jp

基础模型性能

日语任务

模型	规模	JCommonsenseQA（4-shot）	JEMHopQA（4-shot）	NIILC（4-shot）	JSQuAD（4-shot）	XL-Sum（1-shot）	MGSM（4-shot）	WMT20-en-ja（4-shot）	WMT20-ja-en（4-shot）
Llama 2	7B	0.3852	0.4240	0.3410	0.7917	0.1905	0.0760	0.1783	0.1738
Swallow	7B	0.4808	0.5078	0.5968	0.8573	0.1830	0.1240	0.2510	0.1511
Swallow-Plus	7B	0.5478	0.5493	0.6030	0.8544	0.1806	0.1360	0.2568	0.1441
Swallow-NVE	7B	0.5433	0.5425	0.5729	0.8684	0.2117	0.1200	0.2405	0.1512
Llama 2	13B	0.6997	0.4415	0.4170	0.8533	0.2139	0.1320	0.2146	0.1982
Swallow	13B	0.7837	0.5063	0.6398	0.9005	0.2168	0.2040	0.2720	0.1771
Swallow-NVE	13B	0.7712	0.5438	0.6351	0.9030	0.2294	0.2120	0.2735	0.1817
Llama 2	70B	0.8686	0.4656	0.5256	0.9080	0.2361	0.3560	0.2643	0.2398
Swallow	70B	0.9348	0.6290	0.6960	0.9176	0.2266	0.4840	0.3043	0.2298
Swallow-NVE	70B	0.9410	0.5759	0.7024	0.9254	0.2758	0.4720	0.3042	0.2322

英语任务

模型	规模	OpenBookQA（8-shot）	TriviaQA（8-shot）	HellaSwag（8-shot）	SQuAD2.0（8-shot）	XWINO（8-shot）	GSM8K（8-shot）
Llama 2	7B	0.3580	0.6265	0.5860	0.3207	0.9049	0.1410
Swallow	7B	0.3180	0.4836	0.5308	0.3125	0.8817	0.1130
Swallow-Plus	7B	0.3280	0.4558	0.5259	0.3134	0.8929	0.1061
Swallow-NVE	7B	0.3180	0.5079	0.5329	0.2919	0.8817	0.0986
Llama 2	13B	0.3760	0.7255	0.6148	0.3681	0.9140	0.2403
Swallow	13B	0.3500	0.5852	0.5660	0.3406	0.9075	0.2039
Swallow-NVE	13B	0.3460	0.6025	0.5700	0.3478	0.9006	0.1751
Llama 2	70B	0.4280	0.8239	0.6742	0.3770	0.9290	0.5284
Swallow	70B	0.4220	0.7756	0.6458	0.3745	0.9204	0.4867
Swallow-NVE	70B	0.4240	0.7817	0.6439	0.3451	0.9256	0.4943

评估基准

日语评估基准

我们使用了llm-jp-eval(v1.0.0)和JP Language Model Evaluation Harness(提交编号 #9b42d41)。详情如下：

多项选择题回答（JCommonsenseQA [Kurihara+，2022]）
开放式问题回答（JEMHopQA [Ishii+，2023]）
开放式问题回答（NIILC [Sekine，2003]）
机器阅读理解（JSQuAD [Kurihara+，2022]）
自动摘要（XL-Sum [Hasan+，2021]）
机器翻译（WMT2020 ja - en [Barrault+，2020]）
机器翻译（WMT2020 en - ja [Barrault+，2020]）
数学推理（MGSM [Shi+，2023]）

英语评估基准

我们使用了Language Model Evaluation Harness(v.0.3.0)。详情如下：

多项选择题回答（OpenBookQA [Mihaylov+，2018]）
开放式问题回答（TriviaQA [Joshi+，2017]）
机器阅读理解（SQuAD 2.0 [Rajpurkar+，2018]）
常识推理（XWINO [Tikhonov & Ryabinin，2021]）
自然语言推理（HellaSwag [Zellers+，2019]）
数学推理（GSM8k [Cobbe+，2021]）

训练数据集

持续预训练

以下数据集用于持续预训练：

指令微调

以下数据集用于指令微调：

🔧 技术细节

本项目基于Llama 2系列进行持续预训练，增加了日语数据，并使用监督微调（SFT）技术进行微调。模型采用了基于日语数据扩展词汇表的分词器，能够使用更少的标记更高效地表示文本，从而显著加快推理过程。

📄 许可证

👥 作者

冈崎实验室

横田实验室

📝 如何引用

如果你觉得我们的工作有帮助，请随意引用：

@inproceedings{Fujii:COLM2024,
   title={Continual Pre-Training for Cross-Lingual LLM Adaptation:
Enhancing Japanese Language Capabilities},
   author={Kazuki Fujii and Taishi Nakamura and Mengsay Loem and Hiroki
Iida and Masanari Ohi and Kakeru Hattori and Hirai Shota and Sakae
Mizuki and Rio Yokota and Naoaki Okazaki},
   booktitle="Proceedings of the First Conference on Language Modeling",
   series={COLM},
   pages="(to appear)",
   year="2024",
   month=oct,
   address={University of Pennsylvania, USA},
}

@inproceedings{Okazaki:COLM2024,
   title={Building a Large Japanese Web Corpus for Large Language Models},
   author={Naoaki Okazaki and Kakeru Hattori and Hirai Shota and Hiroki
Iida and Masanari Ohi and Kazuki Fujii and Taishi Nakamura and Mengsay
Loem and Rio Yokota and Sakae Mizuki},
   booktitle="Proceedings of the First Conference on Language Modeling",
   series={COLM},
   pages="(to appear)",
   year="2024",
   month=oct,
   address={University of Pennsylvania, USA},
}