Swallow-70b-hf開源大語言模型 - 增強日語能力，多規模及調優版本可選

首頁

Swallow 70b Hf

由tokyotech-llm開發

基於Llama 2系列增強日語能力的開源大語言模型，提供7B/13B/70B三種規模及指令調優版本

大型語言模型

Transformers

支持多種語言#日語優化 #Llama2增強 #多語言生成

下載量 2,088

發布時間 : 11/25/2023

模型概述

東京工業大學開發的日語優化大語言模型，通過持續預訓練和指令微調提升日語任務表現，支持日英雙語文本生成

模型特點

日語優化詞表

擴展日語專用分詞器，顯著提升日語文本處理效率

多規模選擇

提供7B/13B/70B三種參數規模，滿足不同計算需求

指令調優版本

通過監督式微調優化指令跟隨能力

持續更新

團隊保持高頻迭代，2024年已發佈多個增強版本

模型能力

日語文本生成

英語文本生成

指令理解與執行

開放式問答

機器閱讀理解

自動摘要

數學推理

使用案例

教育

日語學習輔助

生成日語學習材料和練習題

提升日語學習效率

內容創作

日語內容生成

自動生成符合日語表達習慣的文本內容

加速日語內容生產流程

研究

日語NLP研究

作為日語自然語言處理研究的基線模型

推動日語AI技術發展

🚀 Swallow

Swallow模型基於Llama 2系列進行持續預訓練，主要增加了日語數據。微調版本採用了監督微調（SFT）技術。其他模型的鏈接可在索引中找到。

🚀 快速開始

本倉庫提供了由TokyoTech-LLM開發的大語言模型。你可以閱讀我們的博客文章或論文。

首先，安裝requirements.txt中的額外依賴項：

pip install -r requirements.txt

🔧 使用指令模型

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

model_name = "tokyotech-llm/Swallow-7b-instruct-hf"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16, low_cpu_mem_usage=True, device_map="auto")


PROMPT_DICT = {
    "prompt_input": (
        "以下に、あるタスクを説明する指示があり、それに付隨する入力が更なる文脈を提供しています。"
        "リクエストを適切に完了するための回答を記述してください。\n\n"
        "### 指示:\n{instruction}\n\n### 入力:\n{input}\n\n### 応答:"

    ),
    "prompt_no_input": (
        "以下に、あるタスクを説明する指示があります。"
        "リクエストを適切に完了するための回答を記述してください。\n\n"
        "### 指示:\n{instruction}\n\n### 応答:"
    ),
}

def create_prompt(instruction, input=None):
    """
    Generates a prompt based on the given instruction and an optional input.
    If input is provided, it uses the 'prompt_input' template from PROMPT_DICT.
    If no input is provided, it uses the 'prompt_no_input' template.

    Args:
        instruction (str): The instruction describing the task.
        input (str, optional): Additional input providing context for the task. Default is None.

    Returns:
        str: The generated prompt.
    """
    if input:
        # Use the 'prompt_input' template when additional input is provided
        return PROMPT_DICT["prompt_input"].format(instruction=instruction, input=input)
    else:
        # Use the 'prompt_no_input' template when no additional input is provided
        return PROMPT_DICT["prompt_no_input"].format(instruction=instruction)

# Example usage
instruction_example = "以下のトピックに関する詳細な情報を提供してください。"
input_example = "東京工業大學の主なキャンパスについて教えてください"
prompt = create_prompt(instruction_example, input_example)

input_ids = tokenizer.encode(
    prompt,
    add_special_tokens=False,
    return_tensors="pt"
)

tokens = model.generate(
    input_ids.to(device=model.device),
    max_new_tokens=128,
    temperature=0.99,
    top_p=0.95,
    do_sample=True,
)

out = tokenizer.decode(tokens[0], skip_special_tokens=True)
print(out)

🔧 使用基礎模型

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

model_name = "tokyotech-llm/Swallow-7b-hf"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16, device_map="auto")

prompt = "東京工業大學の主なキャンパスは、"
input_ids = tokenizer.encode(
    prompt,
    add_special_tokens=False,
    return_tensors="pt"
)
tokens = model.generate(
    input_ids.to(device=model.device),
    max_new_tokens=128,
    temperature=0.99,
    top_p=0.95,
    do_sample=True,
)

out = tokenizer.decode(tokens[0], skip_special_tokens=True)
print(out)

✨ 主要特性

多語言支持：支持日語和英語兩種語言。
持續預訓練：基於Llama 2系列進行持續預訓練，增加了日語數據。
微調版本：提供監督微調（SFT）版本，以更好地適應特定任務。

📦 安裝指南

首先，安裝requirements.txt中的額外依賴項：

pip install -r requirements.txt

💻 使用示例

基礎用法

使用指令模型的示例代碼如下：

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

model_name = "tokyotech-llm/Swallow-7b-instruct-hf"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16, low_cpu_mem_usage=True, device_map="auto")


PROMPT_DICT = {
    "prompt_input": (
        "以下に、あるタスクを説明する指示があり、それに付隨する入力が更なる文脈を提供しています。"
        "リクエストを適切に完了するための回答を記述してください。\n\n"
        "### 指示:\n{instruction}\n\n### 入力:\n{input}\n\n### 応答:"

    ),
    "prompt_no_input": (
        "以下に、あるタスクを説明する指示があります。"
        "リクエストを適切に完了するための回答を記述してください。\n\n"
        "### 指示:\n{instruction}\n\n### 応答:"
    ),
}

def create_prompt(instruction, input=None):
    """
    Generates a prompt based on the given instruction and an optional input.
    If input is provided, it uses the 'prompt_input' template from PROMPT_DICT.
    If no input is provided, it uses the 'prompt_no_input' template.

    Args:
        instruction (str): The instruction describing the task.
        input (str, optional): Additional input providing context for the task. Default is None.

    Returns:
        str: The generated prompt.
    """
    if input:
        # Use the 'prompt_input' template when additional input is provided
        return PROMPT_DICT["prompt_input"].format(instruction=instruction, input=input)
    else:
        # Use the 'prompt_no_input' template when no additional input is provided
        return PROMPT_DICT["prompt_no_input"].format(instruction=instruction)

# Example usage
instruction_example = "以下のトピックに関する詳細な情報を提供してください。"
input_example = "東京工業大學の主なキャンパスについて教えてください"
prompt = create_prompt(instruction_example, input_example)

input_ids = tokenizer.encode(
    prompt,
    add_special_tokens=False,
    return_tensors="pt"
)

tokens = model.generate(
    input_ids.to(device=model.device),
    max_new_tokens=128,
    temperature=0.99,
    top_p=0.95,
    do_sample=True,
)

out = tokenizer.decode(tokens[0], skip_special_tokens=True)
print(out)

高級用法

使用基礎模型的示例代碼如下：

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

model_name = "tokyotech-llm/Swallow-7b-hf"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16, device_map="auto")

prompt = "東京工業大學の主なキャンパスは、"
input_ids = tokenizer.encode(
    prompt,
    add_special_tokens=False,
    return_tensors="pt"
)
tokens = model.generate(
    input_ids.to(device=model.device),
    max_new_tokens=128,
    temperature=0.99,
    top_p=0.95,
    do_sample=True,
)

out = tokenizer.decode(tokens[0], skip_special_tokens=True)
print(out)

📚 詳細文檔

模型發佈更新

2024年4月26日：發佈增強指令微調模型的0.1版本：Swallow-7b-instruct-v0.1、Swallow-13b-instruct-v0.1和Swallow-70b-instruct-v0.1作為預覽版本。
2024年3月2日：發佈Swallow-7b-plus-hf，該模型使用的日語標記數量約為Swallow-7b-hf的兩倍。
2024年2月4日：發佈Swallow-13b-NVE-hf。
2024年1月26日：發佈Swallow-7b-NVE-hf、Swallow-7b-NVE-instruct-hf、Swallow-70b-NVE-hf和Swallow-70b-NVE-instruct-hf。
2023年12月19日：發佈Swallow-7b-hf、Swallow-7b-instruct-hf、Swallow-13b-hf、Swallow-13b-instruct-hf、Swallow-70b-hf和Swallow-70b-instruct-hf。

Swallow模型索引

模型	Swallow-hf	Swallow-instruct-hf	Swallow-instruct-v0.1
7B	鏈接	鏈接	鏈接
7B-Plus	鏈接	N/A	N/A
13B	鏈接	鏈接	鏈接
70B	鏈接	鏈接	鏈接

Swallow模型索引NVE（無詞彙擴展）

模型	Swallow-NVE-hf	Swallow-NVE-instruct-hf
7B	鏈接	鏈接
13B	鏈接	N/A
70B	鏈接	鏈接

模型詳情

屬性	詳情
模型類型	請參考LLaMA - 2技術報告瞭解模型架構詳情。
語言	日語、英語
庫	Megatron-LM
分詞器	該模型採用了基於日語數據擴展詞彙表的分詞器，能夠使用更少的標記更高效地表示文本，從而顯著加快推理過程。
聯繫方式	swallow[at]nlp.c.titech.ac.jp

基礎模型性能

日語任務

模型	規模	JCommonsenseQA（4-shot）	JEMHopQA（4-shot）	NIILC（4-shot）	JSQuAD（4-shot）	XL-Sum（1-shot）	MGSM（4-shot）	WMT20-en-ja（4-shot）	WMT20-ja-en（4-shot）
Llama 2	7B	0.3852	0.4240	0.3410	0.7917	0.1905	0.0760	0.1783	0.1738
Swallow	7B	0.4808	0.5078	0.5968	0.8573	0.1830	0.1240	0.2510	0.1511
Swallow-Plus	7B	0.5478	0.5493	0.6030	0.8544	0.1806	0.1360	0.2568	0.1441
Swallow-NVE	7B	0.5433	0.5425	0.5729	0.8684	0.2117	0.1200	0.2405	0.1512
Llama 2	13B	0.6997	0.4415	0.4170	0.8533	0.2139	0.1320	0.2146	0.1982
Swallow	13B	0.7837	0.5063	0.6398	0.9005	0.2168	0.2040	0.2720	0.1771
Swallow-NVE	13B	0.7712	0.5438	0.6351	0.9030	0.2294	0.2120	0.2735	0.1817
Llama 2	70B	0.8686	0.4656	0.5256	0.9080	0.2361	0.3560	0.2643	0.2398
Swallow	70B	0.9348	0.6290	0.6960	0.9176	0.2266	0.4840	0.3043	0.2298
Swallow-NVE	70B	0.9410	0.5759	0.7024	0.9254	0.2758	0.4720	0.3042	0.2322

英語任務

模型	規模	OpenBookQA（8-shot）	TriviaQA（8-shot）	HellaSwag（8-shot）	SQuAD2.0（8-shot）	XWINO（8-shot）	GSM8K（8-shot）
Llama 2	7B	0.3580	0.6265	0.5860	0.3207	0.9049	0.1410
Swallow	7B	0.3180	0.4836	0.5308	0.3125	0.8817	0.1130
Swallow-Plus	7B	0.3280	0.4558	0.5259	0.3134	0.8929	0.1061
Swallow-NVE	7B	0.3180	0.5079	0.5329	0.2919	0.8817	0.0986
Llama 2	13B	0.3760	0.7255	0.6148	0.3681	0.9140	0.2403
Swallow	13B	0.3500	0.5852	0.5660	0.3406	0.9075	0.2039
Swallow-NVE	13B	0.3460	0.6025	0.5700	0.3478	0.9006	0.1751
Llama 2	70B	0.4280	0.8239	0.6742	0.3770	0.9290	0.5284
Swallow	70B	0.4220	0.7756	0.6458	0.3745	0.9204	0.4867
Swallow-NVE	70B	0.4240	0.7817	0.6439	0.3451	0.9256	0.4943

評估基準

日語評估基準

我們使用了llm-jp-eval(v1.0.0)和JP Language Model Evaluation Harness(提交編號 #9b42d41)。詳情如下：

多項選擇題回答（JCommonsenseQA [Kurihara+，2022]）
開放式問題回答（JEMHopQA [Ishii+，2023]）
開放式問題回答（NIILC [Sekine，2003]）
機器閱讀理解（JSQuAD [Kurihara+，2022]）
自動摘要（XL-Sum [Hasan+，2021]）
機器翻譯（WMT2020 ja - en [Barrault+，2020]）
機器翻譯（WMT2020 en - ja [Barrault+，2020]）
數學推理（MGSM [Shi+，2023]）

英語評估基準

我們使用了Language Model Evaluation Harness(v.0.3.0)。詳情如下：

多項選擇題回答（OpenBookQA [Mihaylov+，2018]）
開放式問題回答（TriviaQA [Joshi+，2017]）
機器閱讀理解（SQuAD 2.0 [Rajpurkar+，2018]）
常識推理（XWINO [Tikhonov & Ryabinin，2021]）
自然語言推理（HellaSwag [Zellers+，2019]）
數學推理（GSM8k [Cobbe+，2021]）

訓練數據集

持續預訓練

以下數據集用於持續預訓練：

指令微調

以下數據集用於指令微調：

🔧 技術細節

本項目基於Llama 2系列進行持續預訓練，增加了日語數據，並使用監督微調（SFT）技術進行微調。模型採用了基於日語數據擴展詞彙表的分詞器，能夠使用更少的標記更高效地表示文本，從而顯著加快推理過程。

📄 許可證

👥 作者

岡崎實驗室

橫田實驗室

📝 如何引用

如果你覺得我們的工作有幫助，請隨意引用：

@inproceedings{Fujii:COLM2024,
   title={Continual Pre-Training for Cross-Lingual LLM Adaptation:
Enhancing Japanese Language Capabilities},
   author={Kazuki Fujii and Taishi Nakamura and Mengsay Loem and Hiroki
Iida and Masanari Ohi and Kakeru Hattori and Hirai Shota and Sakae
Mizuki and Rio Yokota and Naoaki Okazaki},
   booktitle="Proceedings of the First Conference on Language Modeling",
   series={COLM},
   pages="(to appear)",
   year="2024",
   month=oct,
   address={University of Pennsylvania, USA},
}

@inproceedings{Okazaki:COLM2024,
   title={Building a Large Japanese Web Corpus for Large Language Models},
   author={Naoaki Okazaki and Kakeru Hattori and Hirai Shota and Hiroki
Iida and Masanari Ohi and Kazuki Fujii and Taishi Nakamura and Mengsay
Loem and Rio Yokota and Sakae Mizuki},
   booktitle="Proceedings of the First Conference on Language Modeling",
   series={COLM},
   pages="(to appear)",
   year="2024",
   month=oct,
   address={University of Pennsylvania, USA},
}