japanese-gpt-neox-3.6b-instruction-sft: An open-source Japanese dialogue model - Free to deploy and communicate according to instructions

Japanese Gpt Neox 3.6b Instruction Sft

Developed by rinna

This is a Japanese GPT-NeoX model with 3.6 billion parameters, which has been fine-tuned with instructions and can serve as a dialogue agent that follows instructions.

Large Language Model

Transformers

Supports Multiple LanguagesOpen Source License:MIT #Japanese dialogue agent #Instruction fine-tuning #3.6 billion parameters

Downloads 3,003

Release Time : 5/17/2023

Model Overview

A Japanese language model built on rinna/japanese-gpt-neox-3.6b, fine-tuned with instructions and specifically designed for dialogue and instruction-following tasks.

Model Features

Instruction fine-tuning

The model has been specifically fine-tuned to better understand and follow user instructions

Special input format

Use a special dialogue format and <NL> tag to handle line breaks and optimize dialogue interaction

Improved tokenizer

The tokenizer has been optimized to better handle Japanese text and spaces

Multi-turn dialogue support

Supports multi-turn dialogue interaction between users and the system

Model Capabilities

Japanese text generation

Multi-turn dialogue processing

Instruction understanding and execution

Travel information recommendation

Question answering system

Use Cases

Dialogue system

Tourist attraction recommendation

Recommend Japanese tourist attractions based on user requests

Can provide detailed attraction information and recommendation reasons

Information Q&A

Answer users' questions about Japanese culture, geography, etc.

Can generate accurate and detailed answers

Customer service system

Customer consultation handling

Handle Japanese customers' consultation requests

Can understand customer needs and provide relevant assistance

🚀 japanese-gpt-neox-3.6b-instruction-sft

This repository offers a Japanese GPT - NeoX model with 3.6 billion parameters, fine - tuned for instruction - following conversations.

🚀 Quick Start

This repository provides a Japanese GPT - NeoX model of 3.6 billion parameters. The model is based on rinna/japanese-gpt-neox-3.6b and has been finetuned to serve as an instruction - following conversational agent.

✨ Features

Model architecture

A 36 - layer, 2816 - hidden - size transformer - based language model.

Finetuning

The finetuning data is the subset of the following datasets and has been translated into Japanese:

The data will not be released.

Model Series

Variant	Link
3.6B PPO	https://huggingface.co/rinna/japanese-gpt-neox-3.6b-instruction-ppo
3.6B SFT - v2	https://huggingface.co/rinna/japanese-gpt-neox-3.6b-instruction-sft-v2
3.6B SFT	https://huggingface.co/rinna/japanese-gpt-neox-3.6b-instruction-sft
3.6B pretrained	https://huggingface.co/rinna/japanese-gpt-neox-3.6b

Contributors

Tianyu Zhao and Kei Sawada

Release date

March 17, 2023

💻 Usage Examples

I/O Format

A special format has been adopted to construct inputs:

An input prompt is formatted as a conversation between ユーザー and システム.
Each input utterance consists of (1) its speaker ("ユーザー" or "システム"), (2) a colon (":"), (3) a whitespace (" "), and (4) utterance text (e.g. "世界で一番高い山は？").
The input prompt should be ended with "システム: " to acknowledge the model to generate a response.
Since the model's tokenizer does not recognize "\n", a special newline symbol "<NL>" is used instead.
All the newlines in input and output utterances should be replaced with "<NL>".
All the utterances in the input prompt should be separated by "<NL>".

Basic Usage

prompt = [
    {
        "speaker": "ユーザー",
        "text": "日本のおすすめの観光地を教えてください。"
    },
    {
        "speaker": "システム",
        "text": "どの地域の観光地が知りたいですか？"
    },
    {
        "speaker": "ユーザー",
        "text": "渋谷の観光地を教えてください。"
    }
]
prompt = [
    f"{uttr['speaker']}: {uttr['text']}"
    for uttr in prompt
]
prompt = "<NL>".join(prompt)
prompt = (
    prompt
    + "<NL>"
    + "システム: "
)
print(prompt)
# "ユーザー: 日本のおすすめの観光地を教えてください。<NL>システム: どの地域の観光地が知りたいですか？<NL>ユーザー: 渋谷の観光地を教えてください。<NL>システム: "

How to use the model

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("rinna/japanese-gpt-neox-3.6b-instruction-sft", use_fast=False)
model = AutoModelForCausalLM.from_pretrained("rinna/japanese-gpt-neox-3.6b-instruction-sft")

if torch.cuda.is_available():
    model = model.to("cuda")

token_ids = tokenizer.encode(prompt, add_special_tokens=False, return_tensors="pt")

with torch.no_grad():
    output_ids = model.generate(
        token_ids.to(model.device),
        do_sample=True,
        max_new_tokens=128,
        temperature=0.7,
        pad_token_id=tokenizer.pad_token_id,
        bos_token_id=tokenizer.bos_token_id,
        eos_token_id=tokenizer.eos_token_id
    )

output = tokenizer.decode(output_ids.tolist()[0][token_ids.size(1):])
output = output.replace("<NL>", "\n")
print(output)
"""分かりました。いくつかのおすすめを紹介します。
1. ハチ公像です。ハチ公像は、日本の観光スポットの1つとして人気があります。
2. スクランブル交差点です。多くの人々が行き交う大きな交差点で、観光客に人気のスポットです。
3. 109です。109は、ショッピングやエンターテイメント施設です。
4. 道玄坂です。道玄坂は、日本の商業地区である坂道です。</s>"""

📚 Documentation

Tokenization

The model uses a sentencepiece - based tokenizer:

The tokenizer has a vocabulary size of 32,000.
It uses sentencepiece's byte fallback feature to decompose unknown text pieces into UTF - 8 byte pieces and to avoid producing <UNK> tokens.
sentencepiece's --add_dummy_prefix option was turned off so that a leading whitespace will not be prepended automatically.

print(tokenizer.tokenize("吾輩は猫である"))
# ['吾', '輩', 'は', '猫', 'である']
# instead of ['▁', '吾', '輩', 'は', '猫', 'である'] as in rinna/japanese-gpt-1b

sentencepiece's --remove_extra_whitespaces option was turned off so that leading, trailing, and duplicate whitespaces are reserved.

print(tokenizer.tokenize("  吾輩は  猫である   "))
# ['▁', '▁', '吾', '輩', 'は', '▁', '▁', '猫', 'である', '▁', '▁', '▁']
# instead of ['▁', '吾', '輩', 'は', '▁猫', 'である'] as in rinna/japanese-gpt-1b

Don't forget to set use_fast=False to make the above features function correctly.

good_tokenizer = AutoTokenizer.from_pretrained("rinna/japanese-gpt-neox-3.6b", use_fast=False)
bad_tokenizer = AutoTokenizer.from_pretrained("rinna/japanese-gpt-neox-3.6b")

print(good_tokenizer.decode(good_tokenizer.encode("გამარჯობა  吾輩は  猫である   ")))
# 'გამარჯობა  吾輩は  猫である   </s>'
print(bad_tokenizer.decode(bad_tokenizer.encode("გამარჯობა  吾輩は  猫である   ")))
# 'გამარ[UNK]ობა 吾輩は 猫である </s>'

📄 License

The MIT license

📚 How to cite

@misc{rinna-japanese-gpt-neox-3.6b-instruction-sft,
    title = {rinna/japanese-gpt-neox-3.6b-instruction-sft},
    author = {Zhao, Tianyu and Sawada, Kei},
    url = {https://huggingface.co/rinna/japanese-gpt-neox-3.6b-instruction-sft}
}

@inproceedings{sawada2024release,
    title = {Release of Pre-Trained Models for the {J}apanese Language},
    author = {Sawada, Kei and Zhao, Tianyu and Shing, Makoto and Mitsui, Kentaro and Kaga, Akio and Hono, Yukiya and Wakatsuki, Toshiaki and Mitsuda, Koh},
    booktitle = {Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)},
    month = {5},
    year = {2024},
    pages = {13898--13905},
    url = {https://aclanthology.org/2024.lrec-main.1213},
    note = {\url{https://arxiv.org/abs/2404.01657}}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご