Japanese-gpt-neox-3.6b: An Open-Source Japanese Model - Freely Enjoy the Results of Training with Massive Japanese Corpora

Japanese Gpt Neox 3.6b

Developed by rinna

A Japanese GPT-NeoX model with 3.6 billion parameters, based on the Transformer architecture, trained on 312.5 billion tokens of Japanese corpus.

Large Language Model

Transformers

Supports Multiple LanguagesOpen Source License:MIT #Japanese Text Generation #3.6B Parameters #Transformer Architecture

Downloads 34.74k

Release Time : 5/17/2023

Model Overview

This is a Japanese language model based on the GPT-NeoX architecture, primarily used for text generation tasks, supporting Japanese natural language processing.

Model Features

Large-scale Japanese Pretraining

Trained on approximately 312.5 billion tokens of Japanese corpus, including CC-100, C4, and Japanese Wikipedia.

Optimized Tokenizer

Uses a sentencepiece-based tokenizer with UTF-8 byte fallback support, preserving whitespace information.

High Performance

Achieves a final validation perplexity of 8.68, indicating excellent language understanding capabilities.

Model Capabilities

Japanese Text Generation

Language Modeling

Natural Language Processing

Use Cases

Text Generation

Philosophical Text Continuation

Given the beginning of a philosophical topic, the model can generate coherent follow-up content.

Successfully generated coherent text about Nishida Kitaro's philosophy in the example.

Educational Research

Japanese Language Research

Can be used to study the performance and characteristics of Japanese language models.

🚀 Japanese GPT-NeoX-3.6B

This repository offers a Japanese GPT-NeoX model with 3.6 billion parameters, trained on extensive datasets to optimize language modeling.

🚀 Quick Start

To start using the japanese-gpt-neox-3.6b model, you can follow the code example below:

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("rinna/japanese-gpt-neox-3.6b", use_fast=False)
model = AutoModelForCausalLM.from_pretrained("rinna/japanese-gpt-neox-3.6b")

if torch.cuda.is_available():
    model = model.to("cuda")

text = "西田幾多郎は、"
token_ids = tokenizer.encode(text, add_special_tokens=False, return_tensors="pt")

with torch.no_grad():
    output_ids = model.generate(
        token_ids.to(model.device),
        max_new_tokens=100,
        min_new_tokens=100,
        do_sample=True,
        temperature=0.8,
        pad_token_id=tokenizer.pad_token_id,
        bos_token_id=tokenizer.bos_token_id,
        eos_token_id=tokenizer.eos_token_id
    )

output = tokenizer.decode(output_ids.tolist()[0])
print(output)
"""西田幾多郎は、この「絶対矛盾的自己同一」を「世界の自己同一」と置きかえ、さらに西田哲学を出発点として「絶対無」を「世界の成立」に変え、世界と自己を一つの統一物とみなす哲学として展開する。この世界と自己は絶対矛盾的自己同一として同一の性質を有し、同じ働きをする。西田哲学においては、この世界と自己は矛盾しあうのではなく、同一の性質をもっている。この世界と自己は同一である。絶対"""

✨ Features

Library

The model was trained using code based on EleutherAI/gpt-neox.

Model architecture

A 36 - layer, 2816 - hidden - size transformer - based language model.

Pre - training

The model was trained on around 312.5B tokens from Japanese CC - 100, Japanese C4, and Japanese Wikipedia to optimize a traditional language modelling objective. A final validation perplexity of 8.68 has been reached.

Model Series

Variant	Link
3.6B PPO	https://huggingface.co/rinna/japanese-gpt-neox-3.6b-instruction-ppo
3.6B SFT - v2	https://huggingface.co/rinna/japanese-gpt-neox-3.6b-instruction-sft-v2
3.6B SFT	https://huggingface.co/rinna/japanese-gpt-neox-3.6b-instruction-sft
3.6B pretrained	https://huggingface.co/rinna/japanese-gpt-neox-3.6b

Contributors

Tianyu Zhao and Kei Sawada

Release date

March 17, 2023

📚 Documentation

Tokenization

The model uses a sentencepiece - based tokenizer.

The tokenizer has a vocabulary size of 32,000.
It uses sentencepiece's byte fallback feature to decompose unknown text pieces into UTF - 8 byte pieces and to avoid producing <UNK> tokens.

sentencepiece's --add_dummy_prefix option was turned off so that a leading whitespace will not be prepended automatically.

print(tokenizer.tokenize("吾輩は猫である"))
# ['吾', '輩', 'は', '猫', 'である']
# instead of ['▁', '吾', '輩', 'は', '猫', 'である'] as in rinna/japanese-gpt-1b

sentencepiece's --remove_extra_whitespaces option was turned off so that leading, trailing, and duplicate whitespaces are reserved.

print(tokenizer.tokenize("  吾輩は  猫である   "))
# ['▁', '▁', '吾', '輩', 'は', '▁', '▁', '猫', 'である', '▁', '▁', '▁']
# instead of ['▁', '吾', '輩', 'は', '▁猫', 'である'] as in rinna/japanese-gpt-1b

Don't forget to set use_fast=False to make the above features function correctly.

good_tokenizer = AutoTokenizer.from_pretrained("rinna/japanese-gpt-neox-3.6b", use_fast=False)
bad_tokenizer = AutoTokenizer.from_pretrained("rinna/japanese-gpt-neox-3.6b")

print(good_tokenizer.decode(good_tokenizer.encode("გამარჯობა  吾輩は  猫である   ")))
# 'გამარჯობა  吾輩は  猫である   </s>'
print(bad_tokenizer.decode(bad_tokenizer.encode("გამარჯობა  吾輩は  猫である   ")))
# 'გამარ[UNK]ობა 吾輩は 猫である </s>'

📄 License

The MIT license

How to cite

@misc{rinna-japanese-gpt-neox-3.6b,
    title = {rinna/japanese-gpt-neox-3.6b},
    author = {Zhao, Tianyu and Sawada, Kei},
    url = {https://huggingface.co/rinna/japanese-gpt-neox-3.6b}
}

@inproceedings{sawada2024release,
    title = {Release of Pre-Trained Models for the {J}apanese Language},
    author = {Sawada, Kei and Zhao, Tianyu and Shing, Makoto and Mitsui, Kentaro and Kaga, Akio and Hono, Yukiya and Wakatsuki, Toshiaki and Mitsuda, Koh},
    booktitle = {Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)},
    month = {5},
    year = {2024},
    pages = {13898--13905},
    url = {https://aclanthology.org/2024.lrec-main.1213},
    note = {\url{https://arxiv.org/abs/2404.01657}}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご