๐ rinna/nekomata-14b
This project conducts continual pre - training on qwen - 14b using a mixture of Japanese and English datasets, significantly enhancing the model's performance on Japanese tasks and inheriting great features from the original Qwen model.
๐ Quick Start
The rinna/nekomata-14b
model is a result of continual pre - training on qwen - 14b with a blend of Japanese and English datasets. Here is a basic example of how to use it:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("rinna/nekomata-14b", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("rinna/nekomata-14b", device_map="auto", trust_remote_code=True)
text = "่ฅฟ็ฐๅนพๅค้ใฏใ"
token_ids = tokenizer.encode(text, add_special_tokens=False, return_tensors="pt")
with torch.no_grad():
output_ids = model.generate(
token_ids.to(model.device),
max_new_tokens=200,
min_new_tokens=200,
do_sample=True,
temperature=1.0,
top_p=0.95,
pad_token_id=tokenizer.pad_token_id,
bos_token_id=tokenizer.bos_token_id,
eos_token_id=tokenizer.eos_token_id
)
output = tokenizer.decode(output_ids.tolist()[0])
print(output)
โจ Features
- Efficient Japanese Text Processing: The inclusive Qwen vocabulary (vocab size > 150k) enables the model to process Japanese texts much more efficiently than the previously released youri series.
- Long Sequence Support: The model supports a maximum sequence length of 8192.
๐ฆ Installation
The installation mainly involves installing the transformers
library. You can use the following command:
pip install transformers
๐ Documentation
Overview
We conduct continual pre - training of qwen - 14b on 66B tokens from a mixture of Japanese and English datasets. The continual pre - training significantly improves the model's performance on Japanese tasks. It also enjoys the following great features provided by the original Qwen model.
The name nekomata
comes from the Japanese word ็ซๅ/ใญใใพใ/Nekomata
, which is a kind of Japanese mythical creature (ๅฆๆช/ใใใใ/Youkai
).
Model Details
Property |
Details |
Model Type |
A 40 - layer, 5120 - hidden - size transformer - based language model. |
Training Data |
A mixture of Japanese and English datasets including [Japanese CC - 100](http://data.statmt.org/cc - 100/ja.txt.xz), Japanese C4, [Japanese OSCAR](https://huggingface.co/datasets/oscar - corpus/colossal - oscar - 1.0), The Pile, Wikipedia, and rinna curated Japanese dataset. |
Library |
The model was trained using code based on [aws - neuron/neuronx - nemo - megatron](https://github.com/aws - neuron/neuronx - nemo - megatron/). |
Training Infrastructure |
nekomata - 14B was trained on 16 nodes of Amazon EC2 trn1.32xlarge instance powered by AWS Trainium purpose - built ML accelerator chip. The pre - training job was completed within a timeframe of approximately 7 days. |
Contributors |
Tianyu Zhao, Akio Kaga, Kei Sawada |
Release date |
December 21, 2023 |
Benchmarking
Please refer to rinna's LM benchmark page (Sheet 20231221).
Tokenization
The model uses the original Qwen tokenizer. It augments the cl100k
tiktoken tokenizer and has a vocabulary size of 151,936. The inclusive vocabulary helps the model to reach a better tokenization efficiency, especially for Japanese texts.
We compared the Qwen
tokenizer (as used in nekomata
) and the llama - 2
tokenizer (as used in youri
) on different text collections and found that the Qwen tokenizer achieves a much better byte2token rate (i.e. the average number of tokens produced from 1 byte of text) as following. A lower byte2token rate indicates a better tokenization efficiency.
Tokenizer |
Japanese |
English |
Multilingual |
Qwen |
0.24 |
0.27 |
0.27 |
llama - 2 |
0.40 |
0.29 |
0.36 |
How to cite
@misc{rinna-nekomata-14b,
title = {rinna/nekomata-14b},
author = {Zhao, Tianyu and Kaga, Akio and Sawada, Kei},
url = {https://huggingface.co/rinna/nekomata-14b}
}
@inproceedings{sawada2024release,
title = {Release of Pre-Trained Models for the {J}apanese Language},
author = {Sawada, Kei and Zhao, Tianyu and Shing, Makoto and Mitsui, Kentaro and Kaga, Akio and Hono, Yukiya and Wakatsuki, Toshiaki and Mitsuda, Koh},
booktitle = {Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)},
month = {5},
year = {2024},
pages = {13898--13905},
url = {https://aclanthology.org/2024.lrec-main.1213},
note = {\url{https://arxiv.org/abs/2404.01657}}
}
๐ License
Tongyi Qianwen LICENSE AGREEMENT