Nekomata-14B Open-Source Large Language Model - Significantly Improve Performance on Japanese Tasks, Free Deployment and Highly Practical

Nekomata 14b

Developed by rinna

A large language model continuously pre-trained on a mixed Japanese and English dataset based on Qwen-14B, significantly improving performance on Japanese tasks

Large Language Model

Transformers

Supports Multiple LanguagesOpen Source License:Other #Japanese Optimization #Large Context Window #Efficient Tokenization

Downloads 705

Release Time : 12/19/2023

Model Overview

This model is obtained by continuously pre-training Qwen-14B on a mixed Japanese and English corpus of 66 billion tokens, specifically optimized for Japanese processing while retaining the excellent features of the original Qwen model.

Model Features

Efficient Japanese Processing

Uses Qwen vocabulary (vocabulary size >150k), with higher efficiency in Japanese text processing compared to the youri series

Long Context Support

Supports a maximum sequence length of 8192

Multilingual Capability

Trained on mixed Japanese and English corpus, capable of bilingual processing

High-Performance Tokenization

Uses an extended cl100k tiktoken tokenizer, significantly outperforming the llama-2 tokenizer in Japanese tokenization efficiency

Model Capabilities

Japanese Text Generation

English Text Generation

Long Text Processing

Use Cases

Content Creation

Japanese Article Continuation

Automatically generates coherent Japanese articles based on the opening text

Example demonstrates the continuation effect starting with 'Nishida Kitaro wa,'

Language Learning

Bilingual Text Generation

Generates bilingual content in Japanese and English

🚀 `rinna/nekomata-14b`

This project conducts continual pre - training on qwen - 14b using a mixture of Japanese and English datasets, significantly enhancing the model's performance on Japanese tasks and inheriting great features from the original Qwen model.

🚀 Quick Start

The rinna/nekomata-14b model is a result of continual pre - training on qwen - 14b with a blend of Japanese and English datasets. Here is a basic example of how to use it:

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("rinna/nekomata-14b", trust_remote_code=True)

# Use GPU with bf16
# model = AutoModelForCausalLM.from_pretrained("rinna/nekomata-14b", device_map="auto", trust_remote_code=True, bf16=True)

# Use GPU with fp16
# model = AutoModelForCausalLM.from_pretrained("rinna/nekomata-14b", device_map="auto", trust_remote_code=True, fp16=True)

# Use CPU
# model = AutoModelForCausalLM.from_pretrained("rinna/nekomata-14b", device_map="cpu", trust_remote_code=True)

# Automatically select device and precision
model = AutoModelForCausalLM.from_pretrained("rinna/nekomata-14b", device_map="auto", trust_remote_code=True)

text = "西田幾多郎は、"
token_ids = tokenizer.encode(text, add_special_tokens=False, return_tensors="pt")

with torch.no_grad():
    output_ids = model.generate(
        token_ids.to(model.device),
        max_new_tokens=200,
        min_new_tokens=200,
        do_sample=True,
        temperature=1.0,
        top_p=0.95,
        pad_token_id=tokenizer.pad_token_id,
        bos_token_id=tokenizer.bos_token_id,
        eos_token_id=tokenizer.eos_token_id
    )

output = tokenizer.decode(output_ids.tolist()[0])
print(output)

✨ Features

Efficient Japanese Text Processing: The inclusive Qwen vocabulary (vocab size > 150k) enables the model to process Japanese texts much more efficiently than the previously released youri series.
Long Sequence Support: The model supports a maximum sequence length of 8192.

📦 Installation

The installation mainly involves installing the transformers library. You can use the following command:

pip install transformers

📚 Documentation

Overview

We conduct continual pre - training of qwen - 14b on 66B tokens from a mixture of Japanese and English datasets. The continual pre - training significantly improves the model's performance on Japanese tasks. It also enjoys the following great features provided by the original Qwen model.

The name nekomata comes from the Japanese word 猫又/ねこまた/Nekomata, which is a kind of Japanese mythical creature (妖怪/ようかい/Youkai).

Model Details

Property	Details
Model Type	A 40 - layer, 5120 - hidden - size transformer - based language model.
Training Data	A mixture of Japanese and English datasets including [Japanese CC - 100](http://data.statmt.org/cc - 100/ja.txt.xz), Japanese C4, [Japanese OSCAR](https://huggingface.co/datasets/oscar - corpus/colossal - oscar - 1.0), The Pile, Wikipedia, and rinna curated Japanese dataset.
Library	The model was trained using code based on [aws - neuron/neuronx - nemo - megatron](https://github.com/aws - neuron/neuronx - nemo - megatron/).
Training Infrastructure	`nekomata - 14B` was trained on 16 nodes of Amazon EC2 trn1.32xlarge instance powered by AWS Trainium purpose - built ML accelerator chip. The pre - training job was completed within a timeframe of approximately 7 days.
Contributors	Tianyu Zhao, Akio Kaga, Kei Sawada
Release date	December 21, 2023

Benchmarking

Please refer to rinna's LM benchmark page (Sheet 20231221).

Tokenization

The model uses the original Qwen tokenizer. It augments the cl100k tiktoken tokenizer and has a vocabulary size of 151,936. The inclusive vocabulary helps the model to reach a better tokenization efficiency, especially for Japanese texts.

We compared the Qwen tokenizer (as used in nekomata) and the llama - 2 tokenizer (as used in youri) on different text collections and found that the Qwen tokenizer achieves a much better byte2token rate (i.e. the average number of tokens produced from 1 byte of text) as following. A lower byte2token rate indicates a better tokenization efficiency.

Tokenizer	Japanese	English	Multilingual
Qwen	0.24	0.27	0.27
llama - 2	0.40	0.29	0.36

How to cite

@misc{rinna-nekomata-14b,
    title = {rinna/nekomata-14b},
    author = {Zhao, Tianyu and Kaga, Akio and Sawada, Kei},
    url = {https://huggingface.co/rinna/nekomata-14b}
}

@inproceedings{sawada2024release,
    title = {Release of Pre-Trained Models for the {J}apanese Language},
    author = {Sawada, Kei and Zhao, Tianyu and Shing, Makoto and Mitsui, Kentaro and Kaga, Akio and Hono, Yukiya and Wakatsuki, Toshiaki and Mitsuda, Koh},
    booktitle = {Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)},
    month = {5},
    year = {2024},
    pages = {13898--13905},
    url = {https://aclanthology.org/2024.lrec-main.1213},
    note = {\url{https://arxiv.org/abs/2404.01657}}
}

📄 License

Tongyi Qianwen LICENSE AGREEMENT

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご