Sarashina2-13B Open-Source Large Language Model - Free Support for Japanese and English Communication Applications

Home

Sarashina2 13b

Developed by sbintuitions

A large language model trained by SB Intuitions, supporting Japanese and English, based on the Llama2 architecture

Large Language Model

Transformers

Supports Multiple LanguagesOpen Source License:MIT #Japanese Optimization #Trillion-token Training #Llama2 Architecture

Downloads 1,167

Release Time : 6/7/2024

Model Overview

Sarashina2-13B is a large language model based on the Llama2 architecture, supporting Japanese and English text generation tasks. The model has been trained on 2.1 trillion tokens, exhibiting strong language understanding and generation capabilities.

Model Features

Multilingual Support

Supports both Japanese and English processing, capable of handling Japanese text without pre-tokenization

Large-scale Training

Trained on 2.1 trillion tokens, demonstrating robust language understanding and generation capabilities

Efficient Tokenization

Utilizes a sentencepiece tokenizer based on unigram language models, supporting byte fallback mechanism

Model Capabilities

Japanese Text Generation

English Text Generation

Multi-turn Dialogue

Text Continuation

Use Cases

Content Creation

Article Continuation

Automatically generates subsequent content based on the opening paragraph

Can produce coherent text content

Dialogue System

Building multi-turn dialogue robots

Capable of basic conversational interactions

Education

Language Learning Assistance

Helps Japanese or English learners practice writing

Provides language demonstrations and feedback

🚀 Sarashina2-13B

This repository offers large language models trained by SB Intuitions, which can provide powerful language processing capabilities.

🚀 Quick Start

You can use the following code to quickly start using the Sarashina2-13B model:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline, set_seed
 
model = AutoModelForCausalLM.from_pretrained("sbintuitions/sarashina2-13b", torch_dtype=torch.bfloat16, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained("sbintuitions/sarashina2-13b")
# If you want to use slow tokenizer
# tokenizer = AutoTokenizer.from_pretrained("sbintuitions/sarashina2-13b", use_fast=False）
generator = pipeline("text-generation", model=model, tokenizer=tokenizer)
set_seed(123)
 
text = generator(
    "おはようございます、今日の天気は",
    max_length=30,
    do_sample=True,
    pad_token_id=tokenizer.pad_token_id,
    num_return_sequences=3,
)

for t in text:
  print(t)

📚 Documentation

Configuration

The following table shows the configuration information for different parameter versions of the model:

Property	Details
7B	Vocab size: 102400, Training tokens: 2.1T, Architecture: Llama2, Position type: RoPE, Layers: 32, Hidden dim: 4096, Attention heads: 32
13B	Vocab size: 102400, Training tokens: 2.1T, Architecture: Llama2, Position type: RoPE, Layers: 40, Hidden dim: 5120, Attention heads: 40
70B	Vocab size: 102400, Training tokens: 2.1T, Architecture: Llama2, Position type: RoPE, Layers: 80, Hidden dim: 8192, Attention heads: 64

Training Corpus

Japanese training data: We used the Japanese portion of the Common Crawl corpus, the largest Web corpus, as our training dataset. To clean the training corpus, we used CCNet and HojiChar. After cleaning, our Japanese training data contains about 1T tokens.
English training data: We extracted English documents from SlimPajama but removed the books3 corpus due to copyright infringement.

Tokenization

We use a sentencepiece tokenizer with a unigram language model and byte-fallback. We do not apply pre-tokenization with a Japanese tokenizer. Thus, a user may directly feed raw sentences into the tokenizer.

Ethical Considerations and Limitations

⚠️ Important Note

Sarashina2 has not been tuned to follow an instruction yet. Therefore, it might generate some meaningless sequences, some inaccurate instances, or biased/objectionable outputs.

💡 Usage Tip

Before using Sarashina2, developers should tune the models based on human preferences and safety considerations.

📄 License

This project is licensed under the MIT License.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご