Stockmark-13b Open-Source Large Language Model - Trained on Japanese corpora to meet diverse language needs

Stockmark 13b

Developed by stockmark

Stockmark-13b is a 13-billion-parameter large language model pre-trained from scratch on a corpus of approximately 220 billion Japanese tokens, developed by Stockmark Inc.

Large Language Model

Transformers

JapaneseOpen Source License:MIT #Japanese Large Language Model #220 Billion Tokens Pre-training #Patent Document Processing

Downloads 604

Release Time : 10/21/2023

Model Overview

This is a large language model specialized in Japanese language processing, suitable for natural language processing tasks such as text generation.

Model Features

Large-scale Japanese Pre-training

Trained on a corpus of 220 billion Japanese tokens, focusing on Japanese language processing capabilities

AWS Trainium Support

Supported by AWS's Large Language Model Development Support Program, trained using Trainium accelerators

Quantization Support

Supports 8-bit quantization, can run on GPUs like T4 or V100

Model Capabilities

Japanese text generation

Natural language understanding

Contextual learning

Use Cases

Natural Language Processing

Japanese Text Generation

Generate coherent Japanese text

Can generate coherent text with 128 new tokens

Technical Document Processing

Process technical documents such as patent literature

🚀 stockmark/stockmark-13b

Stockmark-13b is a large language model (LLM) with 13 billion parameters. It is pretrained from scratch using a Japanese corpus of approximately 220 billion tokens. This model is developed by Stockmark Inc..

For more detailed information, please refer to our blog.

This project is supported by the AWS LLM development support program.

We also offer stockmark-13b-instruct, which is an instruction-tuned version of stockmark-13b.

🚀 Quick Start

💻 Usage Examples

Basic Usage

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

# For A100 or H100 GPU
model = AutoModelForCausalLM.from_pretrained("stockmark/stockmark-13b", device_map="auto", torch_dtype=torch.bfloat16)

# If you use a T4 or V100 GPU, please load a model in 8 bit with the below code.
# To do so, you need to install `bitsandbytes` via `pip install bitsandbytes`.
# model = AutoModelForCausalLM.from_pretrained("stockmark/stockmark-13b", device_map={"": 0}, load_in_8bit=True)

tokenizer = AutoTokenizer.from_pretrained("stockmark/stockmark-13b")

inputs = tokenizer("自然言語処理とは", return_tensors="pt").to(model.device)
with torch.no_grad():
    tokens = model.generate(
        **inputs,
        max_new_tokens=128,
        do_sample=True,
        temperature=0.7
    )
    
output = tokenizer.decode(tokens[0], skip_special_tokens=True)
print(output)

Examples

LoRA tuning: https://huggingface.co/stockmark/stockmark-13b/blob/main/notebooks/LoRA.ipynb

📚 Documentation

Training dataset

We have used a Japanese corpus with a total of approximately 220 billion tokens.

Property	Details
Stockmark Web Corpus (This dataset will not be released)	9.1 billion tokens
Patent	34.8 billion tokens
Wikipedia	1.0 billion tokens
CC100	10.9 billion tokens
mC4	53.2 billion tokens
CommonCrawl (snapshot: 2023 - 23, 2022 - 49, 2022 - 21, 2021 - 21)	112.9 billion tokens

Accelerator and Library

Accelerator: AWS Trainium
- https://aws.amazon.com/machine-learning/trainium/
Library for distributed training: neuronx-nemo-megatron
- https://github.com/aws-neuron/neuronx-nemo-megatron

📄 License

MIT

Developed by

Stockmark Inc.

Author

Takahiro Omi

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご