Ziya-LLaMA-13B-Pretrain-v1 Open-Source Model - Optimize Chinese Word Segmentation and Greatly Enhance Chinese Generation and Comprehension Abilities

Ziya LLaMA 13B Pretrain V1

Developed by IDEA-CCNL

A large-scale pre-trained model with 13 billion parameters based on the LLaMa architecture, optimized for Chinese tokenization, completing 110 billion tokens of incremental pre-training in Chinese and English, significantly improving Chinese generation and comprehension capabilities

Large Language Model

Transformers

Supports Multiple LanguagesOpen Source License:Gpl-3.0 #Bilingual Optimization (Chinese-English)#Incremental Pre-training #Enhanced Chinese Tokenization

Downloads 113

Release Time : 6/1/2023

Model Overview

Jiang Ziya-LLaMA-13B-Pretrain-v1 is a large-scale pre-trained model with 13 billion parameters based on the LLaMa architecture, optimized for Chinese tokenization, and completing 110 billion tokens of incremental pre-training in Chinese and English, significantly improving Chinese generation and comprehension capabilities.

Model Features

Chinese Optimization

Added 7000+ high-frequency Chinese characters to the original LLaMa vocabulary, significantly improving Chinese processing efficiency

Large-scale Incremental Pre-training

Completed 110 billion tokens of incremental pre-training in Chinese and English, the largest publicly available incremental training for LLaMA-13B models

High-performance Training

Used 160 A100 GPUs with 40GB memory, achieving 118 TFLOP/s per GPU, completing training in just 8 days

Model Capabilities

Text Generation

Text Comprehension

Translation

Programming

Text Classification

Information Extraction

Summary Generation

Copywriting

Common-sense QA

Mathematical Calculation

Use Cases

Content Creation

Travel Plan Generation

Generate detailed travel plans based on user requirements

Produces well-structured and content-rich travel plans

Education

Subject QA

Answer knowledge questions across various academic fields

Outperforms the original LLaMA model in Chinese multi-subject evaluations

🚀 Ziya-LLaMA-13B-Pretrain-v1

Ziya-LLaMA-13B-Pretrain-v1 is a large-scale pre-trained model based on LLaMA, optimized for Chinese tokenization and further enhancing Chinese generation and understanding capabilities.

🚀 Quick Start

Due to the license restrictions of LLaMA weights, this model cannot be used for commercial purposes. Please strictly adhere to the LLaMA usage policy. We cannot directly release the complete model weights. Therefore, we used the FastChat open-source tool as a basis and further optimized it. We calculated and released the difference between the Ziya-LLaMA-13B-v1 weights and the original LLaMA weights. Users can follow the steps below to obtain the complete weights of Ziya-LLaMA-13B-v1:

Step 1: Obtain the LLaMA weights and convert them into the Hugging Face Transformers model format. You can refer to the conversion script (skip this step if you already have the Hugging Face weights).

python src/transformers/models/llama/convert_llama_weights_to_hf.py \
    --input_dir /path/to/downloaded/llama/weights --model_size 13B --output_dir /output/path

Step 2: Download the delta weights of Ziya-LLaMA-13B-v1 and the pre-converted original LLaMA weights from step 1. Use the following script for conversion: https://github.com/IDEA-CCNL/Fengshenbang-LM/blob/main/fengshen/utils/apply_delta.py.

python3 -m apply_delta --base ~/model_weights/llama-13b --target ~/model_weights/Ziya-LLaMA-13B --delta ~/model_weights/Ziya-LLaMA-13B-v1

Step 3: Load the model obtained in step 2 for inference.

from transformers import AutoTokenizer
from transformers import LlamaForCausalLM
import torch


device = torch.device("cuda")

query="帮我写一份去西安的旅游计划"
model = LlamaForCausalLM.from_pretrained(ckpt, torch_dtype=torch.float16, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(ckpt)
inputs =  query.strip()
      
input_ids = tokenizer(inputs, return_tensors="pt").input_ids.to(device)
generate_ids = model.generate(
            input_ids,
            max_new_tokens=1024, 
            do_sample = True, 
            top_p = 0.85, 
            temperature = 1.0, 
            repetition_penalty=1., 
            eos_token_id=2, 
            bos_token_id=1, 
            pad_token_id=0)
output = tokenizer.batch_decode(generate_ids)[0]
print(output)

✨ Features

Optimized for Chinese: Optimized the LLaMA tokenizer for Chinese, adding 7k+ common Chinese characters to the LLaMA vocabulary, significantly improving the encoding and decoding efficiency of Chinese.
Large-scale Incremental Training: Incrementally trained 110 billion tokens of data on the native LLaMA-13B model, which is the largest-scale incremental training on LLaMA-13B so far.
Enhanced Performance: The Ziya-LLaMA-13B-v1 model, based on this pre-trained model, has the ability to perform tasks such as translation, programming, text classification, information extraction, summarization, copywriting, common sense Q&A, and mathematical calculation.

📦 Installation

Not provided in the original document, so this section is skipped.

💻 Usage Examples

Basic Usage

The steps to obtain the complete model weights and perform inference are described in the "Quick Start" section.

Advanced Usage

Not provided in the original document, so this part is skipped.

📚 Documentation

Brief Introduction

The Ziya-LLaMA-13B-Pretrain-v1 is a large-scale pre-trained model based on LLaMA with 13 billion parameters. We optimized the LLaMA tokenizer for Chinese and incrementally trained 110 billion tokens of data on the LLaMA-13B model, which significantly improved the understanding and generation ability on Chinese. Based on the Ziya-LLaMA-13B-Pretrain-v1, the Ziya-LLaMA-13B-v1 is further trained with two stages: multi-task supervised fine-tuning (SFT) and human feedback learning (RM, PPO). The Ziya-LLaMA-13B-v1 has the ability to perform tasks such as translation, programming, text classification, information extraction, summarization, copywriting, common sense Q&A, and mathematical calculation.

README: To follow the License of LLaMA released by Meta, we only release the incremental weights after continual pretraining. The final model Ziya-LLaMA-13B-Pretrain-v1 could be easily got via the script (refer to Usage).

Model Taxonomy

Property	Details
Demand	General
Task	AGI Model
Series	Ziya
Model	LLaMA
Parameter	13B
Extra	English & Chinese

Model Information

Continual Pretraining

The original data contains both English and Chinese, with English data from openwebtext, Books, Wikipedia, and Code, and Chinese data from the cleaned Wudao dataset and self-built Chinese dataset. After deduplication, model scoring, data bucketing, rule filtering, sensitive topic filtering, and data evaluation, we finally obtained 125 billion tokens of data.

To address the issue of low efficiency in Chinese encoding and decoding caused by the tokenizer of LLaMa, we added 8,000 commonly used Chinese characters to the LLaMa SentencePiece vocabulary. Deduplicating with the original LLaMa vocabulary, we finally obtained a vocabulary of size 39,410. We achieved this by reusing the LlamaTokenizer in Transformers.

During the incremental training process, we used 160 A100s with a total of 40GB memory, using a training dataset with 2.6 million tokens and mixed precision of FP16. The throughput reached 118 TFLOP per GPU per second. As a result, we were able to incrementally train 110 billion tokens of data on the native LLaMA-13B model in just 8 days. As far as we know, this is also the largest-scale incremental training on LLaMA-13B to date.

Throughout the training process, we encountered various issues such as machine crashes, underlying framework bugs, and loss spikes. However, we ensured the stability of the incremental training by making rapid adjustments. We also released the loss curve during the training process to help everyone understand the potential issues that may arise.

Performance

Here are comparisons of the Ziya-LLaMA-13B-Pretrain-v1 model and the LLaMA model before continual pre-training, evaluated on the English benchmark (HeLM) and our Chinese multiple-choice evaluation datasets.

Model	Meanwin_rate	MMLU	BoolQ	NarrativeQA	NaturalQuestion(closed-book)	NaturalQuestion(open-book)	QuAC	TruthfulQA	IMDB
LLaMA-13B	0.500	0.424	0.718	0.440	0.349	0.591	0.318	0.326	0.487
Ziya-LLaMA-13B-Pretrain-v1	0.650	0.433	0.753	0.445	0.348	0.528	0.335	0.249	0.497

Model	incontext	c3	Common Sense	Chinese	Math	English	Physics	Chemistry	Biology	History	Politics	Geography
LLaMA-13B	0-shot	0.4817	0.3088	0.2674	0.2882	0.3399	0.2581	0.2478	0.2271	0.3380	0.3275	0.296
Ziya-LLaMA-13B-Pretrain-v1	0-shot	0.5354	0.3373	0.2925	0.3059	0.3428	0.2903	0.2655	0.3215	0.4190	0.4123	0.4425
LLaMA-13B	5-shot	0.5314	0.3586	0.2813	0.2912	0.4476	0.2939	0.2301	0.2330	0.3268	0.3187	0.3103
Ziya-LLaMA-13B-Pretrain-v1	5-shot	0.6037	0.4330	0.2802	0.2912	0.4363	0.2975	0.2802	0.3422	0.4358	0.4357	0.4540

Finetune Example

Refer to ziya_finetune

Inference & Quantization Example

Refer to ziya_inference

Citation

If you are using the resource for your work, please cite our paper:

@article{fengshenbang,
  author    = {Jiaxing Zhang and Ruyi Gan and Junjie Wang and Yuxiang Zhang and Lin Zhang and Ping Yang and Xinyu Gao and Ziwei Wu and Xiaoqun Dong and Junqing He and Jianheng Zhuo and Qi Yang and Yongfeng Huang and Xiayu Li and Yanghan Wu and Junyu Lu and Xinyu Zhu and Weifeng Chen and Ting Han and Kunhao Pan and Rui Wang and Hao Wang and Xiaojun Wu and Zhongshen Zeng and Chongpei Chen},
  title     = {Fengshenbang 1.0: Being the Foundation of Chinese Cognitive Intelligence},
  journal   = {CoRR},
  volume    = {abs/2209.02970},
  year      = {2022}
}

You can also cite our website:

@misc{Fengshenbang-LM,
  title={Fengshenbang-LM},
  author={IDEA-CCNL},
  year={2021},
  howpublished={\url{https://github.com/IDEA-CCNL/Fengshenbang-LM}},
}

🔧 Technical Details

Not provided in the original document, so this section is skipped.

📄 License

The model is licensed under GPL-3.0.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご