14B Causal Language Model Open-Source - Compatible with LLaMA 2 Architecture, Outperforming Models Below 70B

14B

Developed by CausalLM

A 14B-parameter causal language model fully compatible with Meta LLaMA 2 architecture, outperforming all sub-70B models in multiple benchmarks

Large Language Model

Transformers

Supports Multiple Languages#Multilingual Reasoning #Zero-shot Mathematics #DPO Fine-tuning

Downloads 236

Release Time : 10/22/2023

Model Overview

A large language model trained on Qwen and LLaMA2 architectures, specializing in text generation tasks with bilingual support (English/Chinese), demonstrating excellent performance in academic benchmarks

Model Features

High Performance

Outperforms all sub-70B models in benchmarks including MMLU and CEval, with GSM8K mathematical reasoning surpassing MetaMath-13B and Qwen-14B

Multilingual Support

Supports English and Chinese, with Japanese benchmark performance approaching SOTA Japanese models

Full Compatibility

Fully compatible with LLaMA2 architecture, supporting GGUF, GPTQ and AWQ quantization formats

High-quality Training Data

1.3 billion token SFT dataset with 90% sentences manually/synthetically rewritten, incorporating curated content from Wikipedia and multiple sources

Model Capabilities

Text Generation

Mathematical Reasoning

Multilingual Understanding

Academic Q&A

Use Cases

Academic Research

STEM Q&A

Answering questions in science, technology, engineering and mathematics fields

MMLU STEM accuracy 64.19, surpassing all sub-70B models

Educational Assistance

Math Problem Solving

Solving complex mathematical reasoning problems

GSM8K zero-shot math reasoning accuracy 70.13%

🚀 CausalLM 14B - Fully Compatible with Meta LLaMA 2

CausalLM 14B is a powerful language model fully compatible with Meta LLaMA 2. It offers seamless integration with various quantization methods and can be loaded using common transformers libraries, providing high performance in text generation tasks.

Image drawn by GPT - 4 DALL·E 3 TL;DR: Perhaps better than all existing models < 70B, in most quantitative evaluations...

🚀 Quick Start

Use the transformers library that does not require remote/external code to load the model. You can use AutoModelForCausalLM and AutoTokenizer (or manually specify LlamaForCausalLM to load the language model and GPT2Tokenizer to load the tokenizer). Model quantization is fully compatible with GGUF (llama.cpp), GPTQ, and AWQ.

from transformers import AutoModelForCausalLM, AutoTokenizer

# Load the model and tokenizer
model = AutoModelForCausalLM.from_pretrained("your_model_path")
tokenizer = AutoTokenizer.from_pretrained("your_model_path")

✨ Features

Full Compatibility: Fully compatible with Meta LLaMA 2, and model quantization is compatible with GGUF, GPTQ, and AWQ.
High - Performance: In most quantitative evaluations, it may outperform all existing models < 70B.
DPO Version Leader: The DPO version ranks #1 among ~13B models on the 🤗 Open LLM Leaderboard.

📦 Installation

No specific installation steps are provided in the original document.

📚 Documentation

Recent Updates

The DPO - α Version outperforms Zephyr - β in MT - Bench.

Friendly reminder

If your VRAM is insufficient, you should use the 7B model instead of the quantized version. Compared to the quantized versions, the 7B version and the 14B version demonstrate a high level of consistency.

llama.cpp GGUF models

The GPT2Tokenizer was fixed by Kerfuffle on https://github.com/ggerganov/llama.cpp/pull/3743, and new models are now re - uploaded. Thanks to TheBloke for the GGUF quants: https://huggingface.co/TheBloke/CausalLM-14B-GGUF

Caution

Unofficial GPTQ and AWQ models may have issues as they use Wikitext for calibration, while this model has undergone considerable training on a synthesized Wikipedia conversation dataset. It is not recommended to use any form of quantization, but rather to use smaller - sized models, as the 7B and 14B versions have high consistency. However, if you do use model quantization, please use GGUF.

Read Me

Model Training: This model was trained based on the model weights of Qwen (and LLaMA2 was used for calculating some initial weights). You may also need to comply with the commercial use restrictions of these two models depending on the situation. The training process utilized a model architecture identical to LLaMA2, using the same attention calculation method as the original MHA LLaMA2 models, and no additional scaling was applied to the Rotary Positional Encoding (RoPE).
Dataset: We manually curated a SFT dataset of 1.3B tokens for training, utilizing open - source datasets from Hugging Face. For most sentences, we performed manual or synthetic rewrites and generated alternate language versions using larger language models. Additionally, we conducted augmented text training using carefully selected entries from Wikipedia, featured entries from Fandom, and filtered entries from Moegirlpedia. 100% of the data used for training was synthetic data, and no direct use of text from the internet or original texts from publicly available datasets was employed for fine - tuning.
7B Version: The 7B version of the model is a distilled version of the 14B model, specifically designed for speculative sampling. Therefore, it is important to exercise caution when directly using the model, as it may produce hallucinations or unreliable outputs.
Safety: The model was trained on unfiltered internet data. Since we do not have the capacity to vet all of it, there may be a substantial amount of objectionable content, pornography, violence, and offensive language present that we are unable to remove. Therefore, you will still need to complete your own checks on the model's safety and filter keywords in the output. Due to computational resource constraints, we are presently unable to implement RLHF for the model's ethics and safety, nor training on SFT samples that refuse to answer certain questions for restrictive fine - tuning.
Bonus: The model underwent some fine - tuning on the prompt format introduced in LLaVA1.5 that is unrelated to image attention calculation. Therefore, aligning the ViT Projection module with frozen LM under visual instructions would enable rapid implementation of effective multimodal capabilities.

PROMPT FORMAT

chatml

System Prompt must not be empty!

Evaluation Results

MMLU

Subject	Accuracy
STEM	64.19
Humanities	61.40
Other	71.64
Social	75.37
Average	67.36 (Outperforms ALL models under 70B, very close to those best 70B fine - tunes)

CEval (Val)

Subject	Accuracy
STEM	66.71
Social Science	85.10
Humanities	76.68
Other	70.23
Hard	54.71
Average	73.10 (Outperforms Qwen - 14B, and GPT - 4)

GSM8K

Zero - shot ACC 0.7012888551933283 (Outperforms MetaMath - 13B, Qwen - 14B)

AlpacaEval Leaderboard

Model	win_rate	standard_error	n_wins	n_wins_base	n_draws	n_total	mode	avg_length
causallm - 14b	88.26087	1.116333	705	89	11	805	community	1391

Win rate 88.26% on AlpacaEval Leaderboard view raw

MT - Behch on DPO Version

Model	MT - Bench
GPT - 4	8.99
GPT - 3.5 - Turbo	7.94
Zephyr - 7b - β (Overfitting)	7.34
Zephyr - 7b - α	6.88
CausalLM/14B - DPO - α	7.618868
CausalLM/7B - DPO - α	7.038125

Other languages

We are currently unable to produce accurate benchmark templates for non - QA tasks (languages other than English and Chinese). However, we will be working on other language versions of the QA - Task challenge in the near future.

Japanese Benchmark

Task	Version	Metric	Value	Stderr
jcommonsenseqa - 1.1 - 0.6	1.1	acc	0.8213	± 0.0115

JCommonsenseQA benchmark result is very, very close to [Japanese Stable LM Gamma 7B (83.47)](https://github.com/Stability - AI/lm - evaluation - harness/tree/jp - stable), the current SOTA Japanese LM. However, our model was not trained on a particularly large amount of text in Japanese. This seems to reflect the cross - language transferability of metalinguistics.

🤗 Open LLM Leaderboard

On Dec 3, 2023, the DPO Version Ranked #1 among non - base models of its size on the 🤗 Open LLM Leaderboard, outperforming ALL ~13B chat models.

![image/png](https://cdn - uploads.huggingface.co/production/uploads/63468a143ea42ee2cb49ddd1/8nV0yOTteP208bjbCv5MC.png)

📄 License

The model is licensed under the WTFPL license.

🔧 Technical Details

Datasets

Property	Details
Datasets	JosephusCheung/GuanacoDataset, Open - Orca/OpenOrca, stingning/ultrachat, meta - math/MetaMathQA, liuhaotian/LLaVA - Instruct - 150K, jondurbin/airoboros - 3.1, WizardLM/WizardLM_evol_instruct_V2_196k, RyokoAI/ShareGPT52K, RyokoAI/Fandom23K, milashkaarshif/MoeGirlPedia_wikitext_raw_archive, wikipedia, wiki_lingua, fnlp/moss - 003 - sft - data, garage - bAInd/Open - Platypus, LDJnr/Puffin, openbmb/llava_zh, BAAI/COIG, TigerResearch/tigerbot - zhihu - zh - 10k, liwu/MNBVC, teknium/openhermes
Language	en, zh
Pipeline Tag	text - generation
Tags	llama, llama2, qwen, causallm

Model Architecture

The training process utilized a model architecture that was identical to LLaMA2, using the same attention calculation method as the original MHA LLaMA2 models, and no additional scaling applied to the Rotary Positional Encoding (RoPE).

Data Processing

We manually curated a SFT dataset of 1.3B tokens for training. For most sentences, we performed manual or synthetic rewrites and generated alternate language versions using larger language models. 100% of the data used for training was synthetic data.

🔧 Technical Details

The model was developed with a focus on compatibility and performance. It uses a model architecture similar to LLaMA 2, with specific attention calculation methods and no additional scaling on RoPE. The training dataset was carefully curated and processed, and the model has shown excellent results in multiple benchmarks.

📄 License

The model is released under the WTFPL license.

⚠️ Important Note

The model was trained on unfiltered internet data, which may contain a large amount of objectionable content. You need to conduct your own safety checks and keyword filtering.
Due to computational resource constraints, RLHF for ethics and safety and training on SFT samples for restrictive fine - tuning are not currently implemented.

💡 Usage Tip

If your VRAM is insufficient, use the 7B model instead of the quantized version.
When using the 7B model, be cautious as it may produce hallucinations or unreliable outputs.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご