
Model Overview
Model Features
Model Capabilities
Use Cases
🚀 CausalLM 14B - Fully Compatible with Meta LLaMA 2
CausalLM 14B is a powerful language model fully compatible with Meta LLaMA 2. It offers seamless integration with various quantization methods and can be loaded using common transformers libraries, providing high performance in text generation tasks.
Image drawn by GPT - 4 DALL·E 3 TL;DR: Perhaps better than all existing models < 70B, in most quantitative evaluations...
🚀 Quick Start
Use the transformers
library that does not require remote/external code to load the model. You can use AutoModelForCausalLM
and AutoTokenizer
(or manually specify LlamaForCausalLM
to load the language model and GPT2Tokenizer
to load the tokenizer). Model quantization is fully compatible with GGUF (llama.cpp), GPTQ, and AWQ.
from transformers import AutoModelForCausalLM, AutoTokenizer
# Load the model and tokenizer
model = AutoModelForCausalLM.from_pretrained("your_model_path")
tokenizer = AutoTokenizer.from_pretrained("your_model_path")
✨ Features
- Full Compatibility: Fully compatible with Meta LLaMA 2, and model quantization is compatible with GGUF, GPTQ, and AWQ.
- High - Performance: In most quantitative evaluations, it may outperform all existing models < 70B.
- DPO Version Leader: The DPO version ranks #1 among ~13B models on the 🤗 Open LLM Leaderboard.
📦 Installation
No specific installation steps are provided in the original document.
📚 Documentation
Recent Updates
The DPO - α Version outperforms Zephyr - β in MT - Bench.
Friendly reminder
If your VRAM is insufficient, you should use the 7B model instead of the quantized version. Compared to the quantized versions, the 7B version and the 14B version demonstrate a high level of consistency.
llama.cpp GGUF models
The GPT2Tokenizer
was fixed by Kerfuffle on https://github.com/ggerganov/llama.cpp/pull/3743, and new models are now re - uploaded. Thanks to TheBloke for the GGUF quants: https://huggingface.co/TheBloke/CausalLM-14B-GGUF
Caution
Unofficial GPTQ and AWQ models may have issues as they use Wikitext for calibration, while this model has undergone considerable training on a synthesized Wikipedia conversation dataset. It is not recommended to use any form of quantization, but rather to use smaller - sized models, as the 7B and 14B versions have high consistency. However, if you do use model quantization, please use GGUF.
Read Me
- Model Training: This model was trained based on the model weights of Qwen (and LLaMA2 was used for calculating some initial weights). You may also need to comply with the commercial use restrictions of these two models depending on the situation. The training process utilized a model architecture identical to LLaMA2, using the same attention calculation method as the original MHA LLaMA2 models, and no additional scaling was applied to the Rotary Positional Encoding (RoPE).
- Dataset: We manually curated a SFT dataset of 1.3B tokens for training, utilizing open - source datasets from Hugging Face. For most sentences, we performed manual or synthetic rewrites and generated alternate language versions using larger language models. Additionally, we conducted augmented text training using carefully selected entries from Wikipedia, featured entries from Fandom, and filtered entries from Moegirlpedia. 100% of the data used for training was synthetic data, and no direct use of text from the internet or original texts from publicly available datasets was employed for fine - tuning.
- 7B Version: The 7B version of the model is a distilled version of the 14B model, specifically designed for speculative sampling. Therefore, it is important to exercise caution when directly using the model, as it may produce hallucinations or unreliable outputs.
- Safety: The model was trained on unfiltered internet data. Since we do not have the capacity to vet all of it, there may be a substantial amount of objectionable content, pornography, violence, and offensive language present that we are unable to remove. Therefore, you will still need to complete your own checks on the model's safety and filter keywords in the output. Due to computational resource constraints, we are presently unable to implement RLHF for the model's ethics and safety, nor training on SFT samples that refuse to answer certain questions for restrictive fine - tuning.
- Bonus: The model underwent some fine - tuning on the prompt format introduced in LLaVA1.5 that is unrelated to image attention calculation. Therefore, aligning the ViT Projection module with frozen LM under visual instructions would enable rapid implementation of effective multimodal capabilities.
PROMPT FORMAT
System Prompt must not be empty!
Evaluation Results
MMLU
Subject | Accuracy |
---|---|
STEM | 64.19 |
Humanities | 61.40 |
Other | 71.64 |
Social | 75.37 |
Average | 67.36 (Outperforms ALL models under 70B, very close to those best 70B fine - tunes) |
CEval (Val)
Subject | Accuracy |
---|---|
STEM | 66.71 |
Social Science | 85.10 |
Humanities | 76.68 |
Other | 70.23 |
Hard | 54.71 |
Average | 73.10 (Outperforms Qwen - 14B, and GPT - 4) |
GSM8K
Zero - shot ACC 0.7012888551933283 (Outperforms MetaMath - 13B, Qwen - 14B)
AlpacaEval Leaderboard
Model | win_rate | standard_error | n_wins | n_wins_base | n_draws | n_total | mode | avg_length |
---|---|---|---|---|---|---|---|---|
causallm - 14b | 88.26087 | 1.116333 | 705 | 89 | 11 | 805 | community | 1391 |
Win rate 88.26% on AlpacaEval Leaderboard view raw
MT - Behch on DPO Version
Model | MT - Bench |
---|---|
GPT - 4 | 8.99 |
GPT - 3.5 - Turbo | 7.94 |
Zephyr - 7b - β (Overfitting) | 7.34 |
Zephyr - 7b - α | 6.88 |
CausalLM/14B - DPO - α | 7.618868 |
CausalLM/7B - DPO - α | 7.038125 |
Other languages
We are currently unable to produce accurate benchmark templates for non - QA tasks (languages other than English and Chinese). However, we will be working on other language versions of the QA - Task challenge in the near future.
Japanese Benchmark
Task | Version | Metric | Value | Stderr |
---|---|---|---|---|
jcommonsenseqa - 1.1 - 0.6 | 1.1 | acc | 0.8213 | ± 0.0115 |
JCommonsenseQA benchmark result is very, very close to [Japanese Stable LM Gamma 7B (83.47)](https://github.com/Stability - AI/lm - evaluation - harness/tree/jp - stable), the current SOTA Japanese LM. However, our model was not trained on a particularly large amount of text in Japanese. This seems to reflect the cross - language transferability of metalinguistics.
🤗 Open LLM Leaderboard
On Dec 3, 2023, the DPO Version Ranked #1 among non - base models of its size on the 🤗 Open LLM Leaderboard, outperforming ALL ~13B chat models.

📄 License
The model is licensed under the WTFPL license.
🔧 Technical Details
Datasets
Property | Details |
---|---|
Datasets | JosephusCheung/GuanacoDataset, Open - Orca/OpenOrca, stingning/ultrachat, meta - math/MetaMathQA, liuhaotian/LLaVA - Instruct - 150K, jondurbin/airoboros - 3.1, WizardLM/WizardLM_evol_instruct_V2_196k, RyokoAI/ShareGPT52K, RyokoAI/Fandom23K, milashkaarshif/MoeGirlPedia_wikitext_raw_archive, wikipedia, wiki_lingua, fnlp/moss - 003 - sft - data, garage - bAInd/Open - Platypus, LDJnr/Puffin, openbmb/llava_zh, BAAI/COIG, TigerResearch/tigerbot - zhihu - zh - 10k, liwu/MNBVC, teknium/openhermes |
Language | en, zh |
Pipeline Tag | text - generation |
Tags | llama, llama2, qwen, causallm |
Model Architecture
The training process utilized a model architecture that was identical to LLaMA2, using the same attention calculation method as the original MHA LLaMA2 models, and no additional scaling applied to the Rotary Positional Encoding (RoPE).
Data Processing
We manually curated a SFT dataset of 1.3B tokens for training. For most sentences, we performed manual or synthetic rewrites and generated alternate language versions using larger language models. 100% of the data used for training was synthetic data.
🔧 Technical Details
The model was developed with a focus on compatibility and performance. It uses a model architecture similar to LLaMA 2, with specific attention calculation methods and no additional scaling on RoPE. The training dataset was carefully curated and processed, and the model has shown excellent results in multiple benchmarks.
📄 License
The model is released under the WTFPL license.
⚠️ Important Note
- The model was trained on unfiltered internet data, which may contain a large amount of objectionable content. You need to conduct your own safety checks and keyword filtering.
- Due to computational resource constraints, RLHF for ethics and safety and training on SFT samples for restrictive fine - tuning are not currently implemented.
💡 Usage Tip
- If your VRAM is insufficient, use the 7B model instead of the quantized version.
- When using the 7B model, be cautious as it may produce hallucinations or unreliable outputs.

