🚀 CausalLM 7B - Fully Compatible with Meta LLaMA 2
CausalLM 7B is a powerful model fully compatible with Meta LLaMA 2, offering high - performance text generation capabilities and showing excellent results in multiple evaluations.
📄 License
The model is licensed under the WTFPL license.
📦 Datasets
The model was trained using the following datasets:
- JosephusCheung/GuanacoDataset
- Open - Orca/OpenOrca
- stingning/ultrachat
- meta - math/MetaMathQA
- liuhaotian/LLaVA - Instruct - 150K
- jondurbin/airoboros - 3.1
- WizardLM/WizardLM_evol_instruct_V2_196k
- RyokoAI/ShareGPT52K
- RyokoAI/Fandom23K
- milashkaarshif/MoeGirlPedia_wikitext_raw_archive
- wikipedia
- wiki_lingua
- fnlp/moss - 003 - sft - data
- garage - bAInd/Open - Platypus
- LDJnr/Puffin
- openbmb/llava_zh
- BAAI/COIG
- TigerResearch/tigerbot - zhihu - zh - 10k
- liwu/MNBVC
- teknium/openhermes
🌐 Language
The model supports the following languages:
🚀 Quick Start
Use the transformers
library that does not require remote/external code to load the model. You can use AutoModelForCausalLM
and AutoTokenizer
(or manually specify LlamaForCausalLM
to load the language model and GPT2Tokenizer
to load the tokenizer). The model quantization is fully compatible with GGUF (llama.cpp), GPTQ, and AWQ.
✨ Features
Recent Updates
The [DPO - α Version](https://huggingface.co/CausalLM/7B - DPO - alpha) outperforms Zephyr - β in MT - Bench.
llama.cpp GGUF models
The GPT2Tokenizer
was fixed by Kerfuffle on https://github.com/ggerganov/llama.cpp/pull/3743, and new models have been re - uploaded. Thanks to TheBloke for the GGUF quantized models: [https://huggingface.co/TheBloke/CausalLM - 7B - GGUF](https://huggingface.co/TheBloke/CausalLM - 7B - GGUF).
Training Details
The model was trained based on the weights of Qwen (and LLaMA2 weights were also used for some initial weight calculations). You may need to comply with the commercial use restrictions of these two models depending on the situation. The training process used the same model architecture as LLaMA2, the same attention calculation method as the original MHA LLaMA2 models, and no additional scaling was applied to the Rotary Positional Encoding (RoPE).
Dataset Curation
A manually curated SFT dataset of 1.3B tokens was used for training, leveraging open - source datasets from Hugging Face. Most sentences were manually or synthetically rewritten, and alternate language versions were generated using larger language models. Augmented text training was also conducted using carefully selected Wikipedia entries, featured Fandom entries, and filtered Moegirlpedia entries. To balance efficiency and quality, 100% of the training data was synthetic data, and no direct use of internet text or original texts from publicly available datasets was made for fine - tuning.
Model Distillation
The 7B version of the model is a distilled version of the 14B model, specifically designed for speculative sampling. Caution should be exercised when directly using the model, as it may produce hallucinations or unreliable outputs.
Safety Considerations
The model was trained on unfiltered internet data. Since we cannot vet all of it, there may be a substantial amount of objectionable content, pornography, violence, and offensive language that we are unable to remove. You will still need to check the model's safety and filter keywords in the output. Due to computational resource constraints, we are currently unable to implement RLHF for the model's ethics and safety, nor train on SFT samples that refuse to answer certain questions for restrictive fine - tuning.
Multimodal Potential
The model underwent some fine - tuning on the prompt format introduced in LLaVA1.5 that is unrelated to image attention calculation. Aligning the ViT Projection module with frozen LM under visual instructions would enable rapid implementation of effective multimodal capabilities.
💡 Usage Tip
PROMPT FORMAT
Use the [chatml](https://github.com/openai/openai - python/blob/main/chatml.md) format. The System Prompt must not be empty!
📊 Evaluation Results
MMLU
Category |
Accuracy |
STEM |
56.83 |
Humanities |
58.79 |
Other |
70.04 |
Social |
72.41 |
AVERAGE |
63.82 |
The model outperforms / equals the best Mistral - 7B Chat - style fine - tunes, ChatGLM3 - 6B, and all other models under 33B.
CEval (Val)
Category |
Accuracy |
STEM |
61.67 |
Social Science |
81.94 |
Humanities |
77.19 |
Other |
68.35 |
Hard |
48.03 |
AVERAGE |
70.27 |
The model outperforms all current 7B models, including ChatGLM3 - 6B.
GSM8K
Zero - shot Accuracy: 0.5921152388172858 (Outperforms WizardMath - 7B and Qwen - 7B)
MT - Behch on DPO Version
Model |
MT - Bench |
GPT - 4 |
8.99 |
GPT - 3.5 - Turbo |
7.94 |
Zephyr - 7b - β (Overfitting) |
7.34 |
Zephyr - 7b - α |
6.88 |
[CausalLM/14B - DPO - α](https://huggingface.co/CausalLM/14B - DPO - alpha) |
7.618868 |
[CausalLM/7B - DPO - α](https://huggingface.co/CausalLM/7B - DPO - alpha) |
7.038125 |
Image Reference
Image drawn by GPT - 4 DALL·E 3 TL;DR: Perhaps this 7B model, better than all existing models <= 33B, in most quantitative evaluations...