🚀 🦙 Llama for Huggingface Transformers
This project converts Llama-7B from the official version to a Huggingface model, enabling it to work with Transformers/HuggingFace. It offers optimized naming and checkpoint sharding for better performance.
🚀 Quick Start
Llama-7B has been converted from the official Llama-7B to a Huggingface model using HF's conversion script. This model operates under a special license; refer to the LICENSE
file for details.
This repository is an update from decapoda-research/llama-7b-hf. Since many pull requests in the decapoda repo remain unmerged, a new repo is created here. It includes the following improvements:
- Naming Changes: The naming has been adjusted (LLaMA -> Llama) to align with the
transformers
naming convention, both in LlamaForCausalLM
and LlamaTokenizer
. This works well with transformers>=4.28.0
.
- Checkpoint Sharding: The model checkpoints are saved in 2 shards, compared to 33 shards in decapoda-research/Llama-7b-hf. Fewer shards speed up the disk loading process.
📚 Documentation
Llama Model Card
Model details
Intended use
- Primary intended uses: The main purpose of Llama is research on large language models, including exploring potential applications like question answering, natural language understanding, or reading comprehension; understanding the capabilities and limitations of current language models and developing improvement techniques; and evaluating and mitigating biases, risks, toxic and harmful content generations, and hallucinations.
- Primary intended users: The primary users of the model are researchers in natural language processing, machine learning, and artificial intelligence.
- Out - of - scope use cases: Llama is a base model. It should not be used in downstream applications without further risk evaluation and mitigation. In particular, the model has not been trained with human feedback and may generate toxic, offensive content, incorrect information, or unhelpful answers.
Factors
- Relevant factors: One significant factor affecting model performance is the language used. Although 20 languages are included in the training data, most of the dataset consists of English text. Therefore, the model is expected to perform better in English than in other languages. Additionally, previous studies have shown that performance may vary for different dialects, and this is also expected for this model.
- Evaluation factors: Since the model is trained on web data, it is expected to reflect biases from this source. The model was evaluated on RAI datasets to measure biases related to gender, religion, race, sexual orientation, age, nationality, disability, physical appearance, and socio - economic status. The toxicity of model generations was also measured based on the toxicity of the prompting context.
Metrics
- Model performance measures: The following metrics are used to evaluate the model:
- Accuracy for common sense reasoning, reading comprehension, natural language understanding (MMLU), BIG - bench hard, WinoGender, and CrowS - Pairs.
- Exact match for question answering.
- The toxicity score from Perspective API on RealToxicityPrompts.
- Decision thresholds: Not applicable.
- Approaches to uncertainty and variability: Due to the high computational requirements of training LLMs, only one model of each size was trained, so the variability of pre - training could not be evaluated.
Evaluation datasets
The model was evaluated on the following benchmarks: BoolQ, PIQA, SIQA, HellaSwag, WinoGrande, ARC, OpenBookQA, NaturalQuestions, TriviaQA, RACE, MMLU, BIG - bench hard, GSM8k, RealToxicityPrompts, WinoGender, CrowS - Pairs.
Training dataset
The model was trained using the following data sources: CCNet [67%], C4 [15%], GitHub [4.5%], Wikipedia [4.5%], Books [4.5%], ArXiv [2.5%], Stack Exchange[2%]. The Wikipedia and Books domains include data in the following languages: bg, ca, cs, da, de, en, es, fr, hr, hu, it, nl, pl, pt, ro, ru, sl, sr, sv, uk. See the paper for more details about the training set and corresponding preprocessing.
Quantitative analysis
Hyperparameters for the model architecture
Llama |
dimension |
n heads |
n layers |
Learn rate |
Batch size |
n tokens |
7B |
4096 |
32 |
32 |
3.0E - 04 |
4M |
1T |
13B |
5120 |
40 |
40 |
3.0E - 04 |
4M |
1T |
33B |
6656 |
52 |
60 |
1.5.E - 04 |
4M |
1.4T |
65B |
8192 |
64 |
80 |
1.5.E - 04 |
4M |
1.4T |
Table 1 - Summary of Llama Model Hyperparameters
Model performance on reasoning tasks
Llama |
BoolQ |
PIQA |
SIQA |
HellaSwag |
WinoGrande |
ARC - e |
ARC - c |
OBQA |
COPA |
7B |
76.5 |
79.8 |
48.9 |
76.1 |
70.1 |
76.7 |
47.6 |
57.2 |
93 |
13B |
78.1 |
80.1 |
50.4 |
79.2 |
73 |
78.1 |
52.7 |
56.4 |
94 |
33B |
83.1 |
82.3 |
50.4 |
82.8 |
76 |
81.4 |
57.8 |
58.6 |
92 |
65B |
85.3 |
82.8 |
52.3 |
84.2 |
77 |
81.5 |
56 |
60.2 |
94 |
Table 2 - Summary of Llama Model Performance on Reasoning tasks
Model bias summary
No |
Category |
FAIR LLM |
1 |
Gender |
70.6 |
2 |
Religion |
79 |
3 |
Race/Color |
57 |
4 |
Sexual orientation |
81 |
5 |
Age |
70.1 |
6 |
Nationality |
64.2 |
7 |
Disability |
66.7 |
8 |
Physical appearance |
77.8 |
9 |
Socioeconomic status |
71.5 |
|
Llama Average |
66.6 |
Table 3 - Summary bias of our model output
Ethical considerations
- Data: The training data is collected from various sources, mainly the web. It contains offensive, harmful, and biased content, so the model is expected to exhibit these biases.
- Human life: The model is not intended to inform decisions regarding matters central to human life and should not be used in such a way.
- Mitigations: The web data was filtered based on its similarity to Wikipedia text and references using a Kneser - Ney language model and a fastText linear classifier.
- Risks and harms: Risks associated with large language models include the generation of harmful, offensive, or biased content. These models often generate incorrect information (hallucinations), and this model is no exception.
- Use cases: Llama is a foundational model. It should not be used in downstream applications without further investigation and risk mitigation. Potential risks and problematic use cases include, but are not limited to, the generation of misinformation and harmful, biased, or offensive content.
📄 License
This project operates under a non - commercial bespoke license. Refer to the LICENSE
file for detailed information.