đ TinyMistral-248m: A Compact Language Model
This is a pre - trained language model derived from the Mistral 7B model, scaled down to approximately 248 million parameters. It has been trained on 7,488,000 examples. The model is designed not for direct use but for fine - tuning on downstream tasks. It is expected to have a context length of around 32,768 tokens. Due to issues with saving model weights, safe serialization has been removed.
đ Quick Start
This model is mainly for fine - tuning. You can refer to the following inference parameters:
{
"parameters": {
"do_sample": true,
"temperature": 0.5,
"top_p": 0.5,
"top_k": 50,
"max_new_tokens": 250,
"repetition_penalty": 1.176
}
}
⨠Features
- Compact Size: With approximately 248 million parameters, it is a scaled - down version of the Mistral 7B model.
- Small - scale Training: Trained on a relatively small number of examples (7,488,000).
- Fine - tuning Oriented: Intended for fine - tuning on downstream tasks rather than direct use.
- Long Context Length: Expected to have a context length of around 32,768 tokens.
đĻ Installation
No specific installation steps are provided in the original document.
đģ Usage Examples
No code examples are provided in the original document.
đ Documentation
Evaluation Results
During evaluation on InstructMix, this model achieved an average perplexity score of 6.3. More epochs are planned for this model on different datasets.
You can find the Open LLM Leaderboard Evaluation Results (outdated). Detailed results are available here.
Metric |
Value |
Avg. |
24.18 |
ARC (25 - shot) |
20.82 |
HellaSwag (10 - shot) |
26.98 |
MMLU (5 - shot) |
23.11 |
TruthfulQA (0 - shot) |
46.89 |
Winogrande (5 - shot) |
50.75 |
GSM8K (5 - shot) |
0.0 |
DROP (3 - shot) |
0.74 |
Purpose
The purpose of this model is to prove that trillion - scale datasets are not needed to pretrain a language model. As a result of needing small datasets, this model was pretrained on a single GPU (Titan V).
đ§ Technical Details
- Training Datasets:
- Skylion007/openwebtext
- JeanKaddour/minipile
- Model Parameters: Scaled down to approximately 248 million parameters from the Mistral 7B model.
- Training Examples: 7,488,000 examples.
- Context Length: Around 32,768 tokens.
- Serialization: Safe serialization has been removed due to issues saving model weights.
đ License
This model is licensed under the Apache 2.0 license.