🚀 Latxa 7b
Latxa is a family of large language models for Basque. Ranging from 7 to 70 billion parameters, it's based on Llama 2 and further pretrained on a new Basque corpus. It outperforms previous open models and competes with GPT - 4 Turbo in some aspects. All models, corpora, and datasets are publicly available under open licenses.
🚀 Quick Start
Use the code below to get started with the model.
from transformers import pipeline
pipe = pipeline("text - generation", model="HiTZ/latxa-7b-v1.2")
text = "Euskara adimen artifizialera iritsi da!"
pipe(text, max_new_tokens=50, num_beams=5)
>> [
{
'generated_text': 'Euskara adimen artifizialera iritsi da!\nEuskararen eta adimen artifizialaren arteko harremana aspaldikoa da,'
' baina azken urteotan aurrerapauso handiak eman dira arlo horretan'
}
]
✨ Features
- Language - Specific: Designed specifically for the Basque language, addressing the limitations of existing models in low - resource languages.
- High - Quality Training: Based on Llama 2 and further trained on a new Basque corpus with 4.3M documents and 4.2B tokens.
- Open - Source: All models, pretraining corpora, and evaluation datasets are publicly available under open licenses, enabling reproducible research.
📦 Installation
No installation steps were provided in the original README, so this section is skipped.
💻 Usage Examples
Basic Usage
from transformers import pipeline
pipe = pipeline("text - generation", model="HiTZ/latxa-7b-v1.2")
text = "Euskara adimen artifizialera iritsi da!"
pipe(text, max_new_tokens=50, num_beams=5)
📚 Documentation
Model Details
Model Description
Latxa is a family of Large Language Models (LLM) based on Meta’s [LLaMA models](https://huggingface.co/meta - llama). Current LLMs perform well for high - resource languages like English, but poorly for Basque and other low - resource languages. Latxa aims to overcome these limitations and promote LLM - based technology and research for Basque. It follows the same architecture as its original counterparts and was further trained on [Latxa Corpus v1.1](https://huggingface.co/datasets/HiTZ/latxa - corpus - v1.1), a high - quality Basque corpora.
The models are released in three sizes: 7B, 13B, and 70B.
Property |
Details |
Developed by |
HiTZ Research Center & IXA Research group (University of the Basque Country UPV/EHU) |
Model Type |
Language model |
Language(s) (NLP) |
en, eu |
License |
llama2 |
Parent Model |
meta - llama/Llama - 2 - 7b |
Contact |
hitz@ehu.eus |
Uses
Direct Use
Latxa family models are pre - trained LLMs without task - specific or instruction fine - tuning. They can be prompted for specific tasks or further fine - tuned for specific use cases.
Out - of - Scope Use
The model was not fine - tuned to follow instructions or work as a chat assistant, so this kind of usage is not tested or recommended.
Bias, Risks, and Limitations
Latxa has been trained on carefully selected and processed data to reduce potentially disturbing or harmful content. However, as it is based on LLaMA models, it may carry the same bias, risk, and limitations. See LLaMA’s Ethical Considerations and Limitations for more information.
Training Details
Training Data
The training corpus combines existing and new datasets. Quality was prioritized, with high - quality data sources selected and a thorough deduplication and filtering process applied. A 4.17B tokens corpus was used, and 500K documents of English data from the Pile were included to avoid catastrophic forgetting.
Training Procedure
The training was conducted using the [GPT - Neox](https://github.com/EleutherAI/gpt - neox) library on the CINECA HPC Leonardo computing cluster in Italy. The models were trained for 10k steps with a sequence length of 4096 tokens and an effective batch size of 2M tokens, totaling 20B tokens (around 4 epochs). A cosine learning rate schedule was used, with a warm - up of 500 steps and decaying to 3% of the peak learning rate (set to 1e - 4). Other hyperparameters follow (Touvron et al., 2023).
Evaluation
The models were evaluated on zero - shot and few - shot settings for generative, multiple - choice, and classification tasks using the Basque partitions of each dataset.
Testing Data, Factors & Metrics
Testing Data
Metrics
Most tasks used Accuracy. For some tasks in the BasqueGLUE benchmark:
- Micro F1: BEC2016 - eu and BHTCv2
- Macro F1: VaxxStance (favor & against)
Results
The model was evaluated using the LM Evaluation harness library from Eleuther AI. To reproduce the results, follow the instructions in Latxa's [Github repository](https://github.com/hitz - zentroa/latxa?tab=readme - ov - file#evaluation).
Model |
Size |
XStory |
Belebele |
BasGLUE |
EusProf |
EusRead |
EusTrivia |
EusExams |
Avg |
Random |
|
50.00 |
25.00 |
37.50 |
25.00 |
25.83 |
26.55 |
25.00 |
30.70 |
|
|
|
|
|
|
|
|
|
|
GPT 3.5 Turbo |
n/a |
-- |
57.33 |
48.62 |
31.24 |
36.65 |
46.71 |
42.42 |
-- |
GPT 4 Turbo |
n/a |
-- |
90.67 |
62.90 |
56.70 |
75.85 |
73.12 |
70.22 |
-- |
|
|
|
|
|
|
|
|
|
|
XGLM |
7B |
57.71 |
23.88 |
41.47 |
22.96 |
24.43 |
26.53 |
24.59 |
32.51 |
BLOOM |
7B |
57.18 |
27.00 |
40.17 |
25.34 |
28.41 |
27.17 |
25.07 |
33.86 |
Mistral |
7B |
51.09 |
38.89 |
39.22 |
25.01 |
29.26 |
34.58 |
32.15 |
35.94 |
Llama 2 |
7B |
50.43 |
26.22 |
38.20 |
24.09 |
27.27 |
29.50 |
28.84 |
32.51 |
Latxa v1.1 |
7B |
65.45 |
37.33 |
52.56 |
30.26 |
25.00 |
42.16 |
33.82 |
40.94 |
|
|
|
|
|
|
|
|
|
|
mGPT |
13B |
55.39 |
25.00 |
37.56 |
25.00 |
24.15 |
27.17 |
25.73 |
32.14 |
Llama 2 |
13B |
50.63 |
32.00 |
38.98 |
25.90 |
28.98 |
33.53 |
29.66 |
34.36 |
Latxa v1.1 |
13B |
66.51 |
53.89 |
53.36 |
44.11 |
32.67 |
56.38 |
43.66 |
50.08 |
|
|
|
|
|
|
|
|
|
|
Mixtral |
8x7B |
52 |
|
|
|
|
|
|
|
🔧 Technical Details
Latxa is based on Llama 2 architecture. The continued pretraining on the new Basque corpus with specific hyperparameters and training procedures (as described in the Training Procedure section) contributes to its performance on Basque - related tasks. The use of the GPT - Neox library and the specific computing cluster also play important roles in the training process.
📄 License
The models inherit the LLaMA - 2 License, which allows for commercial and research use.