Latxa-7b-v1.2 Open-Source Large Language Model - Supports Basque Language Conversations, Ideal for Low-Resource Languages

Latxa 7b V1.2

Developed by HiTZ

Latxa is a large Basque language model based on the LLaMA-2 architecture, specifically designed for low-resource languages, trained on a 4.2 billion token Basque corpus

Large Language Model

Transformers

Supports Multiple Languages#Basque language optimization #Low-resource language model #Multi-task evaluation suite

Downloads 875

Release Time : 6/11/2024

Model Overview

The Latxa series includes models ranging from 7B to 70B parameters, optimized for Basque, excelling in language understanding and generation tasks, supporting both English and Basque

Model Features

Low-resource language optimization

Specifically designed for low-resource languages like Basque, bridging the technological gap between high- and low-resource languages

High-quality corpus training

Trained on a rigorously selected 4.2 billion token Basque corpus to ensure language quality

Multiple sizes available

Offers three parameter sizes: 7B, 13B, and 70B to meet different computational needs

Open license

Follows LLaMA-2 license agreement, allowing both commercial and research use

Model Capabilities

Basque text generation

Multiple-choice QA

Reading comprehension

Language understanding

English text generation (auxiliary capability)

Use Cases

Education

Language proficiency testing

Used to evaluate Basque C1 level exam questions

Achieved 30.26% accuracy on EusProficiency dataset (5-shot)

Reading comprehension assistance

Helps students understand Basque texts

Achieved 25% accuracy on EusReading dataset (5-shot)

Research

Low-resource language research

Provides benchmarks for large model research on Basque and other low-resource languages

Released complete toolchain including models, corpora, and evaluation datasets

🚀 Latxa 7b

Latxa is a family of large language models for Basque. Ranging from 7 to 70 billion parameters, it's based on Llama 2 and further pretrained on a new Basque corpus. It outperforms previous open models and competes with GPT - 4 Turbo in some aspects. All models, corpora, and datasets are publicly available under open licenses.

📒 Blog Post: Latxa: An Open Language Model and Evaluation Suite for Basque
📖 Paper: Latxa: An Open Language Model and Evaluation Suite for Basque
💻 Code: [hitz - zentroa/latxa](https://github.com/hitz - zentroa/latxa)

🚀 Quick Start

Use the code below to get started with the model.

from transformers import pipeline

pipe = pipeline("text - generation", model="HiTZ/latxa-7b-v1.2")

text = "Euskara adimen artifizialera iritsi da!"

pipe(text, max_new_tokens=50, num_beams=5)

>> [
 {
  'generated_text': 'Euskara adimen artifizialera iritsi da!\nEuskararen eta adimen artifizialaren arteko harremana aspaldikoa da,'
  ' baina azken urteotan aurrerapauso handiak eman dira arlo horretan'
 }
]

✨ Features

Language - Specific: Designed specifically for the Basque language, addressing the limitations of existing models in low - resource languages.
High - Quality Training: Based on Llama 2 and further trained on a new Basque corpus with 4.3M documents and 4.2B tokens.
Open - Source: All models, pretraining corpora, and evaluation datasets are publicly available under open licenses, enabling reproducible research.

📦 Installation

No installation steps were provided in the original README, so this section is skipped.

💻 Usage Examples

Basic Usage

from transformers import pipeline

pipe = pipeline("text - generation", model="HiTZ/latxa-7b-v1.2")

text = "Euskara adimen artifizialera iritsi da!"

pipe(text, max_new_tokens=50, num_beams=5)

📚 Documentation

Model Details

Model Description

Latxa is a family of Large Language Models (LLM) based on Meta’s [LLaMA models](https://huggingface.co/meta - llama). Current LLMs perform well for high - resource languages like English, but poorly for Basque and other low - resource languages. Latxa aims to overcome these limitations and promote LLM - based technology and research for Basque. It follows the same architecture as its original counterparts and was further trained on [Latxa Corpus v1.1](https://huggingface.co/datasets/HiTZ/latxa - corpus - v1.1), a high - quality Basque corpora.

The models are released in three sizes: 7B, 13B, and 70B.

Property	Details
Developed by	HiTZ Research Center & IXA Research group (University of the Basque Country UPV/EHU)
Model Type	Language model
Language(s) (NLP)	en, eu
License	llama2
Parent Model	meta - llama/Llama - 2 - 7b
Contact	hitz@ehu.eus

Uses

Direct Use

Latxa family models are pre - trained LLMs without task - specific or instruction fine - tuning. They can be prompted for specific tasks or further fine - tuned for specific use cases.

Out - of - Scope Use

The model was not fine - tuned to follow instructions or work as a chat assistant, so this kind of usage is not tested or recommended.

Bias, Risks, and Limitations

Latxa has been trained on carefully selected and processed data to reduce potentially disturbing or harmful content. However, as it is based on LLaMA models, it may carry the same bias, risk, and limitations. See LLaMA’s Ethical Considerations and Limitations for more information.

Training Details

Training Data

The training corpus combines existing and new datasets. Quality was prioritized, with high - quality data sources selected and a thorough deduplication and filtering process applied. A 4.17B tokens corpus was used, and 500K documents of English data from the Pile were included to avoid catastrophic forgetting.

Training Procedure

The training was conducted using the [GPT - Neox](https://github.com/EleutherAI/gpt - neox) library on the CINECA HPC Leonardo computing cluster in Italy. The models were trained for 10k steps with a sequence length of 4096 tokens and an effective batch size of 2M tokens, totaling 20B tokens (around 4 epochs). A cosine learning rate schedule was used, with a warm - up of 500 steps and decaying to 3% of the peak learning rate (set to 1e - 4). Other hyperparameters follow (Touvron et al., 2023).

Evaluation

The models were evaluated on zero - shot and few - shot settings for generative, multiple - choice, and classification tasks using the Basque partitions of each dataset.

Testing Data, Factors & Metrics

Testing Data

Belebele (Bandarkar et al.): A multiple - choice machine reading comprehension (MRC) dataset with 122 language variants. Evaluated in a 5 - shot fashion. Data card: https://huggingface.co/datasets/facebook/belebele
X - StoryCloze (Lin et al.): A professionally translated version of the English StoryCloze dataset to 10 non - English languages. Evaluated in a 0 - shot fashion. Data card: https://huggingface.co/datasets/juletxara/xstory_cloze
BasqueGLUE ([Urbizu et al.](https://aclanthology.org/2022.lrec - 1.172.pdf)): A NLU benchmark for Basque. Evaluated in a 5 - shot fashion on multiple tasks. Data card: [https://huggingface.co/datasets/orai - nlp/basqueGLUE](https://huggingface.co/datasets/orai - nlp/basqueGLUE)
EusProficiency (Etxaniz et al., 2024): Comprises 5,169 exercises from past EGA exams. Data card: https://huggingface.co/datasets/HiTZ/EusProficiency
EusReading (Etxaniz et al., 2024): Consists of 352 reading comprehension exercises from past EGA exams. Data card: https://huggingface.co/datasets/HiTZ/EusReading
EusTrivia (Etxaniz et al., 2024): Contains 1,715 trivia questions from multiple online sources. Data card: https://huggingface.co/datasets/HiTZ/EusTrivia
EusExams (Etxaniz et al., 2024): A collection of tests for Basque Public Service examinations. Data card: https://huggingface.co/datasets/HiTZ/EusExams

Metrics

Most tasks used Accuracy. For some tasks in the BasqueGLUE benchmark:

Micro F1: BEC2016 - eu and BHTCv2
Macro F1: VaxxStance (favor & against)

Results

The model was evaluated using the LM Evaluation harness library from Eleuther AI. To reproduce the results, follow the instructions in Latxa's [Github repository](https://github.com/hitz - zentroa/latxa?tab=readme - ov - file#evaluation).

Model	Size	XStory	Belebele	BasGLUE	EusProf	EusRead	EusTrivia	EusExams	Avg
Random		50.00	25.00	37.50	25.00	25.83	26.55	25.00	30.70

GPT 3.5 Turbo	n/a	--	57.33	48.62	31.24	36.65	46.71	42.42	--
GPT 4 Turbo	n/a	--	90.67	62.90	56.70	75.85	73.12	70.22	--

XGLM	7B	57.71	23.88	41.47	22.96	24.43	26.53	24.59	32.51
BLOOM	7B	57.18	27.00	40.17	25.34	28.41	27.17	25.07	33.86
Mistral	7B	51.09	38.89	39.22	25.01	29.26	34.58	32.15	35.94
Llama 2	7B	50.43	26.22	38.20	24.09	27.27	29.50	28.84	32.51
Latxa v1.1	7B	65.45	37.33	52.56	30.26	25.00	42.16	33.82	40.94

mGPT	13B	55.39	25.00	37.56	25.00	24.15	27.17	25.73	32.14
Llama 2	13B	50.63	32.00	38.98	25.90	28.98	33.53	29.66	34.36
Latxa v1.1	13B	66.51	53.89	53.36	44.11	32.67	56.38	43.66	50.08

Mixtral	8x7B	52

🔧 Technical Details

Latxa is based on Llama 2 architecture. The continued pretraining on the new Basque corpus with specific hyperparameters and training procedures (as described in the Training Procedure section) contributes to its performance on Basque - related tasks. The use of the GPT - Neox library and the specific computing cluster also play important roles in the training process.

📄 License

The models inherit the LLaMA - 2 License, which allows for commercial and research use.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご