Latxa-Llama-3.1-70B-Instruct Open Source Model - Optimized for Basque, Excellent Performance in Multiple Tests

Latxa Llama 3.1 70B Instruct

Developed by HiTZ

Latxa 3.1 70B Instruct is an instruction-tuned version based on Llama-3.1 (Instruct), specifically optimized for Basque language, demonstrating excellent performance on multiple Basque benchmarks.

Large Language Model

Transformers

Supports Multiple Languages#Basque language optimization #Instruction fine-tuning #Low-resource language support

Downloads 59

Release Time : 3/28/2025

Model Overview

Latxa is a large language model (LLM) based on Meta's LLaMA model series, designed specifically for Basque language. It maintains the original architecture and continues training on high-quality Basque corpora to promote the development of Basque LLM technology and research.

Model Features

Basque language optimization

Designed specifically for Basque language, significantly outperforming the original model on multiple Basque benchmarks.

Instruction fine-tuning

Fine-tuned for instructions, making it suitable as a chat assistant or for tasks requiring instruction following.

High performance

Ranked third in public arena evaluations, behind only Claude and GPT-4, and outperforming competitors of similar scale.

Model Capabilities

Text generation

Chat assistant

Instruction following

Use Cases

Education

Basque exam preparation

Used for answering practice questions for Basque C1 level exams

Achieved 68.00% accuracy on the EusProficiency dataset

Reading comprehension

Basque language reading comprehension exercises

Achieved 78.98% accuracy on the EusReading dataset

Knowledge Q&A

Basque knowledge Q&A

Answering knowledge questions about Basque culture, history, etc.

Achieved 74.17% accuracy on the EusTrivia dataset

🚀 HiTZ/Latxa-Llama-3.1-70B-Instruct

We introduce Latxa 3.1 70B Instruct, an instructed version of Latxa, which outperforms Llama-3.1-Instruct on Basque benchmarks and shows great potential in chat conversations.

🚀 Quick Start

Use the code below to get started with the model.

from transformers import pipeline

pipe = pipeline('text-generation', model='HiTZ/Latxa-Llama-3.1-70B-Instruct')

messages = [
    {'role': 'user', 'content': 'Kaixo!'},
]

pipe(messages)

>>
[
  {
    'generated_text': [
      {'role': 'user', 'content': 'Kaixo!'},
      {'role': 'assistant', 'content': 'Kaixo! Zer moduz? Zer behar edo galdetu nahi duzu?'}
    ]
  }
]

✨ Features

High Performance on Basque: Our preliminary experimentation shows that Latxa 3.1 70B Instruct outperforms Llama-3.1-Instruct by a large margin on Basque standard benchmarks, especially in chat conversations.
Good Ranking in Public Evaluation: In a public arena-based evaluation, Latxa ranked 3rd, just behind Claude and GPT-4 and above all the other same-size competitors.

📦 Installation

The installation details are related to the transformers library. You can install it using the following command:

pip install transformers

💻 Usage Examples

Basic Usage

from transformers import pipeline

pipe = pipeline('text-generation', model='HiTZ/Latxa-Llama-3.1-70B-Instruct')

messages = [
    {'role': 'user', 'content': 'Kaixo!'},
]

result = pipe(messages)
print(result)

📚 Documentation

Model Details

Model Description

Latxa is a family of Large Language Models (LLM) based on Meta’s LLaMA models. Current LLMs perform incredibly well for high-resource languages like English. However, for Basque and other low-resource languages, their performance is close to a random guesser. These limitations widen the gap between high- and low-resource languages in digital development. We present Latxa to overcome these limitations and promote the development of LLM-based technology and research for the Basque language. Latxa models follow the same architecture as their original counterparts and were further trained in Latxa Corpus v1.1, a high-quality Basque corpora.

Property	Details
Developed by	HiTZ Research Center & IXA Research group (University of the Basque Country UPV/EHU)
Model Type	Language model
Language(s) (NLP)	eu
License	llama3.1
Parent model	meta-llama/Llama-3.1-70B-Instruct
Contact	hitz@ehu.eus

Uses

Latxa models are intended to be used with Basque data; for any other language, the performance is not guaranteed. Similar to the original, Latxa inherits the Llama-3.1 License, which allows for commercial and research use.

Direct Use

Latxa Instruct models are trained to follow instructions or work as chat assistants.

Out-of-Scope Use

The model is not intended for malicious activities, such as harming others or violating human rights. Any downstream application must comply with current laws and regulations. Irresponsible usage in production environments without proper risk assessment and mitigation is also discouraged.

Bias, Risks, and Limitations

In an effort to alleviate potentially disturbing or harmful content, Latxa has been trained on carefully selected and processed data, mainly from local media, national/regional newspapers, encyclopedias, and blogs (see Latxa Corpus v1.1). However, the model is based on Llama 3.1 models and may potentially carry the same bias, risk, and limitations. Please refer to Llama’s Ethical Considerations and Limitations for further information.

Training Details

⚠️ Important Note

Further training details will be released with the corresponding research paper in the near future.

Evaluation

We evaluated the models in 5-shot settings on multiple-choice tasks, using the Basque partitions of each dataset. The arena results will be released in the future.

Testing Data, Factors & Metrics

Testing Data

Belebele (Bandarkar et al.): Belebele is a multiple-choice machine reading comprehension (MRC) dataset covering 122 language variants. We evaluated the model in a 5-shot manner.
- Data card: https://huggingface.co/datasets/facebook/belebele
X-StoryCloze (Lin et al.): XStoryCloze is a professionally translated version of the English StoryCloze dataset into 10 non-English languages. Story Cloze is a commonsense reasoning dataset that requires choosing the correct ending for a four-sentence story. We evaluated the model in a 5-shot manner.
- Data card: https://huggingface.co/datasets/juletxara/xstory_cloze
EusProficiency (Etxaniz et al., 2024): EusProficiency contains 5,169 exercises on different topics from past EGA exams, the official C1-level certificate of proficiency in Basque.
- Data card: https://huggingface.co/datasets/HiTZ/EusProficiency
EusReading (Etxaniz et al., 2024): EusReading consists of 352 reading comprehension exercises (irakurmena) from the same set of past EGA exams.
- Data card: https://huggingface.co/datasets/HiTZ/EusReading
EusTrivia (Etxaniz et al., 2024): EusTrivia includes 1,715 trivia questions from multiple online sources. 56.3% of the questions are at the elementary level (grades 3 - 6), while the rest are considered challenging.
- Data card: https://huggingface.co/datasets/HiTZ/EusTrivia
EusExams (Etxaniz et al., 2024): EusExams is a collection of tests designed to prepare individuals for Public Service examinations conducted by several Basque institutions, including the public health system Osakidetza, the Basque Government, the City Councils of Bilbao and Gasteiz, and the University of the Basque Country (UPV/EHU).
- Data card: https://huggingface.co/datasets/HiTZ/EusExams

Metrics

We use Accuracy since the tasks are framed as Multiple Choice questions.

Results

Task	Llama-3.1 8B Instruct	Latxa 3.1 8B Instruct	Llama-3.1 70B Instruct	Latxa 3.1 70B Instruct
Belebele	73.89	80.00	89.11	91.00
X-Story Cloze	61.22	71.34	69.69	77.83
EusProficiency	34.13	52.83	43.59	68.00
EusReading	49.72	62.78	72.16	78.98
EusTrivia	45.01	61.05	62.51	74.17
EusExams	46.21	56.00	63.28	71.56

Environmental Impact

Carbon emissions can be estimated using the Machine Learning Impact calculator presented in Lacoste et al. (2019).

Hardware Type: HPC Cluster, 4 x A100 64Gb nodes x64
Hours used (total GPU hours): 16005.12h
Cloud Provider: CINECA HPC
Compute Region: Italy
Carbon Emitted: 1901.41kg CO2 eq

Acknowledgements

This work has been partially supported by the Basque Government (IKER-GAITU project). It has also been partially supported by the Ministerio para la Transformación Digital y de la Función Pública - Funded by EU – NextGenerationEU within the framework of the project with reference 2022/TL22/00215335. The models were trained on the Leonardo supercomputer at CINECA under the EuroHPC Joint Undertaking, project EHPC-EXT-2023E01-013.

Citation

Coming soon.

Meanwhile, you can reference:

@misc{etxaniz2024latxa,
    title={{L}atxa: An Open Language Model and Evaluation Suite for {B}asque},
    author={Julen Etxaniz and Oscar Sainz and Naiara Perez and Itziar Aldabe and German Rigau and Eneko Agirre and Aitor Ormazabal and Mikel Artetxe and Aitor Soroa},
    year={2024},
    eprint={2403.20266},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご