Minerva-7B-instruct-v1.0 Open-source Bilingual Large Model - Supports Italian-English Instruction Following and Conversation Tasks

Minerva 7B Instruct V1.0

Developed by sapienzanlp

Minerva is an Italian-English bilingual large language model developed by Sapienza NLP, based on the Mistral architecture with 7 billion parameters, supporting instruction-following and dialogue tasks.

Large Language Model

Transformers

Supports Multiple LanguagesOpen Source License:Apache-2.0 #Italian language optimization #Bilingual instruction fine-tuning #Safety alignment

Downloads 4,331

Release Time : 12/2/2024

Model Overview

The first member of a large language model family fully pre-trained on Italian, optimized with SFT and DPO, supporting text generation tasks in both Italian and English.

Model Features

Bilingual support

Specifically optimized for Italian and English, with Italian training data accounting for nearly 50%

Instruction optimization

Trained in two phases with SFT (Supervised Fine-Tuning) and DPO (Direct Preference Optimization) to enhance instruction-following capabilities

Efficient tokenization

Italian tokenization efficiency reaches 1.32 (CX samples), outperforming similar models

Safety measures

Includes manually curated safety data and targeted protective measures

Model Capabilities

Multi-turn dialogue

Instruction understanding

Knowledge Q&A

Text generation

Use Cases

Education

Language learning assistant

Helps learners practice Italian conversation and grammar

Customer service

Bilingual customer service bot

Handles customer inquiries in Italian and English

🚀 Minerva-7B-instruct-v1.0

Minerva is the first family of large language models (LLMs) pretrained from scratch on Italian. Developed by Sapienza NLP in the Future Artificial Intelligence Research (FAIR) project, in collaboration with CINECA, with additional contributions from Babelscape and the CREATIVE PRIN Project. The Minerva models are truly-open (data and model) Italian-English LLMs, with about half of the pretraining data being Italian text.

🚀 Quick Start

This is the model card for Minerva-7B-instruct-v1.0, a 7 billion parameter model trained on nearly 2.5 trillion tokens (1.14 trillion in Italian, 1.14 trillion in English, and 200 billion in code).

✨ Features

Model Family

This model is part of the Minerva LLM family:

Model Architecture

Minerva-7B-base-v1.0 is a Transformer model based on the Mistral architecture. Check the configuration file for a detailed breakdown of the hyperparameters.

The Minerva LLM family details:

Model Name	Tokens	Layers	Hidden Size	Attention Heads	KV Heads	Sliding Window	Max Context Length
Minerva-350M-base-v1.0	70B (35B it + 35B en)	16	1152	16	4	2048	16384
Minerva-1B-base-v1.0	200B (100B it + 100B en)	16	2048	16	4	2048	16384
Minerva-3B-base-v1.0	660B (330B it + 330B en)	32	2560	32	8	2048	16384
Minerva-7B-base-v1.0	2.48T (1.14T it + 1.14T en + 200B code)	32	4096	32	8	None	4096

Model Training

Minerva-7B-base-v1.0 was trained using llm-foundry 0.8.0 from MosaicML. The hyperparameters:

Model Name	Optimizer	lr	betas	eps	weight decay	Scheduler	Warmup Steps	Batch Size (Tokens)	Total Steps
Minerva-350M-base-v1.0	Decoupled AdamW	2e-4	(0.9, 0.95)	1e-8	0.0	Cosine	2%	4M	16,690
Minerva-1B-base-v1.0	Decoupled AdamW	2e-4	(0.9, 0.95)	1e-8	0.0	Cosine	2%	4M	47,684
Minerva-3B-base-v1.0	Decoupled AdamW	2e-4	(0.9, 0.95)	1e-8	0.0	Cosine	2%	4M	157,357
Minerva-7B-base-v1.0	AdamW	3e-4	(0.9, 0.95)	1e-5	0.1	Cosine	2000	4M	591,558

SFT Training

The SFT model was trained using Llama-Factory. The data mix:

Dataset	Source	Code	English	Italian
Glaive-code-assistant	Link	100,000	0	0
Alpaca-python	Link	20,000	0	0
Alpaca-cleaned	Link	0	50,000	0
Databricks-dolly-15k	Link	0	15,011	0
No-robots	Link	0	9,499	0
OASST2	Link	0	29,000	528
WizardLM	Link	0	29,810	0
LIMA	Link	0	1,000	0
OPENORCA	Link	0	30,000	0
Ultrachat	Link	0	50,000	0
MagpieMT	Link	0	30,000	0
Tulu-V2-Science	Link	0	7,000	0
Aya_datasets	Link	0	3,944	738
Tower-blocks_it	Link	0	0	7,276
Bactrian-X	Link	0	0	67,000
Magpie (Translated by us)	Link	0	0	59,070
Everyday-conversations (Translated by us)	Link	0	0	2,260
alpaca-gpt4-it	Link	0	0	15,000
capybara-claude-15k-ita	Link	0	0	15,000
Wildchat	Link	0	0	5,000
GPT4_INST	Link	0	0	10,000
Italian Safety Instructions	-	0	0	21,426
Italian Conversations	-	0	0	4,843

For more details, check our tech page.

Online DPO Training

This model card is for our DPO model. Direct Preference Optimization (DPO) refines models based on user feedback, similar to Reinforcement Learning from Human Feedback (RLHF) but without reinforcement learning complexity. Online DPO allows real-time adaptation during training, continuously refining the model with new feedback. We used the Hugging Face TRL library and Online DPO, with the Skywork/Skywork-Reward-Llama-3.1-8B-v0.2 model as the judge for optimization. For this stage, we used prompts from HuggingFaceH4/ultrafeedback_binarized (English), efederici/evol-dpo-ita (Italian), and Babelscape/ALERT translated to Italian, plus manually curated safety data.

For more details, check our tech page.

Model Evaluation

For Minerva's evaluation, we used ITA-Bench, a new suite for testing Italian-speaking models. ITA-Bench has 18 benchmarks for assessing language model performance on tasks like scientific knowledge, commonsense reasoning, and mathematical problem-solving.

💻 Usage Examples

Basic Usage

import transformers
import torch

model_id = "sapienzanlp/Minerva-7B-instruct-v1.0"

# Initialize the pipeline.
pipeline = transformers.pipeline(
    model=model_id,
    model_kwargs={"torch_dtype": torch.bfloat16},
    device_map="auto",
)

# Input text for the model.
input_conv = [{"role": "user", "content": "Qual √® la capitale dell'Italia?"}]

# Compute the outputs.
output = pipeline(
  input_conv,
  max_new_tokens=128,
)

output

[{'generated_text': [{'role': 'user', 'content': "Qual √® la capitale dell'Italia?"}, {'role': 'assistant', 'content': "La capitale dell'Italia √® Roma."}]}]

🔧 Technical Details

Bias, Risks, and Limitations

This is a chat foundation model with alignment and safety risk mitigation strategies. However, the model may:

Overrepresent some viewpoints and underrepresent others
Contain stereotypes
Contain personal information
Generate:
- Racist and sexist content
- Hateful, abusive, or violent language
- Discriminatory or prejudicial language
- Content inappropriate for all settings, including sexual content
Make errors, including producing incorrect information or historical facts as if factual
Generate irrelevant or repetitive outputs

We're aware of biases and potential problematic/toxic content in current pretrained large language models. As probabilistic models of (Italian and English) languages, they reflect and amplify training data biases. For more information, refer to our survey:

Biases in Large Language Models: Origins, Inventory, and Discussion

📄 License

The model is licensed under the Apache-2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご