Model Overview
Model Features
Model Capabilities
Use Cases
๐ cerbero-7b Italian LLM
cerbero-7b is the first 100% Free and Open Source Italian Large Language Model (LLM) suitable for research or commercial applications. It's built on mistral-7b, outperforming Llama2 13B across all benchmarks and surpassing Llama1 34B in numerous metrics.
๐ New Release: cerbero-7b-openchat, our latest SOTA model based on openchat3.5, offers performance on par with or superior to ChatGPT 3.5! ๐ฅ The research paper revealing the secrets behind cerbero-7b is now available on arXiv! ๐ข Try an online demo here (quantized demo running on CPU, less powerful than the original cerbero-7b).
A cambrian explosion of Italian Language Models is crucial for building advanced AI architectures to meet the diverse needs of the population. cerbero-7b, along with Camoscio and Fauno, aims to initiate this revolution in Italy, enabling sophisticated AI solutions to interact with and understand the Italian language, thus promoting innovation across industries and strengthening the connection between technology and people.
cerbero-7b is released under the permissive Apache 2.0 license, allowing unrestricted usage, even for commercial applications.
๐ Quick Start
You can load cerbero-7b (or cerbero-7b-openchat) using ๐คtransformers:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("galatolo/cerbero-7b")
tokenizer = AutoTokenizer.from_pretrained("galatolo/cerbero-7b")
prompt = """Questa รจ una conversazione tra un umano ed un assistente AI.
[|Umano|] Come posso distinguere un AI da un umano?
[|Assistente|]"""
input_ids = tokenizer(prompt, return_tensors='pt').input_ids
with torch.no_grad():
output_ids = model.generate(input_ids, max_new_tokens=128)
generated_text = tokenizer.decode(output_ids[0], skip_special_tokens=True)
print(generated_text)
GGUF and llama.cpp
cerbero-7b is fully compatible with llama.cpp. You can find the original and quantized versions of cerbero-7b in the gguf
format here:
from llama_cpp import Llama
from huggingface_hub import hf_hub_download
llm = Llama(
model_path=hf_hub_download(
repo_id="galatolo/cerbero-7b-gguf",
filename="ggml-model-Q4_K.gguf",
),
n_ctx=4086,
)
llm.generate("""Questa รจ una conversazione tra un umano ed un assistente AI.
[|Umano|] Come posso distinguere un AI da un umano?
[|Assistente|]""")
โจ Features
- Powerful Base: Built on mistral-7b, outperforming Llama2 13B and surpassing Llama1 34B in many metrics.
- Versatile Models: Available in different versions like cerbero-7b and cerbero-7b-openchat, suitable for various applications.
- Free and Open Source: Released under the Apache 2.0 license, allowing unrestricted use for research and commercial purposes.
๐ฆ Installation
No specific installation steps are provided in the original document.
๐ป Usage Examples
Basic Usage
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("galatolo/cerbero-7b")
tokenizer = AutoTokenizer.from_pretrained("galatolo/cerbero-7b")
prompt = """Questa รจ una conversazione tra un umano ed un assistente AI.
[|Umano|] Come posso distinguere un AI da un umano?
[|Assistente|]"""
input_ids = tokenizer(prompt, return_tensors='pt').input_ids
with torch.no_grad():
output_ids = model.generate(input_ids, max_new_tokens=128)
generated_text = tokenizer.decode(output_ids[0], skip_special_tokens=True)
print(generated_text)
Advanced Usage
from llama_cpp import Llama
from huggingface_hub import hf_hub_download
llm = Llama(
model_path=hf_hub_download(
repo_id="galatolo/cerbero-7b-gguf",
filename="ggml-model-Q4_K.gguf",
),
n_ctx=4086,
)
llm.generate("""Questa รจ una conversazione tra un umano ed un assistente AI.
[|Umano|] Come posso distinguere un AI da un umano?
[|Assistente|]""")
๐ Documentation
Model Evaluation Results ๐
The cerbero-7b
model has been thoroughly evaluated across several benchmarks to demonstrate its ability to understand and generate Italian text. The following are the summarized results:
SQuAD-it Evaluation
The Stanford Question Answering Dataset (SQuAD) in Italian (SQuAD-it) evaluates the model's reading comprehension and question-answering capabilities. The table below shows the F1 score and Exact Match (EM) metrics:
Model | F1 Score | Exact Match (EM) |
---|---|---|
cerbero-7b-openchat | 74.09% | 56.0% |
cerbero-7b | 72.55% | 55.6% |
Fauno | 44.46% | 0.00% |
Camoscio | 37.42% | 0.00% |
mistral-7b | 15.55% | 8.50% |
EVALITA Benchmark Results
EVALITA benchmarks assess the model's performance in tasks such as toxicity detection, irony detection, and sentiment analysis. The table below presents the F1 scores for these tasks:
Model | Toxicity Detection | Irony Detection | Sentiment Analysis |
---|---|---|---|
cerbero-7b-openchat | 63.33% | 69.16% | 66.89% |
cerbero-7b | 63.04% | 48.51% | 61.80% |
Fauno | 33.84% | 39.17% | 12.23% |
Camoscio | 38.18% | 39.65% | 13.33% |
mistral-7b | 34.16% | 34.16% | 12.14% |
Why Cerbero? ๐ค
The name "Cerbero," inspired by the three-headed dog guarding the gates of the Underworld in Greek mythology, represents the essence of our model, drawing strength from three key pillars:
- Base Model: mistral-7b ๐๏ธ cerbero-7b is built on the powerful mistral-7b as its base model, ensuring a solid foundation and leveraging the capabilities of a cutting-edge language model.
- Datasets: Cerbero Dataset ๐ The Cerbero Dataset is a revolutionary collection designed to improve cerbero-7b's proficiency in understanding and generating Italian text. It is created using an innovative method that combines dynamic self-chat mechanisms with advanced Large Language Model (LLM) technology. Refer to the paper for more details.
- Licensing: Apache 2.0 ๐๏ธ Released under the permissive Apache 2.0 license, cerbero-7b promotes openness and collaboration. This licensing allows developers unrestricted usage, fostering a community-driven approach to advancing AI in Italy and beyond.
Models ๐งฌ
cerbero-7b is available in different versions, each tailored for specific applications. The following table lists these versions, along with their training datasets and base models:
Model Name | Training Dataset | Base Model | Huggingface Model | Llama.cpp and Quantized Model |
---|---|---|---|---|
cerbero-7b | Cerbero Dataset | mistral-7b | link | link |
cerbero-7b-openchat | Cerbero Dataset | openchat3.5 | link | link |
Prompt Format
cerbero-7b is trained on full conversations using the following prompt format:
[|Umano|] First human message
[|Assistente|] First AI reply
[|Umano|] Second human message
[|Assistente|] Second AI reply
When creating prompts, make sure to end with the [|Assistente|]
tag to signal the AI to generate a response. Use [|Umano|]
as the stop word.
For example:
[|Umano|] Come posso distinguere un AI da un umano?
[|Assistente|]
It's possible to include a brief system message at the start of your prompt, but note that the training data for cerbero-7b does not contain such system messages. Therefore, it's recommended to minimize or avoid including them for optimal model performance.
๐ง Technical Details
Training Details ๐
cerbero-7b is a fully fine-tuned LLM, different from LORA or QLORA fine-tunes. The model is trained on a large Italian Large Language Model (LLM) using synthetic datasets generated through dynamic self-chat on a large context window of 8192 tokens.
Dataset Composition ๐
๐ข Details on the Cerbero Dataset will be updated shortly!
Training Setup โ๏ธ
cerbero-7b is trained on an NVIDIA DGX H100:
- Hardware: Using 8xH100 GPUs, each with 80 GB VRAM. ๐ฅ๏ธ
- Parallelism: DeepSpeed Zero stage 1 parallelism for optimal training efficiency.โจ
The model has been trained for 1 epoch, ensuring knowledge convergence and proficiency in handling diverse linguistic tasks.
Differences from the paper
๐ข Attention: The released versions of
cerbero-7b
slightly differ from those used in the paper. The training dataset for the released models was generated usinggarage-bAInd/Platypus2-70B-instruct
instead ofmeta-llama/Llama-2-7b-chat-hf
, due to the more permissive license of the Platypus2 model (CC-BY-NC 4.0). Our tests show that both models produce datasets of comparable quality, and the resulting fine-tuned models have nearly indistinguishable performance.
๐ License
cerbero-7b is released under the Apache 2.0 license.
๐ Citation
If you use cerbero-7b in your research, please cite our paper:
@article{galatolo2023cerbero,
title={Cerbero-7B: A Leap Forward in Language-Specific LLMs Through Enhanced Chat Corpus Generation and Evaluation},
author={Galatolo, Federico A and Cimino, Mario GCA},
journal={arXiv preprint arXiv:2311.15698},
year={2023}
}

