Cerbero-7b Free and Open-Source Large Language Model - Optimized for Italian, Outperforming Llama2 13B in Performance

Cerbero 7b

Developed by galatolo

The first fully free and open-source Italian large language model, built on mistral-7b, optimized for Italian, outperforming Llama2 13B

Large Language Model

Transformers

Supports Multiple LanguagesOpen Source License:Apache-2.0 #Italian LLM #Multi-turn Dialogue Optimization #Apache 2.0 Commercial Use

Downloads 5,722

Release Time : 10/26/2023

Model Overview

Cerberus-7b is the first fully free and open-source Italian large language model, built on mistral-7b, designed to fill the gap in Italy's AI ecosystem. Supports Italian and English, licensed under Apache 2.0, available for research and commercial applications.

Model Features

Italian Language Optimization

Specifically optimized for Italian, filling the gap in Italy's AI ecosystem

High Performance

Outperforms Llama2 13B in all benchmarks, with multiple metrics surpassing Llama1 34B

Open Source & Free

Licensed under Apache 2.0, allowing unlimited use, including commercial applications

Long Context Support

Supports long-context windows of up to 8192 tokens

Model Capabilities

Italian text generation

English text generation

Question answering

Dialogue systems

Text comprehension

Use Cases

Education

Italian Learning Assistant

Helps students learn and practice Italian

Provides accurate explanations of Italian grammar and usage

Customer Service

Italian Customer Service Bot

Provides localized customer support for Italian businesses

Understands complex Italian queries and provides accurate responses

Content Creation

Italian Content Generation

Generates high-quality Italian marketing copy and articles

Produces natural language content aligned with Italian cultural context

🚀 cerbero-7b Italian LLM

cerbero-7b is the first 100% Free and Open Source Italian Large Language Model (LLM) suitable for research or commercial applications. It's built on mistral-7b, outperforming Llama2 13B across all benchmarks and surpassing Llama1 34B in numerous metrics.

🚀 New Release: cerbero-7b-openchat, our latest SOTA model based on openchat3.5, offers performance on par with or superior to ChatGPT 3.5! 🔥 The research paper revealing the secrets behind cerbero-7b is now available on arXiv! 📢 Try an online demo here (quantized demo running on CPU, less powerful than the original cerbero-7b).

A cambrian explosion of Italian Language Models is crucial for building advanced AI architectures to meet the diverse needs of the population. cerbero-7b, along with Camoscio and Fauno, aims to initiate this revolution in Italy, enabling sophisticated AI solutions to interact with and understand the Italian language, thus promoting innovation across industries and strengthening the connection between technology and people.

cerbero-7b is released under the permissive Apache 2.0 license, allowing unrestricted usage, even for commercial applications.

🚀 Quick Start

You can load cerbero-7b (or cerbero-7b-openchat) using 🤗transformers:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("galatolo/cerbero-7b")
tokenizer = AutoTokenizer.from_pretrained("galatolo/cerbero-7b")

prompt = """Questa è una conversazione tra un umano ed un assistente AI.
[|Umano|] Come posso distinguere un AI da un umano?
[|Assistente|]"""

input_ids = tokenizer(prompt, return_tensors='pt').input_ids
with torch.no_grad():
    output_ids = model.generate(input_ids, max_new_tokens=128)

generated_text = tokenizer.decode(output_ids[0], skip_special_tokens=True)
print(generated_text)

GGUF and llama.cpp

cerbero-7b is fully compatible with llama.cpp. You can find the original and quantized versions of cerbero-7b in the gguf format here:

from llama_cpp import Llama
from huggingface_hub import hf_hub_download  

llm = Llama(
    model_path=hf_hub_download(
        repo_id="galatolo/cerbero-7b-gguf",
        filename="ggml-model-Q4_K.gguf",
    ),
    n_ctx=4086,
) 

llm.generate("""Questa è una conversazione tra un umano ed un assistente AI.
[|Umano|] Come posso distinguere un AI da un umano?
[|Assistente|]""")

✨ Features

Powerful Base: Built on mistral-7b, outperforming Llama2 13B and surpassing Llama1 34B in many metrics.
Versatile Models: Available in different versions like cerbero-7b and cerbero-7b-openchat, suitable for various applications.
Free and Open Source: Released under the Apache 2.0 license, allowing unrestricted use for research and commercial purposes.

📦 Installation

No specific installation steps are provided in the original document.

💻 Usage Examples

Basic Usage

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("galatolo/cerbero-7b")
tokenizer = AutoTokenizer.from_pretrained("galatolo/cerbero-7b")

prompt = """Questa è una conversazione tra un umano ed un assistente AI.
[|Umano|] Come posso distinguere un AI da un umano?
[|Assistente|]"""

input_ids = tokenizer(prompt, return_tensors='pt').input_ids
with torch.no_grad():
    output_ids = model.generate(input_ids, max_new_tokens=128)

generated_text = tokenizer.decode(output_ids[0], skip_special_tokens=True)
print(generated_text)

Advanced Usage

from llama_cpp import Llama
from huggingface_hub import hf_hub_download  

llm = Llama(
    model_path=hf_hub_download(
        repo_id="galatolo/cerbero-7b-gguf",
        filename="ggml-model-Q4_K.gguf",
    ),
    n_ctx=4086,
) 

llm.generate("""Questa è una conversazione tra un umano ed un assistente AI.
[|Umano|] Come posso distinguere un AI da un umano?
[|Assistente|]""")

📚 Documentation

Model Evaluation Results 📈

The cerbero-7b model has been thoroughly evaluated across several benchmarks to demonstrate its ability to understand and generate Italian text. The following are the summarized results:

SQuAD-it Evaluation

The Stanford Question Answering Dataset (SQuAD) in Italian (SQuAD-it) evaluates the model's reading comprehension and question-answering capabilities. The table below shows the F1 score and Exact Match (EM) metrics:

Model	F1 Score	Exact Match (EM)
cerbero-7b-openchat	74.09%	56.0%
cerbero-7b	72.55%	55.6%
Fauno	44.46%	0.00%
Camoscio	37.42%	0.00%
mistral-7b	15.55%	8.50%

EVALITA Benchmark Results

EVALITA benchmarks assess the model's performance in tasks such as toxicity detection, irony detection, and sentiment analysis. The table below presents the F1 scores for these tasks:

Model	Toxicity Detection	Irony Detection	Sentiment Analysis
cerbero-7b-openchat	63.33%	69.16%	66.89%
cerbero-7b	63.04%	48.51%	61.80%
Fauno	33.84%	39.17%	12.23%
Camoscio	38.18%	39.65%	13.33%
mistral-7b	34.16%	34.16%	12.14%

Why Cerbero? 🤔

The name "Cerbero," inspired by the three-headed dog guarding the gates of the Underworld in Greek mythology, represents the essence of our model, drawing strength from three key pillars:

Base Model: mistral-7b 🏗️ cerbero-7b is built on the powerful mistral-7b as its base model, ensuring a solid foundation and leveraging the capabilities of a cutting-edge language model.
Datasets: Cerbero Dataset 📚 The Cerbero Dataset is a revolutionary collection designed to improve cerbero-7b's proficiency in understanding and generating Italian text. It is created using an innovative method that combines dynamic self-chat mechanisms with advanced Large Language Model (LLM) technology. Refer to the paper for more details.
Licensing: Apache 2.0 🕊️ Released under the permissive Apache 2.0 license, cerbero-7b promotes openness and collaboration. This licensing allows developers unrestricted usage, fostering a community-driven approach to advancing AI in Italy and beyond.

Models 🧬

cerbero-7b is available in different versions, each tailored for specific applications. The following table lists these versions, along with their training datasets and base models:

Model Name	Training Dataset	Base Model	Huggingface Model	Llama.cpp and Quantized Model
cerbero-7b	Cerbero Dataset	mistral-7b	link	link
cerbero-7b-openchat	Cerbero Dataset	openchat3.5	link	link

Prompt Format

cerbero-7b is trained on full conversations using the following prompt format:

[|Umano|] First human message
[|Assistente|] First AI reply
[|Umano|] Second human message
[|Assistente|] Second AI reply

When creating prompts, make sure to end with the [|Assistente|] tag to signal the AI to generate a response. Use [|Umano|] as the stop word.

For example:

[|Umano|] Come posso distinguere un AI da un umano?
[|Assistente|]

It's possible to include a brief system message at the start of your prompt, but note that the training data for cerbero-7b does not contain such system messages. Therefore, it's recommended to minimize or avoid including them for optimal model performance.

🔧 Technical Details

Training Details 🚀

cerbero-7b is a fully fine-tuned LLM, different from LORA or QLORA fine-tunes. The model is trained on a large Italian Large Language Model (LLM) using synthetic datasets generated through dynamic self-chat on a large context window of 8192 tokens.

Dataset Composition 📊

📢 Details on the Cerbero Dataset will be updated shortly!

Training Setup ⚙️

cerbero-7b is trained on an NVIDIA DGX H100:

Hardware: Using 8xH100 GPUs, each with 80 GB VRAM. 🖥️
Parallelism: DeepSpeed Zero stage 1 parallelism for optimal training efficiency.✨

The model has been trained for 1 epoch, ensuring knowledge convergence and proficiency in handling diverse linguistic tasks.

Differences from the paper

📢 Attention: The released versions of cerbero-7b slightly differ from those used in the paper. The training dataset for the released models was generated using garage-bAInd/Platypus2-70B-instruct instead of meta-llama/Llama-2-7b-chat-hf, due to the more permissive license of the Platypus2 model (CC-BY-NC 4.0). Our tests show that both models produce datasets of comparable quality, and the resulting fine-tuned models have nearly indistinguishable performance.

📄 License

cerbero-7b is released under the Apache 2.0 license.

📖 Citation

If you use cerbero-7b in your research, please cite our paper:

@article{galatolo2023cerbero,
  title={Cerbero-7B: A Leap Forward in Language-Specific LLMs Through Enhanced Chat Corpus Generation and Evaluation},
  author={Galatolo, Federico A and Cimino, Mario GCA},
  journal={arXiv preprint arXiv:2311.15698},
  year={2023}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご