CroissantLLMChat-v0.1 Open Source Language Model - Seamless Bilingual (English and French) Processing Suitable for Consumer-grade Hardware

Croissantllmchat V0.1

Developed by croissantllm

CroissantLLM is a 1.3B-parameter language model trained on 3T English-French bilingual tokens, designed for consumer hardware with fluent bilingual processing capabilities.

Large Language Model

Transformers

Supports Multiple LanguagesOpen Source License:MIT #Bilingual English-French #Consumer Hardware Optimization #Legal Text Generation

Downloads 3,812

Release Time : 1/24/2024

Model Overview

This model is part of the CroissantLLM initiative, trained for 190K steps (2.99T tokens) with a final chat fine-tuning phase, supporting text generation tasks in both French and English.

Model Features

Bilingual Support

Uses a 1:1 English-French pre-training data ratio, specifically optimized for French and English processing.

Efficient Operation

Designed to run smoothly on consumer hardware, suitable for research and industrial applications.

High-Quality French Corpus

Training data includes manually curated, high-quality, and diverse French corpora.

Transparent Open Source

Publicly released codebase, multiple checkpoints, fine-tuned chat models, and translation models, achieving an 81% transparency standard compliance rate.

Model Capabilities

Text Generation

Bilingual Translation

Chat Dialogue

Code Generation

Use Cases

Language Processing

French Q&A

Answer questions about French culture, history, or current events.

Performs well in writing tasks and internal knowledge retrieval.

English-French Translation

Perform translation tasks between English and French.

Excels particularly in translation tasks.

Code Assistance

Code Generation

Generate simple code snippets.

Limited coding capability, suitable for basic code generation.

🚀 CroissantLLMChat (190k steps + Chat)

This model is part of the CroissantLLM initiative. It corresponds to the checkpoint after 190k steps (2.99 T) tokens and a final Chat finetuning phase.

Check the paper on arXiv

For best performance, it should be used with a temperature of 0.3 or more, and with the exact template described below:

chat = [
   {"role": "user", "content": "Que puis-je faire à Marseille en hiver?"},
]

chat_input = tokenizer.apply_chat_template(chat, tokenize=False, add_generation_prompt=True)

This corresponds to:

chat_input = """<|im_start|>user
{USER QUERY}<|im_end|>
<|im_start|>assistant\n"""

✨ Features

We introduce CroissantLLM, a 1.3B language model pretrained on a set of 3T English and French tokens. The goal is to provide the research and industrial community with a high - performance, fully open - sourced bilingual model that can run quickly on consumer - grade local hardware.

To achieve this, we pioneer the approach of training an intrinsically bilingual model with a 1:1 English - to - French pretraining data ratio, a custom tokenizer, and bilingual finetuning datasets. We release the training dataset, which notably contains a French split with manually curated, high - quality, and varied data sources.

To assess performance outside of English, we create a novel benchmark, FrenchBench, consisting of an array of classification and generation tasks, covering various orthogonal aspects of model performance in the French language. Additionally, for transparency and to promote further Large Language Model research, we release codebases, dozens of checkpoints across various model sizes, training data distributions, and training steps, as well as fine - tuned Chat models and strong translation models. We evaluate our model through the FMTI framework and validate 81% of the transparency criteria, far exceeding the scores of most open initiatives.

📦 Installation

Not provided in the original document, so this section is skipped.

💻 Usage Examples

Basic Usage

This model is a Chat model, which means it is finetuned for the Chat function and works best with the provided template.

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer


model_name = "croissantllm/CroissantLLMChat-v0.1"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)


generation_args = {
    "max_new_tokens": 256,
    "do_sample": True,
    "temperature": 0.3,
    "top_p": 0.90,
    "top_k": 40,
    "repetition_penalty": 1.05,
    "eos_token_id": [tokenizer.eos_token_id, 32000],
}

chat = [
   {"role": "user", "content": "Qui est le président francais actuel ?"},
]

chat_input = tokenizer.apply_chat_template(chat, tokenize=False, add_generation_prompt=True)

inputs = tokenizer(chat_input, return_tensors="pt").to(model.device)
tokens = model.generate(**inputs, **generation_args)

print(tokenizer.decode(tokens[0]))
# print tokens individually
print([(tokenizer.decode([tok]), tok) for tok in tokens[0].tolist()])

Advanced Usage

This might require a stopping criteria on <|im_end|> token.

# The code remains the same as the basic usage example since no advanced usage specific code is provided.
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer


model_name = "croissantllm/CroissantLLMChat-v0.1"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)


generation_args = {
    "max_new_tokens": 256,
    "do_sample": True,
    "temperature": 0.3,
    "top_p": 0.90,
    "top_k": 40,
    "repetition_penalty": 1.05,
    "eos_token_id": [tokenizer.eos_token_id, 32000],
}

chat = [
   {"role": "user", "content": "Qui est le président francais actuel ?"},
]

chat_input = tokenizer.apply_chat_template(chat, tokenize=False, add_generation_prompt=True)

inputs = tokenizer(chat_input, return_tensors="pt").to(model.device)
tokens = model.generate(**inputs, **generation_args)

print(tokenizer.decode(tokens[0]))
# print tokens individually
print([(tokenizer.decode([tok]), tok) for tok in tokens[0].tolist()])

🔧 Technical Details

Model limitations

Evaluation results show that the model is strong in its size category. It offers decent performance on writing - based tasks and internal knowledge, and very strong performance on translation tasks. However, the small size of the CroissantLLM model limits its ability to perform more complex reasoning - based tasks, at least in a zero or few - shot manner in its generalist base or chat - model versions. This is consistent with other models of the same size and emphasizes the importance of scale for more abstract tasks.

Knowledge Cutoff

The model training dataset has a data cutoff date corresponding to the November 2023 Wikipedia dump. This is the de facto knowledge cutoff date for our base model, although a lot of information dates back further. Updated versions can be trained through continued pre - training or subsequent fine - tuning.

Multilingual performance

CroissantLLM is mainly a French and English model. Code performance is relatively limited. Although some data from other languages is included in the SlimPajama training set, out - of - the - box performance in other languages is not expected, although some European languages do work quite well.

Hallucinations

CroissantLLM can hallucinate and output factually incorrect data, especially regarding complex topics. This is expected given the small model size, and hallucination rates seem lower than most models of the same size category, although no quantitative assessments have been conducted outside of MT - Bench experiments.

📄 License

The model is released under the MIT license.

📚 Documentation

Datasets

croissantllm/croissant_dataset
croissantllm/CroissantLLM - 2201 - sft
cerebras/SlimPajama - 627B
uonlp/CulturaX
pg19
bigcode/starcoderdata

Languages

Pipeline Tag

text - generation

Citation

Our work can be cited as:

@misc{faysse2024croissantllm,
      title={CroissantLLM: A Truly Bilingual French-English Language Model}, 
      author={Manuel Faysse and Patrick Fernandes and Nuno M. Guerreiro and António Loison and Duarte M. Alves and Caio Corro and Nicolas Boizard and João Alves and Ricardo Rei and Pedro H. Martins and Antoni Bigata Casademunt and François Yvon and André F. T. Martins and Gautier Viaud and Céline Hudelot and Pierre Colombo},
      year={2024},
      eprint={2402.00786},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご