SambaLingo-Serbian-Chat Open-Source Chat Model - Free Support for Bilingual Conversations in Serbian and English

Sambalingo Serbian Chat

Developed by sambanovasystems

A human-aligned chat model supporting Serbian and English, adapted from Llama-2-7b and trained via Direct Preference Optimization

Large Language Model

Transformers

Supports Multiple Languages#Serbian chat #Bilingual conversation #Human alignment optimization

Downloads 35

Release Time : 2/15/2024

Model Overview

This model is a chat model supporting Serbian and English, trained through Direct Preference Optimization, suitable for dialogue generation tasks.

Model Features

Multilingual support

Supports dialogue generation in both Serbian and English

Human alignment

Trained via Direct Preference Optimization (DPO) to make model outputs more aligned with human preferences

Extended vocabulary

Added 25,000 new language tokens to the base Llama model, expanding the vocabulary to 57,000 tokens

Model Capabilities

Serbian dialogue generation

English dialogue generation

Multilingual mixed dialogue

Use Cases

Dialogue systems

Serbian chatbot

Used to build chatbots supporting Serbian

Capable of generating fluent and natural Serbian dialogues

Bilingual customer service assistant

Build customer service systems supporting both Serbian and English

Seamlessly switches between the two languages to answer user queries

🚀 SambaLingo-Serbian-Chat

SambaLingo-Serbian-Chat is a human-aligned chat model trained in Serbian and English. It addresses the need for multilingual chat capabilities, leveraging direct preference optimization on a base model adapted to Serbian. This enables more natural and accurate interactions in both Serbian and English.

🚀 Quick Start

Loading Model With Hugging Face

Ensure to set use_fast=False when loading the tokenizer.

from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("sambanovasystems/SambaLingo-Serbian-Chat", use_fast=False)
model = AutoModelForCausalLM.from_pretrained("sambanovasystems/SambaLingo-Serbian-Chat", device_map="auto", torch_dtype="auto")

Interacting With Model Pipeline

Again, set use_fast=False when loading the tokenizer.

from transformers import pipeline
pipe = pipeline("text-generation", model="sambanovasystems/SambaLingo-Serbian-Chat", device_map="auto", use_fast=False)
messages = [
                {"role": "user", "content": {YOUR_QUESTION}},
]
prompt = pipe.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
outputs = pipe(prompt)[0]
outputs = outputs["generated_text"]

Suggested Inference Parameters

Temperature: 0.8
Repetition penalty: 1.0
Top-p: 0.9

Prompting Guidelines

Use the following chat template to prompt the model:

<|user|>\n{question}</s>\n<|assistant|>\n

Example Prompt and Generations

<|user|>
Od kada datiraju prva naselja na prostoru danasnjeg Beograda, i kada je grad osnovan?</s>
<|assistant|>
Beograd su naseljavali ljudi još u praistoriji, a prva naselja datiraju iz neolitskih vremena, pre oko 7.000 godina. Međutim, grad Beograd nije osnovan upravo u to vreme, već u srednjem veku, kada su Srbi u 9. veku osnovali srednjovekovni grad Beograd, na mestu današnjeg Kalemegdana. Grad je kasnije pretrpeo razna oсвајања, razaranja i obnavljanja, da bi se na kraju razvio u značajnu evropsku metropolu.

✨ Features

Multilingual Support: Trained in both Serbian and English, enabling seamless communication in these languages.
Human Alignment: Utilizes direct preference optimization to provide more human-like responses.
Based on Strong Foundation: Built on the Llama-2-7b model, adapted to Serbian.

📚 Documentation

Model Description

Developed by: SambaNova Systems
Model type: Language Model
Language(s): Serbian, English
Finetuned from model: Llama-2-7b
Try this model: SambaLingo-chat-space
Paper: SambaLingo: Teaching Large Language Models New Languages
Blog Post: sambalingo-open-source-language-experts

Training Details

The alignment phase follows the recipe for Zephyr-7B, with two stages: supervised fine-tuning (SFT) and Direct Performance Optimization (DPO).

The SFT phase was conducted on the ultrachat_200k dataset, mixed with the Google-translated version. It was trained for one epoch with a global batch size of 512 and a max sequence length of 2048 tokens. A linear decay learning rate of 2e-5 and 10% warmup were used.

The DPO phase was carried out on the ultrafeedback dataset and cai-conversation-harmless dataset, mixed with 10% of the data Google-translated. It was trained with a global batch size of 32 for three epochs. A linear decay learning rate of 5e-7, 10% warmup, and β = 0.1 as the regularization factor for DPO were employed.

Tokenizer Details

The vocabulary of the base llama model was extended from 32,000 tokens to 57,000 tokens by adding up to 25,000 non-overlapping tokens from the new language.

Evaluation

For evaluation results, refer to our paper: SambaLingo: Teaching Large Language Models New Languages

Uses

Direct Use

The use of this model is governed by the Meta’s Llama 2 Community License Agreement. Review and accept the license before downloading the model weights.

Out-of-Scope Use

SambaLingo should not be used for:

Mission-critical applications
Applications involving the safety of others
Making highly important decisions

Bias, Risks, and Limitations

Like all LLMs, SambaLingo has certain limitations:

Hallucination: The model may generate responses with plausible but factually incorrect or irrelevant information.
Code Switching: It might unintentionally switch between languages or dialects within a single response, affecting output coherence and understandability.
Repetition: The model may produce repetitive phrases or sentences, resulting in less engaging and informative responses.
Coding and Math: Its performance in generating accurate code or solving complex mathematical problems may be limited.
Toxicity: The model could inadvertently generate responses with inappropriate or harmful content.

Acknowledgments

We express our sincere gratitude to the open-source AI community. This project would not have been possible without open source. SambaNova supports the open-source community and aims to contribute actively.

Special thanks to the following groups:

Meta for open-sourcing LLama 2 and the FLORES-200 dataset
Nguyen et al for open-sourcing the CulturaX dataset
CohereAI for releasing AYA-101 and open-sourcing a multilingual instruction tuning dataset
EleutherAI for their open-source evaluation framework
Hugging Face-H4 team for open-sourcing the zephyr training recipe and alignment handbook repo

Cite SambaLingo

@misc{csaki2024sambalingo,
      title={SambaLingo: Teaching Large Language Models New Languages}, 
      author={Zoltan Csaki and Bo Li and Jonathan Li and Qiantong Xu and Pian Pawakapan and Leon Zhang and Yun Du and Hengyu Zhao and Changran Hu and Urmish Thakker},
      year={2024},
      eprint={2404.05829},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

📄 License

The model uses the Llama 2 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご