SambaLingo - Turkish - Open Source Chat Model - Humanized Chat Experience Supporting Both Turkish and English

Sambalingo Turkish Chat

Developed by sambanovasystems

The SambaLingo-Turkish-Chat Model is a human preference-aligned chat model supporting Turkish and English, adapted from Llama-2-7b with Turkish language adaptation and direct preference optimization training.

Large Language Model

Transformers

Supports Multiple Languages#Turkish Chat #Bilingual Conversation #Direct Preference Optimization

Downloads 2,831

Release Time : 2/15/2024

Model Overview

This model is based on the SambaLingo-Turkish-Base model and trained through direct preference optimization, supporting chat interactions in Turkish and English.

Model Features

Multilingual Support

Supports chat interactions in Turkish and English, with special optimization for Turkish.

Direct Preference Optimization

Trained through two stages: supervised fine-tuning (SFT) and direct preference optimization (DPO), aligning with human preferences.

Extended Vocabulary

Expands the base Llama model's vocabulary from 32,000 to 57,000 tokens by adding up to 25,000 new non-overlapping tokens for the target language.

Model Capabilities

Turkish Text Generation

English Text Generation

Multilingual Chat Interaction

Use Cases

Chat Applications

Multilingual Chat Assistant

Can serve as a bilingual chat assistant for Turkish and English, providing natural and fluent conversation experiences.

Language Learning

Turkish Learning Aid

Assists users learning Turkish with language practice and conversation simulation.

🚀 SambaLingo-Turkish-Chat

SambaLingo-Turkish-Chat is a human - aligned chat model trained in Turkish and English, offering high - quality multilingual interaction.

SambaLingo Logo

🚀 Quick Start

Loading Model With Hugging Face

Ensure to set use_fast=False when loading the tokenizer.

from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("sambanovasystems/SambaLingo-Turkish-Chat", use_fast=False)
model = AutoModelForCausalLM.from_pretrained("sambanovasystems/SambaLingo-Turkish-Chat", device_map="auto", torch_dtype="auto")

Interacting With Model Pipeline

Again, set use_fast=False when loading the tokenizer.

from transformers import pipeline
pipe = pipeline("text-generation", model="sambanovasystems/SambaLingo-Turkish-Chat", device_map="auto", use_fast=False)
messages = [
                {"role": "user", "content": {YOUR_QUESTION}},
]
prompt = pipe.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
outputs = pipe(prompt)[0]
outputs = outputs["generated_text"]

Suggested Inference Parameters

Temperature: 0.8
Repetition penalty: 1.0
Top - p: 0.9

Prompting Guidelines

To prompt this model, use the following chat template:

<|user|>\n{question}</s>\n<|assistant|>\n

✨ Features

SambaLingo-Turkish-Chat is a human - aligned chat model trained in Turkish and English. It is built on the base model SambaLingo-Turkish-Base using direct preference optimization. The base model adapts [Llama - 2 - 7b](https://huggingface.co/meta - llama/Llama - 2 - 7b - hf) to Turkish by training on 42 billion tokens from the Turkish split of the Cultura - X dataset. Try it at [SambaLingo - chat - space](https://huggingface.co/spaces/sambanovasystems/SambaLingo - chat - space).

📚 Documentation

Model Description

Developed by: SambaNova Systems
Model type: Language Model
Language(s): Turkish, English
Finetuned from model: [Llama - 2 - 7b](https://huggingface.co/meta - llama/Llama - 2 - 7b - hf)
Try this model: [SambaLingo - chat - space](https://huggingface.co/spaces/sambanovasystems/SambaLingo - chat - space)
Paper: SambaLingo: Teaching Large Language Models New Languages
Blog Post: [sambalingo - open - source - language - experts](https://sambanova.ai/blog/sambalingo - open - source - language - experts)

Training Details

The alignment phase follows the recipe for [Zephyr - 7B](https://huggingface.co/HuggingFaceH4/zephyr - 7b - beta), and consists of two stages: supervised fine - tuning (SFT) and Direct Performance Optimization (DPO).

The SFT phase was conducted on the ultrachat_200k dataset mixed with the Google - translated version of the ultrachat_200k dataset. It was trained for one epoch with a global batch size of 512 and a max sequence length of 2048 tokens. A linear decay learning rate of 2e - 5 and 10% warmup were used.

The DPO phase was carried out on the ultrafeedback dataset and [cai - conversation - harmless](https://huggingface.co/datasets/HuggingFaceH4/cai - conversation - harmless) dataset, mixed with 10% of the data Google - translated. It was trained with a global batch size of 32 for three epochs. A linear decay learning rate of 5e - 7, 10% warmup, and β = 0.1 as the regularization factor for DPO were used.

Tokenizer Details

The vocabulary of the base llama model was extended from 32,000 tokens to 57,000 tokens by adding up to 25,000 non - overlapping tokens from the new language.

Evaluation

For evaluation results, refer to our paper: SambaLingo: Teaching Large Language Models New Languages

Uses

Direct Use

Use of this model is governed by the Meta’s Llama 2 Community License Agreement. Review and accept the license before downloading the model weights.

Out - of - Scope Use

SambaLingo should NOT be used for:

Mission - critical applications
Applications involving the safety of others
Making highly important decisions

Bias, Risks, and Limitations

Like all LLMs, SambaLingo has certain limitations:

Hallucination: The model may sometimes generate responses with plausible - sounding but factually incorrect or irrelevant information.
Code Switching: The model might unintentionally switch between languages or dialects within a single response, affecting output coherence and understandability.
Repetition: The model may produce repetitive phrases or sentences, resulting in less engaging and informative responses.
Coding and Math: The model's performance in generating accurate code or solving complex mathematical problems may be limited.
Toxicity: The model could inadvertently generate responses with inappropriate or harmful content.

📄 License

This model uses the llama2 license.

Acknowledgments

We are deeply grateful to the open - source AI community; this project would not have been possible without open source. SambaNova supports the open - source community and aims to actively contribute to this initiative.

Special thanks to the following groups:

Meta for open - sourcing LLama 2 and the FLORES - 200 dataset
Nguyen et al for open - sourcing the CulturaX dataset
CohereAI for releasing AYA - 101 and open - sourcing a multilingual instruction tuning dataset
EleutherAI for their open - source evaluation framework
Hugging Face - H4 team for open - sourcing the zephyr training recipe and alignment handbook repo

Cite SambaLingo

@misc{csaki2024sambalingo,
      title={SambaLingo: Teaching Large Language Models New Languages}, 
      author={Zoltan Csaki and Bo Li and Jonathan Li and Qiantong Xu and Pian Pawakapan and Leon Zhang and Yun Du and Hengyu Zhao and Changran Hu and Urmish Thakker},
      year={2024},
      eprint={2404.05829},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご