🚀 CERE-LLMA-3-8b-TR
This model is a fine-tuned version of the Llama3 8b Large Language Model (LLM) tailored for Turkish. It was trained on high-quality Turkish instruction sets sourced from various open - source and internal resources. The Turkish Instruction dataset was carefully annotated to execute Turkish instructions accurately and systematically.
🚀 Quick Start
This fine - tuned LLM offers great potential for Turkish language processing. It can handle various tasks such as question - answering, text generation, etc.
✨ Features
- Base Model: Built on the LLMA 3 8B based LLM, providing a strong foundation for language understanding and generation.
- Tokenizer Extension: Specifically extended for Turkish, enabling better handling of the Turkish language's unique characteristics.
- Training Dataset: Utilized cleaned Turkish raw data with 5 billion tokens and custom Turkish instruction sets, ensuring high - quality training.
- Training Method: Initially trained with DORA, followed by fine - tuning with LORA, optimizing the model's performance.
📚 Documentation
Model Details
Property |
Details |
Model Type |
A fine - tuned Llama3 8b LLM for Turkish |
Tokenizer Extension |
Specifically extended for Turkish |
Training Data |
Cleaned Turkish raw data with 5 billion tokens, custom Turkish instruction sets |
Training Method |
Initially with DORA, followed by fine - tuning with LORA |
Benchmark Results
Benchmark |
Score |
Winogrande_tr |
56.16 |
TruthfulQA_tr_v0.2 |
47.46 |
Mmlu_tr_v0.2 |
46.46 |
HellaSwag_tr_v0.2 |
48.87 |
GSM8k_tr_v0.2 |
25.43 |
Arc_tr_v0.2 |
41.97 |
💻 Usage Examples
Basic Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
device = "cuda"
model = AutoModelForCausalLM.from_pretrained(
"Cerebrum/cere-llama-3-8b-tr",
torch_dtype="auto",
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("Cerebrum/cere-llama-3-8b-tr")
prompt = "Python'da ekrana 'Merhaba Dünya' nasıl yazılır?"
messages = [
{"role": "system", "content": "Sen, Cerebrum Tech tarafından üretilen ve verilen talimatları takip ederek en iyi cevabı üretmeye çalışan yardımcı bir yapay zekasın."},
{"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(device)
generated_ids = model.generate(
model_inputs.input_ids,
temperature=0.3,
top_k=50,
top_p=0.9,
max_new_tokens=512,
repetition_penalty=1,
)
generated_ids = [
output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]
response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
📄 License
This model uses the llama3 license.