Bahasa-4b-chat Open-source Indonesian Large Language Model - Trained on high-quality corpora, freely chat about Indonesian topics

Bahasa 4b Chat

Developed by Bahasalab

A large Indonesian language model based on the qwen-4b model, further trained with 10 billion high-quality Indonesian texts

Large Language Model

Transformers

OtherOpen Source License:Other #Indonesian language optimization #Q&A system #Multitasking

Downloads 120

Release Time : 4/26/2024

Model Overview

Bahasa-4b is a language model optimized for Indonesian, outperforming some 4b and even 7b-scale models in Indonesian language tasks. It is suitable for applications such as Q&A systems, sentiment analysis, and document summarization

Model Features

Indonesian language optimization

Trained specifically with 10 billion high-quality Indonesian texts, excelling in Indonesian language tasks

Efficient performance

Outperforms some 7b models at the 4b parameter scale

Wide applicability

Supports various natural language processing tasks

Model Capabilities

Indonesian text generation

Q&A system

Sentiment analysis

Document summarization

Use Cases

Education

Indonesian learning assistant

Helps learners understand and generate Indonesian content

Provides accurate Indonesian explanations and examples

Business

Indonesian market analysis

Analyzes Indonesian business documents and customer feedback

Extracts key business insights and trends

🚀 Bahasa-4b Model Report

Bahasa-4b is a model fine - tuned from Qwen - 4b, trained on high - quality Indonesian text. It shows excellent performance in various Indonesian NLP tasks.

🚀 Quick Start

If you want to use the Bahasa-4b model, you can refer to the following code example:

from transformers import AutoModelForCausalLM, AutoTokenizer
device = "cuda" # the device to load the model onto

model = AutoModelForCausalLM.from_pretrained(
    "Bahasalab/Bahasa-4b-chat-v2",
    torch_dtype="auto",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("Bahasalab/Bahasa-4b-chat")

messages = [
    {"role": "system", "content": "Kamu adalah asisten yang membantu"},
    {"role": "user", "content": "kamu siapa"}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)

model_inputs = tokenizer([text], return_tensors="pt").to(device)

generated_ids = model.generate(
    input_ids=model_inputs.input_ids,
    attention_mask=model_inputs.attention_mask,
    max_new_tokens=512,
    eos_token_id=tokenizer.eos_token_id

)
generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]

response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(response)

✨ Features

Bahasa-4b is continued training from qwen-4b using 10 billion high quality text of Indonesian.
The model outperforms some 4b, and even 7b models for Indonesian tasks.
It is suitable for various NLP tasks that require understanding and generating Indonesian language, such as question answering, sentiment analysis, document summarization, etc.

📚 Documentation

Model Name

Bahasa-4b

Model Developers

Bahasa AI

Intended Use

This model is intended for various NLP tasks that require understanding and generating Indonesian language. It is suitable for applications such as question answering, sentiment analysis, document summarization, and more.

Training Data

Bahasa-4b was trained on a 10 billion subset data of Indonesian dataset from a collected pool of 100 billion.

Benchmarks

The following table shows the performance of Bahasa-4b compared to the models Sailor_4b and Mistral-7B-v0.1 across several benchmarks:

Dataset	Version	Metric	Mode	Sailor_4b	Bahasa-4b-hf	Mistral-7B-v0.1
tydiqa-id	0e9309	EM	gen	53.98	55.04	63.54
tydiqa-id	0e9309	F1	gen	73.48	75.39	78.73
xcopa-id	36c11c	EM	ppl	69.2	73.2	62.40
xcopa-id	36c11c	F1	ppl	69.2	73.2	-
m3exam-id-ppl	ede415	EM	ppl	31.27	44.47	26.68
belebele-id-ppl	7fe030	EM	ppl	41.33	42.33	41.33

This data demonstrates that Bahasa-4b consistently outperforms the Sailor_4b model in various Indonesian language tasks, showing improvements in both EM (Exact Match) and F1 scores across different datasets, and is competitive with the Mistral-7B-v0.1 model.

📄 License

The license of this model is "other", named "tongyi - qianwen".

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご