Cendol mT5-small Chat Open-Source Large Language Model - Single-round Conversations, Supports Multilingual Interactions Including Indonesian!

Cendol Mt5 Small Chat

Developed by indonlp

Cendol mT5-small Chat is a 300-million-parameter open-source generative large language model, fine-tuned for Indonesian, Sundanese, and Javanese instructions, suitable for single-turn dialogue scenarios.

Large Language Model

Transformers

OtherOpen Source License:Apache-2.0 #Indonesian instruction tuning #Multi-task dialogue #Lightweight LLM

Downloads 46

Release Time : 12/25/2023

Model Overview

This model is a chat version based on the mT5-small architecture, further optimized for general knowledge and human-centric prompts on top of Cendol-Instruct, supporting native Indonesian languages.

Model Features

Native language support

Specifically optimized for Indonesian and regional languages like Sundanese and Javanese

Efficient small model

With 300 million parameters, it strikes a balance between performance and efficiency, competing with some 7-billion-parameter models

Full parameter fine-tuning

Adopts full parameter fine-tuning strategy (non-LoRA), offering higher training efficiency compared to large parameter models

Model Capabilities

Single-turn dialogue generation

General knowledge Q&A

Indonesian native language processing

Use Cases

Dialogue systems

Indonesian chatbot

Localized dialogue agent deployed in customer service systems or social applications

Outperforms similar open-source models in dialogue coherence in human evaluations

Educational applications

Native language learning assistant

Helps learners practice daily conversations in Sundanese/Javanese

🚀 Cendol: Open Instruction-tuned Generative Large Language Models for Indonesian Languages

Cendol is an open - source collection of fine - tuned generative large language models in Indonesian languages. It covers decoder - only and encoder - decoder transformer model architectures, with parameter scales ranging from 300 million to 13 billion.

This is the repository for the 300M Cendol mT5 - small Chat model. Links to other models can be found below.

📚 Documentation

Model Details

Note: Use of Cendol is licensed under the Apache 2.0 license

Overview

IndoNLP developed and publicly released the Cendol family of large language models (LLMs). These are a collection of pretrained and fine - tuned generative text models, with parameter scales ranging from 560 million to 13 billion.

Cendol models have two instruction - tuned versions:

Cendol - Instruct: Instruction - tuned on tasks - specific NLP data such as sentiment analysis, topic modeling, machine translation, summarization, question answering, paraphrasing, etc.
Cendol - Chat: Continuously instruction - tuned from Cendol - Instruct on general knowledge and human - centric prompts.

Both Cendol - Instruct and Cendol - Chat are designed for single - turn conversations. Cendol outperforms open - source multilingual and region - specific LLMs on most benchmarks we tested by a large margin. The smaller version (<1B parameters) of Cendol is highly competitive with other LLMs with 7B parameters.

Model Developers

IndoNLP

Variations

Cendol is based on 2 base models (mT5 and LLaMA - 2), each with a range of parameter sizes. mT5 - based Cendol includes 300M (mT5 - small), 580M (mT5 - base), 1.2B (mT5 - large), 3.7B (mT5 - XL), and 13B (mT5 - XXL) models. LLaMA - 2 - based Cendol includes 7B (LLaMA2 - 7B) and 13B (LLaMA2 - 13B) models. Both variants have Cendol - Instruct and Cendol - Chat variations. All 13B parameter models are tuned with LoRA, while others are fully fine - tuned.

In our paper, we show that adapting region - specific LLMs using LoRA is ineffective and inefficient. For example, the 13B (mT5 - XXL) Cendol models perform slightly worse than the 1.2B (mT5 - large) Cendol models, with 3x slower training time and 4x slower inference time. As an alternative to LoRA, we demonstrate the benefits of vocabulary substitution as an effective and efficient strategy for region - specific adaptation. We improve the efficiency by 11.50% and 18.71% for training and inference times, respectively. In terms of evaluation performance, the model performs on par with the Cendol model trained with the original vocabulary. We also release the Indonesian vocabulary - adapted model denoted as Indonesian - Vocab Instruct.

Input - Output

Models input and output are text only.

Model Architecture

Property	Details
Model Type	Cendol models cover decoder - only and encoder - decoder transformer architectures, based on mT5 and LLaMA - 2.
Training Data	Cendol Collection v1 for Instruct models; Cendol Collection v2 for Chat models.
Params	Ranging from 300M to 13B.
Tuning Strategy	13B parameter models are tuned with LoRA, others are fully fine - tuned.
LR	Varies from (3.0\times10^{-5}) to (3.0\times10^{-4}).

Model	Training Data	Params	Tuning Strategy	LR
Cendol mT5 - small Instruct	Cendol Collection v1	300M	Fully - Finetuned	(3.0\times10^{-4})
Cendol mT5 - base Instruct	Cendol Collection v1	580M	Fully - Finetuned	(3.0\times10^{-4})
Cendol mT5 - large Instruct	Cendol Collection v1	1.2B	Fully - Finetuned	(3.0\times10^{-4})
Cendol mT5 - xl Instruct	Cendol Collection v1	3.7B	Fully - Finetuned	(3.0\times10^{-4})
Cendol mT5 - xxl Instruct	Cendol Collection v1	13B	LoRA	(2.0\times10^{-4})
Cendol LLaMA - 2 (7B) Instruct	Cendol Collection v1	7B	Fully - Finetuned	(2.0\times10^{-5})
Cendol LLaMA - 2 (7B) Indonesian - Vocab Instruct	Cendol Collection v1	7B	Fully - Finetuned	(2.0\times10^{-5})
Cendol LLaMA - 2 (13B) Instruct	Cendol Collection v1	13B	LoRA	(2.0\times10^{-5})
Cendol mT5 - small Chat	Cendol Collection v2	300M	Fully - Finetuned	(3.0\times10^{-5})
Cendol mT5 - base Chat	Cendol Collection v2	580M	Fully - Finetuned	(3.0\times10^{-5})
Cendol mT5 - large Chat	Cendol Collection v2	1.2B	Fully - Finetuned	(3.0\times10^{-5})
Cendol mT5 - xl Chat	Cendol Collection v2	3.7B	Fully - Finetuned	(3.0\times10^{-5})
Cendol mT5 - xxl Chat	Cendol Collection v2	13B	LoRA	(2.0\times10^{-4})
Cendol LLaMA - 2 (7B) Chat	Cendol Collection v2	7B	Fully - Finetuned	(1.0\times10^{-5})
Cendol LLaMA - 2 (13B) Chat	Cendol Collection v2	13B	LoRA	(2.0\times10^{-4})

Model Dates

Cendol was trained between October 2023 and January 2024.

License

Use of Cendol is licensed under the Apache 2.0 license

Research Paper

"Cendol: Open Instruction - tuned Generative Large Language Models for Indonesian Languages"

Intended Use

Intended Use Cases

Cendol is intended for research use, especially on Indonesian languages. Cendol models are designed for single - turn instructions. Cendol - Instruct models can be used for task - specific instructions, while Cendol - Chat models can be used for general knowledge instructions.

Out - of - scope Uses

Use in any manner that violates applicable laws or regulations (including trade compliance laws).
Use in languages other than English and Indonesian languages.
Use in any other way that is prohibited by the Acceptable Use Policy and Licensing Agreement for Cendol.

Evaluation Results

In this section, we report the results for the Cendol models on large - scale NLU and NLG benchmarks. For all the evaluations, we use our internal evaluations library.

NLU Performance

NLG Performance

Human evaluation

Human Evaluation

Ethical Considerations and Limitations

Cendol is a new technology with risks. Testing to date has been in Indonesian and cannot cover all scenarios. As with all LLMs, Cendol’s potential outputs cannot be predicted in advance, and the model may produce inaccurate, biased or other objectionable responses to user prompts. Therefore, before deploying any applications of Cendol, developers should perform safety testing and tuning tailored to their specific applications of the model.

Citation

If you are using any resources including Cendol models, code, or data, please cite the following articles:

@misc{cahyawijaya-etal-2024-cendol,
      title={Cendol: Open Instruction-tuned Generative Large Language Models for Indonesian Languages}, 
      author={Samuel Cahyawijaya and Holy Lovenia and Fajri Koto and Rifki Afina Putri and Emmanuel Dave and Jhonson Lee and Nuur Shadieq and Wawan Cenggoro and Salsabil Maulana Akbar and Muhammad Ihza Mahendra and Dea Annisayanti Putri and Bryan Wilie and Genta Indra Winata and Alham Fikri Aji and Ayu Purwarianti and Pascale Fung},
      year={2024},
      eprint={2404.06138},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

@inproceedings{cahyawijaya-etal-2023-nusacrowd,
    title = "{N}usa{C}rowd: Open Source Initiative for {I}ndonesian {NLP} Resources",
    author = "Cahyawijaya, Samuel  and
      Lovenia, Holy  and
      Aji, Alham Fikri  and
      Winata, Genta  and
      Wilie, Bryan  and
      Koto, Fajri  and
      Mahendra, Rahmad  and
      Wibisono, Christian  and
      Romadhony, Ade  and
      Vincentio, Karissa  and
      Santoso, Jennifer  and
      Moeljadi, David  and
      Wirawan, Cahya  and
      Hudi, Frederikus  and
      Wicaksono, Muhammad Satrio  and
      Parmonangan, Ivan  and
      Alfina, Ika  and
      Putra, Ilham Firdausi  and
      Rahmadani, Samsul  and
      Oenang, Yulianti  and
      Septiandri, Ali  and
      Jaya, James  and
      Dhole, Kaustubh  and
      Suryani, Arie  and
      Putri, Rifki Afina  and
      Su, Dan  and
      Stevens, Keith  and
      Nityasya, Made Nindyatama  and
      Adilazuarda, Muhammad  and
      Hadiwijaya, Ryan  and
      Diandaru, Ryandito  and
      Yu, Tiezheng  and
      Ghifari, Vito  and
      Dai, Wenliang  and
      Xu, Yan  and
      Damapuspita, Dyah  and
      Wibowo, Haryo  and
      Tho, Cuk  and
      Karo Karo, Ichwanul  and
      Fatyanosa, Tirana  and
      Ji, Ziwei  and
      Neubig, Graham  and
      Baldwin, Timothy  and
      Ruder, Sebastian  and
      Fung, Pascale  and
      Sujaini, Herry  and
      Sakti, Sakriani  and
      Purwarianti, Ayu",
    editor = "Rogers, Anna  and
      Boyd-Graber, Jordan  and
      Okazaki, Naoaki",
    booktitle = "Findings of the Association for Computational Linguistics: ACL 2023",
    month = jul,
    year = "2023",
    address = "Toronto, Canada",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2023.findings-acl.868",
    doi = "10.18653/v1/2023.findings-acl.868",
    pages = "13745--13818"
}

Additionally, if you are inspired by our work on region - specific language models especially for Indonesian and its local languages, please also consider citing the following articles:

@inproceedings{cahyawijaya-etal-2023-nusawrites,
    title = "{N}usa{W}rites: Constructing High-Quality Corpora for Underrepresented and Extremely Low-Resource Languages",
    author = "Cahyawijaya, Samuel  and
      Lovenia, Holy  and
      Koto, Fajri  and
      Adhista, Dea  and
      Dave, Emmanuel  and
      Oktavianti, Sarah  and
      Akbar, Salsabil  and
      Lee, Jhonson  and
      Shadieq, Nuur  and
      Cenggoro, Tjeng Wawan  and
      Linuwih, Hanung  and
      Wilie, Bryan  and
      Muridan, Galih  and
      Winata, Genta  and
      Moeljadi, David  and
      Aji, Alham Fikri  and
      Purwarianti, Ayu  and
      Fung, Pascale",
    editor = "Park, Jong C.  and
      Arase, Yuki  and
      Hu, Baotian  and
      Lu, Wei  and
      Wijaya, Derry  and
      Purwarianti, Ayu  and
      Krisnadhi, Adila Alfa",
    booktitle = "Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)",
    month = nov,
    year = "2023",
    address = "Nusa Dua, Bali",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2023.ijcnlp-main.60",
    doi = "10.18653/v1/2023.ijcnlp-main.60",
    pages = "921--945"
}

@inproceedings{winata-etal-2023-nusax,
    title = "{N}usa{X}: Multilingual Parallel Sentiment Dataset for 10 {I}ndonesian Local Languages",
    author = "Winata, Genta Indra  and
      Aji, Alham Fikri  and
      Cahyawijaya, Samuel  and
      Mahendra, Rahmad  and
      Koto, Fajri  and
      Romadhony, Ade  and
      Kurniawan, Kemal  and
      Moeljadi, David  and
      Prasojo, Radityo Eko  and
      Fung, Pascale  and
      Baldwin, Timothy  and
      Lau, Jey Han  and
      Sennrich, Rico  and
      Ruder, Sebastian",
    editor = "Vlachos, Andreas  and
      Augenstein, Isabelle",
    booktitle = "Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing",
    month = dec,
    year = "2023",
    address = "Singapore",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2023.emnlp-main.868",
    doi = "10.18653/v1/2023.emnlp-main.868",
    pages = "13745--13818"
}

📄 License

Use of Cendol is licensed under the Apache 2.0 license

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご