Model Overview
Model Features
Model Capabilities
Use Cases
🚀 Cendol: Open Instruction-tuned Generative Large Language Models for Indonesian Languages
Cendol is an open - source collection of fine - tuned generative large language models in Indonesian languages. It covers decoder - only and encoder - decoder transformer model architectures, with parameter scales ranging from 300 million to 13 billion.
This is the repository for the 300M Cendol mT5 - small Chat model. Links to other models can be found below.
📚 Documentation
Model Details
Note: Use of Cendol is licensed under the Apache 2.0 license
Overview
IndoNLP developed and publicly released the Cendol family of large language models (LLMs). These are a collection of pretrained and fine - tuned generative text models, with parameter scales ranging from 560 million to 13 billion.
Cendol models have two instruction - tuned versions:
- Cendol - Instruct: Instruction - tuned on tasks - specific NLP data such as sentiment analysis, topic modeling, machine translation, summarization, question answering, paraphrasing, etc.
- Cendol - Chat: Continuously instruction - tuned from Cendol - Instruct on general knowledge and human - centric prompts.
Both Cendol - Instruct and Cendol - Chat are designed for single - turn conversations. Cendol outperforms open - source multilingual and region - specific LLMs on most benchmarks we tested by a large margin. The smaller version (<1B parameters) of Cendol is highly competitive with other LLMs with 7B parameters.
Model Developers
IndoNLP
Variations
Cendol is based on 2 base models (mT5 and LLaMA - 2), each with a range of parameter sizes. mT5 - based Cendol includes 300M (mT5 - small), 580M (mT5 - base), 1.2B (mT5 - large), 3.7B (mT5 - XL), and 13B (mT5 - XXL) models. LLaMA - 2 - based Cendol includes 7B (LLaMA2 - 7B) and 13B (LLaMA2 - 13B) models. Both variants have Cendol - Instruct and Cendol - Chat variations. All 13B parameter models are tuned with LoRA, while others are fully fine - tuned.
In our paper, we show that adapting region - specific LLMs using LoRA is ineffective and inefficient. For example, the 13B (mT5 - XXL) Cendol models perform slightly worse than the 1.2B (mT5 - large) Cendol models, with 3x slower training time and 4x slower inference time. As an alternative to LoRA, we demonstrate the benefits of vocabulary substitution as an effective and efficient strategy for region - specific adaptation. We improve the efficiency by 11.50% and 18.71% for training and inference times, respectively. In terms of evaluation performance, the model performs on par with the Cendol model trained with the original vocabulary. We also release the Indonesian vocabulary - adapted model denoted as Indonesian - Vocab Instruct
.
Input - Output
Models input and output are text only.
Model Architecture
Property | Details |
---|---|
Model Type | Cendol models cover decoder - only and encoder - decoder transformer architectures, based on mT5 and LLaMA - 2. |
Training Data | Cendol Collection v1 for Instruct models; Cendol Collection v2 for Chat models. |
Params | Ranging from 300M to 13B. |
Tuning Strategy | 13B parameter models are tuned with LoRA, others are fully fine - tuned. |
LR | Varies from (3.0\times10^{-5}) to (3.0\times10^{-4}). |
Model Dates
Cendol was trained between October 2023 and January 2024.
License
Use of Cendol is licensed under the Apache 2.0 license
Research Paper
"Cendol: Open Instruction - tuned Generative Large Language Models for Indonesian Languages"
Intended Use
Intended Use Cases
Cendol is intended for research use, especially on Indonesian languages. Cendol models are designed for single - turn instructions. Cendol - Instruct models can be used for task - specific instructions, while Cendol - Chat models can be used for general knowledge instructions.
Out - of - scope Uses
- Use in any manner that violates applicable laws or regulations (including trade compliance laws).
- Use in languages other than English and Indonesian languages.
- Use in any other way that is prohibited by the Acceptable Use Policy and Licensing Agreement for Cendol.
Evaluation Results
In this section, we report the results for the Cendol models on large - scale NLU and NLG benchmarks. For all the evaluations, we use our internal evaluations library.
NLU Performance
NLG Performance
Human evaluation
Ethical Considerations and Limitations
Cendol is a new technology with risks. Testing to date has been in Indonesian and cannot cover all scenarios. As with all LLMs, Cendol’s potential outputs cannot be predicted in advance, and the model may produce inaccurate, biased or other objectionable responses to user prompts. Therefore, before deploying any applications of Cendol, developers should perform safety testing and tuning tailored to their specific applications of the model.
Citation
If you are using any resources including Cendol models, code, or data, please cite the following articles:
@misc{cahyawijaya-etal-2024-cendol,
title={Cendol: Open Instruction-tuned Generative Large Language Models for Indonesian Languages},
author={Samuel Cahyawijaya and Holy Lovenia and Fajri Koto and Rifki Afina Putri and Emmanuel Dave and Jhonson Lee and Nuur Shadieq and Wawan Cenggoro and Salsabil Maulana Akbar and Muhammad Ihza Mahendra and Dea Annisayanti Putri and Bryan Wilie and Genta Indra Winata and Alham Fikri Aji and Ayu Purwarianti and Pascale Fung},
year={2024},
eprint={2404.06138},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
@inproceedings{cahyawijaya-etal-2023-nusacrowd,
title = "{N}usa{C}rowd: Open Source Initiative for {I}ndonesian {NLP} Resources",
author = "Cahyawijaya, Samuel and
Lovenia, Holy and
Aji, Alham Fikri and
Winata, Genta and
Wilie, Bryan and
Koto, Fajri and
Mahendra, Rahmad and
Wibisono, Christian and
Romadhony, Ade and
Vincentio, Karissa and
Santoso, Jennifer and
Moeljadi, David and
Wirawan, Cahya and
Hudi, Frederikus and
Wicaksono, Muhammad Satrio and
Parmonangan, Ivan and
Alfina, Ika and
Putra, Ilham Firdausi and
Rahmadani, Samsul and
Oenang, Yulianti and
Septiandri, Ali and
Jaya, James and
Dhole, Kaustubh and
Suryani, Arie and
Putri, Rifki Afina and
Su, Dan and
Stevens, Keith and
Nityasya, Made Nindyatama and
Adilazuarda, Muhammad and
Hadiwijaya, Ryan and
Diandaru, Ryandito and
Yu, Tiezheng and
Ghifari, Vito and
Dai, Wenliang and
Xu, Yan and
Damapuspita, Dyah and
Wibowo, Haryo and
Tho, Cuk and
Karo Karo, Ichwanul and
Fatyanosa, Tirana and
Ji, Ziwei and
Neubig, Graham and
Baldwin, Timothy and
Ruder, Sebastian and
Fung, Pascale and
Sujaini, Herry and
Sakti, Sakriani and
Purwarianti, Ayu",
editor = "Rogers, Anna and
Boyd-Graber, Jordan and
Okazaki, Naoaki",
booktitle = "Findings of the Association for Computational Linguistics: ACL 2023",
month = jul,
year = "2023",
address = "Toronto, Canada",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2023.findings-acl.868",
doi = "10.18653/v1/2023.findings-acl.868",
pages = "13745--13818"
}
Additionally, if you are inspired by our work on region - specific language models especially for Indonesian and its local languages, please also consider citing the following articles:
@inproceedings{cahyawijaya-etal-2023-nusawrites,
title = "{N}usa{W}rites: Constructing High-Quality Corpora for Underrepresented and Extremely Low-Resource Languages",
author = "Cahyawijaya, Samuel and
Lovenia, Holy and
Koto, Fajri and
Adhista, Dea and
Dave, Emmanuel and
Oktavianti, Sarah and
Akbar, Salsabil and
Lee, Jhonson and
Shadieq, Nuur and
Cenggoro, Tjeng Wawan and
Linuwih, Hanung and
Wilie, Bryan and
Muridan, Galih and
Winata, Genta and
Moeljadi, David and
Aji, Alham Fikri and
Purwarianti, Ayu and
Fung, Pascale",
editor = "Park, Jong C. and
Arase, Yuki and
Hu, Baotian and
Lu, Wei and
Wijaya, Derry and
Purwarianti, Ayu and
Krisnadhi, Adila Alfa",
booktitle = "Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)",
month = nov,
year = "2023",
address = "Nusa Dua, Bali",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2023.ijcnlp-main.60",
doi = "10.18653/v1/2023.ijcnlp-main.60",
pages = "921--945"
}
@inproceedings{winata-etal-2023-nusax,
title = "{N}usa{X}: Multilingual Parallel Sentiment Dataset for 10 {I}ndonesian Local Languages",
author = "Winata, Genta Indra and
Aji, Alham Fikri and
Cahyawijaya, Samuel and
Mahendra, Rahmad and
Koto, Fajri and
Romadhony, Ade and
Kurniawan, Kemal and
Moeljadi, David and
Prasojo, Radityo Eko and
Fung, Pascale and
Baldwin, Timothy and
Lau, Jey Han and
Sennrich, Rico and
Ruder, Sebastian",
editor = "Vlachos, Andreas and
Augenstein, Isabelle",
booktitle = "Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing",
month = dec,
year = "2023",
address = "Singapore",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2023.emnlp-main.868",
doi = "10.18653/v1/2023.emnlp-main.868",
pages = "13745--13818"
}
📄 License
Use of Cendol is licensed under the Apache 2.0 license

