Cendol - The Llama2 7B chat model is open-sourced! It can accurately adapt to generating Indonesian content.

Cendol Llama2 7b Chat

Developed by indonlp

Xiandu (Cendol) is an open-source collection of generative large language models fine-tuned for the Indonesian language, covering various architectures and parameter scales.

Large Language Model

Transformers

OtherOpen Source License:Apache-2.0 #Indonesian language optimization #Multi-architecture support #Instruction tuning

Downloads 1,749

Release Time : 12/25/2023

Model Overview

Xiandu (Cendol) is an open-source collection of generative large language models fine-tuned for the Indonesian language, covering both decoder-only and encoder-decoder Transformer model architectures, with parameter scales ranging from 300 million to 13 billion. This model is the 7-billion-parameter Xiandu (Cendol) LLaMA-2 chat model.

Model Features

Multi-architecture and multi-parameter scales

Based on two basic models, mT5 and LLaMA-2, models with various parameter scales are provided to meet the needs of different scenarios.

Rich instruction tuning versions

Including Cendol-Instruct for specific tasks and Cendol-Chat continuously tuned based on general knowledge and human-centered prompts.

Excellent performance

On most test benchmarks, it significantly outperforms open-source multilingual and specific regional large language models. Even the small versions (with parameters less than 1 billion) can compete with other models with 7 billion parameters.

Efficient strategy

A vocabulary replacement strategy is proposed. Compared with LoRA tuning, it improves the training and inference time by 11.50% and 18.71% respectively, and the evaluation performance is comparable to that of the model trained with the original vocabulary.

Model Capabilities

Indonesian text generation

Instruction tuning

Single-round dialogue

Natural language understanding

Natural language generation

Use Cases

Research

Research on Indonesian natural language processing

Used for researching natural language processing tasks in Indonesian, such as text generation and instruction understanding.

Performs excellently on most test benchmarks, surpassing other open-source models.

General knowledge Q&A

Indonesian general knowledge Q&A

Used to answer questions about Indonesian general knowledge.

Performs well in manual evaluation.

🚀 Cendol: Open Instruction-tuned Generative Large Language Models for Indonesian Languages

Cendol is an open - source collection of fine - tuned generative large language models in Indonesian languages. It covers decoder - only and encoder - decoder transformer model architectures, with parameter scales ranging from 300 million to 13 billion.

This is the repository for the 7B Cendol LLaMA - 2 Chat model. Links to other models can be found below.

📚 Documentation

Model Details

Note: Use of Cendol is licensed under the [Apache 2.0 license](https://choosealicense.com/licenses/apache - 2.0/)

Overview

IndoNLP developed and publicly released the Cendol family of large language models (LLMs). It's a collection of pretrained and fine - tuned generative text models, with parameter scales ranging from 560 million to 13 billion.

Cendol models have two instruction - tuned versions:

Cendol - Instruct: Instruction - tuned on tasks - specific NLP data such as sentiment analysis, topic modeling, machine translation, summarization, question answering, paraphrasing, etc.
Cendol - Chat: Continuously instruction - tuned from Cendol - Instruct on general knowledge and human - centric prompts.

Both Cendol - Instruct and Cendol - Chat are designed for single - turn conversations. Cendol outperforms open - source multilingual and region - specific LLMs on most benchmarks we tested by a large margin. The smaller version (<1B parameters) of Cendol is highly competitive with other LLMs with 7B parameters.

Model Developers: IndoNLP

Variations

Cendol is based on 2 base models (mT5 and LLaMA - 2), each with a range of parameter sizes. mT5 - based Cendol includes 300M (mT5 - small), 580M (mT5 - base), 1.2B (mT5 - large), 3.7B (mT5 - XL), and 13B (mT5 - XXL) models. LLaMA - 2 - based Cendol has 7B (LLaMA2 - 7B) and 13B (LLaMA2 - 13B) models. Both variants have Cendol - Instruct and Cendol - Chat variations. All 13B parameter models are tuned with LoRA, while others are fully fine - tuned.

In our paper, we show that adapting region - specific LLMs using LoRA is ineffective and inefficient. For example, the 13B (mT5 - XXL) Cendol models perform slightly worse than the 1.2B (mT5 - large) Cendol models, with 3x slower training time and 4x slower inference time. As an alternative to LoRA, we demonstrate the benefits of vocabulary substitution as an effective and efficient strategy for region - specific adaptation. We improve the efficiency by 11.50% and 18.71% for training and inference times, respectively. In terms of evaluation performance, the model performs on par with the Cendol model trained with the original vocabulary. We also release the Indonesian vocabulary - adapted model denoted as Indonesian - Vocab Instruct.

Input - Output: Models input and output are text only.

Model Architecture

Property	Details
Model Type	Decoder - only and encoder - decoder transformer models (mT5 and LLaMA - 2 based)
Training Data	Cendol Collection v1 for Instruct models; Cendol Collection v2 for Chat models
Params	Ranging from 300M to 13B
Tuning Strategy	Fully - Finetuned for most models; LoRA for 13B parameter models

Model	Training Data	Params	Tuning Strategy	LR
[Cendol mT5 - small Instruct](https://huggingface.co/indonlp/cendol - mt5 - small - inst)	Cendol Collection v1	300M	Fully - Finetuned	3.0 x 10^-4
[Cendol mT5 - base Instruct](https://huggingface.co/indonlp/cendol - mt5 - base - inst)	Cendol Collection v1	580M	Fully - Finetuned	3.0 x 10^-4
[Cendol mT5 - large Instruct](https://huggingface.co/indonlp/cendol - mt5 - large - inst)	Cendol Collection v1	1.2B	Fully - Finetuned	3.0 x 10^-4
[Cendol mT5 - xl Instruct](https://huggingface.co/indonlp/cendol - mt5 - xl - inst)	Cendol Collection v1	3.7B	Fully - Finetuned	3.0 x 10^-4
[Cendol mT5 - xxl Instruct](https://huggingface.co/indonlp/cendol - mt5 - xxl - merged - inst)	Cendol Collection v1	13B	LoRA	2.0 x 10^-4
[Cendol LLaMA - 2 (7B) Instruct](https://huggingface.co/indonlp/cendol - llama2 - 7b - inst)	Cendol Collection v1	7B	Fully - Finetuned	2.0 x 10^-5
[Cendol LLaMA - 2 (7B) Indonesian - Vocab Instruct](https://huggingface.co/indonlp/cendol - llama2 - ind - vocab - inst)	Cendol Collection v1	7B	Fully - Finetuned	2.0 x 10^-5
[Cendol LLaMA - 2 (13B) Instruct](https://huggingface.co/indonlp/cendol - llama2 - 13b - merged - inst)	Cendol Collection v1	13B	LoRA	2.0 x 10^-5
[Cendol mT5 - small Chat](https://huggingface.co/indonlp/cendol - mt5 - small - chat)	Cendol Collection v2	300M	Fully - Finetuned	3.0 x 10^-5
[Cendol mT5 - base Chat](https://huggingface.co/indonlp/cendol - mt5 - base - chat)	Cendol Collection v2	580M	Fully - Finetuned	3.0 x 10^-5
[Cendol mT5 - large Chat](https://huggingface.co/indonlp/cendol - mt5 - large - chat)	Cendol Collection v2	1.2B	Fully - Finetuned	3.0 x 10^-5
[Cendol mT5 - xl Chat](https://huggingface.co/indonlp/cendol - mt5 - xl - chat)	Cendol Collection v2	3.7B	Fully - Finetuned	3.0 x 10^-5
[Cendol mT5 - xxl Chat](https://huggingface.co/indonlp/cendol - mt5 - xxl - merged - chat)	Cendol Collection v2	13B	LoRA	2.0 x 10^-4
[Cendol LLaMA - 2 (7B) Chat](https://huggingface.co/indonlp/cendol - llama2 - 7b - chat)	Cendol Collection v2	7B	Fully - Finetuned	1.0 x 10^-5
[Cendol LLaMA - 2 (13B) Chat](https://huggingface.co/indonlp/cendol - llama2 - 13b - merged - chat)	Cendol Collection v2	13B	LoRA	2.0 x 10^-4

Model Dates Cendol was trained between October 2023 and January 2024.

License Use of Cendol is licensed under the [Apache 2.0 license](https://choosealicense.com/licenses/apache - 2.0/)

Research Paper "Cendol: Open Instruction - tuned Generative Large Language Models for Indonesian Languages"

Intended Use

Intended Use Cases Cendol is intended for research use, especially on Indonesian languages. Cendol models are designed for single - turn instructions. Cendol - Instruct models can be used for task - specific instructions, while Cendol - Chat models can be used for general knowledge instructions.

Out - of - scope Uses Use in any manner that violates applicable laws or regulations (including trade compliance laws). Use in languages other than English and Indonesian languages. Use in any other way that is prohibited by the Acceptable Use Policy and Licensing Agreement for Cendol.

Evaluation Results

In this section, we report the results for the Cendol models on large - scale NLU and NLG benchmarks. For all the evaluations, we use our internal evaluations library.

NLU Performance

NLG Performance

Human evaluation

Ethical Considerations and Limitations

Cendol is a new technology that carries risks with its use. Testing conducted to date has been in Indonesian, and has not covered, nor could it cover all scenarios. For these reasons, as with all LLMs, Cendol's potential outputs cannot be predicted in advance, and the model may in some instances produce inaccurate, biased or other objectionable responses to user prompts. Therefore, before deploying any applications of Cendol, developers should perform safety testing and tuning tailored to their specific applications of the model.

Citation

If you are using any resources including Cendol models, code, or data, please cite the following articles:

@misc{cahyawijaya - etal - 2024 - cendol,
      title={Cendol: Open Instruction - tuned Generative Large Language Models for Indonesian Languages}, 
      author={Samuel Cahyawijaya and Holy Lovenia and Fajri Koto and Rifki Afina Putri and Emmanuel Dave and Jhonson Lee and Nuur Shadieq and Wawan Cenggoro and Salsabil Maulana Akbar and Muhammad Ihza Mahendra and Dea Annisayanti Putri and Bryan Wilie and Genta Indra Winata and Alham Fikri Aji and Ayu Purwarianti and Pascale Fung},
      year={2024},
      eprint={2404.06138},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

@inproceedings{cahyawijaya - etal - 2023 - nusacrowd,
    title = "{N}usa{C}rowd: Open Source Initiative for {I}ndonesian {NLP} Resources",
    author = "Samuel Cahyawijaya and Holy Lovenia and Fajri Koto and Rifki Afina Putri and Emmanuel Dave and Jhonson Lee and Nuur Shadieq and Wawan Cenggoro and Salsabil Maulana Akbar and Muhammad Ihza Mahendra and Dea Annisayanti Putri and Bryan Wilie and Genta Indra Winata and Alham Fikri Aji and Ayu Purwarianti and Pascale Fung",
    editor = "Anna Rogers and Jordan Boyd - Graber and Naoaki Okazaki",
    booktitle = "Findings of the Association for Computational Linguistics: ACL 2023",
    month = jul,
    year = "2023",
    address = "Toronto, Canada",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2023.findings - acl.868",
    doi = "10.18653/v1/2023.findings - acl.868",
    pages = "13745--13818"
}

Additionally, if you are inspired by our work on region - specific language models especially for Indonesian and its local languages, please also consider citing the following articles:

@inproceedings{cahyawijaya - etal - 2023 - nusawrites,
    title = "{N}usa{W}rites: Constructing High - Quality Corpora for Underrepresented and Extremely Low - Resource Languages",
    author = "Samuel Cahyawijaya and Holy Lovenia and Fajri Koto and Dea Adhista and Emmanuel Dave and Sarah Oktavianti and Salsabil Akbar and Jhonson Lee and Nuur Shadieq and Tjeng Wawan Cenggoro and Hanung Linuwih and Bryan Wilie and Galih Muridan and Genta Indra Winata and David Moeljadi and Alham Fikri Aji and Ayu Purwarianti and Pascale Fung",
    editor = "Jong C. Park and Yuki Arase and Baotian Hu and Wei Lu and Derry Wijaya and Ayu Purwarianti and Adila Alfa Krisnadhi",
    booktitle = "Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia - Pacific Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)",
    month = nov,
    year = "2023",
    address = "Nusa Dua, Bali",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2023.ijcnlp - main.60",
    doi = "10.18653/v1/2023.ijcnlp - main.60",
    pages = "921--945"
}

@inproceedings{winata - etal - 2023 - nusax,
    title = "{N}usa{X}: Multilingual Parallel Sentiment Dataset for 10 {I}ndonesian Local Languages",
    author = "Genta Indra Winata and Alham Fikri Aji and Samuel Cahyawijaya and Rahmad Mahendra and Fajri Koto and Ade Romadhony and Kemal Kurniawan and David Moeljadi and Radityo Eko Prasojo and Pascale Fung and Timothy Baldwin and Jey Han Lau and Rico Sennrich and Sebastian Ruder",
    editor = "Andreas Vlachos and Isabelle Augenstein",
    booktitle = "Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics",
    month = may,
    year = "2023",
    address = "Dubrovnik, Croatia",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2023.eacl - main.57",
    doi = "10.18653/v1/2023.eacl - main.57",
    pages = "815--834"
}

@inproceedings{aji - etal - 2022 - one,
    title = "One Country, 700+ Languages: {NLP} Challenges for Underrepresented Languages and Dialects in {I}ndonesia",
    author = "Alham Fikri Aji and Genta Indra Winata and Fajri Koto and Samuel Cahyawijaya and Ade Romadhony and Rahmad Mahendra and Kemal Kurniawan and David Moeljadi and Radityo Eko Prasojo and Timothy Baldwin and Jey Han Lau and Sebastian Ruder",
    editor = "Smaranda Muresan and Preslav Nakov and Aline Villavicencio",
    booktitle = "Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
    month = may,
    year = "2022",
    address = "Dublin, Ireland",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2022.acl - long.500",
    doi = "10.18653/v1/2022.acl - long.500",
    pages = "7226--7249"
}

@inproceedings{cahyawijaya - etal - 2021 - indonlg,
    title = "{I}ndo{NLG}: Benchmark and Resources for Evaluating {I}ndonesian Natural Language Generation",
    author = "Samuel Cahyawijaya and Genta Indra Winata and Bryan Wilie and Karissa Vincentio and Xiaohong Li and Adhiguna Kuncoro and Sebastian Ruder and Zhi Yuan Lim and Syafri Bahar and Masayu Khodra and Ayu Purwarianti and Pascale Fung",
    editor = "Marie - Francine Moens and Xuanjing Huang and Lucia Specia and Scott Wen - tau Yih",
    booktitle = "Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing",
    month = nov,
    year = "2021",
    address = "Online and Punta Cana, Dominican Republic",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2021.emnlp - main.699",
    doi = "10.18653/v1/2021.emnlp - main.699",
    pages = "8875--8898"
}

@inproceedings{wilie - etal - 2020 - indonlu,
    title = "{I}ndo{NLU}: Benchmark and Resources for Evaluating {I}ndonesian Natural Language Understanding",
    author = "Bryan Wilie and Karissa Vincentio and Genta Indra Winata and Samuel Cahyawijaya and Xiaohong Li and Zhi Yuan Lim and Sidik Soleman and Rahmad Mahendra and Pascale Fung and Syafri Bahar and Ayu Purwarianti",
    editor = "Kam - Fai Wong and Kevin Knight and Hua Wu",
    booktitle = "Proceedings of the 1st Conference of the Asia - Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing",
    month = dec,
    year = "2020",
    address = "Suzhou, China",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2020.aacl - main.85",
    pages = "843--857"
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご