仙都（Cendol）- Llama2 7B聊天模型开源！精准适配印尼语生成内容

首页

Cendol Llama2 7b Chat

由 indonlp 开发

仙都（Cendol）是一个开源的、针对印尼语进行微调的生成式大语言模型集合，涵盖多种架构和参数规模。

大型语言模型

Transformers

其他开源协议:Apache-2.0 #印尼语优化 #多架构支持 #指令调优

下载量 1,749

发布时间 : 12/25/2023

模型简介

仙都（Cendol）是一个开源的、针对印尼语进行微调的生成式大语言模型集合，涵盖了仅解码器和编码器-解码器两种Transformer模型架构，参数规模从3亿到130亿不等。本模型为70亿参数的仙都（Cendol）LLaMA-2聊天模型。

模型特点

多架构与多参数规模

基于mT5和LLaMA-2两种基础模型，提供多种参数规模的模型，满足不同场景需求。

指令调优版本丰富

包括针对特定任务的Cendol-Instruct和基于通用知识与以人为中心提示进行持续调优的Cendol-Chat。

性能卓越

在大多数测试基准上，大幅超越开源的多语言和特定地区大语言模型，小版本（参数少于10亿）也能与70亿参数的其他模型相媲美。

高效策略

提出词汇替换策略，相比LoRA调优，在训练和推理时间上分别提高了11.50%和18.71%，且评估性能与使用原始词汇训练的模型相当。

模型能力

印尼语文本生成

指令调优

单轮对话

自然语言理解

自然语言生成

使用案例

研究

印尼语自然语言处理研究

用于研究印尼语的自然语言处理任务，如文本生成、指令理解等。

在大多数测试基准上表现优异，超越其他开源模型。

通用知识问答

印尼语通用知识问答

用于回答关于印尼语通用知识的提问。

在人工评估中表现良好。

🚀 仙都（Cendol）：面向印尼语的开源指令调优生成式大语言模型

仙都（Cendol）是一个开源的、针对印尼语进行微调的生成式大语言模型集合，涵盖了仅解码器和编码器 - 解码器两种Transformer模型架构，参数规模从3亿到130亿不等。

本仓库为 70亿参数的仙都（Cendol）LLaMA - 2聊天模型。其他模型的链接如下文所示。

✨ 主要特性

多架构与多参数规模：基于mT5和LLaMA - 2两种基础模型，提供多种参数规模的模型，满足不同场景需求。
指令调优版本丰富：包括针对特定任务的Cendol - Instruct和基于通用知识与以人为中心提示进行持续调优的Cendol - Chat，均适用于单轮对话。
性能卓越：在大多数测试基准上，大幅超越开源的多语言和特定地区大语言模型，小版本（参数少于10亿）也能与70亿参数的其他模型相媲美。
高效策略：提出词汇替换策略，相比LoRA调优，在训练和推理时间上分别提高了11.50%和18.71%，且评估性能与使用原始词汇训练的模型相当。

📚 详细文档

模型详情

注意：仙都（Cendol）的使用遵循 [Apache 2.0许可证](https://choosealicense.com/licenses/apache - 2.0/)。
概述：由IndoNLP开发并公开发布，是一系列预训练和微调的生成式文本模型，参数规模从5.6亿到130亿不等。
模型开发者：IndoNLP
变体：基于mT5的模型有3亿（mT5 - small）、5.8亿（mT5 - base）、12亿（mT5 - large）、37亿（mT5 - XL）和130亿（mT5 - XXL）；基于LLaMA - 2的模型有70亿（LLaMA2 - 7B）和130亿（LLaMA2 - 13B）。均有Cendol - Instruct和Cendol - Chat两种变体。130亿参数的模型采用LoRA调优，其他则进行全量微调。
输入输出：模型的输入和输出均为文本。
模型架构

模型	训练数据	参数	调优策略	学习率
[仙都（Cendol）mT5 - small Instruct](https://huggingface.co/indonlp/cendol - mT5 - small - inst)	仙都（Cendol）集合v1	3亿	全量微调	3.0 x 10⁻⁴
[仙都（Cendol）mT5 - base Instruct](https://huggingface.co/indonlp/cendol - mT5 - base - inst)	仙都（Cendol）集合v1	5.8亿	全量微调	3.0 x 10⁻⁴
[仙都（Cendol）mT5 - large Instruct](https://huggingface.co/indonlp/cendol - mT5 - large - inst)	仙都（Cendol）集合v1	12亿	全量微调	3.0 x 10⁻⁴
[仙都（Cendol）mT5 - XL Instruct](https://huggingface.co/indonlp/cendol - mT5 - XL - inst)	仙都（Cendol）集合v1	37亿	全量微调	3.0 x 10⁻⁴
[仙都（Cendol）mT5 - XXL Instruct](https://huggingface.co/indonlp/cendol - mT5 - XXL - merged - inst)	仙都（Cendol）集合v1	130亿	LoRA	2.0 x 10⁻⁴
[仙都（Cendol）LLaMA - 2 (7B) Instruct](https://huggingface.co/indonlp/cendol - llama2 - 7B - inst)	仙都（Cendol）集合v1	70亿	全量微调	2.0 x 10⁻⁵
[仙都（Cendol）LLaMA - 2 (7B) 印尼语词汇指令调优模型](https://huggingface.co/indonlp/cendol - llama2 - ind - vocab - inst)	仙都（Cendol）集合v1	70亿	全量微调	2.0 x 10⁻⁵
[仙都（Cendol）LLaMA - 2 (13B) Instruct](https://huggingface.co/indonlp/cendol - llama2 - 13B - merged - inst)	仙都（Cendol）集合v1	130亿	LoRA	2.0 x 10⁻⁵
[仙都（Cendol）mT5 - small Chat](https://huggingface.co/indonlp/cendol - mT5 - small - chat)	仙都（Cendol）集合v2	3亿	全量微调	3.0 x 10⁻⁵
[仙都（Cendol）mT5 - base Chat](https://huggingface.co/indonlp/cendol - mT5 - base - chat)	仙都（Cendol）集合v2	5.8亿	全量微调	3.0 x 10⁻⁵
[仙都（Cendol）mT5 - large Chat](https://huggingface.co/indonlp/cendol - mT5 - large - chat)	仙都（Cendol）集合v2	12亿	全量微调	3.0 x 10⁻⁵
[仙都（Cendol）mT5 - XL Chat](https://huggingface.co/indonlp/cendol - mT5 - XL - chat)	仙都（Cendol）集合v2	37亿	全量微调	3.0 x 10⁻⁵
[仙都（Cendol）mT5 - XXL Chat](https://huggingface.co/indonlp/cendol - mT5 - XXL - merged - chat)	仙都（Cendol）集合v2	130亿	LoRA	2.0 x 10⁻⁴
[仙都（Cendol）LLaMA - 2 (7B) Chat](https://huggingface.co/indonlp/cendol - llama2 - 7B - chat)	仙都（Cendol）集合v2	70亿	全量微调	1.0 x 10⁻⁵
[仙都（Cendol）LLaMA - 2 (13B) Chat](https://huggingface.co/indonlp/cendol - llama2 - 13B - merged - chat)	仙都（Cendol）集合v2	130亿	LoRA	2.0 x 10⁻⁴

模型训练时间：仙都（Cendol）于2023年10月至2024年1月期间进行训练。
许可证：使用仙都（Cendol）遵循 [Apache 2.0许可证](https://choosealicense.com/licenses/apache - 2.0/)
研究论文："仙都（Cendol）：面向印尼语的开源指令调优生成式大语言模型"

预期用途

预期用例：仙都（Cendol）主要用于研究，特别是针对印尼语的研究。Cendol - Instruct模型可用于特定任务指令，Cendol - Chat模型可用于通用知识指令。
非预期用途：禁止以任何违反适用法律法规（包括贸易合规法律）的方式使用；禁止用于英语和印尼语以外的语言；禁止以《仙都（Cendol）可接受使用政策和许可协议》禁止的任何其他方式使用。

评估结果

在本节中，我们报告了仙都（Cendol）模型在大规模自然语言理解（NLU）和自然语言生成（NLG）基准测试中的结果。所有评估均使用我们的内部评估库。

NLU性能

![NLU性能](https://github.com/IndoNLP/indo - t0/assets/2826602/7656f005 - f261 - 4982 - ad06 - f18dc57d5e3b)

NLG性能

![NLG性能](https://github.com/IndoNLP/indo - t0/assets/2826602/4942caea - 35df - 44e1 - a95b - 53a027c6115f)

人工评估

![人工评估](https://github.com/IndoNLP/indo - t0/assets/2826602/6128257f - d36c - 4dbb - 8f6c - 4b936bc2ea66)

伦理考量与局限性

仙都（Cendol）是一项新技术，使用时存在风险。到目前为止的测试均使用印尼语进行，无法涵盖所有场景。因此，与所有大语言模型一样，仙都（Cendol）的潜在输出无法提前预测，在某些情况下，模型可能会对用户提示产生不准确、有偏见或其他令人反感的回复。所以，在部署仙都（Cendol）的任何应用之前，开发者应针对其特定应用对模型进行安全测试和调优。

引用

如果您使用了包括仙都（Cendol）模型、代码或数据在内的任何资源，请引用以下文章：

@misc{cahyawijaya - etal - 2024 - cendol,
      title={Cendol: Open Instruction - tuned Generative Large Language Models for Indonesian Languages}, 
      author={Samuel Cahyawijaya and Holy Lovenia and Fajri Koto and Rifki Afina Putri and Emmanuel Dave and Jhonson Lee and Nuur Shadieq and Wawan Cenggoro and Salsabil Maulana Akbar and Muhammad Ihza Mahendra and Dea Annisayanti Putri and Bryan Wilie and Genta Indra Winata and Alham Fikri Aji and Ayu Purwarianti and Pascale Fung},
      year={2024},
      eprint={2404.06138},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

@inproceedings{cahyawijaya - etal - 2023 - nusacrowd,
    title = "{N}usa{C}rowd: Open Source Initiative for {I}ndonesian {NLP} Resources",
    author = "Cahyawijaya, Samuel  and
      Lovenia, Holy  and
      Aji, Alham Fikri  and
      Winata, Genta  and
      Wilie, Bryan  and
      Koto, Fajri  and
      Mahendra, Rahmad  and
      Wibisono, Christian  and
      Romadhony, Ade  and
      Vincentio, Karissa  and
      Santoso, Jennifer  and
      Moeljadi, David  and
      Wirawan, Cahya  and
      Hudi, Frederikus  and
      Wicaksono, Muhammad Satrio  and
      Parmonangan, Ivan  and
      Alfina, Ika  and
      Putra, Ilham Firdausi  and
      Rahmadani, Samsul  and
      Oenang, Yulianti  and
      Septiandri, Ali  and
      Jaya, James  and
      Dhole, Kaustubh  and
      Suryani, Arie  and
      Putri, Rifki Afina  and
      Su, Dan  and
      Stevens, Keith  and
      Nityasya, Made Nindyatama  and
      Adilazuarda, Muhammad  and
      Hadiwijaya, Ryan  and
      Diandaru, Ryandito  and
      Yu, Tiezheng  and
      Ghifari, Vito  and
      Dai, Wenliang  and
      Xu, Yan  and
      Damapuspita, Dyah  and
      Wibowo, Haryo  and
      Tho, Cuk  and
      Karo Karo, Ichwanul  and
      Fatyanosa, Tirana  and
      Ji, Ziwei  and
      Neubig, Graham  and
      Baldwin, Timothy  and
      Ruder, Sebastian  and
      Fung, Pascale  and
      Sujaini, Herry  and
      Sakti, Sakriani  and
      Purwarianti, Ayu",
    editor = "Rogers, Anna  and
      Boyd - Graber, Jordan  and
      Okazaki, Naoaki",
    booktitle = "Findings of the Association for Computational Linguistics: ACL 2023",
    month = jul,
    year = "2023",
    address = "Toronto, Canada",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2023.findings - acl.868",
    doi = "10.18653/v1/2023.findings - acl.868",
    pages = "13745--13818"
}

此外，如果您受到我们针对印尼语及其当地语言的特定地区大语言模型研究工作的启发，请考虑引用以下文章：

@inproceedings{cahyawijaya - etal - 2023 - nusawrites,
    title = "{N}usa{W}rites: Constructing High - Quality Corpora for Underrepresented and Extremely Low - Resource Languages",
    author = "Cahyawijaya, Samuel  and
      Lovenia, Holy  and
      Koto, Fajri  and
      Adhista, Dea  and
      Dave, Emmanuel  and
      Oktavianti, Sarah  and
      Akbar, Salsabil  and
      Lee, Jhonson  and
      Shadieq, Nuur  and
      Cenggoro, Tjeng Wawan  and
      Linuwih, Hanung  and
      Wilie, Bryan  and
      Muridan, Galih  and
      Winata, Genta  and
      Moeljadi, David  and
      Aji, Alham Fikri  and
      Purwarianti, Ayu  and
      Fung, Pascale",
    editor = "Park, Jong C.  and
      Arase, Yuki  and
      Hu, Baotian  and
      Lu, Wei  and
      Wijaya, Derry  and
      Purwarianti, Ayu  and
      Krisnadhi, Adila Alfa",
    booktitle = "Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia - Pacific Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)",
    month = nov,
    year = "2023",
    address = "Nusa Dua, Bali",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2023.ijcnlp - main.60",
    doi = "10.18653/v1/2023.ijcnlp - main.60",
    pages = "921--945"
}

@inproceedings{winata - etal - 2023 - nusax,
    title = "{N}usa{X}: Multilingual Parallel Sentiment Dataset for 10 {I}ndonesian Local Languages",
    author = "Winata, Genta Indra  and
      Aji, Alham Fikri  and
      Cahyawijaya, Samuel  and
      Mahendra, Rahmad  and
      Koto, Fajri  and
      Romadhony, Ade  and
      Kurniawan, Kemal  and
      Moeljadi, David  and
      Prasojo, Radityo Eko  and
      Fung, Pascale  and
      Baldwin, Timothy  and
      Lau, Jey Han  and
      Sennrich, Rico  and
      Ruder, Sebastian",
    editor = "Vlachos, Andreas  and
      Augenstein, Isabelle",
    booktitle = "Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics",
    month = may,
    year = "2023",
    address = "Dubrovnik, Croatia",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2023.eacl - main.57",
    doi = "10.18653/v1/2023.eacl - main.57",
    pages = "815--834"
}

@inproceedings{aji - etal - 2022 - one,
    title = "One Country, 700 + Languages: {NLP} Challenges for Underrepresented Languages and Dialects in {I}ndonesia",
    author = "Aji, Alham Fikri  and
      Winata, Genta Indra  and
      Koto, Fajri  and
      Cahyawijaya, Samuel  and
      Romadhony, Ade  and
      Mahendra, Rahmad  and
      Kurniawan, Kemal  and
      Moeljadi, David  and
      Prasojo, Radityo Eko  and
      Baldwin, Timothy  and
      Lau, Jey Han  and
      Ruder, Sebastian",
    editor = "Muresan, Smaranda  and
      Nakov, Preslav  and
      Villavicencio, Aline",
    booktitle = "Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
    month = may,
    year = "2022",
    address = "Dublin, Ireland",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2022.acl - long.500",
    doi = "10.18653/v1/2022.acl - long.500",
    pages = "7226--7249"
}

@inproceedings{cahyawijaya - etal - 2021 - indonlg,
    title = "{I}ndo{NLG}: Benchmark and Resources for Evaluating {I}ndonesian Natural Language Generation",
    author = "Cahyawijaya, Samuel  and
      Winata, Genta Indra  and
      Wilie, Bryan  and
      Vincentio, Karissa  and
      Li, Xiaohong  and
      Kuncoro, Adhiguna  and
      Ruder, Sebastian  and
      Lim, Zhi Yuan  and
      Bahar, Syafri  and
      Khodra, Masayu  and
      Purwarianti, Ayu  and
      Fung, Pascale",
    editor = "Moens, Marie - Francine  and
      Huang, Xuanjing  and
      Specia, Lucia  and
      Yih, Scott Wen - tau",
    booktitle = "Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing",
    month = nov,
    year = "2021",
    address = "Online and Punta Cana, Dominican Republic",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2021.emnlp - main.699",
    doi = "10.18653/v1/2021.emnlp - main.699",
    pages = "8875--8898"
}

@inproceedings{wilie - etal - 2020 - indonlu,
    title = "{I}ndo{NLU}: Benchmark and Resources for Evaluating {I}ndonesian Natural Language Understanding",
    author = "Wilie, Bryan  and
      Vincentio, Karissa  and
      Winata, Genta Indra  and
      Cahyawijaya, Samuel  and
      Li, Xiaohong  and
      Lim, Zhi Yuan  and
      Soleman, Sidik  and
      Mahendra, Rahmad  and
      Fung, Pascale  and
      Bahar, Syafri  and
      Purwarianti, Ayu",
    editor = "Wong, Kam - Fai  and
      Knight, Kevin  and
      Wu, Hua",
    booktitle = "Proceedings of the 1st Conference of the Asia - Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing",
    month = dec,
    year = "2020",
    address = "Suzhou, China",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2020.aacl - main.85",
    pages = "843--857"
}