模型简介
模型特点
模型能力
使用案例
🚀 参哆啦(Cendol):面向印尼语的开源指令调优生成式大语言模型
参哆啦(Cendol)是一个开源的、经过微调的生成式大语言模型集合,专为印尼语设计。它涵盖了仅解码器和编码器 - 解码器两种Transformer模型架构,参数规模从3亿到130亿不等。
本仓库是 3亿参数的参哆啦mT5-small聊天模型 的相关内容。其他模型的链接如下所示。
✨ 主要特性
- 多架构支持:覆盖仅解码器和编码器 - 解码器两种Transformer模型架构。
- 多参数规模:参数规模从3亿到130亿,满足不同场景需求。
- 两种指令调优版本:参哆啦 - 指令(Cendol - Instruct)和参哆啦 - 聊天(Cendol - Chat),分别适用于特定任务指令和通用知识指令。
- 高性能表现:在大多数测试基准上,大幅超越开源多语言和特定地区的大语言模型,小版本(参数少于10亿)也能与70亿参数的其他模型竞争。
📚 详细文档
模型详情
- 注意事项:参哆啦的使用遵循 [Apache 2.0许可证](https://choosealicense.com/licenses/apache - 2.0/)。
- 概述:IndoNLP开发并公开发布了参哆啦系列大语言模型(LLMs),这是一组预训练和微调的生成式文本模型,参数规模从5.6亿到130亿不等。参哆啦模型有两种指令调优版本:
- 参哆啦 - 指令(Cendol - Instruct):在特定任务的NLP数据(如情感分析、主题建模、机器翻译、摘要生成、问答、释义等)上进行指令调优。
- 参哆啦 - 聊天(Cendol - Chat):在参哆啦 - 指令的基础上,在通用知识和以人为中心的提示上进行持续指令调优。
参哆啦 - 指令和参哆啦 - 聊天都设计用于单轮对话。在我们测试的大多数基准上,参哆啦大幅超越了开源多语言和特定地区的大语言模型,较小版本(参数少于10亿)的参哆啦也能与70亿参数的其他大语言模型竞争。
- 模型开发者:IndoNLP
- 模型变体:参哆啦基于两种基础模型(mT5和LLaMA - 2),每种模型都有不同的参数规模。基于mT5的参哆啦模型有3亿(mT5 - small)、5.8亿(mT5 - base)、12亿(mT5 - large)、37亿(mT5 - XL)和130亿(mT5 - XXL)参数的版本;基于LLaMA - 2的参哆啦模型有70亿(LLaMA2 - 7B)和130亿(LLaMA2 - 13B)参数的版本。两种变体都有参哆啦 - 指令和参哆啦 - 聊天两种版本。所有130亿参数的模型都使用LoRA进行调优,其他模型则进行全量微调。
在论文中,我们展示了使用LoRA对特定地区大语言模型进行适配是低效的,即130亿(mT5 - XXL)参数的参哆啦模型性能略逊于12亿(mT5 - large)参数的参哆啦模型,而训练时间慢3倍,推理时间慢4倍。作为LoRA的替代方案,我们展示了词汇替换作为一种有效且高效的特定地区适配策略的优势,训练和推理时间分别提高了 11.50% 和 18.71%。在评估性能方面,我们还展示了该模型的表现与使用原始词汇训练的参哆啦模型相当。我们还发布了经过印尼语词汇适配的模型,记为 Indonesian - Vocab Instruct
。
- 输入输出:模型的输入和输出仅为文本。
- 模型架构
属性 | 详情 |
---|---|
模型类型 | [Cendol mT5 - small Instruct](https://huggingface.co/indonlp/cendol - mt5 - small - inst)、[Cendol mT5 - base Instruct](https://huggingface.co/indonlp/cendol - mt5 - base - inst)等多种模型变体 |
训练数据 | Cendol Collection v1 或 Cendol Collection v2 |
参数规模 | 3亿、5.8亿、12亿、37亿、70亿、130亿等 |
调优策略 | 全量微调(Fully - Finetuned)或LoRA |
学习率 | 3.0 x 10 - 4、2.0 x 10 - 4、2.0 x 10 - 5、3.0 x 10 - 5、1.0 x 10 - 5 等 |
- 模型训练时间:参哆啦的训练时间为2023年10月至2024年1月。
- 许可证:参哆啦的使用遵循 [Apache 2.0许可证](https://choosealicense.com/licenses/apache - 2.0/)。
- 研究论文:"Cendol: Open Instruction - tuned Generative Large Language Models for Indonesian Languages"
预期用途
- 预期用例:参哆啦主要用于研究,特别是印尼语相关的研究。参哆啦模型设计用于单轮指令,参哆啦 - 指令模型可用于特定任务指令,参哆啦 - 聊天模型可用于通用知识指令。
- 非预期用途:禁止以任何违反适用法律法规(包括贸易合规法律)的方式使用。禁止用于英语和印尼语以外的语言。禁止以《参哆啦可接受使用政策和许可协议》禁止的任何其他方式使用。
评估结果
在本节中,我们报告了参哆啦模型在大规模自然语言理解(NLU)和自然语言生成(NLG)基准上的评估结果。所有评估均使用我们的内部评估库进行。
- 自然语言理解(NLU)性能 
- 自然语言生成(NLG)性能 
- 人工评估 
伦理考量与局限性
参哆啦是一项新技术,使用时存在风险。到目前为止的测试仅在印尼语环境下进行,无法涵盖所有场景。因此,与所有大语言模型一样,参哆啦的潜在输出无法提前预测,在某些情况下,模型可能会对用户提示产生不准确、有偏见或其他令人反感的回复。因此,在部署参哆啦的任何应用之前,开发者应针对其特定应用进行安全测试和调优。
引用信息
如果您使用参哆啦模型、代码或数据等任何资源,请引用以下文章:
@misc{cahyawijaya - etal - 2024 - cendol,
title={Cendol: Open Instruction - tuned Generative Large Language Models for Indonesian Languages},
author={Samuel Cahyawijaya and Holy Lovenia and Fajri Koto and Rifki Afina Putri and Emmanuel Dave and Jhonson Lee and Nuur Shadieq and Wawan Cenggoro and Salsabil Maulana Akbar and Muhammad Ihza Mahendra and Dea Annisayanti Putri and Bryan Wilie and Genta Indra Winata and Alham Fikri Aji and Ayu Purwarianti and Pascale Fung},
year={2024},
eprint={2404.06138},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
@inproceedings{cahyawijaya - etal - 2023 - nusacrowd,
title = "{N}usa{C}rowd: Open Source Initiative for {I}ndonesian {NLP} Resources",
author = "Cahyawijaya, Samuel and
Lovenia, Holy and
Aji, Alham Fikri and
Winata, Genta and
Wilie, Bryan and
Koto, Fajri and
Mahendra, Rahmad and
Wibisono, Christian and
Romadhony, Ade and
Vincentio, Karissa and
Santoso, Jennifer and
Moeljadi, David and
Wirawan, Cahya and
Hudi, Frederikus and
Wicaksono, Muhammad Satrio and
Parmonangan, Ivan and
Alfina, Ika and
Putra, Ilham Firdausi and
Rahmadani, Samsul and
Oenang, Yulianti and
Septiandri, Ali and
Jaya, James and
Dhole, Kaustubh and
Suryani, Arie and
Putri, Rifki Afina and
Su, Dan and
Stevens, Keith and
Nityasya, Made Nindyatama and
Adilazuarda, Muhammad and
Hadiwijaya, Ryan and
Diandaru, Ryandito and
Yu, Tiezheng and
Ghifari, Vito and
Dai, Wenliang and
Xu, Yan and
Damapuspita, Dyah and
Wibowo, Haryo and
Tho, Cuk and
Karo Karo, Ichwanul and
Fatyanosa, Tirana and
Ji, Ziwei and
Neubig, Graham and
Baldwin, Timothy and
Ruder, Sebastian and
Fung, Pascale and
Sujaini, Herry and
Sakti, Sakriani and
Purwarianti, Ayu",
editor = "Rogers, Anna and
Boyd - Graber, Jordan and
Okazaki, Naoaki",
booktitle = "Findings of the Association for Computational Linguistics: ACL 2023",
month = jul,
year = "2023",
address = "Toronto, Canada",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2023.findings - acl.868",
doi = "10.18653/v1/2023.findings - acl.868",
pages = "13745--13818"
}
此外,如果您受到我们在特定地区语言模型(特别是印尼语及其当地语言)方面工作的启发,也请考虑引用以下文章:
@inproceedings{cahyawijaya - etal - 2023 - nusawrites,
title = "{N}usa{W}rites: Constructing High - Quality Corpora for Underrepresented and Extremely Low - Resource Languages",
author = "Cahyawijaya, Samuel and
Lovenia, Holy and
Koto, Fajri and
Adhista, Dea and
Dave, Emmanuel and
Oktavianti, Sarah and
Akbar, Salsabil and
Lee, Jhonson and
Shadieq, Nuur and
Cenggoro, Tjeng Wawan and
Linuwih, Hanung and
Wilie, Bryan and
Muridan, Galih and
Winata, Genta and
Moeljadi, David and
Aji, Alham Fikri and
Purwarianti, Ayu and
Fung, Pascale",
editor = "Park, Jong C. and
Arase, Yuki and
Hu, Baotian and
Lu, Wei and
Wijaya, Derry and
Purwarianti, Ayu and
Krisnadhi, Adila Alfa",
booktitle = "Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia - Pacific Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)",
month = nov,
year = "2023",
address = "Nusa Dua, Bali",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2023.ijcnlp - main.60",
doi = "10.18653/v1/2023.ijcnlp - main.60",
pages = "921--945"
}
@inproceedings{winata - etal - 2023 - nusax,
title = "{N}usa{X}: Multilingual Parallel Sentiment Dataset for 10 {I}ndonesian Local Languages",
author = "Winata, Genta Indra and
Aji, Alham Fikri and
Cahyawijaya, Samuel and
Mahendra, Rahmad and
Koto, Fajri and
Romadhony, Ade and
Kurniawan, Kemal and
Moeljadi, David and
Prasojo, Radityo Eko and
Fung, Pascale and
Baldwin, Timothy and
Lau, Jey Han and
Sennrich, Rico and
Ruder, Sebastian",
editor = "Vlachos, Andreas and
Augenstein, Isabelle",
booktitle = "Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics",
month = may,
year = "2023",
address = "Dubrovnik, Croatia",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2023.eacl - main.57",
doi = "10.18653/v1/2023.eacl - main.57",
pages = "815--834"
}
@inproceedings{aji - etal - 2022 - one,
title = "One Country, 700+ Languages: {NLP} Challenges for Underrepresented Languages and Dialects in {I}ndonesia",
author = "Aji, Alham Fikri and
Winata, Genta Indra and
Koto, Fajri and
Cahyawijaya, Samuel and
Romadhony, Ade and
Mahendra, Rahmad and
Kurniawan, Kemal and
Moeljadi, David and
Prasojo, Radityo Eko and
Baldwin, Timothy and
Lau, Jey Han and
Ruder, Sebastian",
editor = "Muresan, Smaranda and
Nakov, Preslav and
Villavicencio, Aline",
booktitle = "Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
month = may,
year = "2022",
address = "Dublin, Ireland",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2022.acl - long.500",
doi = "10.18653/v1/2022.acl - long.500",
pages = "7226--7249"
}
@inproceedings{cahyawijaya - etal - 2021 - indonlg,
title = "{I}ndo{NLG}: Benchmark and Resources for Evaluating {I}ndonesian Natural Language Generation",
author = "Cahyawijaya, Samuel and
Winata, Genta Indra and
Wilie, Bryan and
Vincentio, Karissa and
Li, Xiaohong and
Kuncoro, Adhiguna and
Ruder, Sebastian and
Lim, Zhi Yuan and
Bahar, Syafri and
Khodra, Masayu and
Purwarianti, Ayu and
Fung, Pascale",
editor = "Moens, Marie - Francine and
Huang, Xuanjing and
Specia, Lucia and
Yih, Scott Wen - tau",
booktitle = "Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing",
month = nov,
year = "2021",
address = "Online and Punta Cana, Dominican Republic",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2021.emnlp - main.699",
doi = "10.18653/v1/2021.emnlp - main.699",
pages = "8875--8898"
}
@inproceedings{wilie - etal - 2020 - indonlu,
title = "{I}ndo{NLU}: Benchmark and Resources for Evaluating {I}ndonesian Natural Language Understanding",
author = "Wilie, Bryan and
Vincentio, Karissa and
Winata, Genta Indra and
Cahyawijaya, Samuel and
Li, Xiaohong and
Lim, Zhi Yuan and
Soleman, Sidik and
Mahendra, Rahmad and
Fung, Pascale and
Bahar, Syafri and
Purwarianti, Ayu",
editor = "Wong, Kam - Fai and
Knight, Kevin and
Wu, Hua",
booktitle = "Proceedings of the 1st Conference of the Asia - Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing",
month = dec,
year = "2020",
address = "Suzhou, China",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2020.aacl - main.85",
pages = "843--857"
}
📄 许可证
本项目使用 [Apache 2.0许可证](https://choosealicense.com/licenses/apache - 2.0/)。



