模型概述
模型特點
模型能力
使用案例
🚀 參哆啦(Cendol):面向印尼語的開源指令調優生成式大語言模型
參哆啦(Cendol)是一個開源的、經過微調的生成式大語言模型集合,專為印尼語設計。它涵蓋了僅解碼器和編碼器 - 解碼器兩種Transformer模型架構,參數規模從3億到130億不等。
本倉庫是 3億參數的參哆啦mT5-small聊天模型 的相關內容。其他模型的鏈接如下所示。
✨ 主要特性
- 多架構支持:覆蓋僅解碼器和編碼器 - 解碼器兩種Transformer模型架構。
- 多參數規模:參數規模從3億到130億,滿足不同場景需求。
- 兩種指令調優版本:參哆啦 - 指令(Cendol - Instruct)和參哆啦 - 聊天(Cendol - Chat),分別適用於特定任務指令和通用知識指令。
- 高性能表現:在大多數測試基準上,大幅超越開源多語言和特定地區的大語言模型,小版本(參數少於10億)也能與70億參數的其他模型競爭。
📚 詳細文檔
模型詳情
- 注意事項:參哆啦的使用遵循 [Apache 2.0許可證](https://choosealicense.com/licenses/apache - 2.0/)。
- 概述:IndoNLP開發並公開發布了參哆啦系列大語言模型(LLMs),這是一組預訓練和微調的生成式文本模型,參數規模從5.6億到130億不等。參哆啦模型有兩種指令調優版本:
- 參哆啦 - 指令(Cendol - Instruct):在特定任務的NLP數據(如情感分析、主題建模、機器翻譯、摘要生成、問答、釋義等)上進行指令調優。
- 參哆啦 - 聊天(Cendol - Chat):在參哆啦 - 指令的基礎上,在通用知識和以人為中心的提示上進行持續指令調優。
參哆啦 - 指令和參哆啦 - 聊天都設計用於單輪對話。在我們測試的大多數基準上,參哆啦大幅超越了開源多語言和特定地區的大語言模型,較小版本(參數少於10億)的參哆啦也能與70億參數的其他大語言模型競爭。
- 模型開發者:IndoNLP
- 模型變體:參哆啦基於兩種基礎模型(mT5和LLaMA - 2),每種模型都有不同的參數規模。基於mT5的參哆啦模型有3億(mT5 - small)、5.8億(mT5 - base)、12億(mT5 - large)、37億(mT5 - XL)和130億(mT5 - XXL)參數的版本;基於LLaMA - 2的參哆啦模型有70億(LLaMA2 - 7B)和130億(LLaMA2 - 13B)參數的版本。兩種變體都有參哆啦 - 指令和參哆啦 - 聊天兩種版本。所有130億參數的模型都使用LoRA進行調優,其他模型則進行全量微調。
在論文中,我們展示了使用LoRA對特定地區大語言模型進行適配是低效的,即130億(mT5 - XXL)參數的參哆啦模型性能略遜於12億(mT5 - large)參數的參哆啦模型,而訓練時間慢3倍,推理時間慢4倍。作為LoRA的替代方案,我們展示了詞彙替換作為一種有效且高效的特定地區適配策略的優勢,訓練和推理時間分別提高了 11.50% 和 18.71%。在評估性能方面,我們還展示了該模型的表現與使用原始詞彙訓練的參哆啦模型相當。我們還發布了經過印尼語詞彙適配的模型,記為 Indonesian - Vocab Instruct
。
- 輸入輸出:模型的輸入和輸出僅為文本。
- 模型架構
屬性 | 詳情 |
---|---|
模型類型 | [Cendol mT5 - small Instruct](https://huggingface.co/indonlp/cendol - mt5 - small - inst)、[Cendol mT5 - base Instruct](https://huggingface.co/indonlp/cendol - mt5 - base - inst)等多種模型變體 |
訓練數據 | Cendol Collection v1 或 Cendol Collection v2 |
參數規模 | 3億、5.8億、12億、37億、70億、130億等 |
調優策略 | 全量微調(Fully - Finetuned)或LoRA |
學習率 | 3.0 x 10 - 4、2.0 x 10 - 4、2.0 x 10 - 5、3.0 x 10 - 5、1.0 x 10 - 5 等 |
- 模型訓練時間:參哆啦的訓練時間為2023年10月至2024年1月。
- 許可證:參哆啦的使用遵循 [Apache 2.0許可證](https://choosealicense.com/licenses/apache - 2.0/)。
- 研究論文:"Cendol: Open Instruction - tuned Generative Large Language Models for Indonesian Languages"
預期用途
- 預期用例:參哆啦主要用於研究,特別是印尼語相關的研究。參哆啦模型設計用於單輪指令,參哆啦 - 指令模型可用於特定任務指令,參哆啦 - 聊天模型可用於通用知識指令。
- 非預期用途:禁止以任何違反適用法律法規(包括貿易合規法律)的方式使用。禁止用於英語和印尼語以外的語言。禁止以《參哆啦可接受使用政策和許可協議》禁止的任何其他方式使用。
評估結果
在本節中,我們報告了參哆啦模型在大規模自然語言理解(NLU)和自然語言生成(NLG)基準上的評估結果。所有評估均使用我們的內部評估庫進行。
- 自然語言理解(NLU)性能 
- 自然語言生成(NLG)性能 
- 人工評估 
倫理考量與侷限性
參哆啦是一項新技術,使用時存在風險。到目前為止的測試僅在印尼語環境下進行,無法涵蓋所有場景。因此,與所有大語言模型一樣,參哆啦的潛在輸出無法提前預測,在某些情況下,模型可能會對用戶提示產生不準確、有偏見或其他令人反感的回覆。因此,在部署參哆啦的任何應用之前,開發者應針對其特定應用進行安全測試和調優。
引用信息
如果您使用參哆啦模型、代碼或數據等任何資源,請引用以下文章:
@misc{cahyawijaya - etal - 2024 - cendol,
title={Cendol: Open Instruction - tuned Generative Large Language Models for Indonesian Languages},
author={Samuel Cahyawijaya and Holy Lovenia and Fajri Koto and Rifki Afina Putri and Emmanuel Dave and Jhonson Lee and Nuur Shadieq and Wawan Cenggoro and Salsabil Maulana Akbar and Muhammad Ihza Mahendra and Dea Annisayanti Putri and Bryan Wilie and Genta Indra Winata and Alham Fikri Aji and Ayu Purwarianti and Pascale Fung},
year={2024},
eprint={2404.06138},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
@inproceedings{cahyawijaya - etal - 2023 - nusacrowd,
title = "{N}usa{C}rowd: Open Source Initiative for {I}ndonesian {NLP} Resources",
author = "Cahyawijaya, Samuel and
Lovenia, Holy and
Aji, Alham Fikri and
Winata, Genta and
Wilie, Bryan and
Koto, Fajri and
Mahendra, Rahmad and
Wibisono, Christian and
Romadhony, Ade and
Vincentio, Karissa and
Santoso, Jennifer and
Moeljadi, David and
Wirawan, Cahya and
Hudi, Frederikus and
Wicaksono, Muhammad Satrio and
Parmonangan, Ivan and
Alfina, Ika and
Putra, Ilham Firdausi and
Rahmadani, Samsul and
Oenang, Yulianti and
Septiandri, Ali and
Jaya, James and
Dhole, Kaustubh and
Suryani, Arie and
Putri, Rifki Afina and
Su, Dan and
Stevens, Keith and
Nityasya, Made Nindyatama and
Adilazuarda, Muhammad and
Hadiwijaya, Ryan and
Diandaru, Ryandito and
Yu, Tiezheng and
Ghifari, Vito and
Dai, Wenliang and
Xu, Yan and
Damapuspita, Dyah and
Wibowo, Haryo and
Tho, Cuk and
Karo Karo, Ichwanul and
Fatyanosa, Tirana and
Ji, Ziwei and
Neubig, Graham and
Baldwin, Timothy and
Ruder, Sebastian and
Fung, Pascale and
Sujaini, Herry and
Sakti, Sakriani and
Purwarianti, Ayu",
editor = "Rogers, Anna and
Boyd - Graber, Jordan and
Okazaki, Naoaki",
booktitle = "Findings of the Association for Computational Linguistics: ACL 2023",
month = jul,
year = "2023",
address = "Toronto, Canada",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2023.findings - acl.868",
doi = "10.18653/v1/2023.findings - acl.868",
pages = "13745--13818"
}
此外,如果您受到我們在特定地區語言模型(特別是印尼語及其當地語言)方面工作的啟發,也請考慮引用以下文章:
@inproceedings{cahyawijaya - etal - 2023 - nusawrites,
title = "{N}usa{W}rites: Constructing High - Quality Corpora for Underrepresented and Extremely Low - Resource Languages",
author = "Cahyawijaya, Samuel and
Lovenia, Holy and
Koto, Fajri and
Adhista, Dea and
Dave, Emmanuel and
Oktavianti, Sarah and
Akbar, Salsabil and
Lee, Jhonson and
Shadieq, Nuur and
Cenggoro, Tjeng Wawan and
Linuwih, Hanung and
Wilie, Bryan and
Muridan, Galih and
Winata, Genta and
Moeljadi, David and
Aji, Alham Fikri and
Purwarianti, Ayu and
Fung, Pascale",
editor = "Park, Jong C. and
Arase, Yuki and
Hu, Baotian and
Lu, Wei and
Wijaya, Derry and
Purwarianti, Ayu and
Krisnadhi, Adila Alfa",
booktitle = "Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia - Pacific Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)",
month = nov,
year = "2023",
address = "Nusa Dua, Bali",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2023.ijcnlp - main.60",
doi = "10.18653/v1/2023.ijcnlp - main.60",
pages = "921--945"
}
@inproceedings{winata - etal - 2023 - nusax,
title = "{N}usa{X}: Multilingual Parallel Sentiment Dataset for 10 {I}ndonesian Local Languages",
author = "Winata, Genta Indra and
Aji, Alham Fikri and
Cahyawijaya, Samuel and
Mahendra, Rahmad and
Koto, Fajri and
Romadhony, Ade and
Kurniawan, Kemal and
Moeljadi, David and
Prasojo, Radityo Eko and
Fung, Pascale and
Baldwin, Timothy and
Lau, Jey Han and
Sennrich, Rico and
Ruder, Sebastian",
editor = "Vlachos, Andreas and
Augenstein, Isabelle",
booktitle = "Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics",
month = may,
year = "2023",
address = "Dubrovnik, Croatia",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2023.eacl - main.57",
doi = "10.18653/v1/2023.eacl - main.57",
pages = "815--834"
}
@inproceedings{aji - etal - 2022 - one,
title = "One Country, 700+ Languages: {NLP} Challenges for Underrepresented Languages and Dialects in {I}ndonesia",
author = "Aji, Alham Fikri and
Winata, Genta Indra and
Koto, Fajri and
Cahyawijaya, Samuel and
Romadhony, Ade and
Mahendra, Rahmad and
Kurniawan, Kemal and
Moeljadi, David and
Prasojo, Radityo Eko and
Baldwin, Timothy and
Lau, Jey Han and
Ruder, Sebastian",
editor = "Muresan, Smaranda and
Nakov, Preslav and
Villavicencio, Aline",
booktitle = "Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
month = may,
year = "2022",
address = "Dublin, Ireland",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2022.acl - long.500",
doi = "10.18653/v1/2022.acl - long.500",
pages = "7226--7249"
}
@inproceedings{cahyawijaya - etal - 2021 - indonlg,
title = "{I}ndo{NLG}: Benchmark and Resources for Evaluating {I}ndonesian Natural Language Generation",
author = "Cahyawijaya, Samuel and
Winata, Genta Indra and
Wilie, Bryan and
Vincentio, Karissa and
Li, Xiaohong and
Kuncoro, Adhiguna and
Ruder, Sebastian and
Lim, Zhi Yuan and
Bahar, Syafri and
Khodra, Masayu and
Purwarianti, Ayu and
Fung, Pascale",
editor = "Moens, Marie - Francine and
Huang, Xuanjing and
Specia, Lucia and
Yih, Scott Wen - tau",
booktitle = "Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing",
month = nov,
year = "2021",
address = "Online and Punta Cana, Dominican Republic",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2021.emnlp - main.699",
doi = "10.18653/v1/2021.emnlp - main.699",
pages = "8875--8898"
}
@inproceedings{wilie - etal - 2020 - indonlu,
title = "{I}ndo{NLU}: Benchmark and Resources for Evaluating {I}ndonesian Natural Language Understanding",
author = "Wilie, Bryan and
Vincentio, Karissa and
Winata, Genta Indra and
Cahyawijaya, Samuel and
Li, Xiaohong and
Lim, Zhi Yuan and
Soleman, Sidik and
Mahendra, Rahmad and
Fung, Pascale and
Bahar, Syafri and
Purwarianti, Ayu",
editor = "Wong, Kam - Fai and
Knight, Kevin and
Wu, Hua",
booktitle = "Proceedings of the 1st Conference of the Asia - Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing",
month = dec,
year = "2020",
address = "Suzhou, China",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2020.aacl - main.85",
pages = "843--857"
}
📄 許可證
本項目使用 [Apache 2.0許可證](https://choosealicense.com/licenses/apache - 2.0/)。



