Cendol mT5-small Chat開源大語言模型 - 單輪對話，支持印尼語等多語言交互！

首頁

Cendol Mt5 Small Chat

由indonlp開發

Cendol mT5-small Chat是一個3億參數的開源生成式大語言模型，針對印尼語、巽他語和爪哇語進行指令調優，適用於單輪對話場景。

大型語言模型

Transformers

其他開源協議:Apache-2.0 #印尼語指令調優 #多任務對話 #輕量級LLM

下載量 46

發布時間 : 12/25/2023

模型概述

該模型是基於mT5-small架構的聊天版本，在Cendol-Instruct基礎上進一步針對通用知識和以人為本的提示進行優化，支持印尼本土語言。

模型特點

本土語言支持

專門針對印尼語及巽他語、爪哇語等地方語言優化

高效小模型

3億參數規模在性能與效率間取得平衡，小參數量級可與其他70億參數模型競爭

全參數精調

採用全參數微調策略（非LoRA），相比大參數模型訓練效率更高

模型能力

單輪對話生成

通用知識問答

印尼本土語言處理

使用案例

對話系統

印尼語聊天機器人

部署於客服系統或社交應用的本地化對話代理

在人工評估中對話連貫性優於同類開源模型

教育應用

本土語言學習助手

幫助學習者練習巽他語/爪哇語日常對話

🚀 參哆啦（Cendol）：面向印尼語的開源指令調優生成式大語言模型

參哆啦（Cendol）是一個開源的、經過微調的生成式大語言模型集合，專為印尼語設計。它涵蓋了僅解碼器和編碼器 - 解碼器兩種Transformer模型架構，參數規模從3億到130億不等。

本倉庫是 3億參數的參哆啦mT5-small聊天模型 的相關內容。其他模型的鏈接如下所示。

✨ 主要特性

多架構支持：覆蓋僅解碼器和編碼器 - 解碼器兩種Transformer模型架構。
多參數規模：參數規模從3億到130億，滿足不同場景需求。
兩種指令調優版本：參哆啦 - 指令（Cendol - Instruct）和參哆啦 - 聊天（Cendol - Chat），分別適用於特定任務指令和通用知識指令。
高性能表現：在大多數測試基準上，大幅超越開源多語言和特定地區的大語言模型，小版本（參數少於10億）也能與70億參數的其他模型競爭。

📚 詳細文檔

模型詳情

注意事項：參哆啦的使用遵循 [Apache 2.0許可證](https://choosealicense.com/licenses/apache - 2.0/)。
概述：IndoNLP開發並公開發布了參哆啦系列大語言模型（LLMs），這是一組預訓練和微調的生成式文本模型，參數規模從5.6億到130億不等。參哆啦模型有兩種指令調優版本：
- 參哆啦 - 指令（Cendol - Instruct）：在特定任務的NLP數據（如情感分析、主題建模、機器翻譯、摘要生成、問答、釋義等）上進行指令調優。
- 參哆啦 - 聊天（Cendol - Chat）：在參哆啦 - 指令的基礎上，在通用知識和以人為中心的提示上進行持續指令調優。

參哆啦 - 指令和參哆啦 - 聊天都設計用於單輪對話。在我們測試的大多數基準上，參哆啦大幅超越了開源多語言和特定地區的大語言模型，較小版本（參數少於10億）的參哆啦也能與70億參數的其他大語言模型競爭。

模型開發者：IndoNLP
模型變體：參哆啦基於兩種基礎模型（mT5和LLaMA - 2），每種模型都有不同的參數規模。基於mT5的參哆啦模型有3億（mT5 - small）、5.8億（mT5 - base）、12億（mT5 - large）、37億（mT5 - XL）和130億（mT5 - XXL）參數的版本；基於LLaMA - 2的參哆啦模型有70億（LLaMA2 - 7B）和130億（LLaMA2 - 13B）參數的版本。兩種變體都有參哆啦 - 指令和參哆啦 - 聊天兩種版本。所有130億參數的模型都使用LoRA進行調優，其他模型則進行全量微調。

在論文中，我們展示了使用LoRA對特定地區大語言模型進行適配是低效的，即130億（mT5 - XXL）參數的參哆啦模型性能略遜於12億（mT5 - large）參數的參哆啦模型，而訓練時間慢3倍，推理時間慢4倍。作為LoRA的替代方案，我們展示了詞彙替換作為一種有效且高效的特定地區適配策略的優勢，訓練和推理時間分別提高了 11.50% 和 18.71%。在評估性能方面，我們還展示了該模型的表現與使用原始詞彙訓練的參哆啦模型相當。我們還發布了經過印尼語詞彙適配的模型，記為 Indonesian - Vocab Instruct。

輸入輸出：模型的輸入和輸出僅為文本。
模型架構

屬性	詳情
模型類型	[Cendol mT5 - small Instruct](https://huggingface.co/indonlp/cendol - mt5 - small - inst)、[Cendol mT5 - base Instruct](https://huggingface.co/indonlp/cendol - mt5 - base - inst)等多種模型變體
訓練數據	Cendol Collection v1 或 Cendol Collection v2
參數規模	3億、5.8億、12億、37億、70億、130億等
調優策略	全量微調（Fully - Finetuned）或LoRA
學習率	3.0 x 10^{- 4}、2.0 x 10^{- 4}、2.0 x 10^{- 5}、3.0 x 10^{- 5}、1.0 x 10^{- 5} 等

模型訓練時間：參哆啦的訓練時間為2023年10月至2024年1月。
許可證：參哆啦的使用遵循 [Apache 2.0許可證](https://choosealicense.com/licenses/apache - 2.0/)。
研究論文："Cendol: Open Instruction - tuned Generative Large Language Models for Indonesian Languages"

預期用途

預期用例：參哆啦主要用於研究，特別是印尼語相關的研究。參哆啦模型設計用於單輪指令，參哆啦 - 指令模型可用於特定任務指令，參哆啦 - 聊天模型可用於通用知識指令。
非預期用途：禁止以任何違反適用法律法規（包括貿易合規法律）的方式使用。禁止用於英語和印尼語以外的語言。禁止以《參哆啦可接受使用政策和許可協議》禁止的任何其他方式使用。

評估結果

在本節中，我們報告了參哆啦模型在大規模自然語言理解（NLU）和自然語言生成（NLG）基準上的評估結果。所有評估均使用我們的內部評估庫進行。

自然語言理解（NLU）性能 ![NLU Performance](https://github.com/IndoNLP/indo - t0/assets/2826602/7656f005 - f261 - 4982 - ad06 - f18dc57d5e3b)
自然語言生成（NLG）性能 ![NLG Performance](https://github.com/IndoNLP/indo - t0/assets/2826602/4942caea - 35df - 44e1 - a95b - 53a027c6115f)
人工評估 ![Human Evaluation](https://github.com/IndoNLP/indo - t0/assets/2826602/6128257f - d36c - 4dbb - 8f6c - 4b936bc2ea66)

倫理考量與侷限性

參哆啦是一項新技術，使用時存在風險。到目前為止的測試僅在印尼語環境下進行，無法涵蓋所有場景。因此，與所有大語言模型一樣，參哆啦的潛在輸出無法提前預測，在某些情況下，模型可能會對用戶提示產生不準確、有偏見或其他令人反感的回覆。因此，在部署參哆啦的任何應用之前，開發者應針對其特定應用進行安全測試和調優。

引用信息

如果您使用參哆啦模型、代碼或數據等任何資源，請引用以下文章：

@misc{cahyawijaya - etal - 2024 - cendol,
      title={Cendol: Open Instruction - tuned Generative Large Language Models for Indonesian Languages}, 
      author={Samuel Cahyawijaya and Holy Lovenia and Fajri Koto and Rifki Afina Putri and Emmanuel Dave and Jhonson Lee and Nuur Shadieq and Wawan Cenggoro and Salsabil Maulana Akbar and Muhammad Ihza Mahendra and Dea Annisayanti Putri and Bryan Wilie and Genta Indra Winata and Alham Fikri Aji and Ayu Purwarianti and Pascale Fung},
      year={2024},
      eprint={2404.06138},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

@inproceedings{cahyawijaya - etal - 2023 - nusacrowd,
    title = "{N}usa{C}rowd: Open Source Initiative for {I}ndonesian {NLP} Resources",
    author = "Cahyawijaya, Samuel  and
      Lovenia, Holy  and
      Aji, Alham Fikri  and
      Winata, Genta  and
      Wilie, Bryan  and
      Koto, Fajri  and
      Mahendra, Rahmad  and
      Wibisono, Christian  and
      Romadhony, Ade  and
      Vincentio, Karissa  and
      Santoso, Jennifer  and
      Moeljadi, David  and
      Wirawan, Cahya  and
      Hudi, Frederikus  and
      Wicaksono, Muhammad Satrio  and
      Parmonangan, Ivan  and
      Alfina, Ika  and
      Putra, Ilham Firdausi  and
      Rahmadani, Samsul  and
      Oenang, Yulianti  and
      Septiandri, Ali  and
      Jaya, James  and
      Dhole, Kaustubh  and
      Suryani, Arie  and
      Putri, Rifki Afina  and
      Su, Dan  and
      Stevens, Keith  and
      Nityasya, Made Nindyatama  and
      Adilazuarda, Muhammad  and
      Hadiwijaya, Ryan  and
      Diandaru, Ryandito  and
      Yu, Tiezheng  and
      Ghifari, Vito  and
      Dai, Wenliang  and
      Xu, Yan  and
      Damapuspita, Dyah  and
      Wibowo, Haryo  and
      Tho, Cuk  and
      Karo Karo, Ichwanul  and
      Fatyanosa, Tirana  and
      Ji, Ziwei  and
      Neubig, Graham  and
      Baldwin, Timothy  and
      Ruder, Sebastian  and
      Fung, Pascale  and
      Sujaini, Herry  and
      Sakti, Sakriani  and
      Purwarianti, Ayu",
    editor = "Rogers, Anna  and
      Boyd - Graber, Jordan  and
      Okazaki, Naoaki",
    booktitle = "Findings of the Association for Computational Linguistics: ACL 2023",
    month = jul,
    year = "2023",
    address = "Toronto, Canada",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2023.findings - acl.868",
    doi = "10.18653/v1/2023.findings - acl.868",
    pages = "13745--13818"
}

此外，如果您受到我們在特定地區語言模型（特別是印尼語及其當地語言）方面工作的啟發，也請考慮引用以下文章：

@inproceedings{cahyawijaya - etal - 2023 - nusawrites,
    title = "{N}usa{W}rites: Constructing High - Quality Corpora for Underrepresented and Extremely Low - Resource Languages",
    author = "Cahyawijaya, Samuel  and
      Lovenia, Holy  and
      Koto, Fajri  and
      Adhista, Dea  and
      Dave, Emmanuel  and
      Oktavianti, Sarah  and
      Akbar, Salsabil  and
      Lee, Jhonson  and
      Shadieq, Nuur  and
      Cenggoro, Tjeng Wawan  and
      Linuwih, Hanung  and
      Wilie, Bryan  and
      Muridan, Galih  and
      Winata, Genta  and
      Moeljadi, David  and
      Aji, Alham Fikri  and
      Purwarianti, Ayu  and
      Fung, Pascale",
    editor = "Park, Jong C.  and
      Arase, Yuki  and
      Hu, Baotian  and
      Lu, Wei  and
      Wijaya, Derry  and
      Purwarianti, Ayu  and
      Krisnadhi, Adila Alfa",
    booktitle = "Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia - Pacific Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)",
    month = nov,
    year = "2023",
    address = "Nusa Dua, Bali",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2023.ijcnlp - main.60",
    doi = "10.18653/v1/2023.ijcnlp - main.60",
    pages = "921--945"
}

@inproceedings{winata - etal - 2023 - nusax,
    title = "{N}usa{X}: Multilingual Parallel Sentiment Dataset for 10 {I}ndonesian Local Languages",
    author = "Winata, Genta Indra  and
      Aji, Alham Fikri  and
      Cahyawijaya, Samuel  and
      Mahendra, Rahmad  and
      Koto, Fajri  and
      Romadhony, Ade  and
      Kurniawan, Kemal  and
      Moeljadi, David  and
      Prasojo, Radityo Eko  and
      Fung, Pascale  and
      Baldwin, Timothy  and
      Lau, Jey Han  and
      Sennrich, Rico  and
      Ruder, Sebastian",
    editor = "Vlachos, Andreas  and
      Augenstein, Isabelle",
    booktitle = "Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics",
    month = may,
    year = "2023",
    address = "Dubrovnik, Croatia",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2023.eacl - main.57",
    doi = "10.18653/v1/2023.eacl - main.57",
    pages = "815--834"
}

@inproceedings{aji - etal - 2022 - one,
    title = "One Country, 700+ Languages: {NLP} Challenges for Underrepresented Languages and Dialects in {I}ndonesia",
    author = "Aji, Alham Fikri  and
      Winata, Genta Indra  and
      Koto, Fajri  and
      Cahyawijaya, Samuel  and
      Romadhony, Ade  and
      Mahendra, Rahmad  and
      Kurniawan, Kemal  and
      Moeljadi, David  and
      Prasojo, Radityo Eko  and
      Baldwin, Timothy  and
      Lau, Jey Han  and
      Ruder, Sebastian",
    editor = "Muresan, Smaranda  and
      Nakov, Preslav  and
      Villavicencio, Aline",
    booktitle = "Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
    month = may,
    year = "2022",
    address = "Dublin, Ireland",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2022.acl - long.500",
    doi = "10.18653/v1/2022.acl - long.500",
    pages = "7226--7249"
}

@inproceedings{cahyawijaya - etal - 2021 - indonlg,
    title = "{I}ndo{NLG}: Benchmark and Resources for Evaluating {I}ndonesian Natural Language Generation",
    author = "Cahyawijaya, Samuel  and
      Winata, Genta Indra  and
      Wilie, Bryan  and
      Vincentio, Karissa  and
      Li, Xiaohong  and
      Kuncoro, Adhiguna  and
      Ruder, Sebastian  and
      Lim, Zhi Yuan  and
      Bahar, Syafri  and
      Khodra, Masayu  and
      Purwarianti, Ayu  and
      Fung, Pascale",
    editor = "Moens, Marie - Francine  and
      Huang, Xuanjing  and
      Specia, Lucia  and
      Yih, Scott Wen - tau",
    booktitle = "Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing",
    month = nov,
    year = "2021",
    address = "Online and Punta Cana, Dominican Republic",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2021.emnlp - main.699",
    doi = "10.18653/v1/2021.emnlp - main.699",
    pages = "8875--8898"
}

@inproceedings{wilie - etal - 2020 - indonlu,
    title = "{I}ndo{NLU}: Benchmark and Resources for Evaluating {I}ndonesian Natural Language Understanding",
    author = "Wilie, Bryan  and
      Vincentio, Karissa  and
      Winata, Genta Indra  and
      Cahyawijaya, Samuel  and
      Li, Xiaohong  and
      Lim, Zhi Yuan  and
      Soleman, Sidik  and
      Mahendra, Rahmad  and
      Fung, Pascale  and
      Bahar, Syafri  and
      Purwarianti, Ayu",
    editor = "Wong, Kam - Fai  and
      Knight, Kevin  and
      Wu, Hua",
    booktitle = "Proceedings of the 1st Conference of the Asia - Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing",
    month = dec,
    year = "2020",
    address = "Suzhou, China",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2020.aacl - main.85",
    pages = "843--857"
}