仙都（Cendol）- Llama2 7B聊天模型開源！精準適配印尼語生成內容

首頁

Cendol Llama2 7b Chat

由indonlp開發

仙都（Cendol）是一個開源的、針對印尼語進行微調的生成式大語言模型集合，涵蓋多種架構和參數規模。

大型語言模型

Transformers

其他開源協議:Apache-2.0 #印尼語優化 #多架構支持 #指令調優

下載量 1,749

發布時間 : 12/25/2023

模型概述

仙都（Cendol）是一個開源的、針對印尼語進行微調的生成式大語言模型集合，涵蓋了僅解碼器和編碼器-解碼器兩種Transformer模型架構，參數規模從3億到130億不等。本模型為70億參數的仙都（Cendol）LLaMA-2聊天模型。

模型特點

多架構與多參數規模

基於mT5和LLaMA-2兩種基礎模型，提供多種參數規模的模型，滿足不同場景需求。

指令調優版本豐富

包括針對特定任務的Cendol-Instruct和基於通用知識與以人為中心提示進行持續調優的Cendol-Chat。

性能卓越

在大多數測試基準上，大幅超越開源的多語言和特定地區大語言模型，小版本（參數少於10億）也能與70億參數的其他模型相媲美。

高效策略

提出詞彙替換策略，相比LoRA調優，在訓練和推理時間上分別提高了11.50%和18.71%，且評估性能與使用原始詞彙訓練的模型相當。

模型能力

印尼語文本生成

指令調優

單輪對話

自然語言理解

自然語言生成

使用案例

研究

印尼語自然語言處理研究

用於研究印尼語的自然語言處理任務，如文本生成、指令理解等。

在大多數測試基準上表現優異，超越其他開源模型。

通用知識問答

印尼語通用知識問答

用於回答關於印尼語通用知識的提問。

在人工評估中表現良好。

🚀 仙都（Cendol）：面向印尼語的開源指令調優生成式大語言模型

仙都（Cendol）是一個開源的、針對印尼語進行微調的生成式大語言模型集合，涵蓋了僅解碼器和編碼器 - 解碼器兩種Transformer模型架構，參數規模從3億到130億不等。

本倉庫為 70億參數的仙都（Cendol）LLaMA - 2聊天模型。其他模型的鏈接如下文所示。

✨ 主要特性

多架構與多參數規模：基於mT5和LLaMA - 2兩種基礎模型，提供多種參數規模的模型，滿足不同場景需求。
指令調優版本豐富：包括針對特定任務的Cendol - Instruct和基於通用知識與以人為中心提示進行持續調優的Cendol - Chat，均適用於單輪對話。
性能卓越：在大多數測試基準上，大幅超越開源的多語言和特定地區大語言模型，小版本（參數少於10億）也能與70億參數的其他模型相媲美。
高效策略：提出詞彙替換策略，相比LoRA調優，在訓練和推理時間上分別提高了11.50%和18.71%，且評估性能與使用原始詞彙訓練的模型相當。

📚 詳細文檔

模型詳情

注意：仙都（Cendol）的使用遵循 [Apache 2.0許可證](https://choosealicense.com/licenses/apache - 2.0/)。
概述：由IndoNLP開發並公開發布，是一系列預訓練和微調的生成式文本模型，參數規模從5.6億到130億不等。
模型開發者：IndoNLP
變體：基於mT5的模型有3億（mT5 - small）、5.8億（mT5 - base）、12億（mT5 - large）、37億（mT5 - XL）和130億（mT5 - XXL）；基於LLaMA - 2的模型有70億（LLaMA2 - 7B）和130億（LLaMA2 - 13B）。均有Cendol - Instruct和Cendol - Chat兩種變體。130億參數的模型採用LoRA調優，其他則進行全量微調。
輸入輸出：模型的輸入和輸出均為文本。
模型架構

模型	訓練數據	參數	調優策略	學習率
[仙都（Cendol）mT5 - small Instruct](https://huggingface.co/indonlp/cendol - mT5 - small - inst)	仙都（Cendol）集合v1	3億	全量微調	3.0 x 10⁻⁴
[仙都（Cendol）mT5 - base Instruct](https://huggingface.co/indonlp/cendol - mT5 - base - inst)	仙都（Cendol）集合v1	5.8億	全量微調	3.0 x 10⁻⁴
[仙都（Cendol）mT5 - large Instruct](https://huggingface.co/indonlp/cendol - mT5 - large - inst)	仙都（Cendol）集合v1	12億	全量微調	3.0 x 10⁻⁴
[仙都（Cendol）mT5 - XL Instruct](https://huggingface.co/indonlp/cendol - mT5 - XL - inst)	仙都（Cendol）集合v1	37億	全量微調	3.0 x 10⁻⁴
[仙都（Cendol）mT5 - XXL Instruct](https://huggingface.co/indonlp/cendol - mT5 - XXL - merged - inst)	仙都（Cendol）集合v1	130億	LoRA	2.0 x 10⁻⁴
[仙都（Cendol）LLaMA - 2 (7B) Instruct](https://huggingface.co/indonlp/cendol - llama2 - 7B - inst)	仙都（Cendol）集合v1	70億	全量微調	2.0 x 10⁻⁵
[仙都（Cendol）LLaMA - 2 (7B) 印尼語詞彙指令調優模型](https://huggingface.co/indonlp/cendol - llama2 - ind - vocab - inst)	仙都（Cendol）集合v1	70億	全量微調	2.0 x 10⁻⁵
[仙都（Cendol）LLaMA - 2 (13B) Instruct](https://huggingface.co/indonlp/cendol - llama2 - 13B - merged - inst)	仙都（Cendol）集合v1	130億	LoRA	2.0 x 10⁻⁵
[仙都（Cendol）mT5 - small Chat](https://huggingface.co/indonlp/cendol - mT5 - small - chat)	仙都（Cendol）集合v2	3億	全量微調	3.0 x 10⁻⁵
[仙都（Cendol）mT5 - base Chat](https://huggingface.co/indonlp/cendol - mT5 - base - chat)	仙都（Cendol）集合v2	5.8億	全量微調	3.0 x 10⁻⁵
[仙都（Cendol）mT5 - large Chat](https://huggingface.co/indonlp/cendol - mT5 - large - chat)	仙都（Cendol）集合v2	12億	全量微調	3.0 x 10⁻⁵
[仙都（Cendol）mT5 - XL Chat](https://huggingface.co/indonlp/cendol - mT5 - XL - chat)	仙都（Cendol）集合v2	37億	全量微調	3.0 x 10⁻⁵
[仙都（Cendol）mT5 - XXL Chat](https://huggingface.co/indonlp/cendol - mT5 - XXL - merged - chat)	仙都（Cendol）集合v2	130億	LoRA	2.0 x 10⁻⁴
[仙都（Cendol）LLaMA - 2 (7B) Chat](https://huggingface.co/indonlp/cendol - llama2 - 7B - chat)	仙都（Cendol）集合v2	70億	全量微調	1.0 x 10⁻⁵
[仙都（Cendol）LLaMA - 2 (13B) Chat](https://huggingface.co/indonlp/cendol - llama2 - 13B - merged - chat)	仙都（Cendol）集合v2	130億	LoRA	2.0 x 10⁻⁴

模型訓練時間：仙都（Cendol）於2023年10月至2024年1月期間進行訓練。
許可證：使用仙都（Cendol）遵循 [Apache 2.0許可證](https://choosealicense.com/licenses/apache - 2.0/)
研究論文："仙都（Cendol）：面向印尼語的開源指令調優生成式大語言模型"

預期用途

預期用例：仙都（Cendol）主要用於研究，特別是針對印尼語的研究。Cendol - Instruct模型可用於特定任務指令，Cendol - Chat模型可用於通用知識指令。
非預期用途：禁止以任何違反適用法律法規（包括貿易合規法律）的方式使用；禁止用於英語和印尼語以外的語言；禁止以《仙都（Cendol）可接受使用政策和許可協議》禁止的任何其他方式使用。

評估結果

在本節中，我們報告了仙都（Cendol）模型在大規模自然語言理解（NLU）和自然語言生成（NLG）基準測試中的結果。所有評估均使用我們的內部評估庫。

NLU性能

![NLU性能](https://github.com/IndoNLP/indo - t0/assets/2826602/7656f005 - f261 - 4982 - ad06 - f18dc57d5e3b)

NLG性能

![NLG性能](https://github.com/IndoNLP/indo - t0/assets/2826602/4942caea - 35df - 44e1 - a95b - 53a027c6115f)

人工評估

![人工評估](https://github.com/IndoNLP/indo - t0/assets/2826602/6128257f - d36c - 4dbb - 8f6c - 4b936bc2ea66)

倫理考量與侷限性

仙都（Cendol）是一項新技術，使用時存在風險。到目前為止的測試均使用印尼語進行，無法涵蓋所有場景。因此，與所有大語言模型一樣，仙都（Cendol）的潛在輸出無法提前預測，在某些情況下，模型可能會對用戶提示產生不準確、有偏見或其他令人反感的回覆。所以，在部署仙都（Cendol）的任何應用之前，開發者應針對其特定應用對模型進行安全測試和調優。

引用

如果您使用了包括仙都（Cendol）模型、代碼或數據在內的任何資源，請引用以下文章：

@misc{cahyawijaya - etal - 2024 - cendol,
      title={Cendol: Open Instruction - tuned Generative Large Language Models for Indonesian Languages}, 
      author={Samuel Cahyawijaya and Holy Lovenia and Fajri Koto and Rifki Afina Putri and Emmanuel Dave and Jhonson Lee and Nuur Shadieq and Wawan Cenggoro and Salsabil Maulana Akbar and Muhammad Ihza Mahendra and Dea Annisayanti Putri and Bryan Wilie and Genta Indra Winata and Alham Fikri Aji and Ayu Purwarianti and Pascale Fung},
      year={2024},
      eprint={2404.06138},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

@inproceedings{cahyawijaya - etal - 2023 - nusacrowd,
    title = "{N}usa{C}rowd: Open Source Initiative for {I}ndonesian {NLP} Resources",
    author = "Cahyawijaya, Samuel  and
      Lovenia, Holy  and
      Aji, Alham Fikri  and
      Winata, Genta  and
      Wilie, Bryan  and
      Koto, Fajri  and
      Mahendra, Rahmad  and
      Wibisono, Christian  and
      Romadhony, Ade  and
      Vincentio, Karissa  and
      Santoso, Jennifer  and
      Moeljadi, David  and
      Wirawan, Cahya  and
      Hudi, Frederikus  and
      Wicaksono, Muhammad Satrio  and
      Parmonangan, Ivan  and
      Alfina, Ika  and
      Putra, Ilham Firdausi  and
      Rahmadani, Samsul  and
      Oenang, Yulianti  and
      Septiandri, Ali  and
      Jaya, James  and
      Dhole, Kaustubh  and
      Suryani, Arie  and
      Putri, Rifki Afina  and
      Su, Dan  and
      Stevens, Keith  and
      Nityasya, Made Nindyatama  and
      Adilazuarda, Muhammad  and
      Hadiwijaya, Ryan  and
      Diandaru, Ryandito  and
      Yu, Tiezheng  and
      Ghifari, Vito  and
      Dai, Wenliang  and
      Xu, Yan  and
      Damapuspita, Dyah  and
      Wibowo, Haryo  and
      Tho, Cuk  and
      Karo Karo, Ichwanul  and
      Fatyanosa, Tirana  and
      Ji, Ziwei  and
      Neubig, Graham  and
      Baldwin, Timothy  and
      Ruder, Sebastian  and
      Fung, Pascale  and
      Sujaini, Herry  and
      Sakti, Sakriani  and
      Purwarianti, Ayu",
    editor = "Rogers, Anna  and
      Boyd - Graber, Jordan  and
      Okazaki, Naoaki",
    booktitle = "Findings of the Association for Computational Linguistics: ACL 2023",
    month = jul,
    year = "2023",
    address = "Toronto, Canada",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2023.findings - acl.868",
    doi = "10.18653/v1/2023.findings - acl.868",
    pages = "13745--13818"
}

此外，如果您受到我們針對印尼語及其當地語言的特定地區大語言模型研究工作的啟發，請考慮引用以下文章：

@inproceedings{cahyawijaya - etal - 2023 - nusawrites,
    title = "{N}usa{W}rites: Constructing High - Quality Corpora for Underrepresented and Extremely Low - Resource Languages",
    author = "Cahyawijaya, Samuel  and
      Lovenia, Holy  and
      Koto, Fajri  and
      Adhista, Dea  and
      Dave, Emmanuel  and
      Oktavianti, Sarah  and
      Akbar, Salsabil  and
      Lee, Jhonson  and
      Shadieq, Nuur  and
      Cenggoro, Tjeng Wawan  and
      Linuwih, Hanung  and
      Wilie, Bryan  and
      Muridan, Galih  and
      Winata, Genta  and
      Moeljadi, David  and
      Aji, Alham Fikri  and
      Purwarianti, Ayu  and
      Fung, Pascale",
    editor = "Park, Jong C.  and
      Arase, Yuki  and
      Hu, Baotian  and
      Lu, Wei  and
      Wijaya, Derry  and
      Purwarianti, Ayu  and
      Krisnadhi, Adila Alfa",
    booktitle = "Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia - Pacific Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)",
    month = nov,
    year = "2023",
    address = "Nusa Dua, Bali",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2023.ijcnlp - main.60",
    doi = "10.18653/v1/2023.ijcnlp - main.60",
    pages = "921--945"
}

@inproceedings{winata - etal - 2023 - nusax,
    title = "{N}usa{X}: Multilingual Parallel Sentiment Dataset for 10 {I}ndonesian Local Languages",
    author = "Winata, Genta Indra  and
      Aji, Alham Fikri  and
      Cahyawijaya, Samuel  and
      Mahendra, Rahmad  and
      Koto, Fajri  and
      Romadhony, Ade  and
      Kurniawan, Kemal  and
      Moeljadi, David  and
      Prasojo, Radityo Eko  and
      Fung, Pascale  and
      Baldwin, Timothy  and
      Lau, Jey Han  and
      Sennrich, Rico  and
      Ruder, Sebastian",
    editor = "Vlachos, Andreas  and
      Augenstein, Isabelle",
    booktitle = "Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics",
    month = may,
    year = "2023",
    address = "Dubrovnik, Croatia",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2023.eacl - main.57",
    doi = "10.18653/v1/2023.eacl - main.57",
    pages = "815--834"
}

@inproceedings{aji - etal - 2022 - one,
    title = "One Country, 700 + Languages: {NLP} Challenges for Underrepresented Languages and Dialects in {I}ndonesia",
    author = "Aji, Alham Fikri  and
      Winata, Genta Indra  and
      Koto, Fajri  and
      Cahyawijaya, Samuel  and
      Romadhony, Ade  and
      Mahendra, Rahmad  and
      Kurniawan, Kemal  and
      Moeljadi, David  and
      Prasojo, Radityo Eko  and
      Baldwin, Timothy  and
      Lau, Jey Han  and
      Ruder, Sebastian",
    editor = "Muresan, Smaranda  and
      Nakov, Preslav  and
      Villavicencio, Aline",
    booktitle = "Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
    month = may,
    year = "2022",
    address = "Dublin, Ireland",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2022.acl - long.500",
    doi = "10.18653/v1/2022.acl - long.500",
    pages = "7226--7249"
}

@inproceedings{cahyawijaya - etal - 2021 - indonlg,
    title = "{I}ndo{NLG}: Benchmark and Resources for Evaluating {I}ndonesian Natural Language Generation",
    author = "Cahyawijaya, Samuel  and
      Winata, Genta Indra  and
      Wilie, Bryan  and
      Vincentio, Karissa  and
      Li, Xiaohong  and
      Kuncoro, Adhiguna  and
      Ruder, Sebastian  and
      Lim, Zhi Yuan  and
      Bahar, Syafri  and
      Khodra, Masayu  and
      Purwarianti, Ayu  and
      Fung, Pascale",
    editor = "Moens, Marie - Francine  and
      Huang, Xuanjing  and
      Specia, Lucia  and
      Yih, Scott Wen - tau",
    booktitle = "Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing",
    month = nov,
    year = "2021",
    address = "Online and Punta Cana, Dominican Republic",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2021.emnlp - main.699",
    doi = "10.18653/v1/2021.emnlp - main.699",
    pages = "8875--8898"
}

@inproceedings{wilie - etal - 2020 - indonlu,
    title = "{I}ndo{NLU}: Benchmark and Resources for Evaluating {I}ndonesian Natural Language Understanding",
    author = "Wilie, Bryan  and
      Vincentio, Karissa  and
      Winata, Genta Indra  and
      Cahyawijaya, Samuel  and
      Li, Xiaohong  and
      Lim, Zhi Yuan  and
      Soleman, Sidik  and
      Mahendra, Rahmad  and
      Fung, Pascale  and
      Bahar, Syafri  and
      Purwarianti, Ayu",
    editor = "Wong, Kam - Fai  and
      Knight, Kevin  and
      Wu, Hua",
    booktitle = "Proceedings of the 1st Conference of the Asia - Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing",
    month = dec,
    year = "2020",
    address = "Suzhou, China",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2020.aacl - main.85",
    pages = "843--857"
}