仙都（センドール） - Llama2 7Bチャットモデルがオープンソース化！インドネシア語の生成内容に正確に適合

ホーム

Cendol Llama2 7b Chat

indonlpによって開発

仙都（Cendol）は、インドネシア語に対して微調整されたオープンソースの生成型大規模言語モデルの集合で、さまざまなアーキテクチャとパラメータ規模をカバーしています。

大規模言語モデル

Transformers

その他オープンソースライセンス:Apache-2.0 #インドネシア語最適化 #多アーキテクチャサポート #命令調整

ダウンロード数 1,749

リリース時間 : 12/25/2023

モデル概要

仙都（Cendol）は、インドネシア語に対して微調整されたオープンソースの生成型大規模言語モデルの集合で、デコーダのみとエンコーダ - デコーダの2種類のTransformerモデルアーキテクチャをカバーし、パラメータ規模は3億から130億までさまざまです。このモデルは70億パラメータの仙都（Cendol）LLaMA-2チャットモデルです。

モデル特徴

多アーキテクチャと多パラメータ規模

mT5とLLaMA-2の2種類の基礎モデルに基づき、さまざまなパラメータ規模のモデルを提供し、さまざまなシナリオのニーズを満たします。

命令調整バージョンが豊富

特定のタスクに対応したCendol - Instructと、一般的な知識と人間中心のプロンプトに基づいて継続的に調整されたCendol - Chatが含まれています。

卓越した性能

ほとんどのテストベンチマークで、オープンソースの多言語および特定地域の大規模言語モデルを大幅に上回り、小さなバージョン（パラメータが10億未満）も70億パラメータの他のモデルと匹敵します。

効率的な戦略

語彙置換戦略を提案し、LoRA調整と比較して、トレーニング時間と推論時間がそれぞれ11.50%と18.71%向上し、評価性能は元の語彙でトレーニングされたモデルと同等です。

モデル能力

インドネシア語テキスト生成

命令調整

単輪対話

自然言語理解

自然言語生成

使用事例

研究

インドネシア語自然言語処理研究

インドネシア語の自然言語処理タスク、例えばテキスト生成や命令理解などの研究に使用されます。

ほとんどのテストベンチマークで優れた結果を示し、他のオープンソースモデルを上回ります。

一般知識質問応答

インドネシア語一般知識質問応答

インドネシア語の一般知識に関する質問に答えるために使用されます。

人間による評価で良好な結果を示します。

🚀 センドル: インドネシア語向けのオープン命令調整型生成大規模言語モデル

センドルは、インドネシア語に対応したオープンソースの微調整済み生成大規模言語モデルのコレクションです。デコーダー専用およびエンコーダー - デコーダーのトランスフォーマーモデルアーキテクチャをカバーし、パラメータ数は3億から130億までの範囲にあります。

これは、7BセンドルLLaMA - 2チャットモデルのリポジトリです。他のモデルへのリンクは以下に記載されています。

📚 詳細ドキュメント

概要

IndoNLPが開発し、公開したセンドルファミリーの大規模言語モデル（LLM）は、事前学習および微調整された生成テキストモデルのコレクションで、パラメータ数は5億6000万から130億までの範囲にあります。

センドルモデルには、2つの命令調整バージョンがあります。

センドル - インストラクト：感情分析、トピックモデリング、機械翻訳、要約、質問応答、言い換えなどの特定タスクの自然言語処理データで命令調整されたモデル。
センドル - チャット：一般知識と人間中心のプロンプトで、センドル - インストラクトから継続的に命令調整されたモデル。

センドル - インストラクトとセンドル - チャットの両方が、単ターンの会話を想定して設計されています。センドルは、テストしたほとんどのベンチマークで、オープンソースの多言語および地域固有のLLMを大きく上回っており、センドルの小規模バージョン（パラメータ数<10億）も、70億パラメータの他のLLMと競争力を持っています。

モデル開発者

IndoNLP

バリエーション

センドルは、2つのベースモデル（mT5とLLaMA - 2）から派生し、それぞれ様々なパラメータサイズがあります。mT5ベースのセンドルには、3億（mT5 - small）、5億8000万（mT5 - base）、12億（mT5 - large）、37億（mT5 - XL）、130億（mT5 - XXL）のモデルがあり、LLaMA - 2ベースのセンドルには、70億（LLaMA2 - 7B）と130億（LLaMA2 - 13B）のモデルがあります。両方のバリエーションには、センドル - インストラクトとセンドル - チャットのバリエーションがあります。すべての130億パラメータのモデルはLoRAで調整されており、その他のモデルは完全に微調整されています。

論文では、LoRAを使用した地域固有のLLMの適応が非効率であることを示しています。つまり、130億（mT5 - XXL）のセンドルモデルは、12億（mT5 - large）のセンドルモデルよりもわずかに性能が劣り、学習時間は3倍、推論時間は4倍遅くなります。LoRAの代替策として、語彙置換を地域固有の適応に有効かつ効率的な戦略として紹介しており、学習時間と推論時間の効率をそれぞれ**11.50%と18.71%**向上させています。評価性能に関しても、元の語彙で学習されたセンドルモデルと同等の性能を示しています。また、インドネシア語彙適応モデルIndonesian - Vocab Instructも公開しています。

入出力

モデルの入力と出力はテキストのみです。

モデルアーキテクチャ

モデル	学習データ	パラメータ	調整戦略	学習率
[Cendol mT5 - small Instruct](https://huggingface.co/indonlp/cendol - mt5 - small - inst)	Cendol Collection v1	3億	完全微調整	3.0 x 10^-4
[Cendol mT5 - base Instruct](https://huggingface.co/indonlp/cendol - mt5 - base - inst)	Cendol Collection v1	5億8000万	完全微調整	3.0 x 10^-4
[Cendol mT5 - large Instruct](https://huggingface.co/indonlp/cendol - mt5 - large - inst)	Cendol Collection v1	12億	完全微調整	3.0 x 10^-4
[Cendol mT5 - xl Instruct](https://huggingface.co/indonlp/cendol - mt5 - xl - inst)	Cendol Collection v1	37億	完全微調整	3.0 x 10^-4
[Cendol mT5 - xxl Instruct](https://huggingface.co/indonlp/cendol - mt5 - xxl - merged - inst)	Cendol Collection v1	130億	LoRA	2.0 x 10^-4
[Cendol LLaMA - 2 (7B) Instruct](https://huggingface.co/indonlp/cendol - llama2 - 7b - inst)	Cendol Collection v1	70億	完全微調整	2.0 x 10^-5
[Cendol LLaMA - 2 (7B) Indonesian - Vocab Instruct](https://huggingface.co/indonlp/cendol - llama2 - ind - vocab - inst)	Cendol Collection v1	70億	完全微調整	2.0 x 10^-5
[Cendol LLaMA - 2 (13B) Instruct](https://huggingface.co/indonlp/cendol - llama2 - 13b - merged - inst)	Cendol Collection v1	130億	LoRA	2.0 x 10^-5
[Cendol mT5 - small Chat](https://huggingface.co/indonlp/cendol - mt5 - small - chat)	Cendol Collection v2	3億	完全微調整	3.0 x 10^-5
[Cendol mT5 - base Chat](https://huggingface.co/indonlp/cendol - mt5 - base - chat)	Cendol Collection v2	5億8000万	完全微調整	3.0 x 10^-5
[Cendol mT5 - large Chat](https://huggingface.co/indonlp/cendol - mt5 - large - chat)	Cendol Collection v2	12億	完全微調整	3.0 x 10^-5
[Cendol mT5 - xl Chat](https://huggingface.co/indonlp/cendol - mt5 - xl - chat)	Cendol Collection v2	37億	完全微調整	3.0 x 10^-5
[Cendol mT5 - xxl Chat](https://huggingface.co/indonlp/cendol - mt5 - xxl - merged - chat)	Cendol Collection v2	130億	LoRA	2.0 x 10^-4
[Cendol LLaMA - 2 (7B) Chat](https://huggingface.co/indonlp/cendol - llama2 - 7b - chat)	Cendol Collection v2	70億	完全微調整	1.0 x 10^-5
[Cendol LLaMA - 2 (13B) Chat](https://huggingface.co/indonlp/cendol - llama2 - 13b - merged - chat)	Cendol Collection v2	130億	LoRA	2.0 x 10^-4

モデルの日付

センドルは2023年10月から2024年1月の間に学習されました。

ライセンス

センドルの使用は、[Apache 2.0ライセンス](https://choosealicense.com/licenses/apache - 2.0/)の下でライセンスされています。

研究論文

"Cendol: Open Instruction - tuned Generative Large Language Models for Indonesian Languages"

💡 想定使用方法

想定使用ケース

センドルは、特にインドネシア語に関する研究用途を想定しています。センドルモデルは単ターンの命令に対応しており、センドル - インストラクトモデルは特定タスクの命令に、センドル - チャットモデルは一般知識の命令に使用できます。

想定外の使用

適用される法律や規制（貿易コンプライアンス法律を含む）に違反する方法での使用。英語およびインドネシア語以外の言語での使用。センドルの許容使用ポリシーおよびライセンス契約で禁止されているその他の方法での使用。

📊 評価結果

このセクションでは、大規模な自然言語理解（NLU）および自然言語生成（NLG）のベンチマークにおけるセンドルモデルの結果を報告します。すべての評価では、社内の評価ライブラリを使用しています。

NLU性能

NLG性能

人間評価

⚠️ 倫理的考慮事項と制限

センドルは新しい技術であり、使用に伴うリスクがあります。これまでのテストはインドネシア語で行われており、すべてのシナリオを網羅することはできません。これらの理由から、他のすべてのLLMと同様に、センドルの潜在的な出力を事前に予測することはできず、モデルは場合によっては不正確、偏った、またはその他の不快な応答を生成する可能性があります。したがって、センドルのアプリケーションを展開する前に、開発者はモデルの特定のアプリケーションに合わせた安全性テストと調整を行う必要があります。

📖 引用

センドルモデル、コード、またはデータを含む任意のリソースを使用する場合は、以下の論文を引用してください。

@misc{cahyawijaya - etal - 2024 - cendol,
      title={Cendol: Open Instruction - tuned Generative Large Language Models for Indonesian Languages}, 
      author={Samuel Cahyawijaya and Holy Lovenia and Fajri Koto and Rifki Afina Putri and Emmanuel Dave and Jhonson Lee and Nuur Shadieq and Wawan Cenggoro and Salsabil Maulana Akbar and Muhammad Ihza Mahendra and Dea Annisayanti Putri and Bryan Wilie and Genta Indra Winata and Alham Fikri Aji and Ayu Purwarianti and Pascale Fung},
      year={2024},
      eprint={2404.06138},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

@inproceedings{cahyawijaya - etal - 2023 - nusacrowd,
    title = "{N}usa{C}rowd: Open Source Initiative for {I}ndonesian {NLP} Resources",
    author = "Cahyawijaya, Samuel  and
      Lovenia, Holy  and
      Aji, Alham Fikri  and
      Winata, Genta  and
      Wilie, Bryan  and
      Koto, Fajri  and
      Mahendra, Rahmad  and
      Wibisono, Christian  and
      Romadhony, Ade  and
      Vincentio, Karissa  and
      Santoso, Jennifer  and
      Moeljadi, David  and
      Wirawan, Cahya  and
      Hudi, Frederikus  and
      Wicaksono, Muhammad Satrio  and
      Parmonangan, Ivan  and
      Alfina, Ika  and
      Putra, Ilham Firdausi  and
      Rahmadani, Samsul  and
      Oenang, Yulianti  and
      Septiandri, Ali  and
      Jaya, James  and
      Dhole, Kaustubh  and
      Suryani, Arie  and
      Putri, Rifki Afina  and
      Su, Dan  and
      Stevens, Keith  and
      Nityasya, Made Nindyatama  and
      Adilazuarda, Muhammad  and
      Hadiwijaya, Ryan  and
      Diandaru, Ryandito  and
      Yu, Tiezheng  and
      Ghifari, Vito  and
      Dai, Wenliang  and
      Xu, Yan  and
      Damapuspita, Dyah  and
      Wibowo, Haryo  and
      Tho, Cuk  and
      Karo Karo, Ichwanul  and
      Fatyanosa, Tirana  and
      Ji, Ziwei  and
      Neubig, Graham  and
      Baldwin, Timothy  and
      Ruder, Sebastian  and
      Fung, Pascale  and
      Sujaini, Herry  and
      Sakti, Sakriani  and
      Purwarianti, Ayu",
    editor = "Rogers, Anna  and
      Boyd - Graber, Jordan  and
      Okazaki, Naoaki",
    booktitle = "Findings of the Association for Computational Linguistics: ACL 2023",
    month = jul,
    year = "2023",
    address = "Toronto, Canada",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2023.findings - acl.868",
    doi = "10.18653/v1/2023.findings - acl.868",
    pages = "13745--13818"
}

@inproceedings{winata - etal - 2023 - nusax,
    title = "{N}usa{X}: Multilingual Parallel Sentiment Dataset for 10 {I}ndonesian Local Languages",
    author = "Winata, Genta Indra  and
      Aji, Alham Fikri  and
      Cahyawijaya, Samuel  and
      Mahendra, Rahmad  and
      Koto, Fajri  and
      Romadhony, Ade  and
      Kurniawan, Kemal  and
      Moeljadi, David  and
      Prasojo, Radityo Eko  and
      Fung, Pascale  and
      Baldwin, Timothy  and
      Lau, Jey Han  and
      Sennrich, Rico  and
      Ruder, Sebastian",
    editor = "Vlachos, Andreas  and
      Augenstein, Isabelle",
    booktitle = "Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics",
    month = may,
    year = "2023",
    address = "Dubrovnik, Croatia",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2023.eacl - main.57",
    doi = "10.18653/v1/2023.eacl - main.57",
    pages = "815--834"
}

@inproceedings{aji - etal - 2022 - one,
    title = "One Country, 700+ Languages: {NLP} Challenges for Underrepresented Languages and Dialects in {I}ndonesia",
    author = "Aji, Alham Fikri  and
      Winata, Genta Indra  and
      Koto, Fajri  and
      Cahyawijaya, Samuel  and
      Romadhony, Ade  and
      Mahendra, Rahmad  and
      Kurniawan, Kemal  and
      Moeljadi, David  and
      Prasojo, Radityo Eko  and
      Baldwin, Timothy  and
      Lau, Jey Han  and
      Ruder, Sebastian",
    editor = "Muresan, Smaranda  and
      Nakov, Preslav  and
      Villavicencio, Aline",
    booktitle = "Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
    month = may,
    year = "2022",
    address = "Dublin, Ireland",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2022.acl - long.500",
    doi = "10.18653/v1/2022.acl - long.500",
    pages = "7226--7249"
}

@inproceedings{cahyawijaya - etal - 2021 - indonlg,
    title = "{I}ndo{NLG}: Benchmark and Resources for Evaluating {I}ndonesian Natural Language Generation",
    author = "Cahyawijaya, Samuel  and
      Winata, Genta Indra  and
      Wilie, Bryan  and
      Vincentio, Karissa  and
      Li, Xiaohong  and
      Kuncoro, Adhiguna  and
      Ruder, Sebastian  and
      Lim, Zhi Yuan  and
      Bahar, Syafri  and
      Khodra, Masayu  and
      Purwarianti, Ayu  and
      Fung, Pascale",
    editor = "Moens, Marie - Francine  and
      Huang, Xuanjing  and
      Specia, Lucia  and
      Yih, Scott Wen - tau",
    booktitle = "Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing",
    month = nov,
    year = "2021",
    address = "Online and Punta Cana, Dominican Republic",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2021.emnlp - main.699",
    doi = "10.18653/v1/2021.emnlp - main.699",
    pages = "8875--8898"
}

@inproceedings{wilie - etal - 2020 - indonlu,
    title = "{I}ndo{NLU}: Benchmark and Resources for Evaluating {I}ndonesian Natural Language Understanding",
    author = "Wilie, Bryan  and
      Vincentio, Karissa  and
      Winata, Genta Indra  and
      Cahyawijaya, Samuel  and
      Li, Xiaohong  and
      Lim, Zhi Yuan  and
      Soleman, Sidik  and
      Mahendra, Rahmad  and
      Fung, Pascale  and
      Bahar, Syafri  and
      Purwarianti, Ayu",
    editor = "Wong, Kam - Fai  and
      Knight, Kevin  and
      Wu, Hua",
    booktitle = "Proceedings of the 1st Conference of the Asia - Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing",
    month = dec,
    year = "2020",
    address = "Suzhou, China",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2020.aacl - main.85",
    pages = "843--857"
}

また、特にインドネシア語およびその地域言語に対する地域固有の言語モデルに関する当社の研究に触発された場合は、以下の論文も引用を検討してください。

@inproceedings{cahyawijaya - etal - 2023 - nusawrites,
    title = "{N}usa{W}rites: Constructing High - Quality Corpora for Underrepresented and Extremely Low - Resource Languages",
    author = "Cahyawijaya, Samuel  and
      Lovenia, Holy  and
      Koto, Fajri  and
      Adhista, Dea  and
      Dave, Emmanuel  and
      Oktavianti, Sarah  and
      Akbar, Salsabil  and
      Lee, Jhonson  and
      Shadieq, Nuur  and
      Cenggoro, Tjeng Wawan  and
      Linuwih, Hanung  and
      Wilie, Bryan  and
      Muridan, Galih  and
      Winata, Genta  and
      Moeljadi, David  and
      Aji, Alham Fikri  and
      Purwarianti, Ayu  and
      Fung, Pascale",
    editor = "Park, Jong C.  and
      Arase, Yuki  and
      Hu, Baotian  and
      Lu, Wei  and
      Wijaya, Derry  and
      Purwarianti, Ayu  and
      Krisnadhi, Adila Alfa",
    booktitle = "Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia - Pacific Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)",
    month = nov,
    year = "2023",
    address = "Nusa Dua, Bali",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2023.ijcnlp - main.60",
    doi = "10.18653/v1/2023.ijcnlp - main.60",
    pages = "921--945"
}