SEA-LION-v1-7B Open Source Large Language Model - Supports interactive communication in 11 Southeast Asian languages

SEA LION V1 7B

Developed by aisingapore

SEA-LION-v1-7B is a 7B-parameter large language model optimized for Southeast Asia, supporting 11 Southeast Asian languages.

Large Language Model

Transformers

Open Source License:MIT #Southeast Asian multilingual #Customized tokenizer #Regional context optimization

Downloads 451

Release Time : 10/30/2023

Model Overview

The SEA-LION series of large language models are specifically designed for Southeast Asian languages and cultural contexts. They are built based on the MPT architecture and use a customized tokenizer to optimize multilingual performance.

Model Features

Optimized for Southeast Asian languages

Specifically trained for 11 Southeast Asian languages, including a customized tokenizer and regional context understanding.

Large-scale training data

Trained on 980 billion tokens of multilingual data, including programming languages and academic texts.

High-performance architecture

Based on the MPT architecture, a 4096-dimensional model with 32 layers, 256K vocabulary, and a 2048 sequence length.

Model Capabilities

Multilingual text understanding

Southeast Asian context processing

Programming language understanding

Academic text processing

Use Cases

Multilingual applications

Southeast Asian language translation

Supports mutual translation between Southeast Asian languages.

Regional content generation

Generates content that conforms to the Southeast Asian cultural context.

Technical document processing

Code understanding and generation

Processes programming languages such as Python and JavaScript.

🚀 SEA-LION-v1-7B

SEA-LION is a collection of Large Language Models (LLMs) pretrained and instruct - tuned for the Southeast Asia (SEA) region. The model sizes range from 3 billion to 7 billion parameters. This README is for the SEA - LION 7B base model. SEA - LION stands for Southeast Asian Languages In One Network.

🚀 Quick Start

SEA-LION is a significant advancement in Natural Language Processing, specifically trained for the SEA regional context. This 7B base model is built on the MPT architecture, with a vocabulary size of 256K and uses a custom SEABPETokenizer for SEA languages.

✨ Features

Regional Focus: Specially trained for the Southeast Asian context, supporting multiple SEA languages.
Robust Architecture: Built on the MPT architecture, ensuring stable performance.
Custom Tokenizer: Uses the SEABPETokenizer, optimized for SEA languages.

📦 Installation

No installation steps are provided in the original document, so this section is skipped.

📚 Documentation

📋 Model Details

Model Description

The SEA - LION model represents a major step forward in Natural Language Processing, trained to understand the SEA regional context. SEA - LION - v1 - 7B is based on the MPT architecture and has a vocabulary size of 256K. It uses the custom SEABPETokenizer for SEA languages to ensure optimal performance. The training data for this model consists of 980B tokens.

Property	Details
Developed by	Products Pillar, AI Singapore
Funded by	Singapore NRF
Model Type	Decoder
Languages	English, Chinese, Indonesian, Malay, Thai, Vietnamese, Filipino, Tamil, Burmese, Khmer, Lao
License	MIT License

Performance Benchmarks

SEA - LION - v1 - 7B shows average performance on general English tasks (as measured by Hugging Face's LLM Leaderboard):

Model	ARC	HellaSwag	MMLU	TruthfulQA	Average
SEA - LION 7B	39.93	68.51	26.87	35.09	42.60

📈 Training Details

Data

SEA - LION - v1 - 7B was trained on 980B tokens from the following data sources:

Data Source	Unique Tokens	Multiplier	Total Tokens	Percentage
RefinedWeb - English	571.3B	1	571.3B	58.20%
mC4 - Chinese	91.2B	1	91.2B	9.29%
mC4 - Indonesian	3.68B	4	14.7B	1.50%
mC4 - Malay	0.72B	4	2.9B	0.29%
mC4 - Filipino	1.32B	4	5.3B	0.54%
mC4 - Burmese	1.2B	4	4.9B	0.49%
mC4 - Vietnamese	63.4B	1	63.4B	6.46%
mC4 - Thai	5.8B	2	11.6B	1.18%
WangChanBERTa - Thai	5B	2	10B	1.02%
mC4 - Lao	0.27B	4	1.1B	0.12%
mC4 - Khmer	0.97B	4	3.9B	0.40%
mC4 - Tamil	2.55B	4	10.2B	1.04%
the Stack - Python	20.9B	2	41.8B	4.26%
the Stack - Javascript	55.6B	1	55.6B	5.66%
the Stack - Shell	1.25B	2	2.5B	0.26%
the Stack - SQL	6.4B	2	12.8B	1.31%
the Stack - Markdown	26.6B	1	26.6B	2.71%
RedPajama - StackExchange	21.2B	1	21.2B	2.16%
RedPajama - ArXiv	30.6B	1	30.6B	3.12%

Infrastructure

SEA - LION - v1 - 7B was trained using MosaicML Composer on the following hardware:

Training Details	SEA - LION - v1 - 7B
AWS EC2 p4d.24xlarge	32 instances
Nvidia A100 40GB GPU	256
Training Duration	22 days

Configuration

HyperParameter	SEA - LION - v1 - 7B
Precision	bfloat16
Optimizer	decoupled_adamw
Scheduler	cosine_with_warmup
Learning Rate	6.0e - 5
Global Batch Size	2048
Micro Batch Size	4

🔧 Technical Details

Model Architecture and Objective

SEA - LION - v1 - 7B is a decoder model using the MPT architecture.

Parameter	SEA - LION - v1 - 7B
Layers	32
d_model	4096
head_dim	32
Vocabulary	256000
Sequence Length	2048

Tokenizer Details

We sampled 20M lines from the training data to train the tokenizer. The training framework is SentencePiece, and the tokenizer type is Byte - Pair Encoding (BPE).

👥 The Team

Lam Wen Zhi Clarence
Leong Wei Qi
Li Yier
Liu Bing Jie Darius
Lovenia Holy
Montalan Jann Railey
Ng Boon Cheong Raymond
Ngui Jian Gang
Nguyen Thanh Ngan
Ong Tat - Wee David
Rengarajan Hamsawardhini
Susanto Yosephine
Tai Ngee Chia
Tan Choon Meng
Teo Jin Howe
Teo Eng Sipp Leslie
Teo Wei Yi
Tjhi William
Yeo Yeow Tong
Yong Xianbin

🙏 Acknowledgements

AI Singapore is a national programme supported by the National Research Foundation, Singapore and hosted by the National University of Singapore. Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not reflect the views of National Research Foundation, Singapore.

📞 Contact

For more info, please contact us using this SEA - LION Inquiry Form

Link to SEA - LION's GitHub repository

⚠️ Disclaimer

This is the repository for the base model. The model has not been aligned for safety. Developers and users should perform their own safety fine - tuning and related security measures. In no event shall the authors be held liable for any claim, damages, or other liability arising from the use of the released weights and codes.

📚 References

Thai Pre - Training Data Reference

@misc{lowphansirikul2021wangchanberta,
    title={WangchanBERTa: Pretraining transformer-based Thai Language Models},
    author={Lalita Lowphansirikul and Charin Polpanumas and Nawat Jantrakulchai and Sarana Nutanong},
    year={2021},
    eprint={2101.09635},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご