🚀 Bllossom
Bllossom is a Korean-English bilingual language model that enhances the connection of knowledge between the two languages, offering a range of powerful features for language processing.
🚀 Quick Start
The Bllossom project team has released the Bllossom-70.8B, a Korean-English bilingual language model! Supported by the Supercomputing Center of Seoul National University of Science and Technology, it's a Korean-enhanced bilingual model fully fine-tuned with over 100GB of Korean data.
Are you looking for a model that excels in Korean?
- Korean Vocabulary Expansion: It's the first to expand Korean vocabulary to over 30,000 words.
- Longer Context Handling: Capable of processing Korean contexts approximately 25% longer than Llama3.
- Knowledge Linking: Connects Korean and English knowledge through pre - training on a Korean - English parallel corpus.
- Fine - Tuning: Fine - tuned using data crafted by linguists considering Korean culture and language.
- Reinforcement Learning: Incorporates reinforcement learning techniques.
All these features are integrated, and the model is available for commercial use. Create your own model with Bllossom! If you lack GPUs, try the quantized model Quantized Model.
- Bllossom-70.8B is a pragmatism - based language model developed in collaboration with linguists from Seoul National University of Science and Technology, Teddysum, and the Language Resources Laboratory of Yonsei University. We'll continue to manage and update it, so please use it widely.
- We have the super - powerful Advanced - Bllossom 8B, 70B models, and a vision - language model. (If you're interested, contact us individually!)
- Bllossom has been accepted for presentation at NAACL2024 and LREC - COLING2024 (oral).
- We'll keep updating with better language models! We welcome anyone interested in joint research (especially for papers) to enhance the Korean language. If your team can rent even a small number of GPUs, contact us anytime. We'll help you achieve your goals.
✨ Features
The Bllossom language model is a Korean - English bilingual language model based on the open - source LLama3. It enhances the connection of knowledge between Korean and English and has the following features:
- Knowledge Linking: Linking Korean and English knowledge through additional training.
- Vocabulary Expansion: Expansion of Korean vocabulary to enhance Korean expressiveness.
- Instruction Tuning: Tuning using custom - made instruction following data specialized for Korean language and Korean culture.
- Human Feedback: DPO has been applied.
- Vision - Language Alignment: Aligning the vision transformer with this language model.
This model is developed by MLPLab at Seoultech, Teddysum and [Yonsei Univ](https://sites.google.com/view/hansaemkim/hansaem - kim)
📦 Installation
Install Dependencies
pip install torch transformers==4.40.0 accelerate
💻 Usage Examples
Basic Usage
import transformers
import torch
model_id = "Bllossom/llama-3-Korean-Bllossom-70B"
pipeline = transformers.pipeline(
"text-generation",
model=model_id,
model_kwargs={"torch_dtype": torch.bfloat16},
device_map="auto",
)
pipeline.model.eval()
PROMPT = '''You are a helpful AI assistant. Please answer the user's questions kindly. 당신은 유능한 AI 어시스턴트 입니다. 사용자의 질문에 대해 친절하게 답변해주세요.'''
instruction = "서울과학기술대학교 MLP연구실에 대해 소개해줘"
messages = [
{"role": "system", "content": f"{PROMPT}"},
{"role": "user", "content": f"{instruction}"}
]
prompt = pipeline.tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
terminators = [
pipeline.tokenizer.eos_token_id,
pipeline.tokenizer.convert_tokens_to_ids("<|eot_id|>")
]
outputs = pipeline(
prompt,
max_new_tokens=2048,
eos_token_id=terminators,
do_sample=True,
temperature=0.6,
top_p=0.9,
)
print(outputs[0]["generated_text"][len(prompt):])
Advanced Usage
import os
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
model_id = 'Bllossom/llama-3-Korean-Bllossom-70B'
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
device_map="auto",
)
model.eval()
PROMPT = '''You are a helpful AI assistant. Please answer the user's questions kindly. 당신은 유능한 AI 어시스턴트 입니다. 사용자의 질문에 대해 친절하게 답변해주세요.'''
instruction = "서울과학기술대학교 MLP연구실에 대해 소개해줘"
messages = [
{"role": "system", "content": f"{PROMPT}"},
{"role": "user", "content": f"{instruction}"}
]
input_ids = tokenizer.apply_chat_template(
messages,
add_generation_prompt=True,
return_tensors="pt"
).to(model.device)
terminators = [
tokenizer.eos_token_id,
tokenizer.convert_tokens_to_ids("<|eot_id|>")
]
outputs = model.generate(
input_ids,
max_new_tokens=2048,
eos_token_id=terminators,
do_sample=True,
temperature=0.6,
top_p=0.9
)
print(tokenizer.decode(outputs[0][input_ids.shape[-1]:], skip_special_tokens=True))
📚 Documentation
NEWS
- [2024.08.30] Updated to the Bllossom ELO model with pre - training data increased to 250GB. However, no word expansion was performed. If you want to use the existing long - context model with word expansion, please contact us individually!
- [2024.05.08] Vocab Expansion Model Update
- [2024.04.25] We released Bllossom v2.0, based on llama - 3
- [2023/12] We released Bllossom - Vision v1.0, based on Bllossom
- [2023/08] We released Bllossom v1.0, based on llama - 2.
- [2023/07] We released Bllossom v0.7, based on polyglot - ko.
Demo Video
Bllossom - V Demo
Bllossom Demo(Kakao)ㅤㅤㅤㅤㅤㅤㅤㅤ
📄 License
The license for this model is llama3.
📖 Citation
Language Model
@misc{bllossom,
author = {ChangSu Choi, Yongbin Jeong, Seoyoon Park, InHo Won, HyeonSeok Lim, SangMin Kim, Yejee Kang, Chanhyuk Yoon, Jaewan Park, Yiseul Lee, HyeJin Lee, Younggyun Hahm, Hansaem Kim, KyungTae Lim},
title = {Optimizing Language Augmentation for Multilingual Large Language Models: A Case Study on Korean},
year = {2024},
journal = {LREC-COLING 2024},
paperLink = {\url{https://arxiv.org/pdf/2403.10882}},
},
}
Vision - Language Model
@misc{bllossom-V,
author = {Dongjae Shin, Hyunseok Lim, Inho Won, Changsu Choi, Minjun Kim, Seungwoo Song, Hangyeol Yoo, Sangmin Kim, Kyungtae Lim},
title = {X-LLaVA: Optimizing Bilingual Large Vision-Language Alignment},
year = {2024},
publisher = {GitHub},
journal = {NAACL 2024 findings},
paperLink = {\url{https://arxiv.org/pdf/2403.11399}},
},
}
📞 Contact
- KyungTae Lim, Professor at Seoultech.
ktlim@seoultech.ac.kr
- Younggyun Hahm, CEO of Teddysum.
hahmyg@teddysum.ai
- Hansaem Kim, Professor at Yonsei.
khss@yonsei.ac.kr
👥 Contributor
- Chansu Choi, choics2623@seoultech.ac.kr
- Sangmin Kim, sangmin9708@naver.com
- Inho Won, wih1226@seoultech.ac.kr
- Minjun Kim, mjkmain@seoultech.ac.kr
- Seungwoo Song, sswoo@seoultech.ac.kr
- Dongjae Shin, dylan1998@seoultech.ac.kr
- Hyeonseok Lim, gustjrantk@seoultech.ac.kr
- Jeonghun Yuk, usually670@gmail.com
- Hangyeol Yoo, 21102372@seoultech.ac.kr
- Seohyun Song, alexalex225225@gmail.com
🌟 Supported by
- AICA

Demo | Homepage | Github | Colab-tutorial