llama-3-Korean-Blossom-70B Open Source Model - Enhance Korean-English Bilingual Abilities and Optimize Korean Expression

Llama 3 Korean Bllossom 70B

Developed by Bllossom

A Korean-English bilingual enhanced model based on Llama3, optimized for Korean language capabilities through vocabulary expansion and knowledge correlation

Large Language Model

Transformers

Supports Multiple Languages#Korean-English bilingual enhancement #Cultural adaptation instruction fine-tuning #Long-context processing

Downloads 300

Release Time : 5/8/2024

Model Overview

Bllossom is a Korean-English bilingual large language model jointly developed by institutions including Seoul National University of Science and Technology. It significantly improves Korean language processing through vocabulary expansion, knowledge correlation training, and Korean cultural adaptation instruction fine-tuning

Model Features

Korean vocabulary expansion

Expanded 30,000+ Korean vocabulary, significantly improving Korean language expression

Knowledge correlation training

Established bilingual knowledge links through Korean-English parallel corpora

Cultural adaptation instructions

Fine-tuned with linguist-customized Korean cultural adaptation data

Context length optimization

Improved Korean context processing length by approximately 25% compared to the original Llama3

Model Capabilities

Korean-English bilingual text generation

Korean Q&A

Knowledge correlation reasoning

Culturally adapted responses

Use Cases

Education

Korean learning assistant

Helps non-native speakers learn Korean

Provides culturally adapted language explanations

Business

Bilingual customer service chatbot

Handles Korean-English bilingual customer inquiries

Accurately understands and responds to culture-related queries

🚀 Bllossom

Bllossom is a Korean-English bilingual language model that enhances the connection of knowledge between the two languages, offering a range of powerful features for language processing.

🚀 Quick Start

The Bllossom project team has released the Bllossom-70.8B, a Korean-English bilingual language model! Supported by the Supercomputing Center of Seoul National University of Science and Technology, it's a Korean-enhanced bilingual model fully fine-tuned with over 100GB of Korean data.

Are you looking for a model that excels in Korean?

Korean Vocabulary Expansion: It's the first to expand Korean vocabulary to over 30,000 words.
Longer Context Handling: Capable of processing Korean contexts approximately 25% longer than Llama3.
Knowledge Linking: Connects Korean and English knowledge through pre - training on a Korean - English parallel corpus.
Fine - Tuning: Fine - tuned using data crafted by linguists considering Korean culture and language.
Reinforcement Learning: Incorporates reinforcement learning techniques.

All these features are integrated, and the model is available for commercial use. Create your own model with Bllossom! If you lack GPUs, try the quantized model Quantized Model.

Bllossom-70.8B is a pragmatism - based language model developed in collaboration with linguists from Seoul National University of Science and Technology, Teddysum, and the Language Resources Laboratory of Yonsei University. We'll continue to manage and update it, so please use it widely.
We have the super - powerful Advanced - Bllossom 8B, 70B models, and a vision - language model. (If you're interested, contact us individually!)
Bllossom has been accepted for presentation at NAACL2024 and LREC - COLING2024 (oral).
We'll keep updating with better language models! We welcome anyone interested in joint research (especially for papers) to enhance the Korean language. If your team can rent even a small number of GPUs, contact us anytime. We'll help you achieve your goals.

✨ Features

The Bllossom language model is a Korean - English bilingual language model based on the open - source LLama3. It enhances the connection of knowledge between Korean and English and has the following features:

Knowledge Linking: Linking Korean and English knowledge through additional training.
Vocabulary Expansion: Expansion of Korean vocabulary to enhance Korean expressiveness.
Instruction Tuning: Tuning using custom - made instruction following data specialized for Korean language and Korean culture.
Human Feedback: DPO has been applied.
Vision - Language Alignment: Aligning the vision transformer with this language model.

This model is developed by MLPLab at Seoultech, Teddysum and [Yonsei Univ](https://sites.google.com/view/hansaemkim/hansaem - kim)

📦 Installation

Install Dependencies

pip install torch transformers==4.40.0 accelerate

💻 Usage Examples

Basic Usage

import transformers
import torch

model_id = "Bllossom/llama-3-Korean-Bllossom-70B"

pipeline = transformers.pipeline(
    "text-generation",
    model=model_id,
    model_kwargs={"torch_dtype": torch.bfloat16},
    device_map="auto",
)

pipeline.model.eval()
PROMPT = '''You are a helpful AI assistant. Please answer the user's questions kindly. 당신은 유능한 AI 어시스턴트 입니다. 사용자의 질문에 대해 친절하게 답변해주세요.'''
instruction = "서울과학기술대학교 MLP연구실에 대해 소개해줘"

messages = [
    {"role": "system", "content": f"{PROMPT}"},
    {"role": "user", "content": f"{instruction}"}
    ]

prompt = pipeline.tokenizer.apply_chat_template(
        messages, 
        tokenize=False, 
        add_generation_prompt=True
)

terminators = [
    pipeline.tokenizer.eos_token_id,
    pipeline.tokenizer.convert_tokens_to_ids("<|eot_id|>")
]

outputs = pipeline(
    prompt,
    max_new_tokens=2048,
    eos_token_id=terminators,
    do_sample=True,
    temperature=0.6,
    top_p=0.9,
)

print(outputs[0]["generated_text"][len(prompt):])

# 서울과학기술대학교 MLP연구실은 멀티모달 자연어처리 연구를 하고 있습니다. 구성원은 임경태 교수와 김민준, 김상민, 최창수, 원인호, 유한결, 임현석, 송승우, 육정훈, 신동재 학생이 있습니다.

Advanced Usage

import os
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

model_id = 'Bllossom/llama-3-Korean-Bllossom-70B'

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

model.eval()

PROMPT = '''You are a helpful AI assistant. Please answer the user's questions kindly. 당신은 유능한 AI 어시스턴트 입니다. 사용자의 질문에 대해 친절하게 답변해주세요.'''
instruction = "서울과학기술대학교 MLP연구실에 대해 소개해줘"

messages = [
    {"role": "system", "content": f"{PROMPT}"},
    {"role": "user", "content": f"{instruction}"}
    ]

input_ids = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    return_tensors="pt"
).to(model.device)

terminators = [
    tokenizer.eos_token_id,
    tokenizer.convert_tokens_to_ids("<|eot_id|>")
]

outputs = model.generate(
    input_ids,
    max_new_tokens=2048,
    eos_token_id=terminators,
    do_sample=True,
    temperature=0.6,
    top_p=0.9
)

print(tokenizer.decode(outputs[0][input_ids.shape[-1]:], skip_special_tokens=True))
# 서울과학기술대학교 MLP연구실은 멀티모달 자연어처리 연구를 하고 있습니다. 구성원은 임경태 교수와 김민준, 김상민, 최창수, 원인호, 유한결, 임현석, 송승우, 육정훈, 신동재 학생이 있습니다.

📚 Documentation

NEWS

[2024.08.30] Updated to the Bllossom ELO model with pre - training data increased to 250GB. However, no word expansion was performed. If you want to use the existing long - context model with word expansion, please contact us individually!
[2024.05.08] Vocab Expansion Model Update
[2024.04.25] We released Bllossom v2.0, based on llama - 3
[2023/12] We released Bllossom - Vision v1.0, based on Bllossom
[2023/08] We released Bllossom v1.0, based on llama - 2.
[2023/07] We released Bllossom v0.7, based on polyglot - ko.

Demo Video

Bllossom - V Demo

Bllossom Demo(Kakao)ㅤㅤㅤㅤㅤㅤㅤㅤ

📄 License

The license for this model is llama3.

📖 Citation

Language Model

@misc{bllossom,
  author = {ChangSu Choi, Yongbin Jeong, Seoyoon Park, InHo Won, HyeonSeok Lim, SangMin Kim, Yejee Kang, Chanhyuk Yoon, Jaewan Park, Yiseul Lee, HyeJin Lee, Younggyun Hahm, Hansaem Kim, KyungTae Lim},
  title = {Optimizing Language Augmentation for Multilingual Large Language Models: A Case Study on Korean},
  year = {2024},
  journal = {LREC-COLING 2024},
  paperLink = {\url{https://arxiv.org/pdf/2403.10882}},
 },
}

Vision - Language Model

@misc{bllossom-V,
  author = {Dongjae Shin, Hyunseok Lim, Inho Won, Changsu Choi, Minjun Kim, Seungwoo Song, Hangyeol Yoo, Sangmin Kim, Kyungtae Lim},
  title = {X-LLaVA: Optimizing Bilingual Large Vision-Language Alignment},
  year = {2024},
  publisher = {GitHub},
  journal = {NAACL 2024 findings},
  paperLink = {\url{https://arxiv.org/pdf/2403.11399}},
 },
}

📞 Contact

KyungTae Lim, Professor at Seoultech. ktlim@seoultech.ac.kr
Younggyun Hahm, CEO of Teddysum. hahmyg@teddysum.ai
Hansaem Kim, Professor at Yonsei. khss@yonsei.ac.kr

👥 Contributor

Chansu Choi, choics2623@seoultech.ac.kr
Sangmin Kim, sangmin9708@naver.com
Inho Won, wih1226@seoultech.ac.kr
Minjun Kim, mjkmain@seoultech.ac.kr
Seungwoo Song, sswoo@seoultech.ac.kr
Dongjae Shin, dylan1998@seoultech.ac.kr
Hyeonseok Lim, gustjrantk@seoultech.ac.kr
Jeonghun Yuk, usually670@gmail.com
Hangyeol Yoo, 21102372@seoultech.ac.kr
Seohyun Song, alexalex225225@gmail.com

🌟 Supported by

AICA

Demo | Homepage | Github | Colab-tutorial

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご