llama-3-Korean-Bllossom-8B開源語言模型 - 加強韓語能力，支持韓英雙語交流

首頁

Llama 3 Korean Bllossom 8B

由MLP-KTLim開發

Bllossom是基於Llama3的韓英雙語語言模型，通過全面調優增強韓語能力，擴展了韓語詞彙並優化了韓語上下文處理能力。

大型語言模型

Transformers

支持多種語言#多語言對話 #輕量級LLM #中英韓支持

下載量 26.67k

發布時間 : 4/25/2024

模型概述

Bllossom是一個專注於韓英雙語能力的語言模型，通過詞彙擴展、指令調優和人類反饋優化，顯著提升了韓語處理能力。

模型特點

韓語詞彙擴展

擴展了超過3萬個韓語詞彙，增強了韓語表達能力

長上下文處理

相比Llama3，能處理約25%更長的韓語上下文

韓英知識連接

利用韓英平行語料庫進行知識連接預訓練

文化適應性

基於考慮韓國文化與語言的語言學家制作的數據進行微調

強化學習優化

應用了DPO（直接偏好優化）進行模型優化

模型能力

韓語文本生成

英語文本生成

雙語問答

旅遊路線規劃

文化相關內容生成

使用案例

旅遊助手

首爾旅遊路線規劃

為用戶制定首爾著名旅遊路線

生成包含景點、交通和時間的詳細旅遊計劃

教育輔助

韓英雙語學習

輔助韓語和英語學習者進行語言練習

提供準確的雙語翻譯和語言解釋

🚀 Bllossom

Bllossom是一個基於開源LLama3的韓英雙語語言模型，它加強了韓語和英語之間的知識聯繫，為用戶提供更豐富的語言交互體驗。

🚀 快速開始

Bllossom語言模型是基於開源LLama3的韓英雙語語言模型，它加強了韓語和英語之間的知識聯繫。以下是使用該模型的快速指南：

安裝依賴

pip install torch transformers==4.40.0 accelerate

Python代碼示例（使用Pipeline）

import transformers
import torch

model_id = "MLP-KTLim/llama-3-Korean-Bllossom-8B"

pipeline = transformers.pipeline(
    "text-generation",
    model=model_id,
    model_kwargs={"torch_dtype": torch.bfloat16},
    device_map="auto",
)

pipeline.model.eval()

PROMPT = '''You are a helpful AI assistant. Please answer the user's questions kindly. 당신은 유능한 AI 어시스턴트 입니다. 사용자의 질문에 대해 친절하게 답변해주세요.'''
instruction = "서울의 유명한 관광 코스를 만들어줄래?"

messages = [
    {"role": "system", "content": f"{PROMPT}"},
    {"role": "user", "content": f"{instruction}"}
    ]

prompt = pipeline.tokenizer.apply_chat_template(
        messages, 
        tokenize=False, 
        add_generation_prompt=True
)

terminators = [
    pipeline.tokenizer.eos_token_id,
    pipeline.tokenizer.convert_tokens_to_ids("<|eot_id|>")
]

outputs = pipeline(
    prompt,
    max_new_tokens=2048,
    eos_token_id=terminators,
    do_sample=True,
    temperature=0.6,
    top_p=0.9
)

print(outputs[0]["generated_text"][len(prompt):])

Python代碼示例（使用AutoModel）

import os
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

model_id = 'MLP-KTLim/llama-3-Korean-Bllossom-8B'

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

model.eval()

PROMPT = '''You are a helpful AI assistant. Please answer the user's questions kindly. 당신은 유능한 AI 어시스턴트 입니다. 사용자의 질문에 대해 친절하게 답변해주세요.'''
instruction = "서울의 유명한 관광 코스를 만들어줄래?"

messages = [
    {"role": "system", "content": f"{PROMPT}"},
    {"role": "user", "content": f"{instruction}"}
    ]

input_ids = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    return_tensors="pt"
).to(model.device)

terminators = [
    tokenizer.eos_token_id,
    tokenizer.convert_tokens_to_ids("<|eot_id|>")
]

outputs = model.generate(
    input_ids,
    max_new_tokens=2048,
    eos_token_id=terminators,
    do_sample=True,
    temperature=0.6,
    top_p=0.9
)

print(tokenizer.decode(outputs[0][input_ids.shape[-1]:], skip_special_tokens=True))

✨ 主要特性

知識鏈接：通過額外訓練，將韓語和英語知識進行鏈接。
詞彙擴展：擴展韓語詞彙，增強韓語表達能力。
指令微調：使用專門為韓語和韓國文化定製的指令跟隨數據進行微調。
人類反饋：應用了DPO。
視覺 - 語言對齊：將視覺變換器與該語言模型進行對齊。

📦 安裝指南

安裝依賴

pip install torch transformers==4.40.0 accelerate

💻 使用示例

基礎用法

import transformers
import torch

model_id = "MLP-KTLim/llama-3-Korean-Bllossom-8B"

pipeline = transformers.pipeline(
    "text-generation",
    model=model_id,
    model_kwargs={"torch_dtype": torch.bfloat16},
    device_map="auto",
)

pipeline.model.eval()

PROMPT = '''You are a helpful AI assistant. Please answer the user's questions kindly. 당신은 유능한 AI 어시스턴트 입니다. 사용자의 질문에 대해 친절하게 답변해주세요.'''
instruction = "서울의 유명한 관광 코스를 만들어줄래?"

messages = [
    {"role": "system", "content": f"{PROMPT}"},
    {"role": "user", "content": f"{instruction}"}
    ]

prompt = pipeline.tokenizer.apply_chat_template(
        messages, 
        tokenize=False, 
        add_generation_prompt=True
)

terminators = [
    pipeline.tokenizer.eos_token_id,
    pipeline.tokenizer.convert_tokens_to_ids("<|eot_id|>")
]

outputs = pipeline(
    prompt,
    max_new_tokens=2048,
    eos_token_id=terminators,
    do_sample=True,
    temperature=0.6,
    top_p=0.9
)

print(outputs[0]["generated_text"][len(prompt):])

高級用法

import os
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

model_id = 'MLP-KTLim/llama-3-Korean-Bllossom-8B'

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

model.eval()

PROMPT = '''You are a helpful AI assistant. Please answer the user's questions kindly. 당신은 유능한 AI 어시스턴트 입니다. 사용자의 질문에 대해 친절하게 답변해주세요.'''
instruction = "서울의 유명한 관광 코스를 만들어줄래?"

messages = [
    {"role": "system", "content": f"{PROMPT}"},
    {"role": "user", "content": f"{instruction}"}
    ]

input_ids = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    return_tensors="pt"
).to(model.device)

terminators = [
    tokenizer.eos_token_id,
    tokenizer.convert_tokens_to_ids("<|eot_id|>")
]

outputs = model.generate(
    input_ids,
    max_new_tokens=2048,
    eos_token_id=terminators,
    do_sample=True,
    temperature=0.6,
    top_p=0.9
)

print(tokenizer.decode(outputs[0][input_ids.shape[-1]:], skip_special_tokens=True))

📚 詳細文檔

更新日誌

~~[2024.08.09] 基於Llama3.1版本更新為Bllossom - 8B模型。與原基於llama3的Bllossom相比，性能平均提高了約5%。~~（正在修改中）
[2024.06.18] 更新為預訓練量增加到 250GB 的Bllossom ELO模型，但未進行單詞擴展。如果您想使用原單詞擴展的長上下文模型，請與我們聯繫！
[2024.06.18] Bllossom ELO模型是基於自主開發的ELO預訓練的新模型。在LogicKor基準測試中，該模型在現有的韓語10B以下模型中獲得了SOTA分數。

LogicKor性能表

模型	數學	推理	寫作	編碼	理解	語法	單項總分	多項總分	總體
gpt - 3.5 - turbo - 0125	7.14	7.71	8.28	5.85	9.71	6.28	7.50	7.95	7.72
gemini - 1.5 - pro - preview - 0215	8.00	7.85	8.14	7.71	8.42	7.28	7.90	6.26	7.08
llama - 3 - Korean - Bllossom - 8B	5.43	8.29	9.0	4.43	7.57	6.86	6.93	6.93	6.93

模型介紹

我們的Bllossom團隊公開了韓英雙語語言模型Bllossom！這是一個在首爾科技大學超級計算中心的支持下，使用超過100GB韓語數據對整個模型進行全量微調的韓語強化雙語模型！

如果您正在尋找擅長韓語的模型，Bllossom是您的不二之選：

韓語詞彙擴展：韓語領域首創，擴展了超過3萬個韓語詞彙。
長上下文處理：與Llama3相比，能夠處理大約長25%的韓語上下文。
知識鏈接：利用韓英平行語料庫，加強韓語和英語之間的知識聯繫（預訓練）。
定製微調：使用考慮韓語文化和語言特點，由語言學家制作的數據進行微調。
強化學習：應用了強化學習技術。

所有這些特性都集成在Bllossom模型中，並且該模型可用於商業用途。您可以使用它來創建自己的模型，甚至可以在Colab免費GPU上進行訓練。或者，您也可以將量化模型部署在CPU上，量化模型。

其他信息

Bllossom - 8B是與首爾科技大學、Teddysum和延世大學語言資源實驗室的語言學家合作開發的實用主義語言模型！我們將通過持續更新來維護該模型，歡迎大家廣泛使用。
我們擁有超強大的Advanced - Bllossom 8B、70B模型以及視覺 - 語言模型！（如果您感興趣，請單獨與我們聯繫！）
Bllossom已被NAACL2024和LREC - COLING2024（口頭）會議錄用。
我們將持續更新優秀的語言模型！歡迎任何希望共同研究韓語強化的夥伴（特別是論文合作）與我們聯繫！尤其是有少量GPU租賃能力的團隊，隨時歡迎與我們聯繫，我們將盡力提供幫助。

演示視頻

Bllossom - V演示

Bllossom演示（Kakao）

新聞動態

[2024.06.18] 我們恢復到未進行詞彙擴展的模型，但顯著增加了預訓練數據量至250GB。
[2024.05.08] 詞彙擴展模型更新。
[2024.04.25] 我們發佈了基於llama - 3的Bllossom v2.0。

示例代碼

Colab教程

推理代碼鏈接

🔧 技術細節

本模型由首爾科技大學MLPLab、Teddysum和延世大學聯合開發。

📄 許可證

本模型使用llama3許可證。

📚 引用

語言模型

@misc{bllossom,
  author = {ChangSu Choi, Yongbin Jeong, Seoyoon Park, InHo Won, HyeonSeok Lim, SangMin Kim, Yejee Kang, Chanhyuk Yoon, Jaewan Park, Yiseul Lee, HyeJin Lee, Younggyun Hahm, Hansaem Kim, KyungTae Lim},
  title = {Optimizing Language Augmentation for Multilingual Large Language Models: A Case Study on Korean},
  year = {2024},
  journal = {LREC-COLING 2024},
  paperLink = {\url{https://arxiv.org/pdf/2403.10882}},
 },
}

視覺 - 語言模型

@misc{bllossom-V,
  author = {Dongjae Shin, Hyunseok Lim, Inho Won, Changsu Choi, Minjun Kim, Seungwoo Song, Hangyeol Yoo, Sangmin Kim, Kyungtae Lim},
  title = {X-LLaVA: Optimizing Bilingual Large Vision-Language Alignment},
  year = {2024},
  publisher = {GitHub},
  journal = {NAACL 2024 findings},
  paperLink = {\url{https://arxiv.org/pdf/2403.11399}},
 },
}