llama-3-Korean-Bllossom-70B開源模型 - 增強韓英雙語能力，優化韓語表達

首頁

Llama 3 Korean Bllossom 70B

由Bllossom開發

基於Llama3的韓英雙語增強模型，通過詞彙擴展和知識關聯優化韓語能力

大型語言模型

Transformers

支持多種語言#韓英雙語增強 #文化適配指令微調 #長上下文處理

下載量 300

發布時間 : 5/8/2024

模型概述

Bllossom是首爾科技大學等機構聯合開發的韓英雙語大語言模型，通過詞彙擴展、知識關聯訓練和韓語文化適配指令微調，顯著提升了韓語處理能力

模型特點

韓語詞彙擴展

擴展了3萬+韓語詞彙量，顯著提升韓語表達能力

知識關聯訓練

通過韓英平行語料庫建立雙語知識鏈接

文化適配指令

採用語言學家定製的韓語文化適配數據進行微調

上下文長度優化

相比原版Llama3處理韓語上下文長度提升約25%

模型能力

韓英雙語文本生成

韓語問答

知識關聯推理

文化適配響應

使用案例

教育

韓語學習助手

幫助非韓語母語者學習韓語

提供文化適配的語言解釋

商業

雙語客服機器人

處理韓英雙語客戶諮詢

準確理解並回應文化相關查詢

🚀 Bllossom

Bllossom是一個韓語 - 英語雙語語言模型，基於開源的LLama3構建。它加強了韓語和英語之間的知識聯繫，在韓語處理上表現出色，為用戶提供了強大且實用的語言交互能力。

🚀 快速開始

安裝依賴

pip install torch transformers==4.40.0 accelerate

Python代碼（使用Pipeline）

import transformers
import torch

model_id = "Bllossom/llama-3-Korean-Bllossom-70B"

pipeline = transformers.pipeline(
    "text-generation",
    model=model_id,
    model_kwargs={"torch_dtype": torch.bfloat16},
    device_map="auto",
)

pipeline.model.eval()
PROMPT = '''You are a helpful AI assistant. Please answer the user's questions kindly. 당신은 유능한 AI 어시스턴트 입니다. 사용자의 질문에 대해 친절하게 답변해주세요.'''
instruction = "서울과학기술대학교 MLP연구실에 대해 소개해줘"

messages = [
    {"role": "system", "content": f"{PROMPT}"},
    {"role": "user", "content": f"{instruction}"}
    ]

prompt = pipeline.tokenizer.apply_chat_template(
        messages, 
        tokenize=False, 
        add_generation_prompt=True
)

terminators = [
    pipeline.tokenizer.eos_token_id,
    pipeline.tokenizer.convert_tokens_to_ids("<|eot_id|>")
]

outputs = pipeline(
    prompt,
    max_new_tokens=2048,
    eos_token_id=terminators,
    do_sample=True,
    temperature=0.6,
    top_p=0.9,
)

print(outputs[0]["generated_text"][len(prompt):])

# 서울과학기술대학교 MLP연구실은 멀티모달 자연어처리 연구를 하고 있습니다. 구성원은 임경태 교수와 김민준, 김상민, 최창수, 원인호, 유한결, 임현석, 송승우, 육정훈, 신동재 학생이 있습니다.

Python代碼（使用AutoModel）

import os
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

model_id = 'Bllossom/llama-3-Korean-Bllossom-70B'

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

model.eval()

PROMPT = '''You are a helpful AI assistant. Please answer the user's questions kindly. 당신은 유능한 AI 어시스턴트 입니다. 사용자의 질문에 대해 친절하게 답변해주세요.'''
instruction = "서울과학기술대학교 MLP연구실에 대해 소개해줘"

messages = [
    {"role": "system", "content": f"{PROMPT}"},
    {"role": "user", "content": f"{instruction}"}
    ]

input_ids = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    return_tensors="pt"
).to(model.device)

terminators = [
    tokenizer.eos_token_id,
    tokenizer.convert_tokens_to_ids("<|eot_id|>")
]

outputs = model.generate(
    input_ids,
    max_new_tokens=2048,
    eos_token_id=terminators,
    do_sample=True,
    temperature=0.6,
    top_p=0.9
)

print(tokenizer.decode(outputs[0][input_ids.shape[-1]:], skip_special_tokens=True))
# 서울과학기술대학교 MLP연구실은 멀티모달 자연어처리 연구를 하고 있습니다. 구성원은 임경태 교수와 김민준, 김상민, 최창수, 원인호, 유한결, 임현석, 송승우, 육정훈, 신동재 학생이 있습니다.

✨ 主要特性

知識鏈接：通過額外訓練，將韓語和英語知識進行鏈接。
詞彙擴展：擴展韓語詞彙，增強韓語表達能力。
指令微調：使用專門為韓語和韓國文化定製的指令跟隨數據進行微調。
人工反饋：應用了DPO（直接偏好優化）。
視覺 - 語言對齊：將視覺變換器與該語言模型進行對齊。

📦 安裝指南

pip install torch transformers==4.40.0 accelerate

💻 使用示例

基礎用法

import transformers
import torch

model_id = "Bllossom/llama-3-Korean-Bllossom-70B"

pipeline = transformers.pipeline(
    "text-generation",
    model=model_id,
    model_kwargs={"torch_dtype": torch.bfloat16},
    device_map="auto",
)

pipeline.model.eval()
PROMPT = '''You are a helpful AI assistant. Please answer the user's questions kindly. 당신은 유능한 AI 어시스턴트 입니다. 사용자의 질문에 대해 친절하게 답변해주세요.'''
instruction = "서울과학기술대학교 MLP연구실에 대해 소개해줘"

messages = [
    {"role": "system", "content": f"{PROMPT}"},
    {"role": "user", "content": f"{instruction}"}
    ]

prompt = pipeline.tokenizer.apply_chat_template(
        messages, 
        tokenize=False, 
        add_generation_prompt=True
)

terminators = [
    pipeline.tokenizer.eos_token_id,
    pipeline.tokenizer.convert_tokens_to_ids("<|eot_id|>")
]

outputs = pipeline(
    prompt,
    max_new_tokens=2048,
    eos_token_id=terminators,
    do_sample=True,
    temperature=0.6,
    top_p=0.9,
)

print(outputs[0]["generated_text"][len(prompt):])

# 서울과학기술대학교 MLP연구실은 멀티모달 자연어처리 연구를 하고 있습니다. 구성원은 임경태 교수와 김민준, 김상민, 최창수, 원인호, 유한결, 임현석, 송승우, 육정훈, 신동재 학생이 있습니다.

高級用法

import os
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

model_id = 'Bllossom/llama-3-Korean-Bllossom-70B'

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

model.eval()

PROMPT = '''You are a helpful AI assistant. Please answer the user's questions kindly. 당신은 유능한 AI 어시스턴트 입니다. 사용자의 질문에 대해 친절하게 답변해주세요.'''
instruction = "서울과학기술대학교 MLP연구실에 대해 소개해줘"

messages = [
    {"role": "system", "content": f"{PROMPT}"},
    {"role": "user", "content": f"{instruction}"}
    ]

input_ids = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    return_tensors="pt"
).to(model.device)

terminators = [
    tokenizer.eos_token_id,
    tokenizer.convert_tokens_to_ids("<|eot_id|>")
]

outputs = model.generate(
    input_ids,
    max_new_tokens=2048,
    eos_token_id=terminators,
    do_sample=True,
    temperature=0.6,
    top_p=0.9
)

print(tokenizer.decode(outputs[0][input_ids.shape[-1]:], skip_special_tokens=True))
# 서울과학기술대학교 MLP연구실은 멀티모달 자연어처리 연구를 하고 있습니다. 구성원은 임경태 교수와 김민준, 김상민, 최창수, 원인호, 유한결, 임현석, 송승우, 육정훈, 신동재 학생이 있습니다.

📚 詳細文檔

模型更新日誌
- [2024.08.30] 預訓練量增加到250GB的Bllossom ELO模型更新，但未進行單詞擴展。若想使用原單詞擴展的長上下文模型，請單獨聯繫。
- [2024.05.08] 詞彙擴展模型更新
- [2024.04.25] 發佈基於llama - 3的Bllossom v2.0
- [2023/12] 發佈基於Bllossom的Bllossom - Vision v1.0
- [2023/08] 發佈基於llama - 2的Bllossom v1.0
- [2023/07] 發佈基於polyglot - ko的Bllossom v0.7
模型介紹
- Bllossom-70.8B是與首爾科技大學、泰迪森、延世大學語言資源研究室的語言學家合作開發的實用主義語言模型，後續將持續更新和維護。
- 擁有超強的Advanced - Bllossom 8B、70B模型以及視覺 - 語言模型（如有疑問請單獨聯繫）。
- Bllossom已被NAACL2024、LREC - COLING2024（口頭）會議錄用。
- 會持續更新優質語言模型，歡迎共同研究韓語強化相關內容（特別是論文合作），尤其是有少量GPU可供租賃的團隊，隨時歡迎聯繫。

📄 許可證

該模型的許可證為llama3。

📖 引用

語言模型

@misc{bllossom,
  author = {ChangSu Choi, Yongbin Jeong, Seoyoon Park, InHo Won, HyeonSeok Lim, SangMin Kim, Yejee Kang, Chanhyuk Yoon, Jaewan Park, Yiseul Lee, HyeJin Lee, Younggyun Hahm, Hansaem Kim, KyungTae Lim},
  title = {Optimizing Language Augmentation for Multilingual Large Language Models: A Case Study on Korean},
  year = {2024},
  journal = {LREC-COLING 2024},
  paperLink = {\url{https://arxiv.org/pdf/2403.10882}},
 },
}

視覺 - 語言模型

@misc{bllossom-V,
  author = {Dongjae Shin, Hyunseok Lim, Inho Won, Changsu Choi, Minjun Kim, Seungwoo Song, Hangyeol Yoo, Sangmin Kim, Kyungtae Lim},
  title = {X-LLaVA: Optimizing Bilingual Large Vision-Language Alignment},
  year = {2024},
  publisher = {GitHub},
  journal = {NAACL 2024 findings},
  paperLink = {\url{https://arxiv.org/pdf/2403.11399}},
 },
}