FuseLLM-7B開源語言模型 - 融合多模型知識，免費部署統一語言能力

首頁

Fusellm 7B

由Wanfq開發

FuseLLM-7B是一個融合了多個開源大型語言模型知識的統一模型，通過知識融合技術將不同架構的LLM能力整合到一個模型中。

大型語言模型

Transformers

支持多種語言開源協議:Apache-2.0 #多模型知識融合 #開源大語言模型 #文本生成優化

下載量 45

發布時間 : 1/21/2024

模型概述

FuseLLM-7B通過融合Llama-2-7B、OpenLLaMA-7B和MPT-7B三個不同架構的模型，實現了知識整合和能力增強。該模型在多個基準測試中表現出色，適用於文本生成、推理等多種任務。

模型特點

多模型知識融合

整合了Llama-2-7B、OpenLLaMA-7B和MPT-7B三個不同架構模型的知識和能力

跨架構支持

能夠融合不同架構的模型，突破了傳統模型融合的限制

性能提升

在多個基準測試中表現優於單個源模型

輕量級訓練

通過輕量級持續訓練實現知識轉移，訓練效率高

模型能力

文本生成

常識推理

代碼生成

問答系統

閱讀理解

機器翻譯

使用案例

自然語言處理

智能問答系統

用於構建能夠回答複雜問題的問答系統

在TruthfulQA基準上達到38.17的mc2分數

代碼生成

支持多語言編程代碼生成

在MultiPL-E基準上達到15.56的分數

教育輔助

科學問題解答

幫助學生解答科學和數學問題

在GSM8k數學基準上達到14.33的準確率

🚀 FuseLLM-7B：大語言模型知識融合

FuseLLM-7B致力於探索大語言模型（LLM）的知識融合領域，旨在創建一個統一模型，融合多個結構不同的LLM的能力和獨特優勢。通過引入FuseLLM方法，將源LLM的生成分佈進行利用，將集體知識和個體優勢轉移到目標LLM中，實現了不同架構LLM的有效融合。

🚀 快速開始

環境設置

本項目使用 python 3.9，需要安裝 requirements.txt 中列出的所有庫，執行以下命令：

pip install -r requirements.txt

使用示例

from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("Wanfq/FuseLLM-7B", use_fast=False)
model = AutoModelForCausalLM.from_pretrained("Wanfq/FuseLLM-7B", torch_dtype="auto")
model.cuda()
inputs = tokenizer("<your text here>", return_tensors="pt").to(model.device)
tokens = model.generate(
  **inputs,
  max_new_tokens=512,
  temperature=0.6,
  top_p=0.9,
  do_sample=True,
)
print(tokenizer.decode(tokens[0], skip_special_tokens=True))

此外，還可以在 FuseLLM-7B-exl2 找到 Exllama v2 Quantizations 版本，它使用 ExLlamaV2 v0.0.11 進行量化。

✨ 主要特性

知識融合創新：探索LLM知識融合領域，創建統一模型，結合多個結構不同LLM的能力和優勢。
架構兼容性強：與模型集成方法和權重合並技術不同，FuseLLM支持將多個不同架構的LLM融合為更強大的LLM。
性能表現優異：在多個基準測試中表現出色，如BBH、ARC-easy、ARC-challenge等。

📦 安裝指南

本項目使用 python 3.9，安裝 requirements.txt 中列出的所有庫：

pip install -r requirements.txt

💻 使用示例

基礎用法

from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("Wanfq/FuseLLM-7B", use_fast=False)
model = AutoModelForCausalLM.from_pretrained("Wanfq/FuseLLM-7B", torch_dtype="auto")
model.cuda()
inputs = tokenizer("<your text here>", return_tensors="pt").to(model.device)
tokens = model.generate(
  **inputs,
  max_new_tokens=512,
  temperature=0.6,
  top_p=0.9,
  do_sample=True,
)
print(tokenizer.decode(tokens[0], skip_special_tokens=True))

高級用法

可使用 Exllama v2 Quantizations 版本，在 FuseLLM-7B-exl2 中找到，它使用 ExLlamaV2 v0.0.11 進行量化。

📚 詳細文檔

數據構建

使用 MiniPile 數據集進行持續訓練，以下是獲取多個LLM表示以進行模型融合的腳本：

分割長文本

python ./src/utils/split_long_text.py \
  --base_model_name_or_path "<path_to_llama_2_7b>" \
  --blending_model_name_or_path "<path_to_open_llama_7b_v2>" \
  --another_blending_model_name_or_path "<path_to_mpt_7b>" \
  --dataset "<path_to_minipile>" \
  --dataset_save_dir "<path_to_minipile_split>" \
  --cache_dir "<path_to_cache_dir>" \
  --block_size 2048 \
  --preprocessing_num_workers 80

獲取每個LLM的表示

# 我們將數據集分成8個分割，然後在GPU上處理每個分割。
# 請為llama_2_7b、open_llama_7b_v2和mpt_7b運行此腳本。
for i in {0..7}; do
export CUDA_VISIBLE_DEVICES=${i}
python ./src/utils/forward_for_logits.py \
  --model_name_or_path "<path_to_each_model>" \
  --dataset "<path_to_minipile_split>" \
  --dataset_save_dir "${i}_8_<path_to_minipile_split_each_model_representation>" \
  --dataset_split_num 8 \
  --dataset_index ${i} \
  --cache_dir "<path_to_cache_dir>" \
  --model_max_length 2048 \
  --training_mode full \
  --load_in_half bf16 \
  --batch_size 8 \
  --preprocessing_num_workers 80 \
  --top_k_logits 10 \
  --save_per_token_metric 2>&1 > "${i}_8_<path_to_log_file>" 2>&1 &
unset CUDA_VISIBLE_DEVICES
sleep 30
done

wait

對齊不同LLM的表示

# 獲取不同LLM的詞彙映射。

# llama_2_7b <-> open_llama_7b_v2
python ./src/utils/vocab_mapping.py \
  --base_model_name_or_path "<path_to_llama_2_7b>" \
  --blending_model_name_or_path "<path_to_open_llama_7b_v2>" \
  --dataset_dir "<path_to_minipile_split>" \
  --vocab_mapping_save_dir "<path_to_llama_2_7b_open_llama_7b_v2_vocab_mapping>" \
  --cache_dir "<path_to_cache_dir>" \
  --model_max_length 2048 \
  --vocab_mapping_type "default" \
  --num_process 1

# llama_2_7b <-> mpt_7b
python ./src/utils/vocab_mapping.py \
  --base_model_name_or_path "<path_to_llama_2_7b>" \
  --blending_model_name_or_path "<path_to_mpt_7b>" \
  --dataset_dir "<path_to_minipile_split>" \
  --vocab_mapping_save_dir "<path_to_llama_2_7b_mpt_7b_vocab_mapping>" \
  --cache_dir "<path_to_cache_dir>" \
  --model_max_length 2048 \
  --vocab_mapping_type "default" \
  --num_process 1

# 對齊不同LLM的表示。

# llama_2_7b <-> open_llama_7b_v2
for i in {0..7}; do
python ./src/utils/token_alignment.py \
  --base_model_name_or_path "<path_to_llama_2_7b>" \
  --blending_model_name_or_path "<path_to_open_llama_7b_v2>" \
  --base_dataset_dir "${i}_8_<path_to_minipile_split_llama_2_7b_representation>" \
  --blending_dataset_dir "${i}_8_<path_to_minipile_split_open_llama_7b_v2_representation>" \
  --dataset_save_dir "${i}_8_<path_to_minipile_split_llama_2_7b_open_llama_7b_v2_aligned_representation>" \
  --cache_dir "<path_to_cache_dir>" \
  --model_max_length 2048 \
  --preprocessing_num_workers 80 \
  --batch_size 100 \
  --blending_model_index 0 \
  --vocab_align_type "soft" \
  --vocab_mapping_save_dir "<path_to_llama_2_7b_open_llama_7b_v2_vocab_mapping>" \
  --metric_level "sequence"
done 

# llama_2_7b <-> mpt_7b
for i in {0..7}; do
python ./src/utils/token_alignment.py \
  --base_model_name_or_path "<path_to_llama_2_7b>" \
  --blending_model_name_or_path "<path_to_mpt_7b>" \
  --base_dataset_dir "${i}_8_<path_to_minipile_split_llama_2_7b_open_llama_7b_v2_aligned_representation>" \
  --blending_dataset_dir "${i}_8_<path_to_minipile_split_mpt_7b_representation>" \
  --dataset_save_dir "${i}_8_<path_to_minipile_split_llama_2_7b_open_llama_7b_v2_mpt_7b_aligned_representation>" \
  --cache_dir "<path_to_cache_dir>" \
  --model_max_length 2048 \
  --preprocessing_num_workers 80 \
  --batch_size 100 \
  --blending_model_index 1 \
  --vocab_align_type "soft" \
  --vocab_mapping_save_dir "<path_to_llama_2_7b_mpt_7b_vocab_mapping>" \
  --metric_level "sequence"
done

打包所有特徵以加速訓練

for i in {0..7}; do
python3 ./src/utils/packing.py \
  --dataset_dir "${i}_8_<path_to_minipile_split_llama_2_7b_open_llama_7b_v2_mpt_7b_aligned_representation>" \
  --dataset_save_dir "${i}_8_<path_to_miniplie_fusellm_processed>" \
  --cache_dir "<path_to_cache_dir>" \
  --model_max_length 2048 \
  --preprocessing_num_workers 80 \
  --batch_size 1000 \
  --metric_level "sequence"

最終處理後的數據位於 ${i}_8_<path_to_miniplie_fusellm_processed>，其中 i in {0..7}。

訓練

以下是FuseLLM訓練的腳本：

export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7

deepspeed --master_port=20001 ./src/train.py \
  --training_mode full \
  --deepspeed ./config/zero_stage2_config.json \
  --model_name_or_path "<path_to_llama_2_7b>" \
  --output_dir "<path_to_save_fusellm_7b>" \
  --model_max_length 2048 \
  --logging_steps 1 \
  --save_strategy steps \
  --save_steps 500 \
  --save_total_limit 1 \
  --evaluation_strategy steps \
  --per_device_eval_batch_size 1 \
  --logging_strategy steps \
  --do_train \
  --do_eval \
  --bf16 True \
  --tf32 True \
  --warmup_ratio 0.008 \
  --lr_scheduler_type cosine \
  --dataset_name "0_8_<path_to_miniplie_fusellm_processed>,1_8_<path_to_miniplie_fusellm_processed>,2_8_<path_to_miniplie_fusellm_processed>,3_8_<path_to_miniplie_fusellm_processed>,4_8_<path_to_miniplie_fusellm_processed>,5_8_<path_to_miniplie_fusellm_processed>,6_8_<path_to_miniplie_fusellm_processed>,7_8_<path_to_miniplie_fusellm_processed>" \
  --per_device_train_batch_size 1 \
  --gradient_accumulation_steps 16 \
  --num_train_epochs 1 \
  --eval_steps 500 \
  --optim adamw_torch \
  --adam_beta1 0.9 \
  --adam_beta2 0.95 \
  --learning_rate 1e-5 \
  --weight_decay 0.1 \
  --max_grad_norm 1.0 \
  --seed 42 \
  --gradient_checkpointing True \
  --use_flash_attn True \
  --report_to tensorboard \
  --do_distill \
  --distill_with_ref_model True \
  --distill_with_aligned_model_0 True \
  --distill_with_aligned_model_1 True \
  --distill_loss_type "ce" \
  --distill_teacher_temperature 1.0 \
  --lm_loss_weight 0.9 \
  --distill_greater_as_gt True \
  --distill_greater_as_gt_type "hard" \
  --dataloader_num_workers 10 \
  --remove_unused_columns False 2>&1 | tee "<path_to_log_file>"

評估

使用的評估代碼如下：

🔧 技術細節

在本研究中，探索了大語言模型（LLM）的知識融合領域，旨在創建一個統一模型，結合多個結構不同的LLM的能力和獨特優勢。為實現這一目標，引入了FuseLLM方法。該方法首先利用這些源LLM的生成分佈，將它們的集體知識和個體優勢外部化，然後通過輕量級的持續訓練將其轉移到目標LLM中。

與需要並行部署多個LLM的模型集成方法，或通常限於相同架構LLM的權重合並技術不同，FuseLLM旨在支持將多個不同架構的LLM融合為更強大的LLM。通過將它們的知識和能力明確轉移到單個目標LLM中，FuseLLM為LLM的知識融合提供了強大而靈活的解決方案。

📄 許可證

本項目採用 apache-2.0 許可證。

引用

如果您發現這項工作與您的研究或應用相關，請隨時引用我們的工作！

@inproceedings{wan2024knowledge,
    title={Knowledge Fusion of Large Language Models},
    author={Fanqi Wan and Xinting Huang and Deng Cai and Xiaojun Quan and Wei Bi and Shuming Shi},
    booktitle={The Twelfth International Conference on Learning Representations},
    year={2024},
    url={https://openreview.net/pdf?id=jiDsk12qcz}
}

Open LLM Leaderboard評估結果

詳細結果可在此處找到。

指標	值
平均值	51.07
AI2推理挑戰（25-shot）	53.24
HellaSwag（10-shot）	78.72
MMLU（5-shot）	47.93
TruthfulQA（0-shot）	38.17
Winogrande（5-shot）	74.03
GSM8k（5-shot）	14.33