laser - dolphin - mixtral - 2x7b - dpo開源模型，評估性能平均提升約1分的實用之選

首頁

Laser Dolphin Mixtral 2x7b Dpo

由macadeliccc開發

基於Dolphin-2.6-Mistral-7B-DPO-Laser的中等規模混合專家(MoE)實現，在評估性能上平均提升約1分

大型語言模型

Transformers

開源協議:Apache-2.0 #混合專家模型 #文本生成優化 #多任務評估

下載量 133

發布時間 : 1/8/2024

模型概述

這是一個基於混合專家架構的文本生成模型，通過激光處理優化，適用於多種自然語言處理任務。

模型特點

混合專家架構

採用2x7B參數的混合專家架構，平衡性能與效率

激光處理優化

通過激光處理技術優化模型性能

量化支持

提供多種量化版本，包括ExLlamav2、GGUF和AWQ格式

性能提升

相比前一版本在評估性能上平均提升約1分

模型能力

文本生成

代碼生成

問答系統

推理任務

使用案例

編程輔助

代碼生成

根據自然語言描述生成Python代碼

能夠生成如快速排序算法等常見代碼

教育

解題輔助

解答數學和邏輯問題

在GSM8k數學測試集上準確率為48.29%

通用問答

事實問答

回答基於事實的問題

在TruthfulQA測試集上mc2得分為60.76

🚀 Laser-Dolphin-Mixtral-2x7b-dpo

Laser-Dolphin-Mixtral-2x7b-dpo 是一個基於特定模型改進而來的中型 MoE 實現模型，在文本生成任務上有不錯的表現，新版本在評估性能上平均提升了約 1 分。

🚀 快速開始

Ollama 使用

ollama run macadeliccc/laser-dolphin-mixtral-2x7b-dpo

代碼示例

切換註釋的模型定義以使用 4 位量化。該模型在 9GB 顯存下仍能工作，並且在性能上比單個 7B 模型大約高出 5 - 6 分。

from transformers import AutoModelForCausalLM, AutoTokenizer

def generate_response(prompt):
    """
    Generate a response from the model based on the input prompt.

    Args:
    prompt (str): Prompt for the model.

    Returns:
    str: The generated response from the model.
    """
    # Tokenize the input prompt
    inputs = tokenizer(prompt, return_tensors="pt")

    # Generate output tokens
    outputs = model.generate(**inputs, max_new_tokens=256, eos_token_id=tokenizer.eos_token_id, pad_token_id=tokenizer.pad_token_id)

    # Decode the generated tokens to a string
    response = tokenizer.decode(outputs[0], skip_special_tokens=True)

    return response

# Load the model and tokenizer
model_id = "macadeliccc/laser-dolphin-mixtral-2x7b-dpo"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, load_in_4bit=True)

prompt = "Write a quicksort algorithm in python"

# Generate and print responses for each language
print("Response:")
print(generate_response(prompt), "\n")

你可以點擊 colab 查看使用示例。

✨ 主要特性

性能提升：新版本在評估性能上平均提升了約 1 分。
多任務支持：在多種文本生成任務上有良好表現，如 AI2 Reasoning Challenge、HellaSwag、MMLU 等。
多種量化支持：支持 ExLlamav2、GGUF、AWQ 等多種量化方式。

📦 安裝指南

文檔未提及具體安裝步驟，可參考相關依賴庫的安裝方式。

📚 詳細文檔

模型概述

該模型基於 cognitivecomputations/dolphin-2.6-mistral-7b-dpo-laser 構建，是一箇中型 MoE 實現。

處理流程

具體流程可參考 notebook。
mergekit_config 在文件中。
配置中使用的模型未經過處理，但最終產品經過處理，這是與上一版本的更新之處。
該處理流程是實驗性的，實際效果可能有所不同。

未來目標

[ ] 實現函數調用功能。
[ ] 使用新的基礎模型發佈 v2 版本以提升性能。

量化方式

ExLlamav2

這是推薦給在 GPU 上運行模型的用戶的量化方式。感謝用戶 bartowski，現在有 3.5 到 8 bpw 的 exllamav2 量化版本，可在 bartowski/laser-dolphin-mixtral-2x7b-dpo-exl2 查看。

分支	比特數	lm_head 比特數	VRAM (4k)	VRAM (16k)	VRAM (32k)	描述
8_0	8.0	8.0	13.7 GB	15.1 GB	17.2 GB	ExLlamaV2 能產生的最高質量，接近未量化的性能。
6_5	6.5	8.0	11.5 GB	12.9 GB	15.0 GB	在大幅減小模型大小的情況下接近未量化的性能，推薦使用。
5_0	5.0	6.0	9.3 GB	10.7 GB	12.8 GB	與 6.5 相比質量略低，適合有 12GB 顯存且上下文長度為 16k 的顯卡。
4_25	4.25	6.0	8.2 GB	9.6 GB	11.7 GB	相當於 GPTQ 的比特數。
3_5	3.5	6.0	7.0 GB	8.4 GB	10.5 GB	質量較低，不推薦使用。

GGUF

當前 GGUF 量化版本。

AWQ

當前 AWQ 量化版本。

TheBloke

由 TheBloke 提供的量化版本可能會導致不可預測的行為。由於模型已更新，有新的量化版本可用。

HF Spaces

GGUF 聊天可在這裡使用。
4 位 bnb 聊天可在這裡使用。

🔧 技術細節

評估結果

EQ Bench

----Benchmark Complete----
2024-01-31 16:55:37
Time taken: 31.1 mins
Prompt Format: ChatML
Model: macadeliccc/laser-dolphin-mixtral-2x7b-dpo-GGUF
Score (v2): 72.76
Parseable: 171.0
---------------
Batch completed
Time taken: 31.2 mins
---------------

評估 colab

先前評估總結

模型	AGIEval	GPT4All	TruthfulQA	Bigbench	平均
laser-dolphin-mixtral-2x7b-dpo	41.31	73.67	61.69	42.79	54.87

當前詳細評估

模型	AGIEval	GPT4All	TruthfulQA	Bigbench	平均
laser-dolphin-mixtral-2x7b-dpo	42.25	73.45	63.44	43.96	55.77

AGIEval

任務	版本	指標	值		標準誤差
agieval_aqua_rat	0	acc	21.26	±	2.57
		acc_norm	21.65	±	2.59
agieval_logiqa_en	0	acc	34.72	±	1.87
		acc_norm	35.64	±	1.88
agieval_lsat_ar	0	acc	26.96	±	2.93
		acc_norm	26.96	±	2.93
agieval_lsat_lr	0	acc	45.88	±	2.21
		acc_norm	46.08	±	2.21
agieval_lsat_rc	0	acc	59.48	±	3.00
		acc_norm	59.48	±	3.00
agieval_sat_en	0	acc	73.79	±	3.07
		acc_norm	73.79	±	3.07
agieval_sat_en_without_passage	0	acc	42.23	±	3.45
		acc_norm	41.26	±	3.44
agieval_sat_math	0	acc	37.27	±	3.27
		acc_norm	33.18	±	3.18

平均：42.25%

GPT4All

任務	版本	指標	值		標準誤差
arc_challenge	0	acc	58.36	±	1.44
		acc_norm	58.02	±	1.44
arc_easy	0	acc	82.20	±	0.78
		acc_norm	77.40	±	0.86
boolq	1	acc	87.52	±	0.58
hellaswag	0	acc	67.50	±	0.47
		acc_norm	84.43	±	0.36
openbookqa	0	acc	34.40	±	2.13
		acc_norm	47.00	±	2.23
piqa	0	acc	81.61	±	0.90
		acc_norm	82.59	±	0.88
winogrande	0	acc	77.19	±	1.18

平均：73.45%

GSM8K

任務	版本	指標	值
gsm8k	2	exact_match,get-answer	0.75
		exact_match_stderr,get-answer	0.01
		alias	gsm8k

TruthfulQA

任務	版本	指標	值		標準誤差
truthfulqa_mc	1	mc1	45.90	±	1.74
		mc2	63.44	±	1.56

平均：63.44%

Bigbench

任務	版本	指標	值		標準誤差
bigbench_causal_judgement	0	multiple_choice_grade	58.42	±	3.59
bigbench_date_understanding	0	multiple_choice_grade	60.70	±	2.55
bigbench_disambiguation_qa	0	multiple_choice_grade	38.37	±	3.03
bigbench_geometric_shapes	0	multiple_choice_grade	21.73	±	2.18
		exact_str_match	0.00	±	0.00
bigbench_logical_deduction_five_objects	0	multiple_choice_grade	35.00	±	2.14
bigbench_logical_deduction_seven_objects	0	multiple_choice_grade	23.57	±	1.61
bigbench_logical_deduction_three_objects	0	multiple_choice_grade	50.33	±	2.89
bigbench_movie_recommendation	0	multiple_choice_grade	45.00	±	2.23
bigbench_navigate	0	multiple_choice_grade	50.00	±	1.58
bigbench_reasoning_about_colored_objects	0	multiple_choice_grade	60.35	±	1.09
bigbench_ruin_names	0	multiple_choice_grade	51.12	±	2.36
bigbench_salient_translation_error_detection	0	multiple_choice_grade	32.26	±	1.48
bigbench_snarks	0	multiple_choice_grade	67.96	±	3.48
bigbench_sports_understanding	0	multiple_choice_grade	70.59	±	1.45
bigbench_temporal_sequences	0	multiple_choice_grade	35.80	±	1.52
bigbench_tracking_shuffled_objects_five_objects	0	multiple_choice_grade	22.56	±	1.18
bigbench_tracking_shuffled_objects_seven_objects	0	multiple_choice_grade	17.20	±	0.90
bigbench_tracking_shuffled_objects_three_objects	0	multiple_choice_grade	50.33	±	2.89

平均：43.96%

平均得分：55.77%

耗時：02:43:45

Open LLM Leaderboard 評估結果

詳細結果可查看這裡

指標	值
平均	67.16
AI2 Reasoning Challenge (25-Shot)	65.96
HellaSwag (10-Shot)	85.80
MMLU (5-Shot)	63.17
TruthfulQA (0-shot)	60.76
Winogrande (5-shot)	79.01
GSM8k (5-shot)	48.29

📄 許可證

本項目採用 Apache-2.0 許可證。

📖 引用

文獻引用

Fernando Fernandes Neto 和 Eric Hartford. "Optimizing Large Language Models Using Layer-Selective Rank Reduction and Random Matrix Theory." 2024.

@article{sharma2023truth,
title={The Truth is in There: Improving Reasoning in Language Models with Layer-Selective Rank Reduction},
author={Sharma, Pratyusha and Ash, Jordan T and Misra, Dipendra},
journal={arXiv preprint arXiv:2312.13558},
year={2023} }

@article{gao2021framework,
  title={A framework for few-shot language model evaluation},
  author={Gao, Leo and Tow, Jonathan and Biderman, Stella and Black, Sid and DiPofi, Anthony and Foster, Charles and Golding, Laurence and Hsu, Jeffrey and McDonell, Kyle and Muennighoff, Niklas and others},
  journal={Version v0. 0.1. Sept},
  year={2021}
}