laser - dolphin - mixtral - 2x7b - dpo开源模型，评估性能平均提升约1分的实用之选

首页

Laser Dolphin Mixtral 2x7b Dpo

由 macadeliccc 开发

基于Dolphin-2.6-Mistral-7B-DPO-Laser的中等规模混合专家(MoE)实现，在评估性能上平均提升约1分

大型语言模型

Transformers

开源协议:Apache-2.0 #混合专家模型 #文本生成优化 #多任务评估

下载量 133

发布时间 : 1/8/2024

模型简介

这是一个基于混合专家架构的文本生成模型，通过激光处理优化，适用于多种自然语言处理任务。

模型特点

混合专家架构

采用2x7B参数的混合专家架构，平衡性能与效率

激光处理优化

通过激光处理技术优化模型性能

量化支持

提供多种量化版本，包括ExLlamav2、GGUF和AWQ格式

性能提升

相比前一版本在评估性能上平均提升约1分

模型能力

文本生成

代码生成

问答系统

推理任务

使用案例

编程辅助

代码生成

根据自然语言描述生成Python代码

能够生成如快速排序算法等常见代码

教育

解题辅助

解答数学和逻辑问题

在GSM8k数学测试集上准确率为48.29%

通用问答

事实问答

回答基于事实的问题

在TruthfulQA测试集上mc2得分为60.76

🚀 Laser-Dolphin-Mixtral-2x7b-dpo

Laser-Dolphin-Mixtral-2x7b-dpo 是一个基于特定模型改进而来的中型 MoE 实现模型，在文本生成任务上有不错的表现，新版本在评估性能上平均提升了约 1 分。

🚀 快速开始

Ollama 使用

ollama run macadeliccc/laser-dolphin-mixtral-2x7b-dpo

代码示例

切换注释的模型定义以使用 4 位量化。该模型在 9GB 显存下仍能工作，并且在性能上比单个 7B 模型大约高出 5 - 6 分。

from transformers import AutoModelForCausalLM, AutoTokenizer

def generate_response(prompt):
    """
    Generate a response from the model based on the input prompt.

    Args:
    prompt (str): Prompt for the model.

    Returns:
    str: The generated response from the model.
    """
    # Tokenize the input prompt
    inputs = tokenizer(prompt, return_tensors="pt")

    # Generate output tokens
    outputs = model.generate(**inputs, max_new_tokens=256, eos_token_id=tokenizer.eos_token_id, pad_token_id=tokenizer.pad_token_id)

    # Decode the generated tokens to a string
    response = tokenizer.decode(outputs[0], skip_special_tokens=True)

    return response

# Load the model and tokenizer
model_id = "macadeliccc/laser-dolphin-mixtral-2x7b-dpo"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, load_in_4bit=True)

prompt = "Write a quicksort algorithm in python"

# Generate and print responses for each language
print("Response:")
print(generate_response(prompt), "\n")

你可以点击 colab 查看使用示例。

✨ 主要特性

性能提升：新版本在评估性能上平均提升了约 1 分。
多任务支持：在多种文本生成任务上有良好表现，如 AI2 Reasoning Challenge、HellaSwag、MMLU 等。
多种量化支持：支持 ExLlamav2、GGUF、AWQ 等多种量化方式。

📦 安装指南

文档未提及具体安装步骤，可参考相关依赖库的安装方式。

📚 详细文档

模型概述

该模型基于 cognitivecomputations/dolphin-2.6-mistral-7b-dpo-laser 构建，是一个中型 MoE 实现。

处理流程

具体流程可参考 notebook。
mergekit_config 在文件中。
配置中使用的模型未经过处理，但最终产品经过处理，这是与上一版本的更新之处。
该处理流程是实验性的，实际效果可能有所不同。

未来目标

[ ] 实现函数调用功能。
[ ] 使用新的基础模型发布 v2 版本以提升性能。

量化方式

ExLlamav2

这是推荐给在 GPU 上运行模型的用户的量化方式。感谢用户 bartowski，现在有 3.5 到 8 bpw 的 exllamav2 量化版本，可在 bartowski/laser-dolphin-mixtral-2x7b-dpo-exl2 查看。

分支	比特数	lm_head 比特数	VRAM (4k)	VRAM (16k)	VRAM (32k)	描述
8_0	8.0	8.0	13.7 GB	15.1 GB	17.2 GB	ExLlamaV2 能产生的最高质量，接近未量化的性能。
6_5	6.5	8.0	11.5 GB	12.9 GB	15.0 GB	在大幅减小模型大小的情况下接近未量化的性能，推荐使用。
5_0	5.0	6.0	9.3 GB	10.7 GB	12.8 GB	与 6.5 相比质量略低，适合有 12GB 显存且上下文长度为 16k 的显卡。
4_25	4.25	6.0	8.2 GB	9.6 GB	11.7 GB	相当于 GPTQ 的比特数。
3_5	3.5	6.0	7.0 GB	8.4 GB	10.5 GB	质量较低，不推荐使用。

GGUF

当前 GGUF 量化版本。

AWQ

当前 AWQ 量化版本。

TheBloke

由 TheBloke 提供的量化版本可能会导致不可预测的行为。由于模型已更新，有新的量化版本可用。

HF Spaces

GGUF 聊天可在这里使用。
4 位 bnb 聊天可在这里使用。

🔧 技术细节

评估结果

EQ Bench

----Benchmark Complete----
2024-01-31 16:55:37
Time taken: 31.1 mins
Prompt Format: ChatML
Model: macadeliccc/laser-dolphin-mixtral-2x7b-dpo-GGUF
Score (v2): 72.76
Parseable: 171.0
---------------
Batch completed
Time taken: 31.2 mins
---------------

评估 colab

先前评估总结

模型	AGIEval	GPT4All	TruthfulQA	Bigbench	平均
laser-dolphin-mixtral-2x7b-dpo	41.31	73.67	61.69	42.79	54.87

当前详细评估

模型	AGIEval	GPT4All	TruthfulQA	Bigbench	平均
laser-dolphin-mixtral-2x7b-dpo	42.25	73.45	63.44	43.96	55.77

AGIEval

任务	版本	指标	值		标准误差
agieval_aqua_rat	0	acc	21.26	±	2.57
		acc_norm	21.65	±	2.59
agieval_logiqa_en	0	acc	34.72	±	1.87
		acc_norm	35.64	±	1.88
agieval_lsat_ar	0	acc	26.96	±	2.93
		acc_norm	26.96	±	2.93
agieval_lsat_lr	0	acc	45.88	±	2.21
		acc_norm	46.08	±	2.21
agieval_lsat_rc	0	acc	59.48	±	3.00
		acc_norm	59.48	±	3.00
agieval_sat_en	0	acc	73.79	±	3.07
		acc_norm	73.79	±	3.07
agieval_sat_en_without_passage	0	acc	42.23	±	3.45
		acc_norm	41.26	±	3.44
agieval_sat_math	0	acc	37.27	±	3.27
		acc_norm	33.18	±	3.18

平均：42.25%

GPT4All

任务	版本	指标	值		标准误差
arc_challenge	0	acc	58.36	±	1.44
		acc_norm	58.02	±	1.44
arc_easy	0	acc	82.20	±	0.78
		acc_norm	77.40	±	0.86
boolq	1	acc	87.52	±	0.58
hellaswag	0	acc	67.50	±	0.47
		acc_norm	84.43	±	0.36
openbookqa	0	acc	34.40	±	2.13
		acc_norm	47.00	±	2.23
piqa	0	acc	81.61	±	0.90
		acc_norm	82.59	±	0.88
winogrande	0	acc	77.19	±	1.18

平均：73.45%

GSM8K

任务	版本	指标	值
gsm8k	2	exact_match,get-answer	0.75
		exact_match_stderr,get-answer	0.01
		alias	gsm8k

TruthfulQA

任务	版本	指标	值		标准误差
truthfulqa_mc	1	mc1	45.90	±	1.74
		mc2	63.44	±	1.56

平均：63.44%

Bigbench

任务	版本	指标	值		标准误差
bigbench_causal_judgement	0	multiple_choice_grade	58.42	±	3.59
bigbench_date_understanding	0	multiple_choice_grade	60.70	±	2.55
bigbench_disambiguation_qa	0	multiple_choice_grade	38.37	±	3.03
bigbench_geometric_shapes	0	multiple_choice_grade	21.73	±	2.18
		exact_str_match	0.00	±	0.00
bigbench_logical_deduction_five_objects	0	multiple_choice_grade	35.00	±	2.14
bigbench_logical_deduction_seven_objects	0	multiple_choice_grade	23.57	±	1.61
bigbench_logical_deduction_three_objects	0	multiple_choice_grade	50.33	±	2.89
bigbench_movie_recommendation	0	multiple_choice_grade	45.00	±	2.23
bigbench_navigate	0	multiple_choice_grade	50.00	±	1.58
bigbench_reasoning_about_colored_objects	0	multiple_choice_grade	60.35	±	1.09
bigbench_ruin_names	0	multiple_choice_grade	51.12	±	2.36
bigbench_salient_translation_error_detection	0	multiple_choice_grade	32.26	±	1.48
bigbench_snarks	0	multiple_choice_grade	67.96	±	3.48
bigbench_sports_understanding	0	multiple_choice_grade	70.59	±	1.45
bigbench_temporal_sequences	0	multiple_choice_grade	35.80	±	1.52
bigbench_tracking_shuffled_objects_five_objects	0	multiple_choice_grade	22.56	±	1.18
bigbench_tracking_shuffled_objects_seven_objects	0	multiple_choice_grade	17.20	±	0.90
bigbench_tracking_shuffled_objects_three_objects	0	multiple_choice_grade	50.33	±	2.89

平均：43.96%

平均得分：55.77%

耗时：02:43:45

Open LLM Leaderboard 评估结果

详细结果可查看这里

指标	值
平均	67.16
AI2 Reasoning Challenge (25-Shot)	65.96
HellaSwag (10-Shot)	85.80
MMLU (5-Shot)	63.17
TruthfulQA (0-shot)	60.76
Winogrande (5-shot)	79.01
GSM8k (5-shot)	48.29

📄 许可证

本项目采用 Apache-2.0 许可证。

📖 引用

文献引用

Fernando Fernandes Neto 和 Eric Hartford. "Optimizing Large Language Models Using Layer-Selective Rank Reduction and Random Matrix Theory." 2024.

@article{sharma2023truth,
title={The Truth is in There: Improving Reasoning in Language Models with Layer-Selective Rank Reduction},
author={Sharma, Pratyusha and Ash, Jordan T and Misra, Dipendra},
journal={arXiv preprint arXiv:2312.13558},
year={2023} }

@article{gao2021framework,
  title={A framework for few-shot language model evaluation},
  author={Gao, Leo and Tow, Jonathan and Biderman, Stella and Black, Sid and DiPofi, Anthony and Foster, Charles and Golding, Laurence and Hsu, Jeffrey and McDonell, Kyle and Muennighoff, Niklas and others},
  journal={Version v0. 0.1. Sept},
  year={2021}
}