Einstein-v7-Qwen2-7B开源文本生成模型 - 免费可用，在多科学领域表现出色

首页

Einstein V7 Qwen2 7B

由 Weyaxi 开发

Einstein-v7-Qwen2-7B是基于Qwen/Qwen2-7B在多种科学领域数据集上进行全量微调得到的文本生成模型，在科学、物理、化学、生物、数学等多个领域表现出色。

大型语言模型

Transformers

英语开源协议:其他 #科学领域专家 #多学科知识库 #ChatML对话优化

下载量 1,927

发布时间 : 6/24/2024

模型简介

该模型是基于Qwen2-7B架构的全量微调版本，专注于科学领域的文本生成任务，支持多领域知识问答和内容生成。

模型特点

多领域科学知识

在科学、物理、化学、生物、数学等多个领域进行专门训练，具备专业领域的文本生成能力

高性能硬件优化

使用8xMI300X硬件进行微调，充分发挥硬件性能

ChatML模板支持

支持ChatML对话模板，便于对话式文本生成

长上下文处理

支持8192的序列长度，能够处理长文本内容

模型能力

科学领域文本生成

多领域知识问答

专业内容创作

教育辅助

研究支持

使用案例

教育

科学知识讲解

为学生解释复杂的科学概念和原理

提供准确、易懂的科学知识解释

作业辅导

帮助学生解决科学、数学等学科的作业问题

提供分步解答和详细解释

研究

文献摘要

为科研人员生成科学文献的摘要和关键点

快速理解文献核心内容

研究思路生成

帮助研究人员生成新的研究思路和实验设计

提供创新的研究方向建议

🚀 🔬 Einstein-v7-Qwen2-7B

Einstein-v7-Qwen2-7B 是基于 Qwen/Qwen2-7B 在多种数据集上进行全量微调得到的模型。它在科学、物理、化学、生物、数学等多个领域表现出色，为文本生成任务提供了强大的支持。

image/png

🚀 快速开始

模型基础信息

属性	详情
基础模型	Qwen/Qwen2-7B
模型类型	基于 Qwen2-7B 全量微调的文本生成模型
训练数据集	allenai/ai2_arc、camel-ai/physics、camel-ai/chemistry 等众多数据集

提示模板

在使用该模型时，可以使用 ChatML 提示模板：

ChatML

<|im_start|>system
{system}<|im_end|>
<|im_start|>user
{user}<|im_end|>
<|im_start|>assistant
{asistant}<|im_end|>

这个提示模板可以作为聊天模板使用，意味着你可以使用 tokenizer.apply_chat_template() 方法来格式化消息：

messages = [
    {"role": "system", "content": "You are helpful AI asistant."},
    {"role": "user", "content": "Hello!"}
]
gen_input = tokenizer.apply_chat_template(message, return_tensors="pt")
model.generate(**gen_input)

✨ 主要特性

多领域数据集训练：使用了涵盖科学、物理、化学、生物、数学等多个领域的数据集进行训练，使模型在这些领域的文本生成任务中表现出色。
特定硬件微调：使用 8xMI300X 硬件进行微调，充分发挥硬件性能。
支持 ChatML 模板：方便用户进行对话式文本生成。

📦 安装指南

文档未提供具体安装步骤，故跳过该章节。

💻 使用示例

基础用法

使用 ChatML 模板进行文本生成：

messages = [
    {"role": "system", "content": "You are helpful AI asistant."},
    {"role": "user", "content": "Hello!"}
]
gen_input = tokenizer.apply_chat_template(message, return_tensors="pt")
model.generate(**gen_input)

📚 详细文档

数据集使用情况

本模型训练所使用的数据集在模型卡片的元数据部分列出。需要注意的是，元数据中提到的某些数据集可能根据各种标准进行了过滤。过滤过程的结果和相关信息在另一个仓库中：Weyaxi/sci-datasets/main

量化版本

GGUF @bartowski

https://huggingface.co/bartowski/Einstein-v7-Qwen2-7B-GGUF

ExLlamaV2 @bartowski

https://huggingface.co/bartowski/Einstein-v7-Qwen2-7B-exl2

评估结果

Open LLM Leaderboard v2 评估结果详细结果可查看这里

指标	值
平均值	24.01
IFEval (0-Shot)	41.00
BBH (3-Shot)	32.84
MATH Lvl 5 (4-Shot)	15.18
GPQA (0-shot)	6.60
MuSR (0-shot)	14.06
MMLU-PRO (5-shot)	34.40

训练相关信息

本模型进行了 2 个 epoch 的全量微调，总步数为 500。

损失图

image/png

🔧 技术细节

axolotl 配置

查看 axolotl 配置

axolotl 版本：0.4.0

base_model: Qwen/Qwen2-7B
model_type: AutoModelForCausalLM
tokenizer_type: AutoTokenizer

load_in_8bit: false
load_in_4bit: false
strict: false

chat_template: chatml
datasets:
  - path: data/airoboros_3.2_without_contextual_slimorca_orca_sharegpt.json
    ds_type: json
    type: sharegpt
    conversation: chatml

  - path: data/allenai_wild_chat_gpt4_english_toxic_random_half_4k_sharegpt.json
    ds_type: json
    type: sharegpt
    strict: false
    conversation: chatml

  - path: data/buzz_unstacked_chosen_math_removed_filtered.json
    ds_type: json
    type: alpaca
    conversation: chatml

  - path: data/capybara_sharegpt.json
    ds_type: json
    type: sharegpt
    conversation: chatml

  - path: data/cot_alpaca_gpt4_extracted_openhermes_2.5_sharegpt.json
    ds_type: json
    type: sharegpt
    conversation: chatml

  - path: data/everythinglm-data-v3_sharegpt.json
    ds_type: json
    type: sharegpt
    strict: false
    conversation: chatml

  - path: data/gpt4_data_lmys_1m_sharegpt.json
    ds_type: json
    type: sharegpt
    conversation: chatml

  - path: data/gpteacher-instruct-special-alpaca.json
    ds_type: json
    type: gpteacher
    conversation: chatml

  - path: data/merged_all.json
    ds_type: json
    type: alpaca
    conversation: chatml

  - path: data/no_robots_sharegpt.json
    ds_type: json
    type: sharegpt
    strict: false
    conversation: chatml

  - path: data/oasst_top1_from_fusechatmixture_sharegpt.json
    ds_type: json
    type: sharegpt
    strict: false
    conversation: chatml

  - path: data/pippa_bagel_repo_3k_sharegpt.json
    ds_type: json
    type: sharegpt
    conversation: chatml

  - path: data/rpguild_quarter_alignment_lab_sharegpt.json
    ds_type: json
    type: sharegpt
    conversation: chatml

  - path: data/sharegpt_gpt4_english.json
    ds_type: json
    type: sharegpt
    conversation: chatml

  - path: data/slimorca_dedup_filtered_95k_sharegpt.json
    ds_type: json
    type: sharegpt
    conversation: chatml

  - path: data/soda_diaolog_longest_tenth_buzz_sharegpt.json
    ds_type: json
    type: sharegpt
    conversation: chatml

  - path: data/synthia-v1.3_sharegpt_12500.json
    ds_type: json
    type: sharegpt
    conversation: chatml

  - path: data/system_conversations_dolphin_sharegpt.json
    ds_type: json
    type: sharegpt
    conversation: chatml
  
dataset_prepared_path: last_run_prepared
val_set_size: 0.002

output_dir: ./Einstein-v7-Qwen2-7B-model

sequence_len: 8192
sample_packing: true
pad_to_sequence_len: true
eval_sample_packing: false

wandb_project: Einstein
wandb_entity:
wandb_watch:
wandb_name:
wandb_log_model:
hub_model_id: Weyaxi/Einstein-v7-Qwen2-7B

gradient_accumulation_steps: 4
micro_batch_size: 6
num_epochs: 2
optimizer: paged_adamw_8bit
lr_scheduler: cosine
learning_rate: 0.00001 # look

train_on_inputs: false
group_by_length: false
bf16: auto
fp16:
tf32: false

gradient_checkpointing: unsloth
gradient_checkpointing_kwargs:
   use_reentrant: true # look
early_stopping_patience:
resume_from_checkpoint:
local_rank:
logging_steps: 1
xformers_attention:
flash_attention: true

warmup_steps: 10
evals_per_epoch: 2
eval_table_size:
eval_max_new_tokens: 128
saves_per_epoch: 1
debug:

deepspeed: deepspeed_configs/zero3_bf16.json
weight_decay: 0.05
fsdp:
fsdp_config:
special_tokens:
  eos_token: "<|im_end|>"
  pad_token: "<|end_of_text|>"
tokens:
  - "<|im_start|>"
  - "<|im_end|>"