h2o-danube3.1-4b-chat开源聊天模型 - 长上下文对话支持，免费畅聊体验

首页

H2o Danube3.1 4b Chat

由 h2oai 开发

由H2O.ai微调的40亿参数聊天模型，基于Llama 2架构调整，支持8192上下文长度

大型语言模型

Transformers

英语开源协议:Apache-2.0 #手机离线运行 #长上下文支持 #多轮对话优化

下载量 271

发布时间 : 11/29/2024

模型简介

这是一个40亿参数的大语言模型，专为聊天对话优化，可在手机上原生且完全离线运行。

模型特点

移动端支持

可在手机上原生且完全离线运行

长上下文支持

支持8192 tokens的上下文长度

高效推理

支持4-bit和8-bit量化，可在消费级硬件上运行

模型能力

文本生成

对话系统

问答系统

使用案例

聊天应用

个人助手

作为个人聊天助手提供日常问答服务

客服机器人

用于客户服务场景的自动问答

教育

学习辅导

帮助学生解答学习问题

🚀 h2o-danube3.1-4b-chat模型

h2o-danube3.1-4b-chat是由H2O.ai微调的聊天模型，拥有40亿参数。该模型使用H2O LLM Studio进行训练，可在手机上原生且完全离线运行。

🚀 快速开始

你可以使用H2O AI Personal GPT亲自尝试在手机上离线运行该模型。

✨ 主要特性

参数规模：约40亿参数，基于调整后的Llama 2架构。
离线运行：可在手机上原生且完全离线运行。
长上下文：使用Mistral分词器，词汇量32000，上下文长度可达8192。

📦 安装指南

若要在具有GPU的机器上使用transformers库运行该模型，首先确保你已安装transformers库：

pip install transformers==4.45.0

💻 使用示例

基础用法

import torch
from transformers import pipeline

generate_text = pipeline(
    model="h2oai/h2o-danube3.1-4b-chat",
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
    device_map="auto",
)

# We use the HF Tokenizer chat template to format each message
# https://huggingface.co/docs/transformers/main/en/chat_templating

messages = [
    {
        "role": "system",
        "content": "You are a friendly and polite chatbot.",
    },
    {"role": "user", "content": "Hi, how are you?"},
    {"role": "assistant", "content": "I'm doing great, how about you?"},
    {"role": "user", "content": "Why is drinking water so healthy?"},
]

res = generate_text(
    messages,
    renormalize_logits=True
)
print(res[0]["generated_text"][-1]['content'])

高级用法

你可以打印应用聊天模板后的示例提示，查看如何将其输入到分词器：

print(generate_text.tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
))

你也可以自己从加载的模型和分词器构建管道，并考虑预处理步骤：

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "h2oai/h2o-danube3.1-4b-chat"

messages = [
    {
        "role": "system",
        "content": "You are a friendly and polite chatbot.",
    },
    {"role": "user", "content": "Hi, how are you?"},
    {"role": "assistant", "content": "I'm doing great, how about you?"},
    {"role": "user", "content": "Why is drinking water so healthy?"},
]

tokenizer = AutoTokenizer.from_pretrained(
    model_name,
    trust_remote_code=True,
)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)
model.cuda().eval()

inputs = tokenizer.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_tensors="pt",
    return_dict=True,
).to("cuda")

tokens = model.generate(
    input_ids=inputs["input_ids"],
    attention_mask=inputs["attention_mask"],
    renormalize_logits=True
)[0]

tokens = tokens[inputs["input_ids"].shape[1]:]
answer = tokenizer.decode(tokens, skip_special_tokens=True)
print(answer)

🔧 技术细节

模型架构

我们调整了Llama 2架构，总参数约40亿。具体细节请参考我们的技术报告。模型架构的详细信息如下：

超参数	值
n_layers	24
n_heads	32
n_query_groups	8
n_embd	3840
词汇量	32000
序列长度	8192

模型架构的具体代码如下：

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(32000, 3840, padding_idx=0)
    (layers): ModuleList(
      (0-23): 24 x LlamaDecoderLayer(
        (self_attn): LlamaSdpaAttention(
          (q_proj): Linear(in_features=3840, out_features=3840, bias=False)
          (k_proj): Linear(in_features=3840, out_features=960, bias=False)
          (v_proj): Linear(in_features=3840, out_features=960, bias=False)
          (o_proj): Linear(in_features=3840, out_features=3840, bias=False)
          (rotary_emb): LlamaRotaryEmbedding()
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear(in_features=3840, out_features=10240, bias=False)
          (up_proj): Linear(in_features=3840, out_features=10240, bias=False)
          (down_proj): Linear(in_features=10240, out_features=3840, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm((3840,), eps=1e-05)
        (post_attention_layernorm): LlamaRMSNorm((3840,), eps=1e-05)
      )
    )
    (norm): LlamaRMSNorm((3840,), eps=1e-05)
    (rotary_emb): LlamaRotaryEmbedding()
  )
  (lm_head): Linear(in_features=3840, out_features=32000, bias=False)
)

量化和分片

你可以通过指定load_in_8bit=True或load_in_4bit=True使用量化加载模型。此外，通过设置device_map=auto可以在多个GPU上进行分片。

📚 详细文档

基准测试

🤗 Open LLM Leaderboard v2

基准测试	acc_n
平均	16.21
IFEval	50.21
BBH	10.94
MATH Lvl 5	2.11
GPQA	4.7
MUSR	10.2
MML - PRO	19.1

🤗 Open LLM Leaderboard v1

基准测试	acc_n
平均	61.05
ARC - challenge	57.68
Hellaswag	79.05
MMLU	55.61
TruthfulQA	46.84
Winogrande	75.93
GSM8K	51.18

MT - Bench

First Turn: 7.394
Second Turn: 5.563
Average: 6.478

📄 许可证

本项目采用Apache 2.0许可证。

⚠️ 免责声明

在使用本仓库提供的大语言模型之前，请仔细阅读本免责声明。你使用该模型即表示你同意以下条款和条件。

偏差和冒犯性：大语言模型在各种互联网文本数据上进行训练，这些数据可能包含有偏差、种族主义、冒犯性或其他不适当的内容。使用此模型即表示你承认并接受生成的内容有时可能会表现出偏差，或产生冒犯性或不适当的内容。本仓库的开发者不认可、支持或推广任何此类内容或观点。
局限性：大语言模型是基于人工智能的工具，而非人类。它可能会产生不正确、无意义或不相关的回复。用户有责任批判性地评估生成的内容，并自行决定是否使用。
风险自担：使用此大语言模型的用户必须对使用该工具可能产生的任何后果承担全部责任。本仓库的开发者和贡献者不对因使用或滥用该模型而导致的任何损害、损失或伤害承担责任。
道德考量：鼓励用户负责任且合乎道德地使用大语言模型。使用此模型即表示你同意不将其用于宣扬仇恨言论、歧视、骚扰或任何形式的非法或有害活动。
问题报告：如果你遇到大语言模型生成的有偏差、冒犯性或其他不适当的内容，请通过提供的渠道向仓库维护者报告。你的反馈将有助于改进模型并减轻潜在问题。
免责声明变更：本仓库的开发者保留随时修改或更新本免责声明的权利，无需事先通知。用户有责任定期查看免责声明，以了解任何变更。

使用本仓库提供的大语言模型即表示你同意接受并遵守本免责声明中规定的条款和条件。如果你不同意本免责声明的任何部分，应避免使用该模型及其生成的任何内容。