Mythalion-Kimiko-v2-AWQ开源模型 - 高效准确实现快速推理应用

首页

Mythalion Kimiko V2 AWQ

由 TheBloke 开发

Mythalion Kimiko v2 - AWQ 是 nRuaif 创建的 Mythalion Kimiko v2 模型的 AWQ 量化版本，具有高效、准确和快速推理等特点。

大型语言模型

Transformers

开源协议:其他 #4比特量化 #高效推理 #多框架兼容

下载量 403

发布时间 : 12/14/2023

模型简介

该模型是 Mythalion Kimiko v2 的 AWQ 量化版本，支持 4 比特量化，适用于高效推理。

模型特点

高效推理

采用 AWQ 4 比特量化技术，相比 GPTQ 在基于 Transformer 的推理中速度更快。

多版本支持

提供 AWQ、GPTQ 和 GGUF 等多种量化版本的模型，适用于不同的推理场景。

广泛兼容性

支持多种推理工具和框架，如 Text Generation Webui、vLLM、TGI 和 Transformers 等。

模型能力

文本生成

高效推理

使用案例

文本生成

AI 相关问答

回答关于人工智能的问题

故事创作

生成关于特定主题的故事

🚀 Mythalion Kimiko v2 - AWQ

Mythalion Kimiko v2 - AWQ 是 nRuaif 所创建模型 Mythalion Kimiko v2 的 AWQ 量化版本，具有高效、准确和快速推理等特点。

🚀 快速开始

本项目提供了 nRuaif 的 Mythalion Kimiko v2 的 AWQ 模型文件。这些文件使用了由 Massed Compute 慷慨提供的硬件进行量化。

关于 AWQ

AWQ 是一种高效、准确且极快的低比特权重量化方法，目前支持 4 比特量化。与 GPTQ 相比，它在基于 Transformer 的推理中速度更快，并且在质量上与最常用的 GPTQ 设置相当或更好。

AWQ 模型目前仅支持在 Linux 和 Windows 系统上使用 NVidia GPU 运行。macOS 用户请使用 GGUF 模型。

它支持以下应用：

Text Generation Webui - 使用加载器：AutoAWQ
vLLM - 版本 0.2.2 或更高版本支持所有模型类型
Hugging Face Text Generation Inference (TGI)
Transformers 版本 4.35.0 及更高版本，适用于任何支持 Transformers 的代码或客户端
AutoAWQ - 用于 Python 代码

✨ 主要特性

多版本支持：提供了 AWQ、GPTQ 和 GGUF 等多种量化版本的模型，适用于不同的推理场景。
广泛兼容性：支持多种推理工具和框架，如 Text Generation Webui、vLLM、TGI 和 Transformers 等。

📦 安装指南

在 text-generation-webui 中下载和使用此模型

请确保你使用的是 text-generation-webui 的最新版本。强烈建议使用 text-generation-webui 的一键安装程序，除非你确定自己知道如何手动安装。

点击模型选项卡。
在 下载自定义模型或 LoRA 下，输入 TheBloke/Mythalion-Kimiko-v2-AWQ。
点击下载。
模型将开始下载。下载完成后会显示“已完成”。
在左上角，点击模型旁边的刷新图标。
在模型下拉菜单中，选择你刚刚下载的模型：Mythalion-Kimiko-v2-AWQ。
选择 加载器：AutoAWQ。
点击加载，模型将加载并准备好使用。
如果你需要自定义设置，请进行设置，然后点击右上角的 保存此模型的设置，接着点击 重新加载模型。
准备好后，点击 文本生成 选项卡并输入提示以开始使用！

使用 Transformers 从 Python 代码进行推理

安装必要的包

需要：Transformers 4.35.0 或更高版本。
需要：AutoAWQ 0.1.6 或更高版本。

pip3 install --upgrade "autoawq>=0.1.6" "transformers>=4.35.0"

注意，如果你使用的是 PyTorch 2.0.1，上述 AutoAWQ 命令将自动将你升级到 PyTorch 2.1.0。

如果你使用的是 CUDA 11.8 并希望继续使用 PyTorch 2.0.1，请运行以下命令：

pip3 install https://github.com/casper-hansen/AutoAWQ/releases/download/v0.1.6/autoawq-0.1.6+cu118-cp310-cp310-linux_x86_64.whl

如果你在使用预构建的轮子安装 AutoAWQ 时遇到问题，请从源代码安装：

pip3 uninstall -y autoawq
git clone https://github.com/casper-hansen/AutoAWQ
cd AutoAWQ
pip3 install .

💻 使用示例

基础用法

from transformers import AutoModelForCausalLM, AutoTokenizer, TextStreamer

model_name_or_path = "TheBloke/Mythalion-Kimiko-v2-AWQ"

tokenizer = AutoTokenizer.from_pretrained(model_name_or_path)
model = AutoModelForCausalLM.from_pretrained(
    model_name_or_path,
    low_cpu_mem_usage=True,
    device_map="cuda:0"
)

# Using the text streamer to stream output one token at a time
streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)

prompt = "Tell me about AI"
prompt_template=f'''{prompt}
'''

# Convert prompt to tokens
tokens = tokenizer(
    prompt_template,
    return_tensors='pt'
).input_ids.cuda()

generation_params = {
    "do_sample": True,
    "temperature": 0.7,
    "top_p": 0.95,
    "top_k": 40,
    "max_new_tokens": 512,
    "repetition_penalty": 1.1
}

# Generate streamed output, visible one token at a time
generation_output = model.generate(
    tokens,
    streamer=streamer,
    **generation_params
)

# Generation without a streamer, which will include the prompt in the output
generation_output = model.generate(
    tokens,
    **generation_params
)

# Get the tokens from the output, decode them, print them
token_output = generation_output[0]
text_output = tokenizer.decode(token_output)
print("model.generate output: ", text_output)

# Inference is also possible via Transformers' pipeline
from transformers import pipeline

pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    **generation_params
)

pipe_output = pipe(prompt_template)[0]['generated_text']
print("pipeline output: ", pipe_output)

高级用法

使用 vLLM 进行多用户推理服务

from vllm import LLM, SamplingParams

prompts = [
    "Tell me about AI",
    "Write a story about llamas",
    "What is 291 - 150?",
    "How much wood would a woodchuck chuck if a woodchuck could chuck wood?",
]
prompt_template=f'''{prompt}
'''

prompts = [prompt_template.format(prompt=prompt) for prompt in prompts]

sampling_params = SamplingParams(temperature=0.8, top_p=0.95)

llm = LLM(model="TheBloke/Mythalion-Kimiko-v2-AWQ", quantization="awq", dtype="auto")

outputs = llm.generate(prompts, sampling_params)

# Print the outputs.
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

使用 Hugging Face Text Generation Inference (TGI) 进行多用户推理服务

from huggingface_hub import InferenceClient

endpoint_url = "https://your-endpoint-url-here"

prompt = "Tell me about AI"
prompt_template=f'''{prompt}
'''

client = InferenceClient(endpoint_url)
response = client.text_generation(prompt,
                                  max_new_tokens=128,
                                  do_sample=True,
                                  temperature=0.7,
                                  top_p=0.95,
                                  top_k=40,
                                  repetition_penalty=1.1)

print(f"Model output: ", response)

📚 详细文档

可用的仓库

提示模板

{prompt}

提供的文件和 AWQ 参数

目前仅发布 128g GEMM 模型。正在积极考虑添加组大小为 32 的模型和 GEMV 内核模型。

模型以分片的 safetensors 文件形式发布。

分支	比特数	组大小	AWQ 数据集	序列长度	大小
main	4	128	VMware Open Instruct	4096	7.25 GB

兼容性

提供的文件经过测试，可与以下应用兼容：

text-generation-webui，使用 加载器：AutoAWQ。
vLLM 版本 0.2.0 及更高版本。
Hugging Face Text Generation Inference (TGI) 版本 1.1.0 及更高版本。
Transformers 版本 4.35.0 及更高版本。
AutoAWQ 版本 0.1.1 及更高版本。

📄 许可证

本项目使用其他许可证。

🔗 相关链接

模型创建者：nRuaif
原始模型：Mythalion Kimiko v2
Discord 服务器：TheBloke AI's Discord server
Patreon 页面：https://patreon.com/TheBlokeAI
Ko-Fi 页面：https://ko-fi.com/TheBlokeAI

🙏 致谢与贡献

感谢 chirper.ai 团队！感谢来自 gpus.llm-utils.org 的 Clay！

很多人询问是否可以进行贡献。我喜欢提供模型并帮助他人，也希望能够花更多时间做这些事情，同时拓展到新的项目，如微调/训练。

如果你有能力且愿意贡献，我将不胜感激，这将帮助我继续提供更多模型，并开始新的 AI 项目。

捐赠者将在任何 AI/LLM/模型问题和请求上获得优先支持，访问私人 Discord 房间，以及其他福利。

特别感谢：Aemon Algiz。

Patreon 特别提及：Michael Levine, 阿明, Trailburnt, Nikolai Manek, John Detwiler, Randy H, Will Dee, Sebastain Graf, NimbleBox.ai, Eugene Pentland, Emad Mostaque, Ai Maven, Jim Angel, Jeff Scroggin, Michael Davis, Manuel Alberto Morcote, Stephen Murray, Robert, Justin Joy, Luke @flexchar, Brandon Frisco, Elijah Stavena, S_X, Dan Guido, Undi ., Komninos Chatzipapas, Shadi, theTransient, Lone Striker, Raven Klaugh, jjj, Cap'n Zoog, Michel-Marie MAUDET (LINAGORA), Matthew Berman, David, Fen Risland, Omer Bin Jawed, Luke Pendergrass, Kalila, OG, Erik Bjäreholt, Rooh Singh, Joseph William Delisle, Dan Lewis, TL, John Villwock, AzureBlack, Brad, Pedro Madruga, Caitlyn Gatomon, K, jinyuan sun, Mano Prime, Alex, Jeffrey Morgan, Alicia Loh, Illia Dulskyi, Chadd, transmissions 11, fincy, Rainer Wilmers, ReadyPlayerEmma, knownsqashed, Mandus, biorpg, Deo Leter, Brandon Phillips, SuperWojo, Sean Connelly, Iucharbius, Jack West, Harry Royden McLaughlin, Nicholas, terasurfer, Vitor Caleffi, Duane Dunston, Johann-Peter Hartmann, David Ziegler, Olakabola, Ken Nordquist, Trenton Dambrowitz, Tom X Nguyen, Vadim, Ajan Kanaga, Leonard Tan, Clay Pascal, Alexandros Triantafyllidis, JM33133, Xule, vamX, ya boyyy, subjectnull, Talal Aujan, Alps Aficionado, wassieverse, Ari Malik, James Bentley, Woland, Spencer Kim, Michael Dempsey, Fred von Graf, Elle, zynix, William Richards, Stanislav Ovsiannikov, Edmond Seymore, Jonathan Leane, Martin Kemka, usrbinkat, Enrico Ros

感谢所有慷慨的赞助者和捐赠者！再次感谢 a16z 的慷慨资助。