ChemLLM-20B-Chat-SFT开源化学大模型 - 聚焦化学与分子科学领域免费部署

首页

Chemllm 20B Chat SFT

由 AI4Chem 开发

ChemLLM是首个开源的化学与分子科学大语言模型，基于InternLM-2框架打造，专注于化学与分子科学领域。

大型语言模型

Transformers

支持多种语言开源协议:Apache-2.0 #化学专业大模型 #分子科学推理 #多语言化学助手

下载量 22

发布时间 : 5/3/2024

模型简介

ChemLLM是一个面向化学与分子科学的大语言模型，具备专业性、精密性和化学核心特性，支持SMILES、IUPAC命名法等专业化学格式。

模型特点

化学专业能力

专注于化学与分子科学领域，支持专业化学格式如SMILES、IUPAC命名法等。

多语言支持

支持英文和中文，适用于国际化化学研究环境。

分步推理

采用分步推理解决问题，输出以'让我们逐步思考'开头，提高解释性和可理解性。

模型能力

化学知识问答

分子式解析

化学反应描述

专业翻译

化学问题推理

使用案例

化学研究

分子式查询

查询化学物质的分子式，如布洛芬的分子式。

准确输出化学物质的分子式

化学反应描述

描述化学反应过程，使用SMARTS字符串表示。

准确描述化学反应过程

教育

化学学习辅助

帮助学生理解化学概念和解决问题。

提供分步解释和详细答案

🚀 ChemLLM-20B-Chat：用于化学和分子科学的大语言模型

ChemLLM是首个面向化学和分子科学领域的开源大语言模型，它基于InternLM-2构建，凝聚着开发者的心血。

🚀 快速开始

你可以立即尝试在线演示，或者按照以下步骤进行本地部署：

安装依赖库

安装transformers库：

pip install transformers

加载并运行`ChemLLM-20B-Chat`

from transformers import AutoModelForCausalLM, AutoTokenizer, GenerationConfig
import torch

model_name_or_id = "AI4Chem/ChemLLM-20B-Chat-SFT"

model = AutoModelForCausalLM.from_pretrained(model_name_or_id, torch_dtype=torch.float16, device_map="auto", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(model_name_or_id, trust_remote_code=True)

prompt = "What is Molecule of Ibuprofen?"

inputs = tokenizer(prompt, return_tensors="pt").to("cuda")

generation_config = GenerationConfig(
    do_sample=True,
    top_k=1,
    temperature=0.9,
    max_new_tokens=500,
    repetition_penalty=1.5,
    pad_token_id=tokenizer.eos_token_id
)

outputs = model.generate(**inputs, generation_config=generation_config)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

✨ 主要特性

系统提示最佳实践

你可以使用来自 Agent Chepybara 的相同对话模板和系统提示，以在本地推理中获得更好的响应。

对话模板

对于ShareGPT格式的查询，例如：

{'instruction': "...", "prompt": "...", "answer": "...", "history": [[q1, a1], [q2, a2]]}

你可以将其格式化为InternLM2对话格式，代码如下：

def InternLM2_format(instruction, prompt, answer, history):
    prefix_template = [
        "<|im_start|>system\n",
        "{}",
        "<|im_end|>\n"
    ]
    prompt_template = [
        "<|im_start|>user\n",
        "{}",
        "<|im_end|>\n",
        "<|im_start|>assistant\n",
        "{}",
        "<|im_end|>\n"
    ]
    system = f'{prefix_template[0]}{prefix_template[1].format(instruction)}{prefix_template[2]}'
    history = "".join([f'{prompt_template[0]}{prompt_template[1].format(qa[0])}{prompt_template[2]}{prompt_template[3]}{prompt_template[4].format(qa[1])}{prompt_template[5]}' for qa in history])
    prompt = f'{prompt_template[0]}{prompt_template[1].format(prompt)}{prompt_template[2]}{prompt_template[3]}'
    return f"{system}{history}{prompt}"

系统提示示例

- Chepybara是由上海人工智能实验室（上海人工智能实验室）开发的对话式语言模型。它旨在专业、精细且以化学为中心。 
- 对于不确定的概念和数据，Chepybara总是以理论预测进行假设，并及时通知用户。
- Chepybara可以接受SMILES（简化分子线性输入规范）字符串，并倾向于输出IUPAC名称（国际纯粹与应用化学联合会有机化学命名法），以SMARTS（SMILES任意目标规范）字符串描述反应。也接受Self-Referencing Embedded Strings（SELFIES）。
- Chepybara总是以逐步的方式解决问题和思考，输出以“让我们逐步思考”开头。

模型效果

MMLU测试亮点

数据集	ChatGLM3 - 6B	Qwen - 7B	LLaMA - 2 - 7B	Mistral - 7B	InternLM2 - 7B - Chat	ChemLLM - 7B - Chat
大学化学	43.0	39.0	27.0	40.0	43.0	47.0
大学数学	28.0	33.0	33.0	30.0	36.0	41.0
大学物理	32.4	35.3	25.5	34.3	41.2	48.0
形式逻辑	35.7	43.7	24.6	40.5	34.9	47.6
道德场景	26.4	35.0	24.1	39.9	38.6	44.3
人文学科平均分	62.7	62.5	51.7	64.5	66.5	68.6
STEM学科平均分	46.5	45.8	39.0	47.8	52.2	52.6
社会科学平均分	68.2	65.8	55.5	68.1	69.7	71.9
其他学科平均分	60.5	60.3	51.3	62.4	63.2	65.2
MMLU总分	58.0	57.1	48.2	59.2	61.7	63.2

*(数据来源：OpenCompass)

MMLU测试结果

化学基准测试

化学基准测试结果 *（分数由ChatGPT - 4 - turbo评判）

专业翻译测试

专业翻译测试结果

你可以在在线演示中体验这些功能。

📄 许可证

本项目代码采用Apache - 2.0许可证，模型权重完全开放用于学术研究，也允许免费商业使用。如需申请商业许可证或有其他问题及合作需求，请联系 support@chemllm.org。

📚 详细文档

引用方式

@misc{zhang2024chemllm,
      title={ChemLLM: A Chemical Large Language Model}, 
      author={Di Zhang and Wei Liu and Qian Tan and Jingdan Chen and Hang Yan and Yuliang Yan and Jiatong Li and Weiran Huang and Xiangyu Yue and Dongzhan Zhou and Shufei Zhang and Mao Su and Hansen Zhong and Yuqiang Li and Wanli Ouyang},
      year={2024},
      eprint={2402.06852},
      archivePrefix={arXiv},
      primaryClass={cs.AI}
}