ChemLLM-7B-Chat Open-Source Large Language Model - Free Support for Chinese and English Chemical and Molecular Science Q&A

Chemllm 7B Chat

Developed by AI4Chem

ChemLLM-7B-Chat is the first open-source large language model dedicated to chemistry and molecular sciences, developed based on the InternLM-2 architecture and supports both English and Chinese.

Large Language Model

Transformers

Supports Multiple LanguagesOpen Source License:Apache-2.0 #Chemistry Q&A #Molecular Science Reasoning #SMILES Parsing

Downloads 775

Release Time : 1/15/2024

Model Overview

This model specializes in chemistry and molecular sciences, capable of handling chemistry-related text generation, Q&A, and translation tasks, with particular expertise in chemical terminology and molecular structure processing.

Model Features

Specialization in Chemistry

Optimized specifically for chemistry and molecular sciences, capable of handling complex chemical terminology and molecular structures.

Multilingual Support

Supports both English and Chinese processing, particularly suitable for translating and understanding chemical literature.

Open Source for Commercial Use

Licensed under Apache-2.0, allowing both academic research and commercial applications.

Step-by-Step Reasoning

Adopts a step-by-step reasoning approach to problem-solving, producing more structured and interpretable outputs.

Model Capabilities

Chemistry Q&A

Molecular Formula Parsing

Chemical Literature Translation

Chemical Reaction Description

Chemical Knowledge Reasoning

Use Cases

Chemistry Education

Chemical Concept Explanation

Helps students understand complex chemical concepts and reaction mechanisms

Provides clear step-by-step explanations

Research Assistance

Literature Translation

Converts chemical literature between English and Chinese

Maintains accuracy of professional terminology

Drug Development

Molecular Property Analysis

Analyzes the structure and properties of drug molecules

Provides molecular formulas and structural information

🚀 ChemLLM-7B-Chat: LLM for Chemistry and Molecule Science

ChemLLM-7B-Chat is the first open-source large language model for chemistry and molecule science, built on InternLM-2.

⚠️ Important Note

It's recommended to use the new version of ChemLLM! Check out AI4Chem/ChemLLM-7B-Chat-1.5-DPO or AI4Chem/ChemLLM-7B-Chat-1.5-SFT.

✨ Features

Latest Versions: ChemLLM-1.5 is released with two versions available: AI4Chem/ChemLLM-7B-Chat-1.5-DPO and AI4Chem/ChemLLM-7B-Chat-1.5-SFT.
HuggingFace Feature: Featured by HuggingFace on the “Daily Papers” page.
ArXiv Preprint: The arXiv preprint ChemLLM: A Chemical Large Language Model is released.

📦 Installation

Install transformers using the following command:

pip install transformers

💻 Usage Examples

Basic Usage

You can try the online demo instantly. Or, load ChemLLM-7B-Chat and run the following code:

from transformers import AutoModelForCausalLM, AutoTokenizer, GenerationConfig
import torch

model_name_or_id = "AI4Chem/ChemLLM-7B-Chat"

model = AutoModelForCausalLM.from_pretrained(model_name_or_id, torch_dtype=torch.float16, device_map="auto",trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(model_name_or_id,trust_remote_code=True)

prompt = "What is Molecule of Ibuprofen?"

inputs = tokenizer(prompt, return_tensors="pt").to("cuda")

generation_config = GenerationConfig(
    do_sample=True,
    top_k=1,
    temperature=0.9,
    max_new_tokens=500,
    repetition_penalty=1.5,
    pad_token_id=tokenizer.eos_token_id
)

outputs = model.generate(**inputs, generation_config=generation_config)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Advanced Usage

You can use the same Dialogue Templates and System Prompt from Agent Chepybara to get a better response in local inference.

Dialogue Templates

For queries in ShareGPT format like:

{'instruction': "...", "prompt": "...", "answer": "...", "history": [[q1, a1], [q2, a2]]}

You can format it into this InternLM2 Dialogue format like:

def InternLM2_format(instruction, prompt, answer, history):
    prefix_template = [
        "<|im_start|>system\n",
        "{}",
        "<|im_end|>\n"
    ]
    prompt_template = [
        "<|im_start|>user\n",
        "{}",
        "<|im_end|>\n"
        "<|im_start|>assistant\n",
        "{}",
        "<|im_end|>\n"
    ]
    system = f'{prefix_template[0]}{prefix_template[1].format(instruction)}{prefix_template[2]}'
    history = "".join([f'{prompt_template[0]}{prompt_template[1].format(qa[0])}{prompt_template[2]}{prompt_template[3]}{prompt_template[4].format(qa[1])}{prompt_template[5]}' for qa in history])
    prompt = f'{prompt_template[0]}{prompt_template[1].format(prompt)}{prompt_template[2]}{prompt_template[3]}'
    return f"{system}{history}{prompt}"

System Prompt Example

- Chepybara is a conversational language model that is developed by Shanghai AI Laboratory (上海人工智能实验室). It is designed to be Professional, Sophisticated, and Chemical-centric. 
- For uncertain notions and data, Chepybara always assumes it with theoretical prediction and notices users then.
- Chepybara can accept SMILES (Simplified Molecular Input Line Entry System) string, and prefer output IUPAC names (International Union of Pure and Applied Chemistry nomenclature of organic chemistry), depict reactions in SMARTS (SMILES arbitrary target specification) string. Self-Referencing Embedded Strings (SELFIES) are also accepted.
- Chepybara always solves problems and thinks in step-by-step fashion, Output begin with *Let's think step by step*.

📚 Documentation

Results

MMLU Highlights

Property	Details
Model Type	ChemLLM-7B-Chat
Training Data	Not provided

dataset	ChatGLM3-6B	Qwen-7B	LLaMA-2-7B	Mistral-7B	InternLM2-7B-Chat	ChemLLM-7B-Chat
college chemistry	43.0	39.0	27.0	40.0	43.0	47.0
college mathematics	28.0	33.0	33.0	30.0	36.0	41.0
college physics	32.4	35.3	25.5	34.3	41.2	48.0
formal logic	35.7	43.7	24.6	40.5	34.9	47.6
moral scenarios	26.4	35.0	24.1	39.9	38.6	44.3
humanities average	62.7	62.5	51.7	64.5	66.5	68.6
stem average	46.5	45.8	39.0	47.8	52.2	52.6
social science average	68.2	65.8	55.5	68.1	69.7	71.9
other average	60.5	60.3	51.3	62.4	63.2	65.2
mmlu	58.0	57.1	48.2	59.2	61.7	63.2

*(OpenCompass)

image/png

Chemical Benchmark

image/png *（Score judged by ChatGPT-4-turbo）

Professional Translation

image/png

You can try it online.

Cite this work

@misc{zhang2024chemllm,
      title={ChemLLM: A Chemical Large Language Model}, 
      author={Di Zhang and Wei Liu and Qian Tan and Jingdan Chen and Hang Yan and Yuliang Yan and Jiatong Li and Weiran Huang and Xiangyu Yue and Dongzhan Zhou and Shufei Zhang and Mao Su and Hansen Zhong and Yuqiang Li and Wanli Ouyang},
      year={2024},
      eprint={2402.06852},
      archivePrefix={arXiv},
      primaryClass={cs.AI}
}

Disclaimer

⚠️ Important Note

LLM may generate incorrect answers. Please pay attention to proofreading at your own risk.

Open Source License

The code is licensed under Apache-2.0, while model weights are fully open for academic research and also allow free commercial usage. To apply for a commercial license, or other questions and collaborations, please contact support@chemllm.org.

Demo

Agent Chepybara

image/png

Contact

(AI4Physics Sciecne, Shanghai AI Lab)[support@chemllm.org]

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご