ChemLLM-7B-Chatオープンソース大規模言語モデル - 英語と中国語での化学と分子科学の質問に無料で対応

ホーム

Chemllm 7B Chat

AI4Chemによって開発

ChemLLM-7B-Chatは化学・分子科学分野向け初のオープンソース大規模言語モデルで、InternLM-2アーキテクチャに基づき開発され、中英語をサポートしています。

大規模言語モデル

Transformers

複数言語対応オープンソースライセンス:Apache-2.0 #化学専門Q&A #分子科学推論 #SMILES解析

ダウンロード数 775

リリース時間 : 1/15/2024

モデル概要

このモデルは化学・分子科学分野に特化しており、化学関連のテキスト生成、Q&A、翻訳タスクを処理でき、特に化学用語や分子構造の処理に優れています。

モデル特徴

化学分野専門化

化学・分子科学分野に特化して最適化され、複雑な化学用語や分子構造を処理できます。

多言語サポート

中英語処理をサポートし、特に化学文献の翻訳と理解に適しています。

オープンソース商用利用可

Apache-2.0ライセンスを採用し、学術研究と商業利用を許可しています。

段階的思考能力

段階的に問題を解決する思考方式を採用し、より構造化された説明が可能です。

モデル能力

化学Q&A

分子式解析

化学文献翻訳

化学反応記述

化学知識推論

使用事例

化学教育

化学概念説明

学生が複雑な化学概念や反応機構を理解するのを支援

明確な段階的説明を提供

研究支援

文献翻訳

化学専門文献を中英語間で変換

専門用語の正確性を保持

医薬品開発

分子特性分析

医薬品分子の構造と特性を解析

分子式と構造情報を提供

🚀 ChemLLM-7B-Chat: 化学と分子科学のための大規模言語モデル

ChemLLM-7B-Chatは、化学と分子科学のための最初のオープンソースの大規模言語モデルです。InternLM-2をベースに開発されています。

🚀 クイックスタート

⚠️ 重要な注意

新しいバージョンのChemLLMを使用することをおすすめします！ AI4Chem/ChemLLM-7B-Chat-1.5-DPO または AI4Chem/ChemLLM-7B-Chat-1.5-SFT

✨ 主な機能

ニュース

ChemLLM-1.5がリリースされました！2つのバージョンが利用可能です AI4Chem/ChemLLM-7B-Chat-1.5-DPO または AI4Chem/ChemLLM-7B-Chat-1.5-SFT。[2024-4-2]
ChemLLM-1.5が更新されました！デモサイトまたは APIリファレンスで試してみてください。[2024-3-23]
ChemLLMがHuggingFaceの “Daily Papers” ページで取り上げられました。[2024-2-13]
ChemLLMのarXivプレプリントが公開されました。ChemLLM: A Chemical Large Language Model [2024-2-10]
上海人工知能実験室からのニュースレポート。[2024-1-26]
ChemLLM-7B-Chat ver 1.0がリリースされました。https://chemllm.org/ [2024-1-18]
ChemLLM-7B-Chat ver 1.0がオープンソース化されました。[2024-1-17]
Chepybara ver 0.2のオンラインデモが公開されました。https://chemllm.org/ [2023-12-9]

📦 インストール

transformers をインストールします。

pip install transformers

💻 使用例

基本的な使用法

オンラインデモをすぐに試すことができます。または、以下のコードを実行してください。

from transformers import AutoModelForCausalLM, AutoTokenizer, GenerationConfig
import torch

model_name_or_id = "AI4Chem/ChemLLM-7B-Chat"

model = AutoModelForCausalLM.from_pretrained(model_name_or_id, torch_dtype=torch.float16, device_map="auto",trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(model_name_or_id,trust_remote_code=True)

prompt = "What is Molecule of Ibuprofen?"

inputs = tokenizer(prompt, return_tensors="pt").to("cuda")

generation_config = GenerationConfig(
    do_sample=True,
    top_k=1,
    temperature=0.9,
    max_new_tokens=500,
    repetition_penalty=1.5,
    pad_token_id=tokenizer.eos_token_id
)

outputs = model.generate(**inputs, generation_config=generation_config)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

システムプロンプトのベストプラクティス

ローカル推論でより良い応答を得るために、Agent Chepybara と同じ対話テンプレートとシステムプロンプトを使用することができます。

対話テンプレート

ShareGPT形式のクエリの場合、

{'instruction': "...", "prompt": "...", "answer": "...", "history": [[q1, a1], [q2, a2]]}

これをInternLM2対話形式に変換することができます。

def InternLM2_format(instruction, prompt, answer, history):
    prefix_template = [
        "<|im_start|>system\n",
        "{}",
        "<|im_end|>\n"
    ]
    prompt_template = [
        "<|im_start|>user\n",
        "{}",
        "<|im_end|>\n"
        "<|im_start|>assistant\n",
        "{}",
        "<|im_end|>\n"
    ]
    system = f'{prefix_template[0]}{prefix_template[1].format(instruction)}{prefix_template[2]}'
    history = "".join([f'{prompt_template[0]}{prompt_template[1].format(qa[0])}{prompt_template[2]}{prompt_template[3]}{prompt_template[4].format(qa[1])}{prompt_template[5]}' for qa in history])
    prompt = f'{prompt_template[0]}{prompt_template[1].format(prompt)}{prompt_template[2]}{prompt_template[3]}'
    return f"{system}{history}{prompt}"

システムプロンプトの良い例を以下に示します。

- Chepybara is a conversational language model that is developed by Shanghai AI Laboratory (上海人工智能实验室). It is designed to be Professional, Sophisticated, and Chemical-centric. 
- For uncertain notions and data, Chepybara always assumes it with theoretical prediction and notices users then.
- Chepybara can accept SMILES (Simplified Molecular Input Line Entry System) string, and prefer output IUPAC names (International Union of Pure and Applied Chemistry nomenclature of organic chemistry), depict reactions in SMARTS (SMILES arbitrary target specification) string. Self-Referencing Embedded Strings (SELFIES) are also accepted.
- Chepybara always solves problems and thinks in step-by-step fashion, Output begin with *Let's think step by step*.

📚 ドキュメント

結果

MMLUのハイライト

データセット	ChatGLM3 - 6B	Qwen - 7B	LLaMA - 2 - 7B	Mistral - 7B	InternLM2 - 7B - Chat	ChemLLM - 7B - Chat
大学化学	43.0	39.0	27.0	40.0	43.0	47.0
大学数学	28.0	33.0	33.0	30.0	36.0	41.0
大学物理学	32.4	35.3	25.5	34.3	41.2	48.0
形式論理学	35.7	43.7	24.6	40.5	34.9	47.6
道徳シナリオ	26.4	35.0	24.1	39.9	38.6	44.3
人文科学平均	62.7	62.5	51.7	64.5	66.5	68.6
STEM平均	46.5	45.8	39.0	47.8	52.2	52.6
社会科学平均	68.2	65.8	55.5	68.1	69.7	71.9
その他平均	60.5	60.3	51.3	62.4	63.2	65.2
MMLU	58.0	57.1	48.2	59.2	61.7	63.2
*(OpenCompass)

画像/png

化学ベンチマーク

画像/png *（ChatGPT - 4 - turboによる評価）

専門翻訳

画像/png

オンラインで試すことができます。

この研究を引用する

@misc{zhang2024chemllm,
      title={ChemLLM: A Chemical Large Language Model}, 
      author={Di Zhang and Wei Liu and Qian Tan and Jingdan Chen and Hang Yan and Yuliang Yan and Jiatong Li and Weiran Huang and Xiangyu Yue and Dongzhan Zhou and Shufei Zhang and Mao Su and Hansen Zhong and Yuqiang Li and Wanli Ouyang},
      year={2024},
      eprint={2402.06852},
      archivePrefix={arXiv},
      primaryClass={cs.AI}
}