ChemVLM-8B Open-Source Chemical Multimodal Large Model - Free Processing of Textual and Visual Chemical Information

Chemvlm 8B

Developed by AI4Chem

ChemVLM-8B is an 8-billion-parameter multimodal large language model specifically designed for the chemistry domain, capable of processing both text and visual chemical information.

Image-to-Text

Transformers

Open Source License:Apache-2.0 #Chemical Multimodal #Molecular Structure Understanding #Bilingual Chemical Reasoning

Downloads 117

Release Time : 10/28/2024

Model Overview

ChemVLM-8B is a multimodal large language model for the chemistry domain, capable of integrating text and visual information to handle tasks such as molecular structures, chemical reactions, and chemistry exam questions.

Model Features

Multimodal Capability

Capable of processing both text and visual chemical information, including molecular structures, chemical reactions, and chemistry exam questions.

Chemistry-Specific

Specifically designed for the chemistry domain, excelling in tasks such as chemical OCR, molecular understanding, and chemical reasoning.

Open Source

The model and codebase are publicly available, supporting further development and improvement by the community.

Model Capabilities

Chemical OCR

Molecular Understanding

Chemical Reasoning

Image-to-Text Conversion

Multimodal Chemical Information Processing

Use Cases

Chemistry Education

Chemistry Exam Question Analysis

Analyze exam questions containing molecular structures and chemical reactions, providing answers and explanations.

Chemistry Research

Molecular Structure Analysis

Identify and analyze molecular structures from images.

Reaction Type Identification

Identify the type and mechanism of chemical reactions.

Accuracy: 16.79%

🚀 ChemVLM-8B: A Multimodal Large Language Model for Chemistry

ChemVLM-8B is an 8-billion parameter multimodal large language model tailored for chemical applications, offering enhanced capabilities in processing chemical visual and textual information.

🚀 Quick Start

Prerequisites

Install the required libraries using the following command:

pip install transformers>=4.37.0 sentencepiece einops timm accelerate>=0.26.0

Ensure that torch and torchvision are also installed.

Code Example

from transformers import AutoTokenizer, AutoModelforCasualLM
import torch
import torchvision.transforms as T
import transformers
from torchvision.transforms.functional import InterpolationMode


IMAGENET_MEAN = (0.485, 0.456, 0.406)
IMAGENET_STD = (0.229, 0.224, 0.225)

IMAGENET_MEAN = (0.485, 0.456, 0.406)
IMAGENET_STD = (0.229, 0.224, 0.225)


def build_transform(input_size):
    MEAN, STD = IMAGENET_MEAN, IMAGENET_STD
    transform = T.Compose([
        T.Lambda(lambda img: img.convert('RGB') if img.mode != 'RGB' else img),
        T.Resize((input_size, input_size), interpolation=InterpolationMode.BICUBIC),
        T.ToTensor(),
        T.Normalize(mean=MEAN, std=STD)
    ])
    return transform


def find_closest_aspect_ratio(aspect_ratio, target_ratios, width, height, image_size):
    best_ratio_diff = float('inf')
    best_ratio = (1, 1)
    area = width * height
    for ratio in target_ratios:
        target_aspect_ratio = ratio[0] / ratio[1]
        ratio_diff = abs(aspect_ratio - target_aspect_ratio)
        if ratio_diff < best_ratio_diff:
            best_ratio_diff = ratio_diff
            best_ratio = ratio
        elif ratio_diff == best_ratio_diff:
            if area > 0.5 * image_size * image_size * ratio[0] * ratio[1]:
                best_ratio = ratio
    return best_ratio


def dynamic_preprocess(image, min_num=1, max_num=6, image_size=448, use_thumbnail=False):
    orig_width, orig_height = image.size
    aspect_ratio = orig_width / orig_height

    # calculate the existing image aspect ratio
    target_ratios = set(
        (i, j) for n in range(min_num, max_num + 1) for i in range(1, n + 1) for j in range(1, n + 1) if
        i * j <= max_num and i * j >= min_num)
    target_ratios = sorted(target_ratios, key=lambda x: x[0] * x[1])

    # find the closest aspect ratio to the target
    target_aspect_ratio = find_closest_aspect_ratio(
        aspect_ratio, target_ratios, orig_width, orig_height, image_size)

    # calculate the target width and height
    target_width = image_size * target_aspect_ratio[0]
    target_height = image_size * target_aspect_ratio[1]
    blocks = target_aspect_ratio[0] * target_aspect_ratio[1]

    # resize the image
    resized_img = image.resize((target_width, target_height))
    processed_images = []
    for i in range(blocks):
        box = (
            (i % (target_width // image_size)) * image_size,
            (i // (target_width // image_size)) * image_size,
            ((i % (target_width // image_size)) + 1) * image_size,
            ((i // (target_width // image_size)) + 1) * image_size
        )
        # split the image
        split_img = resized_img.crop(box)
        processed_images.append(split_img)
    assert len(processed_images) == blocks
    if use_thumbnail and len(processed_images) != 1:
        thumbnail_img = image.resize((image_size, image_size))
        processed_images.append(thumbnail_img)
    return processed_images


def load_image(image_file, input_size=448, max_num=6):
    image = Image.open(image_file).convert('RGB')
    transform = build_transform(input_size=input_size)
    images = dynamic_preprocess(image, image_size=input_size, use_thumbnail=True, max_num=max_num)
    pixel_values = [transform(image) for image in images]
    pixel_values = torch.stack(pixel_values)
    return pixel_values


tokenizer = AutoTokenizer.from_pretrained('AI4Chem/ChemVLM-8B', trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    "AI4Chem/ChemVLM-8B",
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True,
    trust_remote_code=True
).cuda().eval()

query = "Please describe the molecule in the image."
image_path = "your image path"
pixel_values = load_image(image_path, max_num=6).to(torch.bfloat16).cuda()

gen_kwargs = {"max_length": 1000, "do_sample": True, "temperature": 0.7, "top_p": 0.9}

response = model.chat(tokenizer, pixel_values, query, gen_kwargs)
print(response)

✨ Features

Multimodal Capability: ChemVLM-8B can handle both visual and textual chemical information, such as molecular structures, reactions, and chemistry exam questions.
Bilingual Training: Trained on a bilingual multimodal dataset, enhancing its cross - language understanding in the chemical domain.
Competitive Performance: Achieves competitive results on various chemical tasks compared to other open - source and proprietary multimodal large language models.

📚 Documentation

Paper

ChemVLM: Exploring the Power of Multimodal Large Language Models in Chemistry Area

Abstract

Large Language Models (LLMs) have achieved remarkable success and have been applied across various scientific fields, including chemistry. However, many chemical tasks require the processing of visual information, which cannot be successfully handled by existing chemical LLMs. This brings a growing need for models capable of integrating multimodal information in the chemical domain. In this paper, we introduce ChemVLM, an open - source chemical multimodal large language model specifically designed for chemical applications. ChemVLM is trained on a carefully curated bilingual multimodal dataset that enhances its ability to understand both textual and visual chemical information, including molecular structures, reactions, and chemistry examination questions. We develop three datasets for comprehensive evaluation, tailored to Chemical Optical Character Recognition (OCR), Multimodal Chemical Reasoning (MMCR), and Multimodal Molecule Understanding tasks. We benchmark ChemVLM against a range of open - source and proprietary multimodal large language models on various tasks. Experimental results demonstrate that ChemVLM achieves competitive performance across all evaluated tasks. Our model can be found at https://huggingface.co/AI4Chem/ChemVLM - 26B.

Model Description

The architecture of ChemVLM is based on InternVLM and incorporates both vision and language processing components. The model is trained on a bilingual multimodal dataset containing chemical information, including molecular structures, reactions, and chemistry exam questions. More details about the architecture can be found in the Github README.

ChemVLM

🔧 Technical Details

The architecture of ChemVLM-8B is built upon InternVLM, integrating vision and language processing modules. It is trained on a bilingual multimodal dataset rich in chemical information, enabling it to understand and process various chemical data types.

📄 License

The model is released under the apache - 2.0 license.

📦 Installation

Install the necessary libraries using the following command:

pip install transformers>=4.37.0 sentencepiece einops timm accelerate>=0.26.0

Ensure that torch and torchvision are also installed.

📊 Performances of our 8b model on several tasks

Datasets	MMChemOCR	CMMU	MMCR - bench	Reaction type
Metrics	tanimoto similarity\tani@1.0	score(%, GPT - 4o helps judge)	score(%, GPT - 4o helps judge)	Accuracy(%)
Scores of ChemVLM - 8b	81.75/57.69	52.7(SOTA)	33.6	16.79

📝 Citation

@inproceedings{li2025chemvlm,
  title={Chemvlm: Exploring the power of multimodal large language models in chemistry area},
  author={Li, Junxian and Zhang, Di and Wang, Xunzhi and Hao, Zeying and Lei, Jingdi and Tan, Qian and Zhou, Cai and Liu, Wei and Yang, Yaotian and Xiong, Xinrui and others},
  booktitle={Proceedings of the AAAI Conference on Artificial Intelligence},
  volume={39},
  number={1},
  pages={415--423},
  year={2025}
}

💻 Codebase and Datasets

The codebase and datasets can be found at https://github.com/AI4Chem/ChemVlm.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご