The open-source visual feature extraction model, mlcd-vit-large-patch14-336, outperforms the CLIP benchmark in multimodal tasks.

Mlcd Vit Large Patch14 336

Developed by DeepGlint-AI

A visual feature extraction model based on ViT-L/14@336px architecture, surpassing CLIP benchmarks in multiple multimodal tasks

Multimodal Fusion

Safetensors

Open Source License:Apache-2.0 #Multimodal Large Language Model Visual Backbone #High-Resolution Image Feature Extraction #Zero-Shot Linear Probing

Downloads 1,450

Release Time : 10/11/2024

Model Overview

This model adopts the same visual Transformer architecture as CLIP, focusing on image feature extraction, with special optimization for performance in Multimodal Large Language Models (MLLMs).

Model Features

Multimodal Optimization

Specifically optimized for Multimodal Large Language Models (MLLMs), demonstrating excellent performance in frameworks like LLaVA-NeXT

High-Performance Feature Extraction

Outperforms CLIP models of the same architecture across 20+ benchmarks, with an average improvement of 1.8-2.0 percentage points

Large-Scale Training Data

Trained on two major public datasets, LAION400M and COYO700M, covering a wide range of visual concepts

Model Capabilities

Image Feature Extraction

Multimodal Representation Learning

Visual Question Answering Support

Image Classification

Cross-Modal Retrieval

Use Cases

Multimodal Large Language Models

LLaVA-NeXT Visual Backbone

Integrated as a visual encoder in the LLaVA-NeXT framework

Surpasses CLIP in 12 benchmarks including AI2D (76.98) and ScienceQA_img (78.09)

Computer Vision

Linear Classification Tasks

Performing linear probing with frozen feature extractor

Significantly outperforms CLIP in tasks like CIFAR-100 (93.69) and FGVC Aircraft (86.38)

🚀 Unicom: A Visionary Feature Extraction Model

Unicom is a feature extraction model designed to revolutionize multimodal processing. It leverages a powerful Vision Transformer architecture and is trained on extensive image - caption datasets, demonstrating remarkable performance in Multimodal Large Language Models (MLLMs).

[Paper] [GitHub]

🚀 Quick Start

This README provides an in - depth overview of the Unicom model, including its architecture, training data, performance evaluation, and limitations.

✨ Features

Powerful Vision Architecture: Utilizes the same Vision Transformer architecture ViT - L/14@336px as CLIP.
Large - Scale Training Data: Trained on publicly available image - caption data from LAION400M and COYO700M.
Exceptional Performance: Shows excellent performance in MLLMs and various linear probe evaluations.

📚 Documentation

🔍 Model

We used the same Vision Transformer architecture ViT - L/14@336px as CLIP.

image/png

📊 Data

Our model was trained on publicly available image - caption data from the LAION400M and COYO700M datasets.

📈 Performance and Limitations

A. MLLMs Evaluation Results

In our experiments, we replaced the CLIP model in [LLaVA - NeXT](https://github.com/LLaVA - VL/LLaVA - NeXT) with the MLCD model to demonstrate the performance of the MLCD model in Multimodal Large Language Models (MLLMs). For the language model, we used [Qwen2.5 - 7B](https://huggingface.co/Qwen/Qwen2.5 - 7B). The evaluation results show that the modified model performs exceptionally well across multiple benchmarks, validating the effectiveness of the MLCD model within MLLMs.

Vision Tower	MLCD (ViT_L_14_336px)	CLIP (ViT_L_14_336px)
LLM	Qwen2.5 - 7B	Qwen2.5 - 7B
AI2D	76.98	73.15
ScienceQA_img	78.09	76.35
GQA	64.17	63.31
InfoVQA_val	43.48	38.88
MMBench_cn_dev	74.83	72.51
MMBench_en_dev	76.37	74.57
MME(cognition)	432	384
MME(perception)	1598	1512
SeedBench	68.20	66.80
SeedBench_img	73.75	72.72
MMStar	50.98	48.98
MMMU	44.30	44.20
OCRBench	531.00	525.00
ChartQA	67.84	66.52
DocVQA_val	76.46	75.21
POPE	88.69	88.83
TextVQA_val	61.69	62.47

B. Linear Probe Evaluation Results

This table presents the results of linear probe evaluations comparing CLIP and MLCD models on the ViT_L_14_336px architecture across various datasets. The linear probe test freezes the pre - trained model's weights and trains a linear classifier on top to assess how well the model's representations generalize to different tasks.

Dataset	MLCD (ViT_L_14_336px)	CLIP (ViT_L_14_336px)
AVG	87.15	85.35
Food101	96.21	95.90
CIFAR - 10	99.36	97.90
CIFAR - 100	93.69	87.40
Birdsnap	88.18	79.90
SUN397	87.96	82.20
Stanford Cars	95.16	91.50
FGVC Aircraft	86.38	71.60
Describable Textures Dataset	86.70	83.00
Oxford - IIIT Pets	96.27	95.10
Caltech - 101	97.92	96.00
Flowers102	99.58	99.20
MNIST	98.67	99.20
STL - 10	99.28	99.70
EuroSAT	99.06	98.10
RESISC45	95.48	94.90
GTSRB	92.32	92.40
KITTI	75.39	69.20
Country211	38.12	46.40
PatchCamelyon	88.00	85.60
UCF101	92.86	92.00
Kinetics - 700	73.35	73.00
CLEVR	64.40	60.30
Hateful Memes	72.00	77.30
SST - 2	76.33	80.50
ImageNet	86.30	85.40

C. Limitations

Models with higher resolution are more friendly to OCR results. We are currently training such models and will soon make them available.

🙏 Acknowledgments

We would like to express our gratitude to [Xie Yin](https://huggingface.co/Yin - Xie) and Yumeng Wang for their significant contributions to the experimental validation in MLLMs.

📄 License

This project is licensed under the [Apache 2.0 License](https://www.apache.org/licenses/LICENSE - 2.0).

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご