M

Mlcd Vit Large Patch14 336

Developed by DeepGlint-AI
A visual feature extraction model based on ViT-L/14@336px architecture, surpassing CLIP benchmarks in multiple multimodal tasks
Downloads 1,450
Release Time : 10/11/2024

Model Overview

This model adopts the same visual Transformer architecture as CLIP, focusing on image feature extraction, with special optimization for performance in Multimodal Large Language Models (MLLMs).

Model Features

Multimodal Optimization
Specifically optimized for Multimodal Large Language Models (MLLMs), demonstrating excellent performance in frameworks like LLaVA-NeXT
High-Performance Feature Extraction
Outperforms CLIP models of the same architecture across 20+ benchmarks, with an average improvement of 1.8-2.0 percentage points
Large-Scale Training Data
Trained on two major public datasets, LAION400M and COYO700M, covering a wide range of visual concepts

Model Capabilities

Image Feature Extraction
Multimodal Representation Learning
Visual Question Answering Support
Image Classification
Cross-Modal Retrieval

Use Cases

Multimodal Large Language Models
LLaVA-NeXT Visual Backbone
Integrated as a visual encoder in the LLaVA-NeXT framework
Surpasses CLIP in 12 benchmarks including AI2D (76.98) and ScienceQA_img (78.09)
Computer Vision
Linear Classification Tasks
Performing linear probing with frozen feature extractor
Significantly outperforms CLIP in tasks like CIFAR-100 (93.69) and FGVC Aircraft (86.38)
Featured Recommended AI Models
AIbase
Empowering the Future, Your AI Solution Knowledge Base
Š 2025AIbase