đ Unicom: A Visionary Feature Extraction Model
Unicom is a feature extraction model designed to revolutionize multimodal processing. It leverages a powerful Vision Transformer architecture and is trained on extensive image - caption datasets, demonstrating remarkable performance in Multimodal Large Language Models (MLLMs).
[Paper] [GitHub]
đ Quick Start
This README provides an in - depth overview of the Unicom model, including its architecture, training data, performance evaluation, and limitations.
⨠Features
- Powerful Vision Architecture: Utilizes the same Vision Transformer architecture ViT - L/14@336px as CLIP.
- Large - Scale Training Data: Trained on publicly available image - caption data from LAION400M and COYO700M.
- Exceptional Performance: Shows excellent performance in MLLMs and various linear probe evaluations.
đ Documentation
đ Model
We used the same Vision Transformer architecture ViT - L/14@336px as CLIP.

đ Data
Our model was trained on publicly available image - caption data from the LAION400M and COYO700M datasets.
đ Performance and Limitations
A. MLLMs Evaluation Results
In our experiments, we replaced the CLIP model in [LLaVA - NeXT](https://github.com/LLaVA - VL/LLaVA - NeXT) with the MLCD model to demonstrate the performance of the MLCD model in Multimodal Large Language Models (MLLMs). For the language model, we used [Qwen2.5 - 7B](https://huggingface.co/Qwen/Qwen2.5 - 7B). The evaluation results show that the modified model performs exceptionally well across multiple benchmarks, validating the effectiveness of the MLCD model within MLLMs.
Vision Tower |
MLCD (ViT_L_14_336px) |
CLIP (ViT_L_14_336px) |
LLM |
Qwen2.5 - 7B |
Qwen2.5 - 7B |
AI2D |
76.98 |
73.15 |
ScienceQA_img |
78.09 |
76.35 |
GQA |
64.17 |
63.31 |
InfoVQA_val |
43.48 |
38.88 |
MMBench_cn_dev |
74.83 |
72.51 |
MMBench_en_dev |
76.37 |
74.57 |
MME(cognition) |
432 |
384 |
MME(perception) |
1598 |
1512 |
SeedBench |
68.20 |
66.80 |
SeedBench_img |
73.75 |
72.72 |
MMStar |
50.98 |
48.98 |
MMMU |
44.30 |
44.20 |
OCRBench |
531.00 |
525.00 |
ChartQA |
67.84 |
66.52 |
DocVQA_val |
76.46 |
75.21 |
POPE |
88.69 |
88.83 |
TextVQA_val |
61.69 |
62.47 |
B. Linear Probe Evaluation Results
This table presents the results of linear probe evaluations comparing CLIP and MLCD models on the ViT_L_14_336px architecture across various datasets. The linear probe test freezes the pre - trained model's weights and trains a linear classifier on top to assess how well the model's representations generalize to different tasks.
Dataset |
MLCD (ViT_L_14_336px) |
CLIP (ViT_L_14_336px) |
AVG |
87.15 |
85.35 |
Food101 |
96.21 |
95.90 |
CIFAR - 10 |
99.36 |
97.90 |
CIFAR - 100 |
93.69 |
87.40 |
Birdsnap |
88.18 |
79.90 |
SUN397 |
87.96 |
82.20 |
Stanford Cars |
95.16 |
91.50 |
FGVC Aircraft |
86.38 |
71.60 |
Describable Textures Dataset |
86.70 |
83.00 |
Oxford - IIIT Pets |
96.27 |
95.10 |
Caltech - 101 |
97.92 |
96.00 |
Flowers102 |
99.58 |
99.20 |
MNIST |
98.67 |
99.20 |
STL - 10 |
99.28 |
99.70 |
EuroSAT |
99.06 |
98.10 |
RESISC45 |
95.48 |
94.90 |
GTSRB |
92.32 |
92.40 |
KITTI |
75.39 |
69.20 |
Country211 |
38.12 |
46.40 |
PatchCamelyon |
88.00 |
85.60 |
UCF101 |
92.86 |
92.00 |
Kinetics - 700 |
73.35 |
73.00 |
CLEVR |
64.40 |
60.30 |
Hateful Memes |
72.00 |
77.30 |
SST - 2 |
76.33 |
80.50 |
ImageNet |
86.30 |
85.40 |
C. Limitations
Models with higher resolution are more friendly to OCR results. We are currently training such models and will soon make them available.
đ Acknowledgments
We would like to express our gratitude to [Xie Yin](https://huggingface.co/Yin - Xie) and Yumeng Wang for their significant contributions to the experimental validation in MLLMs.
đ License
This project is licensed under the [Apache 2.0 License](https://www.apache.org/licenses/LICENSE - 2.0).