Gme Qwen2 VL 7B Instruct
G
Gme Qwen2 VL 7B Instruct
Developed by Alibaba-NLP
Qwen2-VL-7B-Instruct is a multimodal vision-language model based on the Qwen2 architecture, supporting both Chinese and English, suitable for various natural language processing tasks.
Downloads 3,844
Release Time : 12/21/2024
Model Overview
This model is a 7B-parameter vision-language model capable of processing text and image inputs, performing text generation, image understanding, and multimodal tasks.
Model Features
Multimodal Capability
Supports text and image inputs, capable of understanding and processing multimodal information.
Multilingual Support
Supports English and Chinese, suitable for cross-language application scenarios.
High Performance
Outstanding performance in multiple benchmarks, especially in text similarity and classification tasks.
Model Capabilities
Text similarity calculation
Text classification
Text clustering
Information retrieval
Reranking
Multimodal understanding
Use Cases
E-commerce
Product Review Classification
Sentiment analysis and classification of product reviews
Achieved 97.33% accuracy in Amazon review classification tasks
Finance
Bank Customer Service Classification
Automatic classification of bank customer inquiries
Achieved 84.76% accuracy on the Banking77 dataset
Academic Research
Paper Clustering
Topic clustering of academic papers
Achieved 54.96% V-measure in ArXiv paper clustering tasks
## 🚀 gme-Qwen2-VL-7B-Instruct
This is a model based on Qwen2-VL-7B-Instruct, supporting multiple languages including English and Chinese. It has been tested on various NLP tasks and shows good performance.
## 📚 Documentation
### Model Information
| Property | Details |
|----------|---------|
| Model Type | gme-Qwen2-VL-7B-Instruct |
| Base Model | Qwen/Qwen2-VL-7B-Instruct |
| Supported Languages | English, Chinese |
| Tags | mteb, sentence-transformers, transformers, Qwen2-VL, sentence-similarity, vidore |
### Evaluation Results
#### 1. Semantic Textual Similarity (STS)
- **Dataset: C-MTEB/AFQMC (Validation Split)**
| Metric | Value |
|--------|-------|
| cos_sim_pearson | 64.72351048394194 |
| cos_sim_spearman | 71.66842612591344 |
| euclidean_pearson | 70.0342809043895 |
| euclidean_spearman | 71.66842612323917 |
| manhattan_pearson | 69.94743870947117 |
| manhattan_spearman | 71.53159630946965 |
- **Dataset: C-MTEB/ATEC (Test Split)**
| Metric | Value |
|--------|-------|
| cos_sim_pearson | 52.38188106868689 |
| cos_sim_spearman | 55.468235529709766 |
| euclidean_pearson | 56.974786979175086 |
| euclidean_spearman | 55.468231026153745 |
| manhattan_pearson | 56.94467132566259 |
| manhattan_spearman | 55.39037386224014 |
- **Dataset: mteb/biosses-sts (Test Split)**
| Metric | Value |
|--------|-------|
| cos_sim_pearson | 86.2557839280406 |
| cos_sim_spearman | 82.58200216886888 |
| euclidean_pearson | 84.80588838508498 |
| euclidean_spearman | 82.58200216886888 |
| manhattan_pearson | 84.53082035185592 |
| manhattan_spearman | 82.4964580510134 |
- **Dataset: C-MTEB/BQ (Test Split)**
| Metric | Value |
|--------|-------|
| cos_sim_pearson | 76.98420285210636 |
| cos_sim_spearman | 78.95549489000658 |
| euclidean_pearson | 79.14591532018991 |
| euclidean_spearman | 78.95549488953284 |
| manhattan_pearson | 79.26212116856509 |
| manhattan_spearman | 79.02104262086006 |
#### 2. Classification
- **Dataset: mteb/amazon_counterfactual (Test Split - English)**
| Metric | Value |
|--------|-------|
| accuracy | 77.61194029850746 |
| ap | 41.29789064067677 |
| f1 | 71.69633278678522 |
- **Dataset: mteb/amazon_polarity (Test Split)**
| Metric | Value |
|--------|-------|
| accuracy | 97.3258 |
| ap | 95.91845683387056 |
| f1 | 97.32526074864263 |
- **Dataset: mteb/amazon_reviews_multi (Test Split - English)**
| Metric | Value |
|--------|-------|
| accuracy | 64.794 |
| f1 | 63.7329780206882 |
- **Dataset: mteb/amazon_reviews_multi (Test Split - Chinese)**
| Metric | Value |
|--------|-------|
| accuracy | 55.099999999999994 |
| f1 | 53.115528412999666 |
- **Dataset: mteb/banking77 (Test Split)**
| Metric | Value |
|--------|-------|
| accuracy | 84.76298701298703 |
| f1 | 84.24881789367576 |
#### 3. Retrieval
- **Dataset: mteb/arguana (Test Split)**
| Metric | Value |
|--------|-------|
| map_at_1 | 40.541 |
| map_at_10 | 56.315000000000005 |
| map_at_100 | 56.824 |
| map_at_1000 | 56.825 |
| map_at_3 | 51.778 |
| map_at_5 | 54.623 |
| mrr_at_1 | 41.038000000000004 |
| mrr_at_10 | 56.532000000000004 |
| mrr_at_100 | 57.034 |
| mrr_at_1000 | 57.034 |
| mrr_at_3 | 52.015 |
| mrr_at_5 | 54.835 |
| ndcg_at_1 | 40.541 |
| ndcg_at_10 | 64.596 |
| ndcg_at_100 | 66.656 |
| ndcg_at_1000 | 66.666 |
| ndcg_at_3 | 55.415000000000006 |
| ndcg_at_5 | 60.527 |
| precision_at_1 | 40.541 |
| precision_at_10 | 9.083 |
| precision_at_100 | 0.996 |
| precision_at_1000 | 0.1 |
| precision_at_3 | 21.977 |
| precision_at_5 | 15.661 |
| recall_at_1 | 40.541 |
| recall_at_10 | 90.825 |
| recall_at_100 | 99.57300000000001 |
| recall_at_1000 | 99.644 |
| recall_at_3 | 65.932 |
| recall_at_5 | 78.307 |
- **Dataset: BeIR/cqadupstack (Test Split - Android)**
| Metric | Value |
|--------|-------|
| map_at_1 | 28.848000000000003 |
| map_at_10 | 40.453 |
| map_at_100 | 42.065000000000005 |
| map_at_1000 | 42.176 |
| map_at_3 | 36.697 |
| map_at_5 | 38.855000000000004 |
| mrr_at_1 | 34.764 |
| mrr_at_10 | 45.662000000000006 |
| mrr_at_100 | 46.56 |
| mrr_at_1000 | 46.597 |
| mrr_at_3 | 42.632 |
| mrr_at_5 | 44.249 |
| ndcg_at_1 | 34.764 |
| ndcg_at_10 | 47.033 |
| ndcg_at_100 | 53.089 |
| ndcg_at_1000 | 54.818 |
| ndcg_at_3 | 41.142 |
| ndcg_at_5 | 43.928 |
| precision_at_1 | 34.764 |
| precision_at_10 | 9.027000000000001 |
| precision_at_100 | 1.465 |
| precision_at_1000 | 0.192 |
| precision_at_3 | 19.695 |
| precision_at_5 | 14.535 |
| recall_at_1 | 28.848000000000003 |
| recall_at_10 | 60.849 |
| recall_at_100 | 85.764 |
| recall_at_1000 | 96.098 |
| recall_at_3 | 44.579 |
| recall_at_5 | 51.678999999999995 |
- **Dataset: BeIR/cqadupstack (Test Split - English)**
| Metric | Value |
|--------|-------|
| map_at_1 | 30.731 |
| map_at_10 | 41.859 |
| map_at_100 | 43.13 |
| map_at_1000 | 43.257 |
| map_at_3 | 38.384 |
| map_at_5 | 40.284 |
| mrr_at_1 | 38.471 |
| mrr_at_10 | 47.531 |
| mrr_at_100 | 48.199 |
| mrr_at_1000 | 48.24 |
| mrr_at_3 | 44.989000000000004 |
| mrr_at_5 | 46.403 |
| ndcg_at_1 | 38.471 |
| ndcg_at_10 | 48.022999999999996 |
| ndcg_at_100 | 52.32599999999999 |
| ndcg_at_1000 | 54.26 |
| ndcg_at_3 | 42.986999999999995 |
| ndcg_at_5 | 45.23 |
| precision_at_1 | 38.471 |
| precision_at_10 | 9.248000000000001 |
| precision_at_100 | 1.469 |
| precision_at_1000 | 0.193 |
| precision_at_3 | 20.892 |
| precision_at_5 | 14.892 |
| recall_at_1 | 30.731 |
| recall_at_10 | 59.561 |
| recall_at_100 | 77.637 |
| recall_at_1000 | 89.64999999999999 |
| recall_at_3 | 44.897999999999996 |
| recall_at_5 | 51.181 |
- **Dataset: BeIR/cqadupstack (Test Split - Gaming)**
| Metric | Value |
|--------|-------|
| map_at_1 | 34.949000000000005 |
| map_at_10 | 48.117 |
| map_at_100 | 49.355 |
| map_at_1000 | 49.409 |
| map_at_3 | 44.732 |
| map_at_5 | 46.555 |
| mrr_at_1 | 40.188 |
| mrr_at_10 | 51.452 |
| mrr_at_100 | 52.219 |
| mrr_at_1000 | 52.24100000000001 |
| mrr_at_3 | 48.642 |
| mrr_at_5 | 50.134 |
| ndcg_at_1 | 40.188 |
| ndcg_at_10 | 54.664 |
| ndcg_at_100 | 59.38099999999999 |
| ndcg_at_1000 | 60.363 |
| ndcg_at_3 | 48.684 |
| ndcg_at_5 | 51.406 |
| precision_at_1 | 40.188 |
| precision_at_10 | 9.116 |
| precision_at_100 | 1.248 |
| precision_at_1000 | 0.13699999999999998 |
| precision_at_3 |... (data incomplete in original) |
#### 4. Clustering
- **Dataset: mteb/arxiv-clustering-p2p (Test Split)**
| Metric | Value |
|--------|-------|
| v_measure | 54.96111428218386 |
- **Dataset: mteb/arxiv-clustering-s2s (Test Split)**
| Metric | Value |
|--------|-------|
| v_measure | 50.637711388838945 |
- **Dataset: mteb/biorxiv-clustering-p2p (Test Split)**
| Metric | Value |
|--------|-------|
| v_measure | 46.86757924102047 |
- **Dataset: mteb/biorxiv-clustering-s2s (Test Split)**
| Metric | Value |
|--------|-------|
| v_measure | 43.86043680479362 |
- **Dataset: C-MTEB/CLSClusteringP2P (Test Split)**
| Metric | Value |
|--------|-------|
| v_measure | 45.684222588040605 |
- **Dataset: C-MTEB/CLSClusteringS2S (Test Split)**
| Metric | Value |
|--------|-------|
| v_measure | 45.45639765303432 |
#### 5. Reranking
- **Dataset: mteb/askubuntudupquestions-reranking (Test Split)**
| Metric | Value |
|--------|-------|
| map | 64.0741897266483 |
| mrr | 76.11440882909028 |
- **Dataset: C-MTEB/CMedQAv1-reranking (Test Split)**
| Metric | Value |
|--------|-------|
| map | 88.7058672660788 |
| mrr | 90.5795634920635 |
- **Dataset: C-MTEB/CMedQAv2-reranking (Test Split)**
| Metric | Value |
|--------|-------|
| map | 90.50750030424048 |
| mrr | 92.3970634920635 |
## 📄 License
This project is licensed under the Apache-2.0 license.
This README provides a detailed overview of the gme-Qwen2-VL-7B-Instruct
model, including its base model, supported languages, tags, and evaluation results on various NLP tasks. The evaluation results are presented in a tabular format for easy comparison. The license information is also provided at the end.
Clip Vit Large Patch14 336
A large-scale vision-language pretrained model based on the Vision Transformer architecture, supporting cross-modal understanding between images and text
Text-to-Image
Transformers

C
openai
5.9M
241
Fashion Clip
MIT
FashionCLIP is a vision-language model fine-tuned specifically for the fashion domain based on CLIP, capable of generating universal product representations.
Text-to-Image
Transformers English

F
patrickjohncyh
3.8M
222
Gemma 3 1b It
Gemma 3 is a lightweight advanced open model series launched by Google, built on the same research and technology as the Gemini models. This model is multimodal, capable of processing both text and image inputs to generate text outputs.
Text-to-Image
Transformers

G
google
2.1M
347
Blip Vqa Base
Bsd-3-clause
BLIP is a unified vision-language pretraining framework, excelling in visual question answering tasks through joint language-image training to achieve multimodal understanding and generation capabilities
Text-to-Image
Transformers

B
Salesforce
1.9M
154
CLIP ViT H 14 Laion2b S32b B79k
MIT
A vision-language model trained on the LAION-2B English dataset based on the OpenCLIP framework, supporting zero-shot image classification and cross-modal retrieval tasks
Text-to-Image
Safetensors
C
laion
1.8M
368
CLIP ViT B 32 Laion2b S34b B79k
MIT
A vision-language model trained on the English subset of LAION-2B using the OpenCLIP framework, supporting zero-shot image classification and cross-modal retrieval
Text-to-Image
Safetensors
C
laion
1.1M
112
Pickscore V1
PickScore v1 is a scoring function for text-to-image generation, used to predict human preferences, evaluate model performance, and rank images.
Text-to-Image
Transformers

P
yuvalkirstain
1.1M
44
Owlv2 Base Patch16 Ensemble
Apache-2.0
OWLv2 is a zero-shot text-conditioned object detection model that can localize objects in images through text queries.
Text-to-Image
Transformers

O
google
932.80k
99
Llama 3.2 11B Vision Instruct
Llama 3.2 is a multilingual, multimodal large language model released by Meta, supporting image-to-text and text-to-text conversion tasks with robust cross-modal understanding capabilities.
Text-to-Image
Transformers Supports Multiple Languages

L
meta-llama
784.19k
1,424
Owlvit Base Patch32
Apache-2.0
OWL-ViT is a zero-shot text-conditioned object detection model that can search for objects in images via text queries without requiring category-specific training data.
Text-to-Image
Transformers

O
google
764.95k
129
Featured Recommended AI Models