Taiyi-vit-87M-D Open-source Visual Encoder - Pretrained on Specific Datasets with Practical Image Encoding Function

Home

Taiyi Vit 87M D

Developed by IDEA-CCNL

An English MAP visual encoder specially pretrained on COCO and Visual Genome datasets, based on ViT-base architecture

Image-to-Text

Transformers

Open Source License:Apache-2.0 #Multimodal Pretraining #Image Classification Enhancement #ViT Architecture Optimization

Downloads 24

Release Time : 5/4/2022

Model Overview

This model is a visual encoder based on the CLIP-ViT-base architecture, infused with multimodal information through specialized training tasks, suitable for visual tasks like image classification

Model Features

Special Pretraining Scheme

Utilizes novel pretraining method D to inject multimodal information through specialized training tasks

High Performance

Outperforms the original CLIP-ViT-base model on benchmarks like CIFAR10 and ImageNet1k

Multimodal Representation

Pretrained on MSCOCO and VG datasets, enabling multimodal understanding capabilities

Model Capabilities

Image Classification

Visual Feature Extraction

Multimodal Representation Learning

Use Cases

Computer Vision

Image Classification

Classifies input images, supporting ImageNet 1000-class tasks

Achieves 82.4% accuracy on ImageNet1k

Visual Feature Extraction

Extracts high-level visual features from images for downstream tasks

🚀 Taiyi-vit-87M-D

A ViT-base visual encoder for the English version of MAP (temporary name), with special pre-training on COCO and VG.

Main Page: Fengshenbang
Github: Fengshenbang-LM

🚀 Quick Start

Taiyi-vit-87M-D is a visual encoder based on ViT-base, which has been specially pre-trained on COCO and VG. It is designed for the English version of MAP (temporary).

✨ Features

Special pre-training on COCO and VG to introduce multimodal information.
A new pre-training method denoted by "D".
Designed several special training objectives for special multimodal representations.

📦 Installation

No specific installation steps are provided in the original document, so this section is skipped.

💻 Usage Examples

Basic Usage

from transformers import ViTFeatureExtractor, ViTForImageClassification
from PIL import Image
import requests

url = 'http://images.cocodataset.org/val2017/000000039769.jpg'
image = Image.open(requests.get(url, stream=True).raw)

feature_extractor = ViTFeatureExtractor.from_pretrained('IDEA-CCNL/Taiyi-vit-87M-D')
model = ViTForImageClassification.from_pretrained('IDEA-CCNL/Taiyi-vit-87M-D')

inputs = feature_extractor(images=image, return_tensors="pt")
outputs = model(**inputs)
logits = outputs.logits
# model predicts one of the 1000 ImageNet classes
predicted_class_idx = logits.argmax(-1).item()
print("Predicted class:", model.config.id2label[predicted_class_idx])
# Predicted class: Egyptian cat

📚 Documentation

Model Taxonomy

Property	Details
Demand	Special
Task	Multimodal
Series	Taiyi
Model	TBD
Parameter	89M
Extra	Special pre-training method D

Model Information

Based on pre-trained clip-vit-base (patch 16, resolution 224x224), we apply some multimodal information with special pre-training tasks. "D" implies a special training method. For special multimodal representations, we design several special training objectives in our paper. The pre-training datasets are MSCOCO and VG. Our code and details of pre-training tasks will be made publicly available upon paper acceptance.

Performance on Downstream Tasks

	CIFAR10	ImageNet1k
clip-vit-base-patch16-224 (official)	96.2	80.2
Taiyi-vit-87M-D (local)	98.7	82.4

The local test settings are: learning rate = 2e-5, batch size = 128, num train epochs = 5, weight decay = 0.01

🔧 Technical Details

Based on pre-trained clip-vit-base (patch 16, resolution 224x224), we introduce multimodal information through special training tasks. The "D" in the model name represents a new pre - training method. We designed several different training objectives for special multimodal representations in the paper. The pre - training datasets are MSCOCO and VG. The code and details of the pre - training tasks will be made public after the paper is accepted.

📄 License

This project is licensed under the Apache-2.0 license.

📖 Citation

If you are using the resource for your work, please cite the our paper:

@article{fengshenbang,
  author    = {Jiaxing Zhang and Ruyi Gan and Junjie Wang and Yuxiang Zhang and Lin Zhang and Ping Yang and Xinyu Gao and Ziwei Wu and Xiaoqun Dong and Junqing He and Jianheng Zhuo and Qi Yang and Yongfeng Huang and Xiayu Li and Yanghan Wu and Junyu Lu and Xinyu Zhu and Weifeng Chen and Ting Han and Kunhao Pan and Rui Wang and Hao Wang and Xiaojun Wu and Zhongshen Zeng and Chongpei Chen},
  title     = {Fengshenbang 1.0: Being the Foundation of Chinese Cognitive Intelligence},
  journal   = {CoRR},
  volume    = {abs/2209.02970},
  year      = {2022}
}

You can also cite our website:

@misc{Fengshenbang-LM,
  title={Fengshenbang-LM},
  author={IDEA-CCNL},
  year={2021},
  howpublished={\url{https://github.com/IDEA-CCNL/Fengshenbang-LM}},
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご