đ Taiyi-vit-87M-D
A ViT-base visual encoder for the English version of MAP (temporary name), with special pre-training on COCO and VG.
đ Quick Start
Taiyi-vit-87M-D is a visual encoder based on ViT-base, which has been specially pre-trained on COCO and VG. It is designed for the English version of MAP (temporary).
⨠Features
- Special pre-training on COCO and VG to introduce multimodal information.
- A new pre-training method denoted by "D".
- Designed several special training objectives for special multimodal representations.
đĻ Installation
No specific installation steps are provided in the original document, so this section is skipped.
đģ Usage Examples
Basic Usage
from transformers import ViTFeatureExtractor, ViTForImageClassification
from PIL import Image
import requests
url = 'http://images.cocodataset.org/val2017/000000039769.jpg'
image = Image.open(requests.get(url, stream=True).raw)
feature_extractor = ViTFeatureExtractor.from_pretrained('IDEA-CCNL/Taiyi-vit-87M-D')
model = ViTForImageClassification.from_pretrained('IDEA-CCNL/Taiyi-vit-87M-D')
inputs = feature_extractor(images=image, return_tensors="pt")
outputs = model(**inputs)
logits = outputs.logits
predicted_class_idx = logits.argmax(-1).item()
print("Predicted class:", model.config.id2label[predicted_class_idx])
đ Documentation
Model Taxonomy
Property |
Details |
Demand |
Special |
Task |
Multimodal |
Series |
Taiyi |
Model |
TBD |
Parameter |
89M |
Extra |
Special pre-training method D |
Model Information
Based on pre-trained clip-vit-base (patch 16, resolution 224x224), we apply some multimodal information with special pre-training tasks. "D" implies a special training method. For special multimodal representations, we design several special training objectives in our paper. The pre-training datasets are MSCOCO and VG. Our code and details of pre-training tasks will be made publicly available upon paper acceptance.
Performance on Downstream Tasks
|
CIFAR10 |
ImageNet1k |
clip-vit-base-patch16-224 (official) |
96.2 |
80.2 |
Taiyi-vit-87M-D (local) |
98.7 |
82.4 |
The local test settings are:
learning rate = 2e-5,
batch size = 128,
num train epochs = 5,
weight decay = 0.01
đ§ Technical Details
Based on pre-trained clip-vit-base (patch 16, resolution 224x224), we introduce multimodal information through special training tasks. The "D" in the model name represents a new pre - training method. We designed several different training objectives for special multimodal representations in the paper. The pre - training datasets are MSCOCO and VG. The code and details of the pre - training tasks will be made public after the paper is accepted.
đ License
This project is licensed under the Apache-2.0 license.
đ Citation
If you are using the resource for your work, please cite the our paper:
@article{fengshenbang,
author = {Jiaxing Zhang and Ruyi Gan and Junjie Wang and Yuxiang Zhang and Lin Zhang and Ping Yang and Xinyu Gao and Ziwei Wu and Xiaoqun Dong and Junqing He and Jianheng Zhuo and Qi Yang and Yongfeng Huang and Xiayu Li and Yanghan Wu and Junyu Lu and Xinyu Zhu and Weifeng Chen and Ting Han and Kunhao Pan and Rui Wang and Hao Wang and Xiaojun Wu and Zhongshen Zeng and Chongpei Chen},
title = {Fengshenbang 1.0: Being the Foundation of Chinese Cognitive Intelligence},
journal = {CoRR},
volume = {abs/2209.02970},
year = {2022}
}
You can also cite our website:
@misc{Fengshenbang-LM,
title={Fengshenbang-LM},
author={IDEA-CCNL},
year={2021},
howpublished={\url{https://github.com/IDEA-CCNL/Fengshenbang-LM}},
}