Vilt - B32 - MLM Open - Source Model: A Free Tool for Joint Image and Text Understanding

Vilt B32 Mlm

Developed by dandelin

ViLT is a vision-and-language Transformer model pretrained on the GCC+SBU+COCO+VG dataset, focusing on joint understanding tasks of images and text.

Text-to-Image

Transformers

Open Source License:Apache-2.0 #Multimodal Pretraining #Masked Language Modeling #Vision-Language Understanding

Downloads 7,761

Release Time : 3/2/2022

Model Overview

This model processes visual and linguistic information through a Transformer architecture without convolution or region supervision, suitable for joint image-text understanding tasks.

Model Features

No Convolution or Region Supervision

The model directly processes raw image and text inputs without relying on convolutional neural networks or region supervision.

Joint Vision-Language Understanding

Capable of simultaneously processing image and text information to understand their relationships.

Transformer-Based Architecture

Utilizes modern Transformer architecture to effectively handle multimodal inputs.

Model Capabilities

Image Understanding

Text Understanding

Multimodal Representation Learning

Masked Language Modeling

Use Cases

Multimodal Understanding

Image Captioning

Generate or complete textual descriptions based on image content

Visual Question Answering

Answer questions related to image content

🚀 Vision-and-Language Transformer (ViLT), pre-trained only

Vision-and-Language Transformer (ViLT) is a pre-trained model on GCC+SBU+COCO+VG (200k steps). It was introduced in the paper ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision by Kim et al. and first released in this repository. Note that this model only includes the language modeling head.

Disclaimer: The team releasing ViLT did not write a model card for this model, so this model card has been written by the Hugging Face team.

🚀 Quick Start

You can use the raw model for masked language modeling given an image and a piece of text with [MASK] tokens.

💻 Usage Examples

Basic Usage

from transformers import ViltProcessor, ViltForMaskedLM
import requests
from PIL import Image
import re

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)
text = "a bunch of [MASK] laying on a [MASK]."

processor = ViltProcessor.from_pretrained("dandelin/vilt-b32-mlm")
model = ViltForMaskedLM.from_pretrained("dandelin/vilt-b32-mlm")

# prepare inputs
encoding = processor(image, text, return_tensors="pt")

# forward pass
outputs = model(**encoding)

tl = len(re.findall("\[MASK\]", text))
inferred_token = [text]

# gradually fill in the MASK tokens, one by one
with torch.no_grad():
    for i in range(tl):
        encoded = processor.tokenizer(inferred_token)
        input_ids = torch.tensor(encoded.input_ids).to(device)
        encoded = encoded["input_ids"][0][1:-1]
        outputs = model(input_ids=input_ids, pixel_values=pixel_values)
        mlm_logits = outputs.logits[0]  # shape (seq_len, vocab_size)
        # only take into account text features (minus CLS and SEP token)
        mlm_logits = mlm_logits[1 : input_ids.shape[1] - 1, :]
        mlm_values, mlm_ids = mlm_logits.softmax(dim=-1).max(dim=-1)
        # only take into account text
        mlm_values[torch.tensor(encoded) != 103] = 0
        select = mlm_values.argmax().item()
        encoded[select] = mlm_ids[select].item()
        inferred_token = [processor.decode(encoded)]

selected_token = ""
encoded = processor.tokenizer(inferred_token)
processor.decode(encoded.input_ids[0], skip_special_tokens=True)

Advanced Usage

Since there is no advanced usage example in the original content, this part is skipped.

📄 License

This project is licensed under the Apache-2.0 license.

BibTeX entry and citation info

@misc{kim2021vilt,
      title={ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision}, 
      author={Wonjae Kim and Bokyung Son and Ildoo Kim},
      year={2021},
      eprint={2102.03334},
      archivePrefix={arXiv},
      primaryClass={stat.ML}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご