4M-21_XL Open-Source Multimodal Model - A Practical Tool Supporting Any-to-Any Conversion

4M 21 XL

Developed by EPFL-VILAB

4M is a framework for training 'any-to-any' multimodal foundation models, extending to various modalities through tokenization and masking techniques.

Multimodal Fusion

Safetensors

Open Source License:Other #Multimodal Foundation Model #Any-to-Any Conversion #Masked Modeling Technique

Downloads 57

Release Time : 6/12/2024

Model Overview

The foundation model trained by the 4M framework can perform a wide range of visual tasks, exhibits strong transfer capabilities, and serves as a flexible and controllable multimodal generation model.

Model Features

Any-to-Any Multimodal Conversion

Supports conversion between dozens of different modalities.

Strong Transfer Capability

Can effectively transfer to unseen tasks and modalities.

Flexible and Controllable Generation

Highly flexible and controllable as a multimodal generation model.

Model Capabilities

Multimodal Masked Modeling

Visual Task Processing

Multimodal Generation

Cross-Modal Conversion

Use Cases

Computer Vision

Image Generation

Generate images from other modalities (e.g., text, depth maps, etc.)

Multimodal Processing

Cross-Modal Conversion

Convert between different visual and language modalities

🚀 4M: Massively Multimodal Masked Modeling

A framework for training any-to-any multimodal foundation models. Scalable. Open-sourced. Across tens of modalities and tasks.

Website | GitHub | BibTeX

Official implementation and pre-trained models for :

4M: Massively Multimodal Masked Modeling, NeurIPS 2023 (Spotlight)
David Mizrahi*, Roman Bachmann*, Oğuzhan Fatih Kar, Teresa Yeo, Mingfei Gao, Afshin Dehghan, Amir Zamir

4M-21: An Any-to-Any Vision Model for Tens of Tasks and Modalities, arXiv 2024
Roman Bachmann*, Oğuzhan Fatih Kar*, David Mizrahi*, Ali Garjani, Mingfei Gao, David Griffiths, Jiaming Hu, Afshin Dehghan, Amir Zamir

4M is a framework for training "any-to-any" foundation models, using tokenization and masking to scale to many diverse modalities. Models trained using 4M can perform a wide range of vision tasks, transfer well to unseen tasks and modalities, and are flexible and steerable multimodal generative models. We are releasing code and models for "4M: Massively Multimodal Masked Modeling" (here denoted 4M-7), as well as "4M-21: An Any-to-Any Vision Model for Tens of Tasks and Modalities" (here denoted 4M-21).

🚀 Quick Start

4M is a powerful framework for training multimodal foundation models. You can quickly start exploring its capabilities through the following steps.

✨ Features

Any-to-Any Training: Enables training of foundation models across various modalities.
Scalable: Can scale to handle a large number of diverse modalities.
Open-Sourced: Allows the community to contribute and use freely.
Multimodal Generative: Flexible and steerable multimodal generative models.

📦 Installation

For install instructions, please see https://github.com/apple/ml-4m.

💻 Usage Examples

Basic Usage

This model can be loaded from Hugging Face Hub as follows:

from fourm.models.fm import FM
fm = FM.from_pretrained('EPFL-VILAB/4M-21_XL')

Please see README_GENERATION.md for more detailed instructions and https://github.com/apple/ml-4m for other 4M model and tokenizer checkpoints.

📚 Documentation

The official documentation provides in - depth information about the 4M framework, including model architecture, training details, and more. You can refer to the official website and GitHub repository for more information.

📄 License

The model weights in this repository are released under the Sample Code license as found in the LICENSE file.

📄 Citation

If you find this repository helpful, please consider citing our work:

@inproceedings{4m,
    title={{4M}: Massively Multimodal Masked Modeling},
    author={David Mizrahi and Roman Bachmann and O{\u{g}}uzhan Fatih Kar and Teresa Yeo and Mingfei Gao and Afshin Dehghan and Amir Zamir},
    booktitle={Thirty-seventh Conference on Neural Information Processing Systems},
    year={2023},
}

@article{4m21,
    title={{4M-21}: An Any-to-Any Vision Model for Tens of Tasks and Modalities},
    author={Roman Bachmann and O{\u{g}}uzhan Fatih Kar and David Mizrahi and Ali Garjani and Mingfei Gao and David Griffiths and Jiaming Hu and Afshin Dehghan and Amir Zamir},
    journal={arXiv 2024},
    year={2024},
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご