4M - 21_B Open-source Basic Model Training Framework - Multimodal Extension for Arbitrary-to-Arbitrary Conversion

4M 21 B

Developed by EPFL-VILAB

4M is a 'any-to-any' foundational model training framework that achieves multimodal expansion through tokenization and masking techniques

Multimodal Fusion

Safetensors

Open Source License:Other #Multimodal Transformation #Any-to-Any Modeling #Masked Pretraining

Downloads 324

Release Time : 6/12/2024

Model Overview

The multimodal foundational model trained by the 4M framework can perform a wide range of visual tasks, transfer to unseen tasks and modalities, and possesses flexible and controllable multimodal generation capabilities.

Model Features

Any-to-Any Multimodal Transformation

Supports mutual conversion and processing among dozens of modalities

Task Transfer Capability

Can transfer to unseen tasks and modalities

Controllable Generation

Possesses flexible and controllable multimodal generation capabilities

Open-Source Framework

Provides a complete training framework and pretrained models

Model Capabilities

Multimodal Data Processing

Visual Task Processing

Cross-Modal Transformation

Controllable Content Generation

Use Cases

Computer Vision

Image Understanding and Generation

Processes various visual understanding tasks and generates related content

Multimodal Applications

Cross-Modal Transformation

Converts and processes data between different modalities

🚀 4M: Massively Multimodal Masked Modeling

A framework for training any-to-any multimodal foundation models. Scalable. Open-sourced. Across tens of modalities and tasks.

Website | GitHub | BibTeX

Official implementation and pre-trained models for :

4M: Massively Multimodal Masked Modeling, NeurIPS 2023 (Spotlight)
David Mizrahi*, Roman Bachmann*, Oğuzhan Fatih Kar, Teresa Yeo, Mingfei Gao, Afshin Dehghan, Amir Zamir

4M-21: An Any-to-Any Vision Model for Tens of Tasks and Modalities, arXiv 2024
Roman Bachmann*, Oğuzhan Fatih Kar*, David Mizrahi*, Ali Garjani, Mingfei Gao, David Griffiths, Jiaming Hu, Afshin Dehghan, Amir Zamir

4M is a framework for training "any-to-any" foundation models, using tokenization and masking to scale to many diverse modalities. Models trained using 4M can perform a wide range of vision tasks, transfer well to unseen tasks and modalities, and are flexible and steerable multimodal generative models. We are releasing code and models for "4M: Massively Multimodal Masked Modeling" (here denoted 4M-7), as well as "4M-21: An Any-to-Any Vision Model for Tens of Tasks and Modalities" (here denoted 4M-21).

🚀 Quick Start

4M offers a framework for training any - to - any multimodal foundation models. It uses tokenization and masking techniques to handle a large number of diverse modalities.

✨ Features

Scalable: Capable of scaling across tens of modalities and tasks.
Open - sourced: The code and pre - trained models are publicly available.
Flexible: Models trained with 4M can perform a wide range of vision tasks and transfer well to unseen scenarios.

📦 Installation

For install instructions, please see https://github.com/apple/ml-4m.

💻 Usage Examples

Basic Usage

from fourm.models.fm import FM
fm = FM.from_pretrained('EPFL-VILAB/4M-21_B')

Please see README_GENERATION.md for more detailed instructions and https://github.com/apple/ml-4m for other 4M model and tokenizer checkpoints.

📚 Documentation

This section provides the official implementation and pre - trained models for two key papers: "4M: Massively Multimodal Masked Modeling" and "4M - 21: An Any - to - Any Vision Model for Tens of Tasks and Modalities". It also details the capabilities of the 4M framework and how it can be used to train multimodal foundation models.

📄 License

The model weights in this repository are released under the Sample Code license as found in the LICENSE file.

📄 BibTeX Citation

If you find this repository helpful, please consider citing our work:

@inproceedings{4m,
    title={{4M}: Massively Multimodal Masked Modeling},
    author={David Mizrahi and Roman Bachmann and O{\u{g}}uzhan Fatih Kar and Teresa Yeo and Mingfei Gao and Afshin Dehghan and Amir Zamir},
    booktitle={Thirty-seventh Conference on Neural Information Processing Systems},
    year={2023},
}

@article{4m21,
    title={{4M-21}: An Any-to-Any Vision Model for Tens of Tasks and Modalities},
    author={Roman Bachmann and O{\u{g}}uzhan Fatih Kar and David Mizrahi and Ali Garjani and Mingfei Gao and David Griffiths and Jiaming Hu and Afshin Dehghan and Amir Zamir},
    journal={arXiv 2024},
    year={2024},
}

Property	Details
Pipeline Tag	any - to - any
License	other
License Name	sample - code - license
License Link	LICENSE
Library Name	ml - 4m

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご