4M-21_L Open-source Foundation Model - Achievable 'Any-to-Any' Multimodal Generation Effect

4M 21 L

Developed by EPFL-VILAB

4M is an 'any-to-any' foundational model training framework extended to multiple modalities through tokenization and masking techniques

Multimodal Fusion

Safetensors

Open Source License:Other #Multimodal Unified Modeling #Any-to-Any Transformation #Cross-Modal Generation

Downloads 49

Release Time : 6/12/2024

Model Overview

Models trained with 4M can perform a wide range of visual tasks, transfer to unseen tasks and modalities, and possess flexible and controllable multimodal generation capabilities

Model Features

Any-to-Any Multimodal Processing

Supports flexible processing capabilities for dozens of modalities and tasks

Scalability

Framework design supports extension to new modalities and tasks

Transfer Learning Capability

Can transfer to unseen tasks and modalities

Controllable Multimodal Generation

Possesses flexible and controllable multimodal generation capabilities

Model Capabilities

Multimodal Masked Modeling

Visual Task Processing

Cross-Modal Transfer Learning

Controllable Content Generation

Use Cases

Computer Vision

Multimodal Visual Understanding

Process and understand multiple visual modality data

Generative AI

Controllable Content Generation

Generate multimodal content based on input conditions

🚀 4M: Massively Multimodal Masked Modeling

A framework for training any-to-any multimodal foundation models. Scalable. Open-sourced. Across tens of modalities and tasks.

Website | GitHub | BibTeX

✨ Features

4M is a framework for training "any-to-any" foundation models, using tokenization and masking to scale to many diverse modalities. Models trained using 4M can perform a wide range of vision tasks, transfer well to unseen tasks and modalities, and are flexible and steerable multimodal generative models. We are releasing code and models for "4M: Massively Multimodal Masked Modeling" (here denoted 4M-7), as well as "4M-21: An Any-to-Any Vision Model for Tens of Tasks and Modalities" (here denoted 4M-21).

📦 Installation

For install instructions, please see https://github.com/apple/ml-4m.

💻 Usage Examples

Basic Usage

from fourm.models.fm import FM
fm = FM.from_pretrained('EPFL-VILAB/4M-21_L')

Please see README_GENERATION.md for more detailed instructions and https://github.com/apple/ml-4m for other 4M model and tokenizer checkpoints.

📚 Documentation

Citation

If you find this repository helpful, please consider citing our work:

@inproceedings{4m,
    title={{4M}: Massively Multimodal Masked Modeling},
    author={David Mizrahi and Roman Bachmann and O{\u{g}}uzhan Fatih Kar and Teresa Yeo and Mingfei Gao and Afshin Dehghan and Amir Zamir},
    booktitle={Thirty-seventh Conference on Neural Information Processing Systems},
    year={2023},
}

@article{4m21,
    title={{4M-21}: An Any-to-Any Vision Model for Tens of Tasks and Modalities},
    author={Roman Bachmann and O{\u{g}}uzhan Fatih Kar and David Mizrahi and Ali Garjani and Mingfei Gao and David Griffiths and Jiaming Hu and Afshin Dehghan and Amir Zamir},
    journal={arXiv 2024},
    year={2024},
}

📄 License

The model weights in this repository are released under the Sample Code license as found in the LICENSE file.

Property	Details
Pipeline Tag	any-to-any
License	other
License Name	sample-code-license
License Link	LICENSE
Library Name	ml-4m

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご