Perception LM 1B
Model Overview
Model Features
Model Capabilities
Use Cases
🚀 Perception Language Model (PLM)
Perception Language Model (PLM) is a state - of - the - art, fully open and reproducible MLLM designed for transparent research in image and video understanding. It addresses the challenges in these fields by leveraging advanced techniques and high - quality data, providing valuable insights and tools for researchers.
🚀 Quick Start
The training and evaluation code for PLM is available at perception_models codebase. You can find detailed instructions and more information in the GitHub repo.
✨ Features
- Advanced Architecture: PLM consists of a vision encoder paired with a small - scale (<8B parameters) LLM decoder.
- Data - Driven Approach: It starts with an analysis of standard training pipelines using available data, without relying on proprietary model distillation.
- Large - Scale Synthetic Data: Investigates large - scale synthetic data and establishes key scaling laws to identify data gaps in video understanding, especially for spatio - temporal reasoning and fine - grained understanding tasks.
- High - Quality Human - Labeled Data: To fill the identified gaps, 2.8M high - quality human - labeled data is created, which is nearly an order of magnitude larger than the largest existing video datasets.
📚 Documentation
Model Overview
PLM was introduced in "PerceptionLM: Open - Access Data and Models for Detailed Visual Understanding". You can also refer to the tech report and the GitHub repository.
Resources
Property | Details | Documentation |
---|---|---|
Evaluation | Evaluation of PLM using lmms - eval | docs/evaluation.md |
Training / Finetuning | Training and finetuning instructions for PLM | docs/training.md |
PLM - VideoBench | Evaluation on PLM - VideoBench using lmms - eval | docs/plm_videobench.md |
End - to - End Finetuning Example | End - to - end finetuning example on radiology images | docs/finetune_example.md |
Generating Response | Generate responses using a trained model with generate.py |
generate.py |
Benchmark Results
PLM Image Benchmark Results
Model | DocVQA | ChartQA | TextVQA | InfoQA | AI2D | OCRBench | COCO | Nocap | Flickr | MMMU | VQAv2 | OKVQA | VizWiz | MME | SEED | BLINK | CVBench | RealWorldQA | VSR | POPE |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
PLM1B | 90.7 | 78.6 | 82.1 | 63.0 | 84.9 | 807 | 138.6 | 124.2 | 100.5 | 34.8 | 81.7 | 61.0 | 59.7 | 1603 | 76.3 | 46.8 | 73.8 | 67.1 | 68.8 | 88.4 |
PLM3B | 93.8 | 84.3 | 84.3 | 74.6 | 90.9 | 830 | 144.9 | 126.5 | 98.0 | 41.2 | 84.3 | 66.8 | 64.0 | 1879 | 78.5 | 55.4 | 81.4 | 72.4 | 80.4 | 88.7 |
PLM8B | 94.6 | 85.5 | 86.5 | 80.9 | 92.7 | 870 | 146.7 | 129.9 | 105.6 | 46.1 | 85.6 | 69.6 | 67.0 | 1989 | 79.3 | 56.0 | 81.3 | 75.0 | 82.8 | 89.9 |
PLM Video Benchmark Results
Model | VATEX | DREAM 1K | How2QA | MVBench | NExTQA | PerceptionTest (test) | STAR | TVQA | VideoMME | TVBench | ActivityNetQA | EgoSchema (test) | TemporalBench | TOMATO | MotionBench (dev) | TempCompass (MCQ) | CGBench (clue) | Charades STA | VideoHallucer | Halluc. EventHallusion |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
PLM1B | 92.5 | 34.3 | 86.4 | 70.1 | 80.3 | 72.7 | 83.7 | 50.3 | 49.2 | 50.4 | 62.5 | 60.4 | 18.2 | 25.5 | 52.2 | 64.6 | 43.6 | 55.2 | 49.2 | 79.5 |
PLM3B | 96.1 | 37.4 | 89.4 | 74.7 | 83.4 | 79.3 | 84.8 | 55.3 | 54.9 | 58.9 | 66.2 | 66.9 | 23.4 | 30.9 | 60.4 | 69.3 | 47.2 | 57.7 | 55.5 | 76.5 |
PLM8B | 99.7 | 35.9 | 90.7 | 77.1 | 84.1 | 82.7 | 84.9 | 59.3 | 58.3 | 63.5 | 67.3 | 68.8 | 28.3 | 33.2 | 61.4 | 72.7 | 46.4 | 58.6 | 57.7 | 77.3 |
📄 License
The use of PLM is subject to the FAIR Noncommercial Research License. By clicking “I Accept” or using or distributing any portion of the Research Materials, you agree to be bound by this Agreement.
Key Terms
- Acceptable Use Policy: The FAIR Acceptable Use Policy applicable to Research Materials.
- Agreement: The terms and conditions for using, reproducing, distributing, and modifying the Research Materials.
- Documentation: Specifications, manuals, and documentation accompanying the Research Materials distributed by Meta.
- Licensee: You, your employer, or any other person or entity entering into this Agreement.
- Meta: Meta Platforms Ireland Limited (if in the EEA or Switzerland) or Meta Platforms, Inc. (outside the EEA or Switzerland).
- Noncommercial Research Uses: Non - commercial research use cases not primarily for commercial advantage or monetary compensation.
- Research Materials: Documentation, models, software, algorithms, and related elements distributed by Meta under this Agreement.
Prohibited Uses
You agree not to use the Research Materials for:
- Illegal or Unlawful Activities: Such as violence, terrorism, exploitation of children, human trafficking, sexual solicitation, and other criminal activities.
- Harassment and Discrimination: Engaging in, promoting, or facilitating harassment, abuse, discrimination, or other unlawful or harmful conduct.
- Unauthorized Professional Practice: Unauthorized or unlicensed practice of any profession.
- Sensitive Information: Collecting, processing, or disclosing sensitive personal information without proper consent.
- Infringement: Engaging in actions that infringe, misappropriate, or violate third - party rights.
- Malicious Code: Creating or facilitating the creation of malicious code or anything that could harm a website or computer system.
- Dangerous Activities: Engaging in activities presenting a risk of death or bodily harm, such as military, warfare, or illegal weapon - related activities.
- Deception: Intentionally deceiving or misleading others, including generating fraud, disinformation, or spam.
- Failure to Disclose: Failing to appropriately disclose any known dangers of the Research Materials.
Please report any violations of this Policy at [https://docs.google.com/forms/d/e/1FAIpQLSeb11cryAopJ7LNrC4nxEUXrHY26hfkXQMf_uH - oFgA3WlYZQ/viewform].
📚 Citation
If you find our code useful for your research, please consider citing:
@article{cho2025PerceptionLM,
title={PerceptionLM: Open-Access Data and Models for Detailed Visual Understanding},
author={Jang Hyun Cho and Andrea Madotto and Effrosyni Mavroudi and Triantafyllos Afouras and Tushar Nagarajan and Muhammad Maaz and Yale Song and Tengyu Ma and Shuming Hu and Hanoona Rasheed and Peize Sun and Po-Yao Huang and Daniel Bolya and Suyog Jain and Miguel Martin and Huiyu Wang and Nikhila Ravi and Shashank Jain and Temmy Stark and Shane Moon and Babak Damavandi and Vivian Lee and Andrew Westbury and Salman Khan and Philipp Kr\"{a}henb\"{u}hl and Piotr Doll{\'a}r and Lorenzo Torresani and Kristen Grauman and Christoph Feichtenhofer},
journal={arXiv},
year={2025}
}
@article{bolya2025PerceptionEncoder,
title={Perception Encoder: The best visual embeddings are not at the output of the network},
author={Daniel Bolya and Po-Yao Huang and Peize Sun and Jang Hyun Cho and Andrea Madotto and Chen Wei and Tengyu Ma and Jiale Zhi and Jathushan Rajasegaran and Hanoona Rasheed and Junke Wang and Marco Monteiro and Hu Xu and Shiyu Dong and Nikhila Ravi and Daniel Li and Piotr Doll{\'a}r and Christoph Feichtenhofer},
journal={arXiv},
year={2025}
}

