PE-Lang-G14-448開源感知編碼器 - 助力圖像視頻理解，泛化能力超強大

首頁

PE Lang G14 448

由facebook開發

感知編碼器是通過視覺語言訓練實現的最先進圖像視頻理解編碼器，具有強大的泛化能力。

文本生成圖像開源協議:Apache-2.0 #多模態視覺理解 #語言對齊優化 #文檔OCR增強

下載量 247

發布時間 : 4/11/2025

模型概述

感知編碼器（PE）是一系列在各類視覺任務中表現卓越的大規模視覺編碼模型，通過對比預訓練和合成對齊視頻微調，實現卓越的分類檢索和下游任務泛化能力。

模型特點

強大的泛化能力

PE內部生成的特徵具有強大的泛化能力，可擴展至多種下游任務。

語言對齊優化

PE語言版特別優化了通用性，適用於多模態語言建模的各種場景。

卓越的文檔處理能力

在OCR和文檔任務中表現尤為突出。

模型能力

圖像理解

視頻理解

文檔問答

信息問答

文本問答

多模態語言建模

使用案例

文檔處理

文檔問答

用於回答基於文檔內容的問題

在測試集上達到94.6的準確率

視覺問答

信息問答

回答基於圖像或視頻內容的問題

在測試集上達到78.8的準確率

多模態理解

感知測試

評估模型對視覺內容的理解能力

在測試集上達到82.7的準確率

🚀 感知編碼器（Perception Encoder）

感知編碼器（Perception Encoder）是一種通過簡單的視覺 - 語言學習訓練的先進編碼器，用於圖像和視頻理解。它在多種視覺任務中表現出色，能夠為下游任務提供強大的通用特徵。

🚀 快速開始

感知編碼器（PE）是一個大規模視覺編碼器模型家族，在各種視覺任務中具有先進的性能。通過使用強大的對比預訓練方法並在合成對齊的視頻上進行微調，PE不僅在分類和檢索任務上優於所有現有模型，還能在內部生成強大的通用特徵，適用於下游任務。PE通過對齊調整，使大規模對比預訓練能夠遷移到下游任務，充分利用這些通用特徵。

✨ 主要特性

感知編碼器語言模型（Perception Encoder: Language）

PE lang 從 PE core 的中間層獲取強大的語言性能，並根據 PLM 進一步進行語言建模對齊。我們特別調整了 PE lang，使其適用於任何多模態語言建模用例，包括使用不同的語言模型解碼器（如 Llama / Qwen）和不同的評估設置（如原生分辨率 / 分塊）。PE lang 在 OCR 和文檔任務中表現尤其出色。

我們發佈了兩個 PE Lang 檢查點，L14 - 448 和 G14 - 448。以下是在我們的基準設置下，使用凍結編碼器和 260 萬 SFT 數據混合，僅使用 448px（即不進行分塊）並以 Llama 3.1 8B 作為解碼器的結果：

編碼器	檢查點	文檔視覺問答（驗證集）	信息問答（驗證集）	文本視覺問答	MVBench	感知測試（驗證集）	自我圖式（驗證集）
L/14 448px	[PE - Lang - L14 - 448](https://huggingface.co/facebook/PE - Lang - L14 - 448)	81.9	46.4	73.0	52.3	54.7	59.8
G/14 448px	[PE - Lang - G14 - 448](https://huggingface.co/facebook/PE - Lang - G14 - 448)	84.4	48.3	75.2	52.4	56.0	62.0

以下是使用 PE Core G 與 [PLM - 8B](https://huggingface.co/facebook/Perception - LM - 8B) 進一步對齊（階段 3），使用 36 + 1 圖像分塊 / 32 視頻幀，並以 Llama 3.1 8B 作為解碼器可獲得的性能示例：

模型	編碼器	文檔視覺問答（測試集）	信息問答（測試集）	文本視覺問答	MVBench	感知測試（測試集）	自我圖式（測試集）
PLM - 8B	[PE - Core - G14 - 448](https://huggingface.co/facebook/PE - Core - G14 - 448)*	94.6	78.8	86.5	77.1	82.7	68.8

* PE - Core - G14 - 448 檢查點使用分塊進行了進一步訓練。我們將盡快發佈分塊對齊的檢查點。

完整的性能評估和與其他模型的公平比較請參考論文。

📚 詳細文檔

模型加載代碼

我們在 https://github.com/facebookresearch/perception_models 中提供了模型加載代碼。你可以在 GitHub 倉庫中找到更多詳細信息。

📄 許可證

本項目採用 Apache - 2.0 許可證。

📖 引用

如果你發現我們的代碼對您的研究有用，請考慮引用：

@article{bolya2025PerceptionEncoder,
  title={Perception Encoder: The best visual embeddings are not at the output of the network},
  author={Daniel Bolya and Po - Yao Huang and Peize Sun and Jang Hyun Cho and Andrea Madotto and Chen Wei and Tengyu Ma and Jiale Zhi and Jathushan Rajasegaran and Hanoona Rasheed and Junke Wang and Marco Monteiro and Hu Xu and Shiyu Dong and Nikhila Ravi and Daniel Li and Piotr Doll{\'a}r and Christoph Feichtenhofer},
  journal={arXiv},
  year={2025}
}

@article{cho2025PerceptionLM,
  title={PerceptionLM: Open - Access Data and Models for Detailed Visual Understanding},
  author={Jang Hyun Cho and Andrea Madotto and Effrosyni Mavroudi and Triantafyllos Afouras and Tushar Nagarajan and Muhammad Maaz and Yale Song and Tengyu Ma and Shuming Hu and Hanoona Rasheed and Peize Sun and Po - Yao Huang and Daniel Bolya and Suyog Jain and Miguel Martin and Huiyu Wang and Nikhila Ravi and Shashank Jain and Temmy Stark and Shane Moon and Babak Damavandi and Vivian Lee and Andrew Westbury and Salman Khan and Philipp Kr\"{a}henb\"{u}hl and Piotr Doll{\'a}r and Lorenzo Torresani and Kristen Grauman and Christoph Feichtenhofer},
  journal={arXiv},
  year={2025}
}