InternLM-XComposer2.5-OL開源多模態系統 - 支持長時流視頻與音頻交互

首頁

Internlm Xcomposer2d5 Ol 7b

由internlm開發

InternLM-XComposer2.5-OL是一個支持長時流式視頻與音頻交互的全方位多模態系統。

文本生成圖像

Safetensors

開源協議:其他 #長時流式音視頻交互 #全方位多模態系統 #音頻理解

下載量 79

發布時間 : 12/11/2024

模型概述

該模型是一個多模態系統，支持長時流式視頻與音頻交互，能夠處理圖像理解和音頻理解等多種任務。

模型特點

多模態交互

支持圖像和音頻的多模態輸入與交互。

長時流式處理

能夠處理長時流式視頻與音頻數據。

高效推理

支持高效的推理速度，適用於即時應用。

模型能力

圖像理解

音頻理解

語音識別

多模態交互

使用案例

多媒體分析

圖像內容分析

分析圖像中的內容，提供詳細的描述和分析。

能夠準確識別圖像中的物體和場景。

語音識別

識別語音內容並轉換為文本。

支持多種語言的語音識別。

即時交互

即時視頻分析

處理即時視頻流，提供即時分析結果。

適用於監控和即時反饋系統。

🚀 InternLM-XComposer-2.5-OL

InternLM-XComposer-2.5-OL 是一個用於長期流式視頻和音頻交互的綜合多模態系統，為相關領域的應用提供了強大的支持。

InternLM-XComposer-2.5-OL

[💻Github 倉庫](https://github.com/InternLM/InternLM-XComposer)

🚀 快速開始

我們提供了以下簡單示例，展示如何使用 🤗 Transformers 來使用 InternLM-XComposer-2.5-OL。完整指南請參考此處。

💻 使用示例

基礎用法

以下是使用 Transformers 加載基礎大語言模型的代碼：

import torch
from transformers import AutoModel, AutoTokenizer

torch.set_grad_enabled(False)

# 初始化模型和分詞器
model = AutoModel.from_pretrained('internlm/internlm-xcomposer2d5-ol-7b', model_dir='base', torch_dtype=torch.bfloat16, trust_remote_code=True).cuda().eval().half()
tokenizer = AutoTokenizer.from_pretrained('internlm/internlm-xcomposer2d5-ol-7b', model_dir='base', trust_remote_code=True)
model.tokenizer = tokenizer

以下是使用 MS-Swift 加載基礎音頻模型的代碼：

import os
os.environ['USE_HF'] = 'True'

import torch
from swift.llm import (
    get_model_tokenizer, get_template, ModelType,
    get_default_template_type, inference
)
from swift.utils import seed_everything

model_type = ModelType.qwen2_audio_7b_instruct
model_id_or_path = 'internlm/internlm-xcomposer2d5-ol-7b'
template_type = get_default_template_type(model_type)
print(f'template_type: {template_type}')

model, tokenizer = get_model_tokenizer(model_type, torch.float16, model_id_or_path=model_id_or_path, model_dir='audio',
                                       model_kwargs={'device_map': 'cuda:0'})
model.generation_config.max_new_tokens = 256
template = get_template(template_type, tokenizer)
seed_everything(42)

高級用法

音頻理解

import os
os.environ['USE_HF'] = 'True'

import torch
from swift.llm import (
    get_model_tokenizer, get_template, ModelType,
    get_default_template_type, inference
)
from swift.utils import seed_everything

model_type = ModelType.qwen2_audio_7b_instruct
model_id_or_path = 'internlm/internlm-xcomposer2d5-ol-7b'
template_type = get_default_template_type(model_type)
print(f'template_type: {template_type}')

model, tokenizer = get_model_tokenizer(model_type, torch.float16, model_id_or_path=model_id_or_path, model_dir='audio',
                                       model_kwargs={'device_map': 'cuda:0'})
model.generation_config.max_new_tokens = 256
template = get_template(template_type, tokenizer)
seed_everything(42)

# 中文自動語音識別
query = '<audio>Detect the language and recognize the speech.'
response, _ = inference(model, template, query, audios='examples/audios/chinese.mp3')
print(f'query: {query}')
print(f'response: {response}')

圖像理解

import torch
from transformers import AutoModel, AutoTokenizer

torch.set_grad_enabled(False)

# 初始化模型和分詞器
model = AutoModel.from_pretrained('internlm/internlm-xcomposer2d5-ol-7b', model_dir='base', torch_dtype=torch.bfloat16, trust_remote_code=True).cuda().eval().half()
tokenizer = AutoTokenizer.from_pretrained('internlm/internlm-xcomposer2d5-ol-7b', model_dir='base', trust_remote_code=True)
model.tokenizer = tokenizer

query = 'Analyze the given image in a detail manner'
image = ['examples/images/dubai.png']
with torch.autocast(device_type='cuda', dtype=torch.float16):
    response, _ = model.chat(tokenizer, query, image, do_sample=False, num_beams=3, use_meta=True)
print(response)

📄 許可證

代碼採用 Apache 2.0 許可證，而模型權重完全開放用於學術研究，也允許免費商業使用。如需申請商業許可證，請填寫申請表（[英文](application form)/中文）。如有其他問題或合作需求，請聯繫 internlm@pjlab.org.cn。

引用

如果您發現 InternLM-XComposer-2.5-OL 對您的研究和應用有幫助，請使用以下 BibTeX 進行引用：

@misc{zhang2024internlmxcomposer25omnilivecomprehensivemultimodallongterm,
      title={InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for Long-term Streaming Video and Audio Interactions}, 
      author={Pan Zhang and Xiaoyi Dong and Yuhang Cao and Yuhang Zang and Rui Qian and Xilin Wei and Lin Chen and Yifei Li and Junbo Niu and Shuangrui Ding and Qipeng Guo and Haodong Duan and Xin Chen and Han Lv and Zheng Nie and Min Zhang and Bin Wang and Wenwei Zhang and Xinyue Zhang and Jiaye Ge and Wei Li and Jingwen Li and Zhongying Tu and Conghui He and Xingcheng Zhang and Kai Chen and Yu Qiao and Dahua Lin and Jiaqi Wang},
      year={2024},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2412.09596}, 
}