InternLM-XComposer2.5-OLオープンソースマルチモーダルシステム - 長時間のストリーミングビデオとオーディオの相互作用をサポート

ホーム

Internlm Xcomposer2d5 Ol 7b

internlmによって開発

InternLM-XComposer2.5-OLは、長時間ストリーミング映像と音声のインタラクションをサポートする全方位マルチモーダルシステムです。

テキスト生成画像

Safetensors

オープンソースライセンス:その他 #長時間ストリーミング音声・映像インタラクション #全方位マルチモーダルシステム #音声理解

ダウンロード数 79

リリース時間 : 12/11/2024

モデル概要

このモデルはマルチモーダルシステムで、長時間ストリーミング映像と音声のインタラクションをサポートし、画像理解や音声理解など様々なタスクを処理できます。

モデル特徴

マルチモーダルインタラクション

画像と音声のマルチモーダル入力をサポートします。

長時間ストリーミング処理

長時間ストリーミング映像と音声データを処理できます。

効率的な推論

効率的な推論速度をサポートし、リアルタイムアプリケーションに適しています。

モデル能力

画像理解

音声理解

音声認識

マルチモーダルインタラクション

使用事例

マルチメディア分析

画像内容分析

画像の内容を分析し、詳細な説明と分析を提供します。

画像中の物体やシーンを正確に識別できます。

音声認識

音声内容を認識しテキストに変換します。

複数言語の音声認識をサポートします。

リアルタイムインタラクション

リアルタイム映像分析

リアルタイム映像ストリームを処理し、即時の分析結果を提供します。

監視やリアルタイムフィードバックシステムに適しています。

🚀 InternLM-XComposer-2.5-OL

InternLM-XComposer2.5-OLは、長期間のストリーミングビデオとオーディオの対話を行うための包括的なマルチモーダルシステムです。

InternLM-XComposer-2.5-OL

💻Github Repo

🚀 クイックスタート

以下に、🤗 Transformersを使用してInternLM-XComposer-2.5-OLを使用する簡単な例を示します。完全なガイドについては、こちらを参照してください。

💻 使用例

基本的な使用法

Transformersを使用してベースの大規模言語モデル（LLM）を読み込むには、次のコードを使用します。

import torch
from transformers import AutoModel, AutoTokenizer

torch.set_grad_enabled(False)

# init model and tokenizer
model = AutoModel.from_pretrained('internlm/internlm-xcomposer2d5-ol-7b', model_dir='base', torch_dtype=torch.bfloat16, trust_remote_code=True).cuda().eval().half()
tokenizer = AutoTokenizer.from_pretrained('internlm/internlm-xcomposer2d5-ol-7b', model_dir='base', trust_remote_code=True)
model.tokenizer = tokenizer

MS-Swiftを使用してベースのオーディオモデルを読み込むには、次のコードを使用します。

import os
os.environ['USE_HF'] = 'True'

import torch
from swift.llm import (
    get_model_tokenizer, get_template, ModelType,
    get_default_template_type, inference
)
from swift.utils import seed_everything

model_type = ModelType.qwen2_audio_7b_instruct
model_id_or_path = 'internlm/internlm-xcomposer2d5-ol-7b'
template_type = get_default_template_type(model_type)
print(f'template_type: {template_type}')

model, tokenizer = get_model_tokenizer(model_type, torch.float16, model_id_or_path=model_id_or_path, model_dir='audio',
                                       model_kwargs={'device_map': 'cuda:0'})
model.generation_config.max_new_tokens = 256
template = get_template(template_type, tokenizer)
seed_everything(42)

高度な使用法

オーディオ理解

import os
os.environ['USE_HF'] = 'True'

import torch
from swift.llm import (
    get_model_tokenizer, get_template, ModelType,
    get_default_template_type, inference
)
from swift.utils import seed_everything

model_type = ModelType.qwen2_audio_7b_instruct
model_id_or_path = 'internlm/internlm-xcomposer2d5-ol-7b'
template_type = get_default_template_type(model_type)
print(f'template_type: {template_type}')

model, tokenizer = get_model_tokenizer(model_type, torch.float16, model_id_or_path=model_id_or_path, model_dir='audio',
                                       model_kwargs={'device_map': 'cuda:0'})
model.generation_config.max_new_tokens = 256
template = get_template(template_type, tokenizer)
seed_everything(42)

# Chinese ASR
query = '<audio>Detect the language and recognize the speech.'
response, _ = inference(model, template, query, audios='examples/audios/chinese.mp3')
print(f'query: {query}')
print(f'response: {response}')

画像理解

import torch
from transformers import AutoModel, AutoTokenizer

torch.set_grad_enabled(False)

# init model and tokenizer
model = AutoModel.from_pretrained('internlm/internlm-xcomposer2d5-ol-7b', model_dir='base', torch_dtype=torch.bfloat16, trust_remote_code=True).cuda().eval().half()
tokenizer = AutoTokenizer.from_pretrained('internlm/internlm-xcomposer2d5-ol-7b', model_dir='base', trust_remote_code=True)
model.tokenizer = tokenizer

query = 'Analyze the given image in a detail manner'
image = ['examples/images/dubai.png']
with torch.autocast(device_type='cuda', dtype=torch.float16):
    response, _ = model.chat(tokenizer, query, image, do_sample=False, num_beams=3, use_meta=True)
print(response)

引用

もしあなたの研究やアプリケーションにInternLM-XComposer-2.5-OLが役立つと思われる場合は、次のBibTeXを使用して引用してください。

@misc{zhang2024internlmxcomposer25omnilivecomprehensivemultimodallongterm,
      title={InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for Long-term Streaming Video and Audio Interactions}, 
      author={Pan Zhang and Xiaoyi Dong and Yuhang Cao and Yuhang Zang and Rui Qian and Xilin Wei and Lin Chen and Yifei Li and Junbo Niu and Shuangrui Ding and Qipeng Guo and Haodong Duan and Xin Chen and Han Lv and Zheng Nie and Min Zhang and Bin Wang and Wenwei Zhang and Xinyue Zhang and Jiaye Ge and Wei Li and Jingwen Li and Zhongying Tu and Conghui He and Xingcheng Zhang and Kai Chen and Yu Qiao and Dahua Lin and Jiaqi Wang},
      year={2024},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2412.09596}, 
}

📄 ライセンス

コードはApache 2.0ライセンスの下で提供されており、モデルの重みは学術研究用に完全にオープンであり、無料の商用利用も許可されています。商用ライセンスを申請するには、申請フォーム（英語）/申請表（中国語）に記入してください。その他の質問や協力については、internlm@pjlab.org.cnまでご連絡ください。