InternVideo2.5 is a video multimodal large language model (MLLM) built upon InternVL2.5, enhanced with Long and Rich Context (LRC) modeling, capable of perceiving fine-grained details and capturing long-term temporal structures.
Video-to-Text
Transformers English