VideoLLaMA 2 is a next-generation video large language model, focusing on enhancing spatiotemporal modeling and audio understanding capabilities, supporting multimodal video question answering and description tasks.
Text-to-Video
Transformers English