T

Tarsier 34b

Developed by omni-research
Tarsier-34b is an open-source large-scale video-language model focused on generating high-quality video captions and achieving leading results in multiple public benchmarks.
Downloads 103
Release Time : 7/3/2024

Model Overview

Tarsier-34b is a large video-language model designed to generate high-quality video captions while possessing excellent general video understanding capabilities.

Model Features

Two-stage Training Strategy
Adopts a two-stage training method involving multi-task pre-training and multi-granularity instruction fine-tuning.
Parameter-efficient Training
Freezes ViT parameters and only trains the projection layer and large language model parameters.
Leading in Multiple Benchmarks
Achieves SOTA results in 6 public benchmarks.

Model Capabilities

Video Caption Generation
Video Question Answering
Video Understanding
Multimodal Reasoning

Use Cases

Video Content Analysis
Automatic Video Caption Generation
Generates high-quality textual descriptions for videos
Performs excellently on datasets like DREAM-1K
Video Question Answering System
Answers various questions about video content
Achieves leading scores on datasets like MVBench and NeXT-QA
Research Applications
Multimodal Model Research
Used for research and development of large multimodal models
Featured Recommended AI Models
AIbase
Empowering the Future, Your AI Solution Knowledge Base
Š 2025AIbase