Tarsier-34b Open-source Video Language Model - Free Deployment for Generating High-quality Video Descriptions

Tarsier 34b

Developed by omni-research

Tarsier-34b is an open-source large-scale video-language model focused on generating high-quality video captions and achieving leading results in multiple public benchmarks.

Video-to-Text

Transformers

Open Source License:Apache-2.0 #Video Caption Generation #Multimodal Video Understanding #Open-source Large Video Model

Downloads 103

Release Time : 7/3/2024

Model Overview

Tarsier-34b is a large video-language model designed to generate high-quality video captions while possessing excellent general video understanding capabilities.

Model Features

Two-stage Training Strategy

Adopts a two-stage training method involving multi-task pre-training and multi-granularity instruction fine-tuning.

Parameter-efficient Training

Freezes ViT parameters and only trains the projection layer and large language model parameters.

Leading in Multiple Benchmarks

Achieves SOTA results in 6 public benchmarks.

Model Capabilities

Video Caption Generation

Video Question Answering

Video Understanding

Multimodal Reasoning

Use Cases

Video Content Analysis

Automatic Video Caption Generation

Generates high-quality textual descriptions for videos

Performs excellently on datasets like DREAM-1K

Video Question Answering System

Answers various questions about video content

Achieves leading scores on datasets like MVBench and NeXT-QA

Research Applications

Multimodal Model Research

Used for research and development of large multimodal models

🚀 Tarsier Model Card

Tarsier-34b is an open - source large - scale video - language model. It can generate high - quality video descriptions and has excellent general video understanding capabilities, achieving SOTA results on 6 open benchmarks.

🚀 Quick Start

For usage details, please refer to this link.

✨ Features

Capable of generating high - quality video descriptions.
Achieves SOTA results on 6 open benchmarks for general video understanding.

📚 Documentation

Model Details

Property	Details
Model Type	Tarsier - 34b is an open - source large - scale video - language model, designed to generate high - quality video descriptions and has good general video understanding capabilities (SOTA results on 6 open benchmarks).
Model Date	Tarsier - 34b was trained in June 2024.
Paper or Resources for More Information	- github repo: https://github.com/bytedance/tarsier - paper link: https://arxiv.org/abs/2407.00634

Intended Use

Primary Intended Uses

The primary use of Tarsier is research on large multimodal models, especially video description.

Primary Intended Users

The primary intended users of the model are researchers and hobbyists in computer vision, natural language processing, machine learning, and artificial intelligence.

Training Dataset

Tarsier adopts a two - stage training strategy:

Stage - 1: Multi - task Pre - training on 13M data
Stage - 2: Multi - grained Instruction Tuning on 500K data

In both stages, ViT is frozen and all the parameters of the projection layer and LLM are trained.

Evaluation Dataset

A challenging video description dataset: [DREAM - 1K](https://huggingface.co/datasets/omni - research/DREAM - 1K)
Multi - choice VQA: MVBench, [NeXT - QA](https://github.com/doc - doc/NExT - QA) and Egoschema
Open - ended VQA: MSVD - QA, [MSR - VTT - QA](https://opendatalab.com/OpenDataLab/MSR - VTT), [ActivityNet - QA](https://github.com/MILVLG/activitynet - qa) and [TGIF - QA](https://opendatalab.com/OpenDataLab/TGIF - QA)
Video Caption: MSVD - Caption, [MSRVTT - Caption](https://opendatalab.com/OpenDataLab/MSR - VTT), [VATEX](https://eric - xw.github.io/vatex - website/about.html)

Where to Send Questions or Comments about the Model

https://github.com/bytedance/tarsier/issues

📄 License

NousResearch/Nous - Hermes - 2 - Yi - 34B license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご