Tarsier-7b Open-source Video Language Model - Freely Generate High-quality Video Descriptions with General Understanding Ability

Tarsier 7b

Developed by omni-research

Tarsier-7b is an open-source large-scale video-language model from the Tarsier series, specializing in generating high-quality video descriptions with excellent general video understanding capabilities.

Video-to-Text

Transformers

#Video Description Generation #Multimodal Understanding #Open-source Large Model

Downloads 635

Release Time : 7/4/2024

Model Overview

Tarsier-7b is an open-source large-scale video-language model designed to generate high-quality video descriptions while possessing outstanding general video understanding capabilities. It is a member of the Tarsier series, built upon the liuhaotian/llava-v1.6-vicuna-7b model.

Model Features

High-Quality Video Description Generation

Capable of generating high-quality video descriptions suitable for various video content.

General Video Understanding Capabilities

Possesses excellent general video understanding capabilities, performing well across multiple benchmarks.

Two-Stage Training Strategy

Adopts a two-stage training strategy of multi-task pre-training and multi-granularity instruction fine-tuning to enhance model performance.

Model Capabilities

Video Description Generation

Video Question Answering

Multi-Granularity Video Understanding

Open-Ended Video Question Answering

Video Caption Generation

Use Cases

Video Content Analysis

Video Description Generation

Generates detailed textual descriptions for videos, suitable for video content indexing and retrieval.

High-quality video descriptions

Video Question Answering

Answers complex questions about video content, applicable in fields like education and entertainment.

Accurate video question answering results

Video Caption Generation

Automatic Caption Generation

Automatically generates captions for videos to enhance accessibility.

High-quality caption content

🚀 Tarsier Model Card

Tarsier-7b is an open - source large - scale video - language model. It can generate high - quality video descriptions and has a good ability for general video understanding. This README provides detailed information about the model, including its features, training, evaluation, and usage.

✨ Features

Model Details

Property	Details
Model Type	Tarsier - 7b is one of the Tarsier family -- an open - source large - scale video - language models, which is designed to generate high - quality video descriptions, together with good capability of general video understanding (Tarsier - 34b gains SOTA results on 6 open benchmarks). Base LLM: [liuhaotian/llava - v1.6 - vicuna - 7b](https://huggingface.co/liuhaotian/llava - v1.6 - vicuna - 7b)
Model Date	Tarsier - 7b was trained in June 2024.
Paper or Resources for More Information	- github repo: https://github.com/bytedance/tarsier - paper link: https://arxiv.org/abs/2407.00634

Intended Use

Primary Intended Uses: The primary use of Tarsier is research on large multimodal models, especially video description.
Primary Intended Users: The primary intended users of the model are researchers and hobbyists in computer vision, natural language processing, machine learning, and artificial intelligence.

📦 Installation

No specific installation steps are provided in the original document, so this section is skipped.

📚 Documentation

Training Dataset

Tarsier tasks a two - stage training strategy.

Stage - 1: Multi - task Pre - training on 13M data
Stage - 2: Multi - grained Instruction Tuning on 500K data

In both stages, we freeze ViT and train all the parameters of projection layer and LLM.

Evaluation Dataset

A challenging video description dataset: [DREAM - 1K](https://huggingface.co/datasets/omni - research/DREAM - 1K)
Multi - choice VQA: MVBench, [NeXT - QA](https://github.com/doc - doc/NExT - QA) and Egoschema
Open - ended VQA: MSVD - QA, [MSR - VTT - QA](https://opendatalab.com/OpenDataLab/MSR - VTT), [ActivityNet - QA](https://github.com/MILVLG/activitynet - qa) and [TGIF - QA](https://opendatalab.com/OpenDataLab/TGIF - QA)
Video Caption: MSVD - Caption, [MSRVTT - Caption](https://opendatalab.com/OpenDataLab/MSR - VTT), [VATEX](https://eric - xw.github.io/vatex - website/about.html)

How to Use

see https://github.com/bytedance/tarsier?tab=readme - ov - file#usage

📄 License

lmsys/vicuna - 7b - v1.5 license.

Where to send questions or comments about the model: https://github.com/bytedance/tarsier/issues

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご