LLaVA-UHD-v2-Vicuna-7B Open-source Multimodal Model - Super Practical for Capturing Different Visual Granularities!

Llava UHD V2 Vicuna 7B

Developed by YipengZhang

LLaVA-UHD v2 is an advanced multimodal large language model built around a hierarchical window transformer, capable of capturing different visual granularities through a high-resolution feature pyramid.

Multimodal Fusion

Transformers

#High-resolution visual understanding #Multimodal large language model #Feature pyramid integration

Downloads 103

Release Time : 11/26/2024

Model Overview

Primarily used for research on large multimodal models and chatbots, suitable for fields such as computer vision and natural language processing.

Model Features

High-resolution feature pyramid

Capture different visual granularities by constructing and integrating a high-resolution feature pyramid

Hierarchical window transformer

Adopt an innovative hierarchical window transformer architecture to optimize multimodal processing capabilities

Large-scale multimodal training

Use a mixed dataset of over 858k for supervised fine-tuning to improve model performance

Model Capabilities

Multimodal understanding

Vision-language interaction

High-resolution image analysis

Natural language generation

Use Cases

Academic research

Multimodal model research

Used to explore advanced model architectures that combine vision and language

Chatbot development

Build an intelligent dialogue system with visual understanding capabilities

Industrial applications

Intelligent content analysis

Conduct joint analysis and understanding of image and text content

🚀 LLaVA-UHD v2 Model Card

LLaVA-UHD v2 is an advanced MLLM that focuses on a Hierarchical window transformer. It can capture diverse visual granularity by constructing and integrating a high - resolution feature pyramid, which is valuable for research on large multimodal models and chatbots.

📚 Documentation

Model details

Property	Details
Model Type	LLaVA-UHD v2, an advanced MLLM centered around a Hierarchical window transformer that enables capturing diverse visual granularity by constructing and integrating a high resolution feature pyramid.
Model Date	LLaVA-UHD v2 was trained in November 2024.
Base LLM Model	lmsys/vicuna-7b-v1.5
Paper or resources for more information	https://github.com/thunlp/LLaVA-UHD

License

Where to send questions or comments about the model: https://github.com/thunlp/LLaVA-UHD/issues

Intended use

Primary intended uses: The primary use of LLaVA-UHD v2 is research on large multimodal models and chatbots.

Primary intended users: The primary intended users of the model are researchers and hobbyists in computer vision, natural language processing, machine learning, and artificial intelligence.

Training dataset

VDIM Pretrain: MS - COCO stuff 2017
Pretrain: LLaVA - Pretrain 558K (filtered image - text pairs from LAION/CC/SBU, captioned by BLIP.)
SFT: 858k - mixed dataset in https://huggingface.co/datasets/YipengZhang/LLaVA-UHD-v2-SFT-Data

Citation

If you find LLaVA-UHD v2 useful for your research and applications, please cite using this BibTeX:

@article{zhang2024llavauhdv2,
  title={LLaVA-UHD v2: an MLLM Integrating High-Resolution Feature Pyramid via Hierarchical Window Transformer},
  author={Yipeng Zhang and Yifan Liu and Zonghao Guo and Yidan Zhang and Xuesong Yang and Chi Chen and Jun Song and Bo Zheng and Yuan Yao and Zhiyuan Liu and Tat-Seng Chua and Maosong Sun},
  journal={arXiv preprint arXiv:2412.13871},
  year={2024}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご