NousResearch_Nous-Hermes-2-Vision-GGUF Open-Source Model - A Vision-Language Tool Supporting Multimodal Interaction

Nousresearch Nous Hermes 2 Vision GGUF

Developed by PsiPi

A vision-language model based on Mistral-7B, integrating SigLIP-400M visual encoder and function calling capabilities, supporting multimodal interaction

Image-to-Text EnglishOpen Source License:Apache-2.0 #Visual Language Function Calling #SigLIP Efficient Encoding #Multimodal Dialogue System

Downloads 905

Release Time : 12/7/2023

Model Overview

This is a groundbreaking vision-language model enhanced by SigLIP architecture and function calling datasets, capable of handling complex visual-language tasks and performing automated operations

Model Features

Efficient Visual Encoding

Utilizes SigLIP-400M architecture to replace traditional 3B visual encoders, achieving performance breakthroughs while maintaining lightweight design

Function Calling Capability

Trained with 150K private function calling data, the model can parse and execute structured function calls

Multimodal Interaction

Supports joint processing of image understanding and text generation for complex visual-language tasks

Model Capabilities

Image understanding

Visual question answering

Structured data extraction

Multi-turn dialogue

Automated task execution

Use Cases

Intelligent Customer Service

Product Identification and Recommendation

Provides detailed information and suggestions based on product images uploaded by users

Accurately identifies food items in menus and generates structured outputs

Automation Systems

Visual Data Extraction

Extracts structured information from images and converts it into JSON format

Successfully extracts attributes such as bus color, features, and status

🚀 Nous-Hermes-2-Vision - Mistral 7B

In the realm of AI, Nous-Hermes-2-Vision, based on Mistral 7B, emerges as a powerful Vision-Language Model. It combines the strengths of advanced techniques and custom datasets to handle complex human discourse and visual tasks with high efficiency.

📦 Metadata

Property	Details
Language	English
License	Apache-2.0
Tags	mistral, instruct, finetune, chatml, gpt4, synthetic data, distillation, multimodal, llava
Base Model	mistralai/Mistral-7B-v0.1
Pipeline Tag	image-text-to-text
Model Name	Nous-Hermes-2-Vision

✨ Features

GGUF Quants by Twobob, with thanks to @jartine and @cmp-nct for the assistance.
It refers to Vicuna, and the reference can be found here.
There is a known bug in the inference that is likely to be fixed upstream.

image/png

Model description

Nous-Hermes-2-Vision is a pioneering Vision-Language Model that builds on the advancements of the renowned OpenHermes-2.5-Mistral-7B by teknium. It features two key enhancements:

SigLIP-400M Integration: Instead of relying on large 3B vision encoders, it uses the powerful SigLIP-400M. This not only simplifies the model architecture, making it more lightweight, but also takes advantage of SigLIP's capabilities, resulting in a significant performance boost.
Custom Dataset Enriched with Function Calling: The training data includes function calling, transforming Nous-Hermes-2-Vision into a Vision-Language Action Model. Developers can use it to create various automations.

This project is led by qnguyen3 and teknium.

📦 Training

Dataset

220K from LVIS-INSTRUCT4V
60K from ShareGPT4V
150K Private Function Calling Data
50K conversations from teknium's OpenHermes-2.5

💻 Usage Examples

Prompt Format

Similar to other LLaVA variants, this model uses Vicuna-V1 as its prompt template. Refer to conv_llava_v1 in this file.
For the Gradio UI, visit this GitHub Repo.

Function Calling

For function calling, the message should start with a <fn_call> tag. Here is an example:

<fn_call>{
  "type": "object",
  "properties": {
    "bus_colors": {
      "type": "array",
      "description": "The colors of the bus in the image.",
      "items": {
        "type": "string",
        "enum": ["red", "blue", "green", "white"]
      }
    },
    "bus_features": {
      "type": "string",
      "description": "The features seen on the back of the bus."
    },
    "bus_location": {
      "type": "string",
      "description": "The location of the bus (driving or pulled off to the side).",
      "enum": ["driving", "pulled off to the side"]
    }
  }
}

Output:

{
  "bus_colors": ["red", "white"],
  "bus_features": "An advertisement",
  "bus_location": "driving"
}

📚 Examples

Chat

image/png

Function Calling

Input image:

Input message:

<fn_call>{
    "type": "object",
    "properties": {
      "food_list": {
        "type": "array",
        "description": "List of all the food",
        "items": {
          "type": "string",
        }
      },
    }
}

Output:

{
    "food_list": [
        "Double Burger",
        "Cheeseburger",
        "French Fries",
        "Shakes",
        "Coffee"
    ]
}

⚠️ Important Note

There is still some kind of bug in the inference that is likely to get fixed upstream.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご