Fluxi_AI_Small_Vision Open-Source Multimodal Intelligent Assistant - Supports Text, Image, and Video Processing, Optimizes Portuguese

Fluxi AI Small Vision

Developed by JJhooww

Fluxi AI is a multimodal intelligent assistant based on Qwen2-VL-7B-Instruct, capable of processing text, images, and videos, with special optimization for Portuguese language support.

Image-to-Text

Transformers

OtherOpen Source License:Apache-2.0 #Multimodal Interaction #Portuguese Optimization #Function Calling

Downloads 25

Release Time : 2/1/2025

Model Overview

An all-around AI assistant capable of multimodal interactions with text, images, and videos, supporting function calling, retrieval-augmented generation (RAG), and system-guided interactions.

Model Features

Multimodal Intelligence

Supports multimodal interaction and understanding of text, images, and videos.

Multilingual Understanding

Specially optimized for Portuguese, while also supporting various European and Asian languages.

Function Execution Capability

Supports calling and executing predefined functions, with optimized Portuguese function calling.

Advanced RAG Technology

Retrieval-augmented generation technology, optimized for Portuguese content retrieval and integration.

Natural and Friendly Interaction Experience

Provides role-based responses and enhanced contextual understanding.

Model Capabilities

Text Generation and Understanding

Image Analysis and Interpretation

Video Understanding

Function Calling

Retrieval-Augmented Generation (RAG)

System-Guided Interaction

Use Cases

Multimodal Interaction

Image Captioning

Generates detailed descriptive text based on input images.

Produces accurate and detailed image descriptions.

Video Content Analysis

Analyzes video content and generates descriptions or summaries.

Produces detailed descriptions or summaries of video content.

Function Calling

Contact Creation

Creates contact records based on user input.

Generates structured contact information and calls relevant functions.

Retrieval-Augmented Generation

Information Query

Answers user questions based on provided document context.

Generates accurate answers based on documents.

🚀 Fluxi AI - Small Vision 🤖✨

Fluxi AI - Small Vision is a versatile AI assistant with multimodal intelligence, multilingual comprehension, function execution capabilities, advanced RAG, and natural and friendly interaction.

🚀 Quick Start

We offer a set of tools to help you handle various types of visual input more conveniently, including base64, URLs, and interleaved images and videos. You can install it with the following command:

pip install qwen-vl-utils

Here is a code snippet to show how to use the chat model with transformers and qwen_vl_utils:

from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info

# Load the model on the available device(s)
model = Qwen2VLForConditionalGeneration.from_pretrained(
    "JJhooww/Fluxi_AI_Small_Vision", torch_dtype="auto", device_map="auto"
)

# We recommend enabling flash_attention_2 for better acceleration and memory savings, especially in scenarios with multiple images and videos.
# model = Qwen2VLForConditionalGeneration.from_pretrained(
#     "JJhooww/Fluxi_AI_Small_Vision",
#     torch_dtype=torch.bfloat16,
#     attn_implementation="flash_attention_2",
#     device_map="auto",
# )

# Default processor
processor = AutoProcessor.from_pretrained("JJhooww/Fluxi_AI_Small_Vision")

# The default range for the number of visual tokens per image in the model is 4 - 16384. You can configure min_pixels and max_pixels according to your needs, such as a token count range of 256 - 1280, to balance speed and memory usage.
# min_pixels = 256*28*28
# max_pixels = 1280*28*28
# processor = AutoProcessor.from_pretrained("JJhooww/Fluxi_AI_Small_Vision", min_pixels=min_pixels, max_pixels=max_pixels)

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
            },
            {"type": "text", "text": "Describe this image."},
        ],
    }
]

# Preparation for inference
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

# Inference: Generate output
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

Without qwen_vl_utils

from PIL import Image
import requests
import torch
from torchvision import io
from typing import Dict
from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor

# Load the model in reduced precision on the available device(s)
model = Qwen2VLForConditionalGeneration.from_pretrained(
    "JJhooww/Fluxi_AI_Small_Vision", torch_dtype="auto", device_map="auto"
)
processor = AutoProcessor.from_pretrained("JJhooww/Fluxi_AI_Small_Vision")

# Image
url = "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg"
image = Image.open(requests.get(url, stream=True).raw)

conversation = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
            },
            {"type": "text", "text": "Describe this image."},
        ],
    }
]

# Preprocess inputs
text_prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
# Expected output: '<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\n<|vision_start|><|image_pad|><|vision_end|>Describe this image.<|im_end|>\n<|im_start|>assistant\n'

inputs = processor(
    text=[text_prompt], images=[image], padding=True, return_tensors="pt"
)
inputs = inputs.to("cuda")

# Inference: Generate output
output_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids = [
    output_ids[len(input_ids) :]
    for input_ids, output_ids in zip(inputs.input_ids, output_ids)
]
output_text = processor.batch_decode(
    generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True
)
print(output_text)

Inference with multiple images

# Messages containing multiple images and a text query
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "file:///path/to/image1.jpg"},
            {"type": "image", "image": "file:///path/to/image2.jpg"},
            {"type": "text", "text": "Identify the similarities between these images."},
        ],
    }
]

# Preparation for inference
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

# Inference
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

Inference with video

# Messages containing a list of images as a video and a text query
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "video",
                "video": [
                    "file:///path/to/frame1.jpg",
                    "file:///path/to/frame2.jpg",
                    "file:///path/to/frame3.jpg",
                    "file:///path/to/frame4.jpg",
                ],
                "fps": 1.0,
            },
            {"type": "text", "text": "Describe this video."},
        ],
    }
]
# Messages containing a video and a text query
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "video",
                "video": "file:///path/to/video1.mp4",
                "max_pixels": 360 * 420,
                "fps": 1.0,
            },
            {"type": "text", "text": "Describe this video."},
        ],
    }
]

# Preparation for inference
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

# Inference
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

Batch inference

# Example messages for batch inference
messages1 = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "file:///path/to/image1.jpg"},
            {"type": "image", "image": "file:///path/to/image2.jpg"},
            {"type": "text", "text": "What are the common elements in these images?"}, 
        ],
    }
]
messages2 = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Who are you?"}
]
# Combine messages for batch processing
messages = [messages1, messages1]

# Preparation for batch inference
texts = [
    processor.apply_chat_template(msg, tokenize=False, add_generation_prompt=True)
    for msg in messages
]
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=texts,
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

# Batch inference
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_texts = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_texts)

✨ Features

Multimodal Intelligence: Capable of handling multimodal interactions, including text, images, and videos.
Multilingual Comprehension: Understands and processes multiple languages, including Portuguese, English, Spanish, French, German, Japanese, Korean, Arabic, and Vietnamese.
Function Execution Capability: Can execute predefined functions, with structured input/output handling and support for complex parameters.
Advanced RAG: Integrates context with documents, extracts relevant information, and provides contextual and adaptive responses.
Natural and Friendly Interaction: Offers a more natural and user - friendly interaction experience.

📦 Installation

The code for Qwen2-VL is available in the latest version of Hugging Face Transformers. We recommend building from the source code with the following command:

pip install git+https://github.com/huggingface/transformers

Otherwise, you may encounter the following error:

KeyError: 'qwen2_vl'

💻 Usage Examples

Basic Usage

1. Function Call Example

# Structure of a function call dataset
messages = [
    {
        "role": "system",
        "content": [{
            "type": "text",
            "text": """# Tools

You can call one or more functions to assist in the user's query.

You receive function signatures in the XML tags <tools></tools>:
<tools>
{
 "type": "function",
 "function": {
   "name": "criar_contato",
   "description": "Create a new contact",
   "parameters": {
     "type": "object",
     "properties": {
       "nome": {"type": "string", "description": "The name of the contact"},
       "email": {"type": "string", "description": "The email address of the contact"}
     },
     "required": ["nome", "email"]
   }
 }
}
</tools>"""
        }]
    },
    {
        "role": "user",
        "content": [{
            "type": "text",
            "text": "I need to create a new contact for my friend John Doe. His email is johndoe@example.com."
        }]
    }
]

# Assistant's response
response = {
    "role": "assistant",
    "content": [{
        "type": "text",
        "text": "<tool_call>\n{'name': 'criar_contato', 'arguments': {'nome': 'John Doe', 'email': 'johndoe@example.com'}}</tool_call>"
    }]
}

2. RAG Implementation Example

# Structure of a RAG dataset
messages = [
    {
        "role": "user",
        "content": [{
            "type": "text",
            "text": "What are the main countries of origin of foreign tourists visiting Brazil?"
        }]
    },
    {
        "role": "system",
        "content": [{
            "type": "text",
            "text": "{'doc04': {'text': 'Brazil received more than 6 million foreign tourists, mainly from Argentina, the United States, Paraguay, Chile, Uruguay, and France...'}}"
        }]
    }
]

# Assistant's response
response = {
    "role": "assistant",
    "content": [{
        "type": "text",
        "text": "The main countries of origin of foreign tourists visiting Brazil are:\n\n- Argentina\n- United States\n- Paraguay\n- Chile\n- Uruguay\n- France"
    }]
}

3. System - Guided Agent Example

# Configuration of a system - guided agent
messages = [
    {
        "role": "system",
        "content": [{
            "type": "text",
            "text": "You are an expert in various scientific disciplines, including physics, chemistry, and biology. Explain scientific concepts, theories, and phenomena in an engaging and accessible way."
        }]
    },
    {
        "role": "user",
        "content": [{
            "type": "text",
            "text": "Can you help me write an essay on deforestation?"
        }]
    }
]

📚 Documentation

General Model Overview

This is a versatile AI assistant capable of handling multimodal interactions, including text, images, and videos. The model supports function calls, RAG (Retrieval - Augmented Generation), and system - guided interactions, with enhanced capabilities in Portuguese.

Base Model

This assistant is based on the Qwen2-VL-7B-Instruct model, a powerful multimodal language model developed by Qwen. The main features include:

7 billion parameters
Advanced architecture for vision and language
Support for multiple image resolutions
Video processing capability
Specific optimizations for multimodal tasks

Main Functionalities

Multimodal Processing

✅ Text generation and comprehension ✅ Image analysis and understanding ✅ Video comprehension (up to 20+ minutes) ✅ Support for various input formats:

Local files
Base64 images
URLs
Combination of interleaved images and videos

Multilingual Support

🌎 The model understands and processes multiple languages, including:

Portuguese (enhanced support)
English
Spanish, French, German, and other European languages
Japanese and Korean
Arabic and Vietnamese

Key Features

1. Function Calls

⚙️ Ability to execute predefined functions 📄 Structured input/output handling 🛠️ Support for complex parameters 🇧🇷 Optimization for function calls in Portuguese

2. Retrieval - Augmented Generation (RAG)

📚 Integration of context with documents 🔎 Extraction of relevant information 🤖 Contextual and adaptive responses 🇧🇷 Optimization for Portuguese - language content

3. System - Guided Interactions

👥 Function and role - based responses 📌 Adaptation to different knowledge areas 📖 Enhanced contextual understanding 🇧🇷 Specific optimization for Portuguese - language agents

Portuguese Language Optimizations

Function Calls

✔️ Function names and descriptions in Portuguese ✔️ Brazilian conventions for parameter naming ✔️ Localized error messages and responses ✔️ Function selection based on Brazilian use cases

Advanced RAG

📜 Optimized content retrieval for Portuguese 🌎 Priority for Brazilian context 🔍 Higher precision in local information extraction 📝 Improved language pattern recognition

Specific Enhancements for Agents

🎭 Enhanced Brazilian cultural context 📌 Integration with regional knowledge 🗣️ Improved understanding of Portuguese nuances 📚 Optimization for specific Brazilian domains

🔧 Technical Details

The model is built upon the Qwen2-VL-7B-Instruct architecture, which has 7 billion parameters. It uses an advanced architecture for vision and language processing, supporting multiple image resolutions and video processing. The RAG mechanism integrates external documents to provide more accurate and relevant information, and the function call feature allows the model to interact with external systems.

📄 License

This project is licensed under the Apache - 2.0 license.

📚 Citations

Base Model Citation

@article{Qwen2VL,
  title={Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution},
  author={Wang, Peng and Bai, Shuai and Tan, Sinan and Wang, Shijie and Fan, Zhihao and Bai, Jinze and Chen, Keqin and Liu, Xuejing and Wang, Jialin and Ge, Wenbin and Fan, Yang and Dang, Kai and Du, Mengfei and Ren, Xuancheng and Men, Rui and Liu, Dayiheng and Zhou, Chang and Zhou, Jingren and Lin, Junyang},
  journal={arXiv preprint arXiv:2409.12191},
  year={2024}
}

@article{Qwen-VL,
  title={Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond},
  author={Bai, Jinze and Bai, Shuai and Yang, Shusheng and Wang, Shijie and Tan, Sinan and Wang, Peng and Lin, Junyang and Zhou, Chang and Zhou, Jingren},
  journal={arXiv preprint arXiv:2308.12966},
  year={2023}
}

📋 Model Limitations

🔇 No audio support 📅 Limited database until June 2023 🔍 Restricted recognition of individuals and brands 🧩 Reduced performance for complex multi - step tasks 🔢 Difficulty in accurate object counting 📏 Limited 3D spatial reasoning

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご