Model Overview
Model Features
Model Capabilities
Use Cases
đ Fluxi AI - Small Vision đ¤â¨
Fluxi AI - Small Vision is a versatile AI assistant with multimodal intelligence, multilingual comprehension, function execution capabilities, advanced RAG, and natural and friendly interaction.
đ Quick Start
We offer a set of tools to help you handle various types of visual input more conveniently, including base64, URLs, and interleaved images and videos. You can install it with the following command:
pip install qwen-vl-utils
Here is a code snippet to show how to use the chat model with transformers
and qwen_vl_utils
:
from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info
# Load the model on the available device(s)
model = Qwen2VLForConditionalGeneration.from_pretrained(
"JJhooww/Fluxi_AI_Small_Vision", torch_dtype="auto", device_map="auto"
)
# We recommend enabling flash_attention_2 for better acceleration and memory savings, especially in scenarios with multiple images and videos.
# model = Qwen2VLForConditionalGeneration.from_pretrained(
# "JJhooww/Fluxi_AI_Small_Vision",
# torch_dtype=torch.bfloat16,
# attn_implementation="flash_attention_2",
# device_map="auto",
# )
# Default processor
processor = AutoProcessor.from_pretrained("JJhooww/Fluxi_AI_Small_Vision")
# The default range for the number of visual tokens per image in the model is 4 - 16384. You can configure min_pixels and max_pixels according to your needs, such as a token count range of 256 - 1280, to balance speed and memory usage.
# min_pixels = 256*28*28
# max_pixels = 1280*28*28
# processor = AutoProcessor.from_pretrained("JJhooww/Fluxi_AI_Small_Vision", min_pixels=min_pixels, max_pixels=max_pixels)
messages = [
{
"role": "user",
"content": [
{
"type": "image",
"image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
},
{"type": "text", "text": "Describe this image."},
],
}
]
# Preparation for inference
text = processor.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
text=[text],
images=image_inputs,
videos=video_inputs,
padding=True,
return_tensors="pt",
)
inputs = inputs.to("cuda")
# Inference: Generate output
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)
Without qwen_vl_utils
from PIL import Image
import requests
import torch
from torchvision import io
from typing import Dict
from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor
# Load the model in reduced precision on the available device(s)
model = Qwen2VLForConditionalGeneration.from_pretrained(
"JJhooww/Fluxi_AI_Small_Vision", torch_dtype="auto", device_map="auto"
)
processor = AutoProcessor.from_pretrained("JJhooww/Fluxi_AI_Small_Vision")
# Image
url = "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg"
image = Image.open(requests.get(url, stream=True).raw)
conversation = [
{
"role": "user",
"content": [
{
"type": "image",
},
{"type": "text", "text": "Describe this image."},
],
}
]
# Preprocess inputs
text_prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
# Expected output: '<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\n<|vision_start|><|image_pad|><|vision_end|>Describe this image.<|im_end|>\n<|im_start|>assistant\n'
inputs = processor(
text=[text_prompt], images=[image], padding=True, return_tensors="pt"
)
inputs = inputs.to("cuda")
# Inference: Generate output
output_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids = [
output_ids[len(input_ids) :]
for input_ids, output_ids in zip(inputs.input_ids, output_ids)
]
output_text = processor.batch_decode(
generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True
)
print(output_text)
Inference with multiple images
# Messages containing multiple images and a text query
messages = [
{
"role": "user",
"content": [
{"type": "image", "image": "file:///path/to/image1.jpg"},
{"type": "image", "image": "file:///path/to/image2.jpg"},
{"type": "text", "text": "Identify the similarities between these images."},
],
}
]
# Preparation for inference
text = processor.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
text=[text],
images=image_inputs,
videos=video_inputs,
padding=True,
return_tensors="pt",
)
inputs = inputs.to("cuda")
# Inference
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)
Inference with video
# Messages containing a list of images as a video and a text query
messages = [
{
"role": "user",
"content": [
{
"type": "video",
"video": [
"file:///path/to/frame1.jpg",
"file:///path/to/frame2.jpg",
"file:///path/to/frame3.jpg",
"file:///path/to/frame4.jpg",
],
"fps": 1.0,
},
{"type": "text", "text": "Describe this video."},
],
}
]
# Messages containing a video and a text query
messages = [
{
"role": "user",
"content": [
{
"type": "video",
"video": "file:///path/to/video1.mp4",
"max_pixels": 360 * 420,
"fps": 1.0,
},
{"type": "text", "text": "Describe this video."},
],
}
]
# Preparation for inference
text = processor.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
text=[text],
images=image_inputs,
videos=video_inputs,
padding=True,
return_tensors="pt",
)
inputs = inputs.to("cuda")
# Inference
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)
Batch inference
# Example messages for batch inference
messages1 = [
{
"role": "user",
"content": [
{"type": "image", "image": "file:///path/to/image1.jpg"},
{"type": "image", "image": "file:///path/to/image2.jpg"},
{"type": "text", "text": "What are the common elements in these images?"},
],
}
]
messages2 = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Who are you?"}
]
# Combine messages for batch processing
messages = [messages1, messages1]
# Preparation for batch inference
texts = [
processor.apply_chat_template(msg, tokenize=False, add_generation_prompt=True)
for msg in messages
]
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
text=texts,
images=image_inputs,
videos=video_inputs,
padding=True,
return_tensors="pt",
)
inputs = inputs.to("cuda")
# Batch inference
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_texts = processor.batch_decode(
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_texts)
⨠Features
- Multimodal Intelligence: Capable of handling multimodal interactions, including text, images, and videos.
- Multilingual Comprehension: Understands and processes multiple languages, including Portuguese, English, Spanish, French, German, Japanese, Korean, Arabic, and Vietnamese.
- Function Execution Capability: Can execute predefined functions, with structured input/output handling and support for complex parameters.
- Advanced RAG: Integrates context with documents, extracts relevant information, and provides contextual and adaptive responses.
- Natural and Friendly Interaction: Offers a more natural and user - friendly interaction experience.
đĻ Installation
The code for Qwen2-VL is available in the latest version of Hugging Face Transformers. We recommend building from the source code with the following command:
pip install git+https://github.com/huggingface/transformers
Otherwise, you may encounter the following error:
KeyError: 'qwen2_vl'
đģ Usage Examples
Basic Usage
1. Function Call Example
# Structure of a function call dataset
messages = [
{
"role": "system",
"content": [{
"type": "text",
"text": """# Tools
You can call one or more functions to assist in the user's query.
You receive function signatures in the XML tags <tools></tools>:
<tools>
{
"type": "function",
"function": {
"name": "criar_contato",
"description": "Create a new contact",
"parameters": {
"type": "object",
"properties": {
"nome": {"type": "string", "description": "The name of the contact"},
"email": {"type": "string", "description": "The email address of the contact"}
},
"required": ["nome", "email"]
}
}
}
</tools>"""
}]
},
{
"role": "user",
"content": [{
"type": "text",
"text": "I need to create a new contact for my friend John Doe. His email is johndoe@example.com."
}]
}
]
# Assistant's response
response = {
"role": "assistant",
"content": [{
"type": "text",
"text": "<tool_call>\n{'name': 'criar_contato', 'arguments': {'nome': 'John Doe', 'email': 'johndoe@example.com'}}</tool_call>"
}]
}
2. RAG Implementation Example
# Structure of a RAG dataset
messages = [
{
"role": "user",
"content": [{
"type": "text",
"text": "What are the main countries of origin of foreign tourists visiting Brazil?"
}]
},
{
"role": "system",
"content": [{
"type": "text",
"text": "{'doc04': {'text': 'Brazil received more than 6 million foreign tourists, mainly from Argentina, the United States, Paraguay, Chile, Uruguay, and France...'}}"
}]
}
]
# Assistant's response
response = {
"role": "assistant",
"content": [{
"type": "text",
"text": "The main countries of origin of foreign tourists visiting Brazil are:\n\n- Argentina\n- United States\n- Paraguay\n- Chile\n- Uruguay\n- France"
}]
}
3. System - Guided Agent Example
# Configuration of a system - guided agent
messages = [
{
"role": "system",
"content": [{
"type": "text",
"text": "You are an expert in various scientific disciplines, including physics, chemistry, and biology. Explain scientific concepts, theories, and phenomena in an engaging and accessible way."
}]
},
{
"role": "user",
"content": [{
"type": "text",
"text": "Can you help me write an essay on deforestation?"
}]
}
]
đ Documentation
General Model Overview
This is a versatile AI assistant capable of handling multimodal interactions, including text, images, and videos. The model supports function calls, RAG (Retrieval - Augmented Generation), and system - guided interactions, with enhanced capabilities in Portuguese.
Base Model
This assistant is based on the Qwen2-VL-7B-Instruct model, a powerful multimodal language model developed by Qwen. The main features include:
- 7 billion parameters
- Advanced architecture for vision and language
- Support for multiple image resolutions
- Video processing capability
- Specific optimizations for multimodal tasks
Main Functionalities
Multimodal Processing
â Text generation and comprehension â Image analysis and understanding â Video comprehension (up to 20+ minutes) â Support for various input formats:
- Local files
- Base64 images
- URLs
- Combination of interleaved images and videos
Multilingual Support
đ The model understands and processes multiple languages, including:
- Portuguese (enhanced support)
- English
- Spanish, French, German, and other European languages
- Japanese and Korean
- Arabic and Vietnamese
Key Features
1. Function Calls
âī¸ Ability to execute predefined functions đ Structured input/output handling đ ī¸ Support for complex parameters đ§đˇ Optimization for function calls in Portuguese
2. Retrieval - Augmented Generation (RAG)
đ Integration of context with documents đ Extraction of relevant information đ¤ Contextual and adaptive responses đ§đˇ Optimization for Portuguese - language content
3. System - Guided Interactions
đĨ Function and role - based responses đ Adaptation to different knowledge areas đ Enhanced contextual understanding đ§đˇ Specific optimization for Portuguese - language agents
Portuguese Language Optimizations
Function Calls
âī¸ Function names and descriptions in Portuguese âī¸ Brazilian conventions for parameter naming âī¸ Localized error messages and responses âī¸ Function selection based on Brazilian use cases
Advanced RAG
đ Optimized content retrieval for Portuguese đ Priority for Brazilian context đ Higher precision in local information extraction đ Improved language pattern recognition
Specific Enhancements for Agents
đ Enhanced Brazilian cultural context đ Integration with regional knowledge đŖī¸ Improved understanding of Portuguese nuances đ Optimization for specific Brazilian domains
đ§ Technical Details
The model is built upon the Qwen2-VL-7B-Instruct architecture, which has 7 billion parameters. It uses an advanced architecture for vision and language processing, supporting multiple image resolutions and video processing. The RAG mechanism integrates external documents to provide more accurate and relevant information, and the function call feature allows the model to interact with external systems.
đ License
This project is licensed under the Apache - 2.0 license.
đ Citations
Base Model Citation
@article{Qwen2VL,
title={Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution},
author={Wang, Peng and Bai, Shuai and Tan, Sinan and Wang, Shijie and Fan, Zhihao and Bai, Jinze and Chen, Keqin and Liu, Xuejing and Wang, Jialin and Ge, Wenbin and Fan, Yang and Dang, Kai and Du, Mengfei and Ren, Xuancheng and Men, Rui and Liu, Dayiheng and Zhou, Chang and Zhou, Jingren and Lin, Junyang},
journal={arXiv preprint arXiv:2409.12191},
year={2024}
}
@article{Qwen-VL,
title={Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond},
author={Bai, Jinze and Bai, Shuai and Yang, Shusheng and Wang, Shijie and Tan, Sinan and Wang, Peng and Lin, Junyang and Zhou, Chang and Zhou, Jingren},
journal={arXiv preprint arXiv:2308.12966},
year={2023}
}
đ Model Limitations
đ No audio support đ Limited database until June 2023 đ Restricted recognition of individuals and brands đ§Š Reduced performance for complex multi - step tasks đĸ Difficulty in accurate object counting đ Limited 3D spatial reasoning






