Fintor-GUI-S2 Open-source Model - Focus on GUI Multimodal Tasks and Provide Free Support for Interface Operations

Fintor GUI S2

Developed by Fintor

Fintor-GUI-S2 is a GUI foundation model fine-tuned based on UI-TARS-7B-DPO, specializing in multimodal tasks for graphical user interfaces (GUI).

Image-to-Text

Transformers

Open Source License:Apache-2.0 #GUI multimodal understanding #Screen element localization #Instruction fine-tuning enhancement

Downloads 190

Release Time : 3/12/2025

Model Overview

This model is a multimodal model optimized for graphical user interfaces (GUI), capable of understanding and generating text and image content related to GUI.

Model Features

GUI optimization

Specially fine-tuned for graphical user interface tasks, demonstrating excellent performance in GUI-related tasks.

Multimodal capability

Capable of processing both image and text information simultaneously, achieving cross-modal understanding and generation.

Performance improvement

Significant performance improvement compared to the base model on the Screenspot benchmark.

Model Capabilities

GUI image understanding

Cross-modal text generation

GUI element recognition

Multimodal reasoning

Use Cases

GUI automation

GUI element description generation

Generate descriptive text for interface elements based on GUI screenshots

Achieved 91.8 accuracy on the Screenspot v2 benchmark

GUI operation guidance

Generate step-by-step instructions based on GUI images

🚀 Fintor-GUI-S2

Fintor-GUI-S2 is a GUI grounding model that addresses the challenges in GUI grounding tasks. It fine - tunes from a powerful base model, offering enhanced performance in relevant benchmarks.

🚀 Quick Start

from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info
# default: Load the model on the available device(s)
model = Qwen2VLForConditionalGeneration.from_pretrained(
    "Fintor/Ui-Tars-7B-Instruct-Finetuned-Os-Atlas", 
    torch_dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
    device_map="auto",
)
# default processer
processor = AutoProcessor.from_pretrained("Fintor/Ui-Tars-7B-Instruct-Finetuned-Os-Atlas")
# Example input
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "path/to/image.jpeg",
            },
            {"type": "text", "text": "Describe this image."},
        ],
    }
]
# Preparation for inference
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")
# Inference: Generation of the output
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

✨ Features

Fine - tuned Model: Fintor - GUI - S2 is finetuned from [UI - TARS - 7B - DPO](https://huggingface.co/bytedance - research/UI - TARS - 7B - DPO), leveraging the pre - trained knowledge of the base model.
Multimodal Capability: It falls under the multimodel tag, capable of handling image - text - to - text tasks, which is suitable for GUI grounding scenarios.

📦 Installation

No installation steps are provided in the original document, so this section is skipped.

💻 Usage Examples

Basic Usage

The quick - start code above shows the basic usage of loading the model, preparing inputs, and generating outputs.

Advanced Usage

No advanced usage examples are provided in the original document, so this part is not expanded.

📚 Documentation

Model Description

Fintor - GUI - S2 is a GUI grounding model finetuned from [UI - TARS - 7B - DPO](https://huggingface.co/bytedance - research/UI - TARS - 7B - DPO).

Evaluation Results

We evaluated our model using [Screenspot](https://github.com/likaixin2000/ScreenSpot - Pro - GUI - Grounding) on two benchmarks: Screenspot Pro and Screenspot v2. We also include evaluation scripts used on these benchmarks. The table below compares our model's performance against the base model performance.

Model	size	Screenspot Pro	Screenspot v2
[UI - TARS - 7B - DPO](https://huggingface.co/bytedance - research/UI - TARS - 7B - DPO)	7B	27.0	83.0
Ours
Ui - Tars - 7B - Instruct - Finetuned - Os - Atlas	7B	33.0	91.8

Note - The base model scores slightly lower than the mentioned scores in the paper because the prompts used for evaluation are not publicly available. We used the default prompts when evaluating the base and fine - tuned models.

Training procedure

This model used the OS - Copilot dataset for fine - tuning: [OS - Copilot](https://huggingface.co/datasets/OS - Copilot/OS - Atlas - data/tree/main).

[](https://wandb.ai/am_fintor - neuralleap/huggingface/runs/hl90xquy?nw=nwuseram_fintor)

This model was trained with SFT and LoRA.

Evaluation Scripts:

Evaluation scripts available here - [Screenspot_Ui - Tars](https://github.com/ma - neuralleap/ScreenSpot - Pro - GUI - Grounding/blob/main/models/uitaris.py)

🔧 Technical Details

No specific technical details (more than 50 - word descriptions) are provided in the original document, so this section is skipped.

📄 License

This project is licensed under the Apache - 2.0 license.

📚 Citation

No citation content is provided in the original document, so this section is skipped.

Property	Details
Model Type	GUI grounding model
Training Data	[OS - Copilot](https://huggingface.co/datasets/OS - Copilot/OS - Atlas - data/tree/main)
Base Model	[bytedance - research/UI - TARS - 7B - DPO](https://huggingface.co/bytedance - research/UI - TARS - 7B - DPO)
Pipeline Tag	image - text - to - text
Library Name	transformers
Tags	multimodel, gui

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご