đ Fintor-GUI-S2
Fintor-GUI-S2 is a GUI grounding model that addresses the challenges in GUI grounding tasks. It fine - tunes from a powerful base model, offering enhanced performance in relevant benchmarks.
đ Quick Start
from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info
model = Qwen2VLForConditionalGeneration.from_pretrained(
"Fintor/Ui-Tars-7B-Instruct-Finetuned-Os-Atlas",
torch_dtype=torch.bfloat16,
attn_implementation="flash_attention_2",
device_map="auto",
)
processor = AutoProcessor.from_pretrained("Fintor/Ui-Tars-7B-Instruct-Finetuned-Os-Atlas")
messages = [
{
"role": "user",
"content": [
{
"type": "image",
"image": "path/to/image.jpeg",
},
{"type": "text", "text": "Describe this image."},
],
}
]
text = processor.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
text=[text],
images=image_inputs,
videos=video_inputs,
padding=True,
return_tensors="pt",
)
inputs = inputs.to("cuda")
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)
⨠Features
- Fine - tuned Model: Fintor - GUI - S2 is finetuned from [UI - TARS - 7B - DPO](https://huggingface.co/bytedance - research/UI - TARS - 7B - DPO), leveraging the pre - trained knowledge of the base model.
- Multimodal Capability: It falls under the
multimodel
tag, capable of handling image - text - to - text tasks, which is suitable for GUI grounding scenarios.
đĻ Installation
No installation steps are provided in the original document, so this section is skipped.
đģ Usage Examples
Basic Usage
The quick - start code above shows the basic usage of loading the model, preparing inputs, and generating outputs.
Advanced Usage
No advanced usage examples are provided in the original document, so this part is not expanded.
đ Documentation
Model Description
Fintor - GUI - S2 is a GUI grounding model finetuned from [UI - TARS - 7B - DPO](https://huggingface.co/bytedance - research/UI - TARS - 7B - DPO).
Evaluation Results
We evaluated our model using [Screenspot](https://github.com/likaixin2000/ScreenSpot - Pro - GUI - Grounding) on two benchmarks: Screenspot Pro and Screenspot v2.
We also include evaluation scripts used on these benchmarks. The table below compares our model's performance against the base model performance.
Model |
size |
Screenspot Pro |
Screenspot v2 |
[UI - TARS - 7B - DPO](https://huggingface.co/bytedance - research/UI - TARS - 7B - DPO) |
7B |
27.0 |
83.0 |
Ours |
|
|
|
Ui - Tars - 7B - Instruct - Finetuned - Os - Atlas |
7B |
33.0 |
91.8 |
Note - The base model scores slightly lower than the mentioned scores in the paper because the prompts used for evaluation are not publicly available. We used the default prompts when evaluating the base and fine - tuned models.
Training procedure
This model used the OS - Copilot dataset for fine - tuning: [OS - Copilot](https://huggingface.co/datasets/OS - Copilot/OS - Atlas - data/tree/main).
[
](https://wandb.ai/am_fintor - neuralleap/huggingface/runs/hl90xquy?nw=nwuseram_fintor)
This model was trained with SFT and LoRA.
Evaluation Scripts:
Evaluation scripts available here - [Screenspot_Ui - Tars](https://github.com/ma - neuralleap/ScreenSpot - Pro - GUI - Grounding/blob/main/models/uitaris.py)
đ§ Technical Details
No specific technical details (more than 50 - word descriptions) are provided in the original document, so this section is skipped.
đ License
This project is licensed under the Apache - 2.0 license.
đ Citation
No citation content is provided in the original document, so this section is skipped.
Property |
Details |
Model Type |
GUI grounding model |
Training Data |
[OS - Copilot](https://huggingface.co/datasets/OS - Copilot/OS - Atlas - data/tree/main) |
Base Model |
[bytedance - research/UI - TARS - 7B - DPO](https://huggingface.co/bytedance - research/UI - TARS - 7B - DPO) |
Pipeline Tag |
image - text - to - text |
Library Name |
transformers |
Tags |
multimodel, gui |