Florence-2-base-Castollux-v0.4 Open-source Image Description Model

Florence 2 Base Castollux V0.4

Developed by PJMixers-Images

An image caption generation model fine-tuned based on microsoft/Florence-2-base, focusing on improving description quality and format

Image-to-Text

Transformers

English#Image Caption Generation #High-Precision Detail Recognition #Natural Scene Understanding

Downloads 23

Release Time : 2/4/2025

Model Overview

This model is an image-to-text model fine-tuned on the Florence-2-base architecture, specifically optimized for the quality and format of image caption generation. Trained using the <CAPTION> task prompt, it is suitable for generating detailed and accurate image descriptions.

Model Features

High-Quality Image Captions

Generates detailed and accurate image descriptions, outperforming the base model

Format Optimization

Specifically optimized for the format and structure of descriptions

Task Prompt Support

Supports <CAPTION> task prompt and can be extended to other prompt types

Model Capabilities

Image Caption Generation

Detailed Scene Analysis

Object Recognition and Description

Use Cases

Content Generation

Automatic Image Annotation

Generates detailed descriptive text for images

Produces more accurate and detailed descriptions than the base model

Accessibility Assistance

Provides image content descriptions for visually impaired users

Offers more comprehensive scene understanding

Media Processing

Media Content Analysis

Automatically analyzes image content and generates descriptions

Can be used for content classification and retrieval

🚀 Florence-2-base-Castollux-v0.5

This is a fine - tuned model based on microsoft/Florence-2-base, aiming to enhance the quality and formatting of image captioning. It uses the <CAPTION> task prompt, and experiments show that training on <DETAILED_CAPTION> or <MORE_DETAILED_CAPTION> doesn't significantly improve the quality compared to <CAPTION>.

📦 Datasets and Model Information

Property	Details
Datasets	PJMixers - Images/Castollux - Dataset
Base Model	microsoft/Florence - 2 - base
Pipeline Tag	image - to - text
Library Name	transformers
New Version	PJMixers - Images/Florence - 2 - base - Castollux - v0.5

🚀 Quick Start

This model is a fine - tuned version of microsoft/Florence-2-base. It's designed to improve image captioning quality. In earlier versions (v0.1 & v0.2), captions with more than 1000 Florence - 2 tokens were filtered out. From v0.3 onwards, captions above 512 T5 tokens are filtered.

✨ Features

Improved Captioning: Enhances the quality and formatting of image captioning.
Task Prompt Utilization: Uses the <CAPTION> task prompt effectively.
Token Filtering: Filters out captions based on token limits for better performance.

📚 Documentation

Model Evaluation

The selected images for evaluation are not cherry - picked. They are from the first batch of images in the randomly chosen eval split.

Ground Truth	Original Model Prediction	Fine - tuned Model Prediction
A two - toned wall is visible on the left, featuring an orange upper section and a blue lower section, with a thin horizontal line separating the two colors. A small, square window with vertical metal bars is embedded in the orange section of the wall. A light orange pipe runs vertically down the orange section of the wall, adjacent to the color dividing line. To the right of this wall is a textured, light grey wall. A grey pipe runs vertically down this wall. A red octagonal stop sign with white lettering that reads "STOP" is mounted on a grey pole, positioned in front of the grey wall. Further to the right, a blue shuttered window is set into the grey wall. A decorative dark blue railing is attached to the wall below the window.	A stop sign sitting on the side of a building.	A red octagonal stop sign with white text "STOP" is positioned on the left side of the image, attached to a light beige stucco wall. To the right of the stop sign is a closed blue door with peeling paint and metal bars, and a black wrought - iron balcony with decorative elements on the right side. The wall is textured and has visible cracks and imperfections. A white pipe runs vertically down the wall between the orange and blue sections of the wall. A small rectangular window with dark metal bars is visible inside the orange section.
A brown and white dog with floppy ears and an open mouth stands in a dense field of green leafy plants with small white flowers. The dog is positioned in the lower center of the image, facing slightly to the left. The field of plants extends throughout the foreground and midground, creating a thick carpet of green. Behind the dog and the lower field, a steeper bank covered in similar green leafy plants rises up, filling the upper portion of the image. A narrow, dark, dirt path or small stream runs horizontally across the midground, separating the lower field from the higher bank. A fallen log lies along the edge of this path on the right side of the image. The leaves on the higher bank appear slightly smaller and more densely packed than those in the foreground. Some thin, bare branches are visible on the right side of the higher bank.	A dog is sitting in the middle of a lush green field.	A brown and white dog with floppy ears is positioned in the center of the image, facing slightly to the left, with its tongue sticking out. The dog is surrounded by a field of green leafy plants with small white flowers. Behind the dog and to the right is a dense green wall covered in leafy foliage. The ground in the foreground appears to be a dirt path or path. The lighting in the image is soft and natural.
A very tall waterfall cascades down the center of a dark grey rocky cliff into a dark body of water at the bottom. The cliff is textured and has patches of snow. Sparse trees are at the top of the cliff and on the cliff face. The dark water reflects the cliff and waterfall. The sky is grey.	A large waterfall in the middle of a body of water.	A tall waterfall cascades down a rocky cliff face in the center of the image, with the water appearing dark blue and reflecting the sky above. The cliff face is composed of grey and brown rock, with visible cracks and crevices. At the top of the cliff, a dense forest of dark green trees is visible. The sky is overcast and grey. The water in the foreground is dark blue.
A brightly colored fish ride with a white head, blue eye, open orange mouth, and multicolored scales of blue, orange, yellow, and white is positioned in the center of the image, mounted on a black base with black handlebars. Behind the fish ride is a cream - colored decorative architectural element with arched openings. To the left of the fish ride are several metal pots and pans, including a large blue enamel pot tipped over with a red handle facing the viewer, and other smaller pots stacked behind it. To the right of the fish ride are several black buckets stacked on top of each other. The fish ride and buckets are situated on a red platform. In the background, there is a weathered blue wooden wall with a white - framed window, and to the left of the wall are aged wooden doors with decorative carvings. There is text visible on a white label attached to the cream - colored architectural element.	A large fish statue sitting on top of a table.	A large metal fish sculpture with orange, white, and blue scales is positioned in the center of the image, facing slightly to the right. The fish has a black eye and a yellow beak. It is mounted on a black metal stand. To the left of the fish is a red bucket with a blue rim and a red handle. Further to the left is a white bucket. In the foreground, there is a stack of black plastic buckets. Behind the fish and bucket are several other metal buckets and containers. The background shows a building with green wooden siding and a white window with multiple panes. A wooden door is visible on the left side, with a metal railing above it. A small white sign with black text is attached to the side of the building.
A young girl with brown hair and brown eyes is smiling and looking directly at the viewer while standing with her hands on her hips. She is wearing a light blue headband with a flower on the right side of her head. She has on a light blue sleeveless top and a long skirt that is multi - colored with pastel shades of yellow, pink, purple, and blue, and appears to be made of tulle or a similar sheer fabric. She is barefoot and standing on a plain white background.	A little girl wearing a dress with a flower in her hair.	A young girl with fair skin and brown eyes is smiling and looking directly at the viewer, wearing a light blue headband with a large blue flower in the center, a white sleeveless dress with thin straps, and a rainbow - colored tulle skirt. Her hands are placed on her hips, and her feet are bare. The background is a plain white wall.
A woman with brown wavy hair and fair skin is sitting in a yellow armchair. She is wearing a red sequined dress with thin straps and black high - heeled shoes. Her legs are crossed at the knees, and her right hand is resting on the arm of the chair. She is looking directly at the viewer with brown eyes and wearing makeup including eyeliner and lipstick. To the left of the armchair is a small round white table on which sits an orange vase containing a bouquet of flowers with pink, purple, and yellow blooms. Behind the woman and the table are white vertical blinds and blue curtains. The wall behind the blinds and curtains appears to be a light grey or blue color. The floor is partially visible and appears to be a light color. The bottom of a green bed or couch is visible in the foreground. The lighting in the image is studio - like, with a focus on the woman and the immediate surroundings.	A woman in a red dress sitting on a yellow chair.	A woman with long brown wavy hair is sitting in a yellow armchair. She is wearing a red sleeveless dress with thin straps. Her legs are crossed and she is wearing black high - heeled shoes. The armchair has a curved back and armrests. To the left of the armchair is a small round white table. On the table is a clear glass vase filled with pink and white flowers. Behind the vase is a window with white vertical blinds and a dark blue curtain on the right side. The background is a light blue wall.
A clear blue sky with a few wispy white clouds in the upper left corner is above a turquoise and clear sea with visible ripples and wave patterns. A small wave with white foam is breaking on the white sandy beach in the foreground. The water is very transparent, allowing the sandy bottom to be seen. The horizon line is straight and separates the sea and sky.	A sandy beach with clear blue water and white sand.	A body of turquoise water fills the majority of the frame, with gentle waves crashing onto a sandy beach in the foreground. The water is a vibrant shade of green, and the waves are white and foamy. The sand is a light beige color. The horizon line is visible in the distance between the water and the beach. The sky above is a clear blue with a few wispy white clouds scattered across it.
A landscape view shows a waterfall cascading down a rocky cliff face that is illuminated by bright sunlight, with a rainbow visible at the base of the falls, and two small figures standing near the base of the waterfall. The cliff face is mostly brown and grey rock with patches of green vegetation. Behind the sunlit cliff face is a darker, shadowed mountain range with visible rock striations and some sparse vegetation. In the far background to the left is the distinct shape of Half Dome under a cloudy sky. The foreground is filled with a dense, dark green forest of trees, contrasting with the brightly lit waterfall and cliff face. The sky above is mostly cloudy with dark grey clouds, but there are some brighter areas peeking through.	A large waterfall in the middle of a mountain range.	A waterfall cascades down a rocky cliff face on the right side of the image, with the water appearing yellow and white. The cliff face is composed of grey and brown rocks, with patches of green vegetation growing on the top and sides. The waterfall is located in the center of the cliff face and is surrounded by a dense forest of dark green trees. In the background, a large mountain range is visible, covered in dark green coniferous trees. The sky above is cloudy with dark grey and white clouds. The overall scene is dramatic and dramatic, with a focus on the waterfall and the forest below.

val_loss

Training Settings

The model was trained with Florence-2ner using the following config and approximately 17K images:

{
    "model_name": "microsoft/Florence-2-base",
    "task_prompt": "<CAPTION>",
    "dataset_path": "./0000_Datasets/Gemini-512lim",
    "wandb_project_name": "Florence-2-base",
    "run_name": "Florence-2-base-Castollux-v0.4-run7",
    "epochs": 2,
    "optimizer": "CAME",
    "learning_rate": 5e-6,
    "lr_scheduler": "REX",
    "gradient_checkpointing": true,
    "freeze_vision": false,
    "freeze_language": false,
    "freeze_other": false,
    "train_batch_size": 8,
    "eval_batch_size": 8,
    "gradient_accumulation_steps": 8,
    "clip_grad_norm": 1,
    "weight_decay": 1e-2,
    "save_total_limit": 3,
    "save_steps": 10,
    "eval_steps": 10,
    "warmup_steps": 10,
    "eval_split_ratio": 0.05,
    "seed": 42,
    "filtering_processes": 128,
    "attn_implementation": "sdpa"
}

📄 License

No license information is provided in the original document.

📖 Citations

Show Citations

@misc{xiao2023florence2advancingunifiedrepresentation,
      title={Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks}, 
      author={Bin Xiao and Haiping Wu and Weijian Xu and Xiyang Dai and Houdong Hu and Yumao Lu and Michael Zeng and Ce Liu and Lu Yuan},
      year={2023},
      eprint={2311.06242},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2311.06242}, 
}
@misc{wolf2020huggingfacestransformersstateoftheartnatural,
      title={HuggingFace's Transformers: State-of-the-art Natural Language Processing},
      author={Thomas Wolf and Lysandre Debut and Victor Sanh and Julien Chaumond and Clement Delangue and Anthony Moi and Pierric Cistac and Tim Rault and Rémi Louf and Morgan Funtowicz and Joe Davison and Sam Shleifer and Patrick von Platen and Clara Ma and Yacine Jernite and Julien Plu and Canwen Xu and Teven Le Scao and Sylvain Gugger and Mariama Drame and Quentin Lhoest and Alexander M. Rush},
      year={2020},
      eprint={1910.03771},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/1910.03771},
}
@misc{dao2023flashattention2fasterattentionbetter,
      title={FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning},
      author={Tri Dao},
      year={2023},
      eprint={2307.08691},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2307.08691},
}
@misc{luo2023cameconfidenceguidedadaptivememory,
      title={CAME: Confidence-guided Adaptive Memory Efficient Optimization}, 
      author={Yang Luo and Xiaozhe Ren and Zangwei Zheng and Zhuo Jiang and Xin Jiang and Yang You},
      year={2023},
      eprint={2307.02047},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2307.02047}, 
}
@misc{chen2021rexrevisitingbudgetedtraining,
      title={REX: Revisiting Budgeted Training with an Improved Schedule}, 
      author={John Chen and Cameron Wolfe and Anastasios Kyrillidis},
      year={2021},
      eprint={2107.04197},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2107.04197}, 
}
@misc{geminiteam2024geminifamilyhighlycapable,
      title={Gemini: A Family of Highly Capable Multimodal Models}, 
      author={Gemini Team and Rohan Anil and Sebastian Borgeaud and Jean-Baptiste Alayrac and Jiahui Yu and Radu Soricut and Johan Schalkwyk and Andrew M. Dai and Anja Hauth and Katie Millican and David Silver and Melvin Johnson and Ioannis Antonoglou and Julian Schrittwieser and Amelia Glaese and Jilin Chen and Emily Pitler and Timothy Lillicrap and Angeliki Lazaridou and Orhan Firat and James Molloy and Michael Isard and Paul R. Barham and Tom Hennigan and Benjamin Lee and Fabio Viola and Malcolm Reynolds and Yuanzhong Xu and Ryan Doherty and Eli Collins and Clemens Meyer and Eliza Rutherford and Erica Moreira and Kareem Ayoub and Megha Goel and Jack Krawczyk and Cosmo Du and Ed Chi and Heng-Tze Cheng and Eric Ni and Purvi Shah and Patrick Kane and Betty Chan and Manaal Faruqui and Aliaksei Severyn and Hanzhao Lin and YaGuang Li and Yong Cheng and Abe Ittycheriah and Mahdis Mahdieh and Mia Chen and Pei Sun and Dustin Tran and Sumit Bagri and Balaji Lakshminarayanan and Jeremiah Liu and Andras Orban and Fabian Güra and Hao Zhou and Xinying Song and Aurelien Boffy and Harish Ganapathy and Steven Zheng and HyunJeong Choe and Ágoston Weisz and Tao Zhu and Yifeng Lu and Siddharth Gopal and Jarrod Kahn and Maciej Kula and Jeff Pitman and Rushin Shah and Emanuel Taropa and Majd Al Merey and Martin Baeuml and Zhifeng Chen and Laurent El Shafey and Yujing Zhang and Olcan Sercinoglu and George Tucker and Enrique Piqueras and Maxim Krikun and Iain Barr and Nikolay Savinov and Ivo Danihelka and Becca Roelofs and Anaïs White and Anders Andreassen and Tamara von Glehn and Lakshman Yagati and Mehran Kazemi and Lucas Gonzalez and Misha Khalman and Jakub Sygnowski and Alexandre Frechette and Charlotte Smith and Laura Culp and Lev Proleev and Yi Luan and Xi Chen and James Lottes and Nathan Schucher and Federico Lebron and Alban Rrustemi and Natalie Clay and Phil Crone and Tomas Kocisky and Jeffrey Zhao and Bartek Perz and Dian Yu and Heidi Howard and Adam Bloniarz and Jack W. Rae and Han Lu and Laurent Sifre and Marcello Maggioni and Fred Alcober and Dan Garrette and Megan Barnes and Shantanu Thakoor and Jacob Austin and Gabriel Barth-Maron and William Wong and Rishabh Joshi and Rahma Chaabouni and Deeni Fatiha and Arun Ahuja and Gaurav Singh Tomar and Evan Senter and Martin Chadwick and Ilya Kornakov and Nithya Attaluri and Iñaki Iturrate and Ruibo Liu and Yunxuan Li and Sarah Cogan and Jeremy Chen and Chao Jia and Chenjie Gu and Qiao Zhang and Jordan Grimstad and Ale Jakse Hartman and Xavier Garcia and Thanumalayan Sankaranarayana Pillai and Jacob Devlin and Michael Laskin and Diego de Las Casas and Dasha Valter and Connie Tao and Lorenzo Blanco and Adrià Puigdomènech Badia and David Reitter and Mianna Chen and Jenny Brennan and Clara Rivera and Sergey Brin and Shariq Iqbal and Gabriela Surita and Jane Labanowski and Abhi Rao and Stephanie Winkler and Emilio Parisotto and Yiming Gu and Kate Olszewska and Ravi Addanki and Antoine Miech and Annie Louis and Denis Teplyashin and Geoff Brown and Elliot Catt and Jan Balaguer and Jackie Xiang and Pidong Wang and Zoe Ashwood and Anton Briukhov and Albert Webson and Sanjay Ganapathy and Smit Sanghavi and Ajay Kannan and Ming-Wei Chang and Axel Stjerngren and Josip Djolonga and Yuting Sun and Ankur Bapna and Matthew Aitchison and Pedram Pejman and Henryk Michalewski and Tianhe Yu and Cindy Wang and Juliette Love and Junwhan Ahn and Dawn Bloxwich and Kehang Han and Peter Humphreys and Thibault Sellam and James Bradbury and Varun Godbole and Sina Samangooei and Bogdan Damoc and Alex Kaskasoli and Sébastien M. R. Arnold and Vijay Vasudevan and Shubham Agrawal and Jason Riesa and Dmitry Lepikhin and Richard Tanburn and Srivatsan Srinivasan and Hyeontaek Lim and Sarah Hodkinson and Pranav Shyam and Johan Ferret and Steven Hand and Ankush Garg and Tom Le Paine and Jian Li and Yujia Li and Minh Giang and Alexander Neitz and Zaheer Abbas and Sarah York and Machel Reid and Elizabeth Cole and Aakanksha Chowdhery and Dipanjan Das and Dominika Rogozińska and Vitaliy Nikolaev and Pablo Sprechmann and Zachary Nado and Lukas Zilka and Flavien Prost and Luheng He and Marianne Monteiro and Gaurav Mishra and Chris Welty and Josh Newlan and Dawei Jia and Miltiadis Allamanis and Clara Huiyi Hu and Raoul de Liedekerke and Justin Gilmer and Carl Saroufim and Shruti Rijhwani and Shaobo Hou and Disha Shrivastava and Anirudh Baddepudi and Alex Goldin and Adnan Ozturel and Albin Cassirer and Yunhan Xu and Daniel Sohn and Devendra Sachan and Reinald Kim Amplayo and Craig Swanson and Dessie Petrova and Shashi Narayan and Arthur Guez and Siddhartha Brahma and Jessica Landon and Miteyan Patel and Ruizhe Zhao and Kevin Villela and Luyu Wang and Wenhao Jia and Matthew Rahtz and Mai Giménez and Legg Yeung and James Keeling and Petko Georgiev and Diana Mincu and Boxi Wu and Salem Haykal and Rachel Saputro and
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご