đ Euclid-convnext-xxlarge
A multimodal large language model specifically trained for strong low-level geometric perception.
đ Quick Start
Clone our Euclid repo first, set up the environment, then run:
pip install -U "huggingface_hub[cli]"
huggingface-cli download --cache-dir $MODEL_PATH EuclidAI/Euclid-convnext-xxlarge
python euclid/eval/run_euclid_geo.py --model_path $MODEL_PATH --device cuda
⨠Features
- Trained on 1.6M synthetic geometry images with high - fidelity question - answer pairs using a curriculum learning approach.
- Combines a ConvNeXt visual encoder with a Qwen - 2.5 language model, connected through a 2 - layer MLP multimodal connector.
- Capable of performing precise low - level geometric perception tasks such as point - on - line detection, point - on - circle detection, angle classification, length comparison, and geometric annotation understanding.
đĻ Installation
Clone our Euclid repo first, set up the environment, then run:
pip install -U "huggingface_hub[cli]"
huggingface-cli download --cache-dir $MODEL_PATH EuclidAI/Euclid-convnext-xxlarge
đ Documentation
Model Details
Model Description
Euclid is trained on 1.6M synthetic geometry images with high - fidelity question - answer pairs using a curriculum learning approach. It combines a ConvNeXt visual encoder with a Qwen - 2.5 language model, connected through a 2 - layer MLP multimodal connector.
Model Sources
- Repository: https://github.com/euclid-multimodal/Euclid
- Paper: https://arxiv.org/abs/2412.08737
- Demo: https://euclid-multimodal.github.io/
Uses
The model is trained for precise low - level geometric perception tasks which is able to perform:
- Point - on - line detection
- Point - on - circle detection
- Angle classification
- Length comparison
- Geometric annotation understanding
Please refer to our repo for full input format.
Limitations and Applications
Our model is not designed to handle:
- Comprehensive image understanding tasks
- Advanced cognitive reasoning beyond geometric analysis
However, the model demonstrates strength in low - level visual perception. This capability makes it potentially valuable for serving as a base model for specialized downstream fintuning including:
- Robotic vision and automation systems
- Medical imaging and diagnostic support
- Industrial quality assurance and inspection
- Geometric education and visualization tools
Evaluation Results
Performance on Geoperception benchmark tasks:
Model |
POL |
POC |
ALC |
LHC |
PEP |
PRA |
EQL |
Overall |
Random Baseline |
0.43 |
2.63 |
59.92 |
51.36 |
0.25 |
0.00 |
0.02 |
16.37 |
Pixtral - 12B |
22.85 |
53.21 |
47.33 |
51.43 |
22.53 |
37.11 |
58.45 |
41.84 |
Gemini - 1.5 - Pro |
24.42 |
69.80 |
57.96 |
79.05 |
39.60 |
77.59 |
52.27 |
57.24 |
EUCLID - ConvNeXt - Large |
80.54 |
57.76 |
86.37 |
88.24 |
42.23 |
64.94 |
34.45 |
64.93 |
EUCLID - ConvNeXt - XXLarge |
82.98 |
61.45 |
90.56 |
90.82 |
46.96 |
70.52 |
31.94 |
67.89 |
Citation
If you find Euclid useful for your research and applications, please cite using this BibTeX:
@article{zhang2024euclid,
title={Euclid: Supercharging Multimodal LLMs with Synthetic High - Fidelity Visual Descriptions},
author={Zhang, Jiarui and Liu, Ollie and Yu, Tianyu and Hu, Jinyi and Neiswanger, Willie},
journal={arXiv preprint arXiv:2412.08737},
year={2024}
}
## đ License
This project is licensed under the Apache - 2.0 license.
| Property | Details |
|----------|---------|
| Base Model | Qwen/Qwen2.5 - 1.5B - Instruct, laion/CLIP - convnext_xxlarge - laion2B - s34B - b82K - augreg - soup |
| Pipeline Tag | question - answering |
| Metrics | accuracy |
| Library Name | transformers |