Euclid-convnext-xxlarge-120524 Open-source Multimodal Model - Reinforcing Geometric Perception for High-fidelity Visual Analysis

Euclid Convnext Xxlarge 120524

Developed by euclid-multimodal

A multimodal large language model specifically trained to enhance low-level geometric perception, improving geometric analysis capabilities through high-fidelity synthetic visual descriptions

Text-to-Image

Transformers

EnglishOpen Source License:Apache-2.0 #Geometry-aware Enhancement #Synthetic Data Training #Robotic Vision

Downloads 22

Release Time : 12/3/2024

Model Overview

A multimodal model combining ConvNeXt visual encoder with Qwen-2.5 language model, trained on 1.6 million synthetic geometric images and Q&A pairs, excelling in precise geometric relationship detection and analysis

Model Features

High-fidelity Geometric Perception

Trained on synthetic geometric images with precise Q&A annotations, achieving millimeter-level geometric relationship recognition

Curriculum Learning Architecture

Adopts progressive training strategy, gradually improving model capabilities from simple geometric elements to complex relationships

Multimodal Fusion

Innovatively aligns ConvNeXt visual features with language model through two-layer MLP

Model Capabilities

Point-line relationship detection

Point-circle relationship detection

Angle classification

Length comparison

Geometric annotation understanding

Geometric proof verification

Geometric equation solving

Use Cases

Industrial Inspection

Mechanical Part Dimension Measurement

Automatically detects key dimensional relationships in part drawings

Achieves 90.82% accuracy in length comparison tasks

Medical Imaging

Anatomical Structure Analysis

Identifies geometric features of organs in medical images

EdTech

Geometry Proof Assistance

Verifies steps in student-submitted geometric proofs

Achieves 70.52% accuracy in proof verification tasks

🚀 Euclid-convnext-xxlarge

A multimodal large language model specifically trained for strong low-level geometric perception.

🚀 Quick Start

Clone our Euclid repo first, set up the environment, then run:

pip install -U "huggingface_hub[cli]"
huggingface-cli download --cache-dir $MODEL_PATH EuclidAI/Euclid-convnext-xxlarge
python euclid/eval/run_euclid_geo.py --model_path $MODEL_PATH --device cuda

✨ Features

Trained on 1.6M synthetic geometry images with high - fidelity question - answer pairs using a curriculum learning approach.
Combines a ConvNeXt visual encoder with a Qwen - 2.5 language model, connected through a 2 - layer MLP multimodal connector.
Capable of performing precise low - level geometric perception tasks such as point - on - line detection, point - on - circle detection, angle classification, length comparison, and geometric annotation understanding.

📦 Installation

Clone our Euclid repo first, set up the environment, then run:

pip install -U "huggingface_hub[cli]"
huggingface-cli download --cache-dir $MODEL_PATH EuclidAI/Euclid-convnext-xxlarge

📚 Documentation

Model Details

Model Description

Euclid is trained on 1.6M synthetic geometry images with high - fidelity question - answer pairs using a curriculum learning approach. It combines a ConvNeXt visual encoder with a Qwen - 2.5 language model, connected through a 2 - layer MLP multimodal connector.

Model Sources

Repository: https://github.com/euclid-multimodal/Euclid
Paper: https://arxiv.org/abs/2412.08737
Demo: https://euclid-multimodal.github.io/

Uses

The model is trained for precise low - level geometric perception tasks which is able to perform:

Point - on - line detection
Point - on - circle detection
Angle classification
Length comparison
Geometric annotation understanding

Please refer to our repo for full input format.

Limitations and Applications

Our model is not designed to handle:

Comprehensive image understanding tasks
Advanced cognitive reasoning beyond geometric analysis

However, the model demonstrates strength in low - level visual perception. This capability makes it potentially valuable for serving as a base model for specialized downstream fintuning including:

Robotic vision and automation systems
Medical imaging and diagnostic support
Industrial quality assurance and inspection
Geometric education and visualization tools

Evaluation Results

Performance on Geoperception benchmark tasks:

Model	POL	POC	ALC	LHC	PEP	PRA	EQL	Overall
Random Baseline	0.43	2.63	59.92	51.36	0.25	0.00	0.02	16.37
Pixtral - 12B	22.85	53.21	47.33	51.43	22.53	37.11	58.45	41.84
Gemini - 1.5 - Pro	24.42	69.80	57.96	79.05	39.60	77.59	52.27	57.24
EUCLID - ConvNeXt - Large	80.54	57.76	86.37	88.24	42.23	64.94	34.45	64.93
EUCLID - ConvNeXt - XXLarge	82.98	61.45	90.56	90.82	46.96	70.52	31.94	67.89

Citation

If you find Euclid useful for your research and applications, please cite using this BibTeX:

@article{zhang2024euclid,
  title={Euclid: Supercharging Multimodal LLMs with Synthetic High - Fidelity Visual Descriptions},
  author={Zhang, Jiarui and Liu, Ollie and Yu, Tianyu and Hu, Jinyi and Neiswanger, Willie},
  journal={arXiv preprint arXiv:2412.08737},
  year={2024}
}

## 📄 License
This project is licensed under the Apache - 2.0 license.

| Property | Details |
|----------|---------|
| Base Model | Qwen/Qwen2.5 - 1.5B - Instruct, laion/CLIP - convnext_xxlarge - laion2B - s34B - b82K - augreg - soup |
| Pipeline Tag | question - answering |
| Metrics | accuracy |
| Library Name | transformers |

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご