EchoLLaMA-1B Open-Source Multimodal AI - 3D Vision to Speech, Voice Conversation Interaction Supported

Echollama 1B

Developed by AquaLabs

EchoLLaMA is a multimodal AI system capable of converting 3D visual data into natural speech descriptions while supporting interactive dialogue through voice input.

Image-to-Text

Transformers

#3D Scene Voice Synthesis #Multimodal AI System #Depth-Aware Description

Downloads 75

Release Time : 3/31/2025

Model Overview

Implementation based on the LLaMA-3.2-1B-Instruct model, fine-tuned with Direct Preference Optimization (DPO) for generating rich textual descriptions of 3D scenes.

Model Features

3D Object Detection Matrix

Constructs grid-based spatial coordinate representations for detected objects

Depth-Aware Scene Understanding

Integrates relative depth values to capture 3D spatial relationships

Natural Language Generation

Generates coherent and context-rich descriptions

High-Quality Voice Synthesis

Converts text descriptions into natural and fluent speech

Model Capabilities

3D Scene Description Generation

Voice Interaction

Multimodal Data Processing

Object Detection

Depth Estimation

Use Cases

Assistive Technology

Visual Assistance

Provides environmental descriptions for visually impaired individuals

Helps users understand their surroundings through voice output

Smart Home

Smart Environment Interaction

Interacts with smart home systems via voice

Enables natural language control of home devices

🚀 EchoLLaMA: 3D-to-Speech with Multimodal AI

EchoLLaMA is a multimodal AI system that transforms 3D visual data into natural spoken descriptions and enables interactive dialogue through speech input.

🚀 Quick Start

To get started with EchoLLaMA, you need to install the necessary components.

📦 Installation

# Clone the repository
git clone https://github.com/The-Aqua-Labs/EchoLLaMA-Pipeline.git
cd EchoLLaMA-Pipeline

After cloning the repository, run the Jupyter Notebook file.

✨ Features

3D Object Detection Matrix: Constructs a grid-based representation of detected objects with spatial coordinates
Depth-Aware Scene Understanding: Incorporates relative depth values to capture 3D relationships
Natural Language Generation: Produces coherent and contextually rich descriptions
High-Quality Speech Synthesis: Converts textual descriptions into natural-sounding speech

📚 Documentation

Overview

EchoLLaMA is a multimodal AI system that transforms 3D visual data into natural spoken descriptions while enabling interactive dialogue through speech input. This repository contains the implementation of the LLaMA-3.2-1B-Instruct model fine-tuned with Direct Preference Optimization (DPO) for generating rich textual descriptions of 3D scenes.

Model Architecture

The EchoLLaMA pipeline integrates four specialized models:

Image Analysis:
- DETR (DEtection TRansformer) for object detection
- MiDaS for monocular depth estimation
- Moondream for holistic image captioning
Text Generation:
- LLaMA-3.2-1B-Instruct fine-tuned with DPO
Speech Synthesis:
- Orpheus-3B-0.1-ft TTS model fine-tuned on the Elise English speech dataset
Speech Recognition:
- SpeechRecognition package for transcribing user speech input

Pipeline Flow

Image is processed with DETR for object detection and MiDaS for depth estimation
Moondream generates a caption describing the image content
The object detection matrix and caption are combined into a prompt
LLaMA-3.2-1B-Instruct generates a detailed textual description
Orpheus-3B-0.1-ft converts the text into speech

Dataset

The training dataset contains 1999 samples, each consisting of:

An image-derived prompt with object detection matrix and caption
A chosen response from DeepSeek-V3-0324
A rejected response from LLaMA-3.2-1B-Instruct

You can access the dataset at AquaLabs/Spatial-DPO-Dataset

Model Weights

LLaMA-3.2-1B-Instruct (fine-tuned): AquaLabs/EchoLLaMA-1B
Orpheus-3B-0.1-ft (fine-tuned): AquaLabs/Orpheus-3B-0.1-ft-Elise

🔧 Technical Details

LLaMA Model

The LLaMA-3.2-1B-Instruct model was fine-tuned using:

Technique: Direct Preference Optimization (DPO) with LoRA
Dataset: 2000 samples from COCO 2017 processed with DETR, and Moondream
Chosen Responses: Generated by DeepSeek-V3-0324
Rejected Responses: Generated by pre-fine-tuned LLaMA-3.2-1B-Instruct
Training Parameters:
- LoRA Rank: 8
- β (DPO): 0.1
- Learning Rate: 2×10⁻⁵ with cosine decay
- Batch Size: 16 (with 2×8 accumulation)
- Sequence Length: 8192
Hardware: 2×T4 GPU
Training Time: 1 hour 40 minutes

Orpheus Model

The Orpheus-3B-0.1-ft TTS model was fine-tuned using:

Technique: Low-Rank Adaptation (LoRA)
Dataset: Elise English speech dataset
Training Parameters:
- LoRA Rank (r): 64
- LoRA Alpha (α): 64
- LoRA Dropout: 0
- Learning Rate: 2×10⁻⁴
Hardware: 2×T4 GPU
Training Time: 47 minutes

📄 License

This project is licensed under the Apache-2.0 License. Details are provided in the paper.

👥 Contributors

Ahmet Erdem Pamuk - GitHub | Hugging Face
Emir Kaan Özdemir - GitHub | Hugging Face
Şuayp Talha Kocabay - GitHub | Hugging Face

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご