Fish - Agent - v0.1 - 3B Open - source Speech Model - Accurately Capture Environmental Audio, Support Text - to

Fish Agent V0.1 3b

Developed by fishaudio

A groundbreaking speech-to-speech model capable of accurately capturing and generating environmental audio information, while featuring advanced text-to-speech capabilities.

Speech Synthesis Supports Multiple Languages#Non-semantic speech generation #Multilingual TTS #Environmental audio modeling

Downloads 653

Release Time : 10/29/2024

Model Overview

Fish Language Intelligent Agent V0.1 3B Edition is a versatile speech processing model supporting speech-to-speech and text-to-speech tasks, designed with a non-semantic token architecture that eliminates reliance on traditional semantic encoders/decoders.

Model Features

Non-semantic token architecture

Eliminates reliance on traditional semantic encoders/decoders like Whisper or CosyVoice for more efficient speech processing

Multilingual support

Supports speech processing in 8 languages including major languages like Chinese and English

Large-scale training data

Trained on a 700,000-hour multilingual audio dataset to ensure model performance

Versatile speech processing

Simultaneously supports speech-to-speech and text-to-speech tasks with broad application scenarios

Model Capabilities

Speech-to-speech

Text-to-speech

Speech-to-text

Multilingual speech processing

Use Cases

Speech synthesis

Multilingual speech synthesis

Convert text into natural and fluent speech output

Supports speech synthesis in 8 languages

Voice conversion

Voice style conversion

Transform input speech into output with different styles or characteristics

🚀 Fish Agent V0.1 3B

Fish Agent V0.1 3B is a revolutionary Voice-to-Voice model that can capture and generate environmental audio information with unparalleled accuracy. Its semantic-token-free architecture distinguishes it from others, eliminating the need for traditional semantic encoders/decoders such as Whisper and CosyVoice. Moreover, it serves as a state-of-the-art text-to-speech (TTS) model, trained on a vast dataset of 700,000 hours of multilingual audio content. This model is a continue-pretrained version of Qwen-2.5-3B-Instruct for 200B voice & text tokens.

🚀 Quick Start

For detailed information and implementation guidelines, please visit our Fish Speech GitHub repository.

✨ Features

High - accuracy Audio Processing: Capable of capturing and generating environmental audio information with unprecedented accuracy.
Semantic - token - free Architecture: Eliminates the need for traditional semantic encoders/decoders.
Multilingual TTS: Trained on 700,000 hours of multilingual audio content.

📦 Installation

No installation steps are provided in the original document, so this section is skipped.

💻 Usage Examples

No code examples are provided in the original document, so this section is skipped.

📚 Documentation

Supported Languages

The model supports the following languages with their respective training data sizes:

Property	Details
English (en)	~300,000 hours
Chinese (zh)	~300,000 hours
German (de)	~20,000 hours
Japanese (ja)	~20,000 hours
French (fr)	~20,000 hours
Spanish (es)	~20,000 hours
Korean (ko)	~20,000 hours
Arabic (ar)	~20,000 hours

Citation

If you find this repository helpful in your work, please consider citing:

@misc{fish-agent-0.1,
    author = {Shijia Liao and Tianyu Li and Rcell and others},
    title = {Fish Agent V0.1 3B},
    year = {2024},
    publisher = {GitHub},
    journal = {GitHub repository},
    howpublished = {\url{https://github.com/fishaudio/fish-speech}}
}

🔧 Technical Details

No specific technical implementation details (more than 50 words) are provided in the original document, so this section is skipped.

📄 License

This model and its associated code are released under the BY - CC - NC - SA - 4.0 license, allowing for non - commercial use with appropriate attribution.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご