SAUTE Open-source Model - A lightweight design and an essential tool for effectively modeling multi-speaker conversations

Saute

Developed by JustinDuc

SAUTE is a lightweight Transformer architecture with speaker perception ability, designed for effectively modeling multi-speaker dialogues.

Dialogue System

Transformers

EnglishOpen Source License:MIT #Speaker perception #Multi-turn dialogue modeling #Linear attention

Downloads 229

Release Time : 6/9/2025

Model Overview

SAUTE combines EDU-level utterance embeddings, speaker-sensitive memory, and an efficient linear attention mechanism to encode rich dialogue contexts with minimal overhead, suitable for multi-turn dialogues, multi-speaker interactions, and long-distance dialogue dependencies.

Model Features

Speaker-aware memory

Represents the dialogue context for each speaker in a structured way

Linear attention mechanism

Efficient and scalable to long dialogues, avoiding the quadratic cost of full self-attention mechanism

Compatible with pre-trained Transformer

Can be connected to a frozen or fine-tuned BERT model

Lightweight design

Fewer parameters but better performance than traditional multi-layer Transformer

Model Capabilities

Multi-speaker dialogue modeling

Capturing long-distance dialogue dependencies

Masked language modeling

Generating utterance-level embeddings

Use Cases

Dialogue system

Multi-turn dialogue understanding

Track the context of different speakers in complex dialogues

Significant improvement in MLM accuracy on the SODA dataset

Meeting record analysis

Identify and distinguish the speech content of multiple participants

🚀 👨‍🍳 SAUTE: Speaker-Aware Utterance Embedding Unit

SAUTE is a lightweight, speaker-aware transformer architecture. It's designed to effectively model multi-speaker dialogues. By combining EDU-level utterance embeddings, speaker-sensitive memory, and efficient linear attention, it can encode rich conversational context with minimal overhead.

🚀 Quick Start

from saute_model import SAUTEConfig, UtteranceEmbedings
from transformers import BertTokenizerFast

# Load tokenizer and model
tokenizer = BertTokenizerFast.from_pretrained("bert-base-uncased")
model = UtteranceEmbedings.from_pretrained("JustinDuc/saute")

# Prepare inputs (example)
outputs = model(
    input_ids=input_ids,
    attention_mask=attention_mask,
    speaker_names=speaker_names
)

✨ Features

🧠 Speaker-Aware Memory: It offers a structured per-speaker representation of dialogue context.
⚡ Linear Attention: This feature is efficient and scalable to long dialogues.
🧩 Pretrained Transformer Compatible: It can be plugged into frozen or fine-tuned BERT models.
🪶 Lightweight: It has about 4M fewer parameters than a 2 - layer model and shows strong improvements in MLM performance.

📚 Documentation

🧠 Overview

SAUTE is suitable for:

🗣️ Multi - turn conversations
👥 Multi - speaker interactions
🧵 Long - range dialog dependencies

It avoids the quadratic cost of full self - attention. It does this by summarizing per - speaker memory from EDU embeddings and injecting contextual information through lightweight linear attention mechanisms.

🧱 Architecture

🔍 SAUTE contextualizes each token with speaker - specific memory summaries built from utterance - level embeddings.

EDU - Level Encoder: It mean - pools BERT outputs per utterance.
Speaker Memory: It accumulates based on the outer - product per speaker.
Contextualization Layer: It integrates memory summaries with current token representations.

![saute - architecture](https://github.com/user - attachments/assets/7f18d5b8-9c6b-4577-b718-206a34d84535)

📈 Performance (on SODA, Masked Language Modeling)

Property	Details
Model Type	SAUTE
Training Data	SODA

Model	Avg MLM Acc	Best MLM Acc
BERT - base (frozen)	33.45	45.89
+ 1 - layer Transformer	68.20	76.69
+ 2 - layer Transformer	71.81	79.54
+ 1 - layer SAUTE (Ours)	72.05	80.40%
+ 3 - layer Transformer	73.5	80.84
+ 3 - layer SAUTE (Ours)	75.65	85.55%

SAUTE achieves the best accuracy using fewer parameters than multi - layer transformers.

📚 Citation / Paper

📄 [SAUTE: Speaker - Aware Utterance Embedding Unit (PDF)](https://github.com/user - attachments/files/20689695/SAUTE_Speaker_Aware_Utterance_Embedding_Unit.pdf)

📄 License

This project is licensed under the MIT license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご