Selene-1-Mini-Llama-3.1-8B Open-source Judge Language Model - Small Size, Comparable to Large Models, Performance Exceeds GPT-4

Selene 1 Mini Llama 3.1 8B

Developed by AtlaAI

Atla Selene Mini is currently the most advanced small judge language model (SLMJ), with performance comparable to models 10 times its size, surpassing GPT-4o in multiple benchmarks.

Large Language Model

Transformers

Supports Multiple LanguagesOpen Source License:Apache-2.0 #Judge-type Language Model #Multilingual Evaluation #Efficient Scoring

Downloads 2,840

Release Time : 1/22/2025

Model Overview

A post-training model based on Llama-3.1-8B, specifically designed for evaluation tasks, supporting various scoring criteria and structured evaluation outputs.

Model Features

High-performance Evaluation Capability

Surpasses GPT-4o in RewardBench, EvalBiasBench, and AutoJ benchmarks

Multi-task Evaluation Support

Supports three types of evaluation tasks: absolute scoring, classification judgment, and pairwise preference

Multilingual Support

Primarily supports English, while also compatible with various European and Asian languages

Long Context Handling

Supports context lengths of up to 128K

Model Capabilities

Text generation

Response evaluation

Harmlessness scoring

Logical consistency judgment

RAG hallucination detection

Multilingual processing

Use Cases

Content Evaluation

Response Quality Scoring

Evaluates response quality on a scale of 1-5

Provides structured scoring and qualitative feedback

Harmlessness Detection

Assesses content safety and harmlessness

Identifies potentially harmful content

RAG Systems

Hallucination Detection

Detects factual errors in generated content

Improves RAG system reliability

Dialogue Systems

Dialogue Quality Evaluation

Evaluates the quality of dialogue system responses

Optimizes dialogue system performance

🚀 Atla Selene Mini

Atla Selene Mini is a state-of-the-art small language model-as-a-judge (SLMJ) that achieves comparable performance to models 10x its size and outperforms GPT - 4o on multiple benchmarks.

🚀 Quick Start

Installation

You can install the necessary libraries and load the model as follows:

from transformers import AutoModelForCausalLM, AutoTokenizer
device = "cuda" # the device to load the model onto

model_id = "AtlaAI/Selene-1-Mini-Llama-3.1-8B"

model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(model_id)

Usage

prompt = "I heard you can evaluate my responses?" # replace with your prompt / we provide prompt templates used during training at github.com/atla-ai/selene-mini/tree/main/prompt-templates
messages = [{"role": "user", "content": prompt}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
model_inputs = tokenizer([text], return_tensors="pt").to(device)

generated_ids = model.generate(model_inputs.input_ids, max_new_tokens=512, do_sample=True)
generated_ids = [output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)]

response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]

✨ Features

High - performance: Atla Selene Mini achieves comparable performance to models 10x its size and outperforms GPT - 4o on RewardBench, EvalBiasBench, and AutoJ.
Multilingual support: Supports English, German, French, Italian, Portuguese, Hindi, Spanish, Thai.
Versatile evaluation: Can be used as a general - purpose evaluation model, supporting different inputs & scoring scales, generating structured evaluation outputs, and providing qualitative critiques with reasoning.
Long context length: Has a context length of 128K.

📦 Model Details

Property	Details
Developed by	Atla
Model Type	Post - trained from Llama - 3.1 - 8B
Language(s) (NLP)	Primarily English but supports German, French, Italian, Portuguese, Hindi, Spanish, Thai
Context length	128K

💻 Usage Examples

Basic Usage

from transformers import AutoModelForCausalLM, AutoTokenizer
device = "cuda" # the device to load the model onto

model_id = "AtlaAI/Selene-1-Mini-Llama-3.1-8B"

model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(model_id)

prompt = "I heard you can evaluate my responses?" # replace with your prompt / we provide prompt templates used during training at github.com/atla-ai/selene-mini/tree/main/prompt-templates
messages = [{"role": "user", "content": prompt}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
model_inputs = tokenizer([text], return_tensors="pt").to(device)

generated_ids = model.generate(model_inputs.input_ids, max_new_tokens=512, do_sample=True)
generated_ids = [output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)]

response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]

Advanced Usage

To achieve best results, you can use the prompts we used for training here. And remember to apply the conversation template of Llama 3.

# Load the conversation template and apply it to your input
# You can find the conversation class at this [link](https://github.com/lm-sys/FastChat/blob/main/fastchat/conversation.py)
# Here is an example of applying the template
# ... (code to apply the template)

📚 Documentation

Absolute scoring: Try our cookbook to get started.
RAG hallucination: Check out our cookbook for this use case.

📄 License

This model is licensed under the apache - 2.0 license.

📞 Contact

Email: support@atla-ai.com
Discord: Join our Discord

📚 Citation

@misc{alexandru2025atlaseleneminigeneral,
      title={Atla Selene Mini: A General Purpose Evaluation Model}, 
      author={Andrei Alexandru and Antonia Calvi and Henry Broomfield and Jackson Golden and Kyle Dai and Mathias Leys and Maurice Burger and Max Bartolo and Roman Engeler and Sashank Pisupati and Toby Drane and Young Sun Park},
      year={2025},
      eprint={2501.17195},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2501.17195}, 
}

⚠️ Important Note

Remember to apply the conversation template of Llama 3 - not doing so might lead to unexpected behaviors.

💡 Usage Tip

To achieve best results, use the prompts we used for training here.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご