DRAMA-large Open-Source Multilingual Text Retrieval Model - Efficient Generalization for Quick Search of Desired Texts

Drama Large

Developed by facebook

DRAMA-large (0.3B) is a dense retrieval model built upon a pruned large language model architecture, optimized for efficient and generalizable multilingual text retrieval tasks.

Text Embedding

Transformers

Supports Multiple Languages#Multilingual Retrieval #Large Model Pruning #Nested Representation Learning

Downloads 55

Release Time : 2/25/2025

Model Overview

This model is obtained by pruning a large language model and fine-tuning it for efficient and generalizable multilingual text retrieval tasks. By leveraging the large language model for high-quality data augmentation, despite its non-embedding parameter size of only 0.3B, DRAMA-large demonstrates strong performance in both English and multilingual retrieval tasks.

Model Features

Large Language Model Augmentation

Enhances retrieval performance through high-quality data augmentation using large language models.

Multilingual Support

Supports text retrieval tasks in 20 languages.

Nested Representation Learning

Supports flexible dimension adjustment, allowing truncation of embedding dimensions as needed.

Efficient Retrieval

Despite its smaller parameter size, it performs excellently in multilingual retrieval tasks.

Model Capabilities

Multilingual Text Retrieval

Sentence Similarity Calculation

Efficient Vector Encoding

Use Cases

Information Retrieval

Cross-language Document Retrieval

Retrieves relevant documents from document libraries in different languages.

Performs excellently on multilingual datasets such as MIRACL.

Question Answering Systems

Used for retrieving relevant passages in question-answering systems.

Performs well on benchmarks like BEIR.

🚀 DRAMA-large (0.3B): Diverse Augmentation from Large Language Models to Smaller Dense Retrievers

DRAMA-large (0.3B) is a dense retrieval model. It's built on a pruned large language model backbone. By pruning a large language model and fine - tuning, it achieves efficient and generalizable multilingual text retrieval. Despite having only 0.3B non - embedding parameters, it performs well in both English and multilingual retrieval tasks, thanks to high - quality data augmentation from large language models.

The default embedding size of drama - large is 1024. With Matryoshka Representation Learning, the dimensionality can be flexibly truncated to 512 or 256.

Check our paper for more details.

🚀 Quick Start

✨ Features

Built on a pruned large language model backbone for efficient retrieval.
Fine - tuned for multilingual text retrieval.
Achieves strong performance in both English and multilingual tasks with a compact size.
Supports flexible dimensionality with Matryoshka Representation Learning.

📦 Installation

No installation steps are provided in the original document, so this section is skipped.

💻 Usage Examples

Basic Usage

Below are examples of using drama - large to encode query and document examples from the MIRACL dataset, using either Transformers or Sentence Transformers:

Transformers

import torch
from transformers import AutoTokenizer, AutoModel


queries = [
    'What percentage of the Earth\'s atmosphere is oxygen?',
    '意大利首都是哪里？',
]
documents = [
    "The amount of oxygen in the atmosphere has fluctuated over the last 600 million years, reaching a peak of 35% during the Carboniferous period, significantly higher than today's 21%.",
    "羅馬是欧洲国家意大利首都和罗马首都广域市的首府及意大利全国的政治、经济、文化和交通中心，位于意大利半島中部的台伯河下游平原地，建城初期在七座小山丘上，故又名“七丘之城”。按城市范围内的人口计算，罗马是意大利人口最多的城市，也是欧盟人口第三多的城市。",
]

model_name = "facebook/drama-large"
device = "cuda" if torch.cuda.is_available() else "cpu"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name, trust_remote_code=True).to(device)

query_embs = model.encode_queries(tokenizer, queries)
doc_embs = model.encode_documents(tokenizer, documents)

scores = query_embs @ doc_embs.T
print(scores.tolist())
# Expected output: [[0.5429, 0.1109], [0.1317, 0.6074]]

⚠️ Important Note

The trust_remote_code will use our customized drama_modeling.py with two details:

We use bi - directional attention instead of uni - directional attention

We add "Query: " as prefix for query text. (No prefix added to document)

DRAMA models are trained using Matryoshka Representation Learning (MRL) to support flexible dimensionality. Both queries and documents can be encoded into smaller dimensions, such as 256, using the following:

query_embs = model.encode_queries(tokenizer, queries, dim=256)
doc_embs = model.encode_documents(tokenizer, documents, dim=256)

scores = query_embs @ doc_embs.T
print(scores.tolist())
# Expected output: [[0.6239, 0.2294], [0.2604, 0.6942]]

Sentence Transformers

from sentence_transformers import SentenceTransformer

queries = [
    'What percentage of the Earth\'s atmosphere is oxygen?',
    '意大利首都是哪里？',
]
documents = [
    "The amount of oxygen in the atmosphere has fluctuated over the last 600 million years, reaching a peak of 35% during the Carboniferous period, significantly higher than today's 21%.",
    "羅馬是欧洲国家意大利首都和罗马首都广域市的首府及意大利全国的政治、经济、文化和交通中心，位于意大利半島中部的台伯河下游平原地，建城初期在七座小山丘上，故又名“七丘之城”。按城市范围内的人口计算，罗马是意大利人口最多的城市，也是欧盟人口第三多的城市。",
]

model = SentenceTransformer("facebook/drama-large", trust_remote_code=True)

query_embs = model.encode(queries, prompt_name="query")
doc_embs = model.encode(documents)

scores = model.similarity(query_embs, doc_embs)
print(scores.tolist())
# Expected output: [[0.5429, 0.1109], [0.1317, 0.6074]]

⚠️ Important Note

The trust_remote_code will use our customized drama_modeling.py which uses bi - directional attention instead of uni - directional attention.

For queries, you have to use prompt_name="query" to select the prompt called "query", or prompt="Query: " to specify the prompt string manually.

from sentence_transformers import SentenceTransformer

queries = [
    'What percentage of the Earth\'s atmosphere is oxygen?',
    '意大利首都是哪里？',
]
documents = [
    "The amount of oxygen in the atmosphere has fluctuated over the last 600 million years, reaching a peak of 35% during the Carboniferous period, significantly higher than today's 21%.",
    "羅馬是欧洲国家意大利首都和罗马首都广域市的首府及意大利全国的政治、经济、文化和交通中心，位于意大利半島中部的台伯河下游平原地，建城初期在七座小山丘上，故又名“七丘之城”。按城市范围内的人口计算，罗马是意大利人口最多的城市，也是欧盟人口第三多的城市。",
]

model = SentenceTransformer("facebook/drama-large", truncate_dim=256, trust_remote_code=True)

query_embs = model.encode(queries, prompt_name="query")
doc_embs = model.encode(documents)

scores = model.similarity(query_embs, doc_embs)
print(scores.tolist())
# Expected output: [[0.6239, 0.2294], [0.2604, 0.6942]]

📚 Documentation

The model has been evaluated on multiple retrieval benchmarks, including [BEIR](https://github.com/beir - cellar/beir), [MIRACL](https://github.com/project - miracl/miracl), MLDR, and several multilingual retrieval tasks in [MTEB](https://github.com/embeddings - benchmark/mteb). It shows strong performance in both English and multilingual retrieval tasks.

drama - large released in this page corresponds to the line DRAMA - 0.3B with 265M non - embedding parameters.

🔧 Technical Details

DRAMA - large was initialized from [Llama3.2 - 1B](https://huggingface.co/meta - llama/Llama - 3.2 - 1B) (originally pruned from [Llama3.1 - 8B](https://huggingface.co/meta - llama/Llama - 3.1 - 8B)). During pruning and retriever training, the training data covered the following 20 languages (sorted alphabetically):

Arabic, Bengali, Chinese, English, Finnish, French, German, Hindi, Indonesian, Italian, Japanese, Korean, Persian, Portuguese, Russian, Spanish, Swahili, Telugu, Thai, Yoruba

Other languages may have downgraded performance.

📄 License

The model is licensed under cc - by - nc - 4.0.

📖 Citation

If you find our paper or models helpful, please consider citing as follows:

@article{drama,
  title={{Drama}: Diverse Augmentation from Large Language Models To Smaller Dense Retrievers},
  author={Ma, Xueguang and Lin, Victoria Xi and Oguz, Barlas and Lin, Jimmy and Yih, Wen - tau and Chen, Xilun},
  journal={arXiv:2502.18460},
  year={2025}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご