Dewey_en_beta Open-Source Long Context Embedding Model - Super-Long Window Empowers Long Document Retrieval!

Dewey En Beta

Developed by infgrad

Dewey is a novel long-context embedding model based on the ModernBERT architecture, supporting a 128k context window and excelling in long-document retrieval tasks.

Text Embedding

Transformers

EnglishOpen Source License:MIT #128k long text processing #multi-vector retrieval #instruction-based embedding

Downloads 447

Release Time : 3/23/2025

Model Overview

The Dewey model focuses on improving retrieval performance in long-document scenarios. It employs instruction-based training to align embeddings with tasks, supports both single-vector and multi-vector representations, and features a flexible text chunking mechanism.

Model Features

Ultra-long context support

Supports processing of ultra-long contexts up to 128k tokens.

Multi-vector representation

Supports Colbert-like multi-vector representation but with fewer vectors (only 0.5% of the token count).

Efficient encoding

Benefits from the advantages of the ModernBERT architecture, maintaining efficiency even during long-text encoding.

Flexible chunking

Supports fully customizable text chunking strategies to adapt to different application scenarios.

Model Capabilities

Long-document retrieval

Semantic similarity calculation

Text classification

Text clustering

Use Cases

Information retrieval

Long-document retrieval

Efficient retrieval in databases containing ultra-long documents.

Achieved a score of 0.86 in the LongEmbed benchmark, surpassing multiple commercial models.

Semantic analysis

Semantic similarity calculation

Calculates semantic similarity between texts.

Performed excellently in short-text evaluation (MTEB-eng-v2), surpassing multiple 7B-scale models.

🚀 Dewey Long Context Embedding Model: A Technical Report

In this technical report, we introduce Dewey, a novel long context embedding model designed to enhance retrieval performance in long document scenarios. It builds upon the ModernBERT architecture and incorporates an instruction-based training approach.

🚀 Quick Start

The Dewey Long Context Embedding Model was presented in the paper . Cooperating with Richinfo, this released model was trained using a novel approach. Although we haven't fully understood the underlying principles yet, we have achieved promising results. Therefore, we have decided to open-source the model and hope that someone will test the model and provide us with feedback!

The technical report: https://arxiv.org/abs/2503.20376

The core training method of this model will be implemented in the RAG-Retrieval repository open sourced by the NovaSearch Team, welcome to star!

✨ Features

Max length and parameter size: Max length is 128k, parameter size is 395M, and support only for English.
Vector representation: Supports both single-vector and multi-vector (similar to Colbert, but with fewer vectors, only 0.5% of the number of tokens).
Short text performance: Achieved quite impressive results on the short text evaluation (MTEB-eng-v2), without using the MTEB training set, even surpassing several 7B-sized models.
Long text performance: On the long text evaluation LongEmbed, the single-vector surpasses many large and commercial models. If multi-vector is used, the average score becomes the first place. Currently, our score is 0.86, while the current first place score is 0.79.
Encoding speed: Ultra-fast encoding speed, benefiting from the architectural advantages of ModernBert, the encoding speed for long texts is still very fast.
Flexible combination: Super flexible multi-vector combination method, where the multi-vector can be understood as span or chunk level, not token level, so how to specify the chunk can be completely customized according to your own scenario, very flexible.

📦 Installation

No specific installation steps are provided in the original document.

💻 Usage Examples

Basic Usage

We suggest you read the following contents with the model architecture diagram.

avatar

We do hope you read the modeling_dewey_v1.py and custom_st.py carefully, these codes are easy to read and will help you a lot!

Prompts

Our model is a kind of instruct-embedding-model, when using our model, you should add prompt before the text.

For STS task, you MUST use our provided prompt: <|START_INSTRUCTION|>Generate semantically similar text<|END_INSTRUCTION|>

Single Vector

For using single vector, our model is compatible with the SentenceTransformer.

import os

# os.environ["HF_ENDPOINT"] = "https://hf-mirror.com"
import torch
from sentence_transformers import SentenceTransformer

RETRIEVE_Q_PROMPT = "<|START_INSTRUCTION|>Answer the question<|END_INSTRUCTION|>"
RETRIEVE_P_PROMPT = "<|START_INSTRUCTION|>Candidate document<|END_INSTRUCTION|>"
model = SentenceTransformer(
    "infgrad/dewey_en_beta",
    trust_remote_code=True,
    model_kwargs={
        "torch_dtype": torch.bfloat16,
        "attn_implementation": "flash_attention_2"
    },
    config_kwargs={"single_vector_type": "mean"}
).cuda().bfloat16().eval()
# the choice of single_vector_type:
## for short text (<1k): cls_add_mean
## for long text (>1k): mean

# the max length of model is 128*1024
model.max_seq_length = 32 * 1024

query_vectors = model.encode(
    sentences=[f"{RETRIEVE_Q_PROMPT}What is a computer composed of?", f"{RETRIEVE_Q_PROMPT}why the sky is blue"]
)
passage_vectors = model.encode(
    sentences=[
        f"{RETRIEVE_P_PROMPT}Central processing unit (CPU), memory (RAM), storage (hard drive or SSD), input/output devices (keyboard, mouse, monitor), and a motherboard",
        f"{RETRIEVE_P_PROMPT}Shorter wavelengths of light, such as blue and violet, are scattered more by gases and particles in Earth's atmosphere.",
    ]
)

print(query_vectors @ passage_vectors.T)
# the output is:
# [[0.52512825 0.19771025]
#  [0.17617573 0.5918883 ]]

Multi Vectors

Our multi vectors are based on text span(i.e. chunk), so each vector can be considered as a contextual chunk vector. In order to get multi vectors of a document, you should get chunks and their spans first.

Below are detailed steps to get multi vectors:

Step1: Chunk the document to get chunks and spans. This can be done by using our encode function, or you can also chunk documents by yourself according to your scenario. Note that, if you decide to chunk by yourself, your chunk and span should not contain prompt!!! Step2: encode text to get token embeddings Step3: according to span (i.e. start_position and end_position) to get chunk vector, we use mean of span token embeddings as chunk vector (i.e. normalize(token_embed[start_position:end_position].mean( axis=0))) Step4: For each span, do Step3, until get all chunk vectors, you can also add span(0,1) and span(1+prompt_len, text_len-1) to get global vector

For retrieval tasks, query vector should be single vector, so the final score between query and document is the max score of query with every document vector. This is compatible with FAISS, MILVUS and so on. Just enlarge the top-k and do de-duplicate on searched documents.

Chunk text in the `encode` function

You can directly use encode method in our model to get multi vectors. This method will chunk text automatically. You can choose the chunk strategy by setting fast_chunk parameter, if fast_chunk is true, directly chunk on input ids, else using RecursiveCharacterTextSplitter.

import os
import numpy as np

# os.environ["HF_ENDPOINT"] = "https://hf-mirror.com"
from pydantic import BaseModel
from typing import Optional, List
from transformers import AutoTokenizer, AutoModel


class TextSpan(BaseModel):
    s: int
    e: int
    text: Optional[str] = None
    module_name: str


RETRIEVE_Q_PROMPT = "<|START_INSTRUCTION|>Answer the question<|END_INSTRUCTION|>"
RETRIEVE_P_PROMPT = "<|START_INSTRUCTION|>Candidate document<|END_INSTRUCTION|>"
model = AutoModel.from_pretrained(
    "infgrad/dewey_en_beta",
    trust_remote_code=True,
    attn_implementation="flash_attention_2"
).cuda().bfloat16()
model.tokenizer = AutoTokenizer.from_pretrained("infgrad/dewey_en_beta")
max_seq_length = 32 * 1024

q_list = ["why the sky is blue"]
p_list = [
    """
    I’ve been trying to understand why the sky changes colors, and I think I understand most of it, but something in the online explanations doesn’t make it clear for me:

I’ve read:

sky is blue because blue light gets scattered the most during the day.

in the evening it turns red because now even more of the blue light gets scattered

So a few questions:

The scattering of light during the day: does it mean that blue light gets reflected off air particles and reaches our eyes, while the rest of the frequencies pass through and reach the ground?

Surely some of the other frequencies also get scattered during the day, just in much smaller amounts?

So during the evening blue light gets scattered even more, to the point where even less of it reaches the eyes?

And so it gets red because now we can see the lower frequencies being scattered without blue overshadowing them?\

Trying to word it myself: during the day only the highest frequencies get filtered, but during the evening also lower frequencies get filtered, because now the “light strainer” (air) is just catching more of it?\

It gets darker in the evening without a good ability to see colors because there’s is no blue and so on light to reflect off of objects?\

Is it ok to speak about light as a frequency? Or it’s only correct to say “wave length”?

Blue light is scattered in all directions by the tiny molecules of air in Earth's atmosphere. Blue is scattered more than other colors because it travels as shorter, smaller waves. 
This is why we see a blue sky most of the time. Closer to the horizon, the sky fades to a lighter blue or white.
    """
]

# query should be a single vector, so we set chunk_size as -1 to avoid chunk.
# If chunk size is -1, the model will return an array with shape of (2,2048) consisting of cls_vector and mean_vector(mean of all token embeddings).
query_vectors = model.encode(
    sentences=q_list,
    use_cuda=True,
    show_progress_bar=True,
    chunk_size=-1,
    chunk_overlap=32,
    convert_to_tensor=False,
    max_seq_length=max_seq_length,
    batch_size=8,
    normalize_embeddings=True,
    prompt=RETRIEVE_Q_PROMPT,
    fast_chunk=False

)[0]
# query vector do not need multi vector, we only use mean as final single vector
pred = [vecs[1:2, :] for vecs in query_vectors]

# spans_list contail each chunk's span, you can use span to get text
spans_list: List[List[TextSpan]]
passage_vectors_list: List[np.ndarray]
passage_vectors_list, spans_list = model.encode(
    sentences=p_list,
    use_cuda=True,
    show_progress_bar=True,
    chunk_size=64,
    chunk_overlap=8,
    convert_to_tensor=False,
    max_seq_length=max_seq_length,
    batch_size=8,
    normalize_embeddings=True,
    prompt=RETRIEVE_P_PROMPT,
    fast_chunk=True,  # if fast_chunk is true, directly chunk on input ids, else using RecursiveCharacterTextSplitter
)
# spans_list stores each passage's spans, passage_vectors_list stores each passage's vectors so len(spans_list) == len(p_list) and len(spans_list) == len(passage_vectors_list)
# for a passage's spans and vectors, each span corresponds to a vector (1*2048). So, len(spans_list[idx]) ==  len(passage_vectors_list[idx])
print((query_vectors[0] @ passage_vectors_list[0].T).max())
# output 0.7331543
# get each chunk's content
for spans, passage in zip(spans_list, p_list):
    text_ids = model.tokenizer.encode(RETRIEVE_P_PROMPT + passage)
    for span in spans:
        s, e = span.s, span.e
        chunk_text = model.tokenizer.decode(
            text_ids[s:e],
            skip_special_tokens=True,
            clean_up_tokenization_spaces=True
        ).strip()

Chunk text by yourself

If you want to chunk text by yourself, you should just set the batch_text_spans parameter in the encode function.

import os
import numpy as np

# os.environ["HF_ENDPOINT"] = "https://hf-mirror.com"
from pydantic import BaseModel
from typing import Optional, List
from transformers import AutoTokenizer, AutoModel


class TextSpan(BaseModel):
    s: int
    e: int
    text: Optional[str] = None
    module_name: str


prompt = "<|START_INSTRUCTION|>Candidate document<|END_INSTRUCTION|>"

# load model
model = AutoModel.from_pretrained(
    "infgrad/dewey_en_beta",
    trust_remote_code=True,
    attn_implementation="flash_attention_2"
)
model.tokenizer = AutoTokenizer.from_pretrained("infgrad/dewey_en_beta")
max_seq_length = 32 * 1024

# chunk text
passage = "this sentence 1. this sentence 2. this sentence 3"
chunks = ["this sentence 1. this sentence 2.", "this sentence 2. this sentence 3"]
prompt_length = len(model.tokenizer.tokenize(prompt))
text_spans = [
    # s=0, e=1 means that this vector is cls vector, so the module_name is cls_linear, otherwise the module_name is chunk_linear
    TextSpan(s=0, e=1, module_name="cls_linear")
]
for chunk in chunks:
    s = passage.find(chunk)
    e = s + len(chunk)
    text_spans.append(
        TextSpan(
            # add 1, as there is a [CLS] token at the beginning of text.
            s=1 + prompt_length + len(model.tokenizer.tokenize(passage[:s])),
            e=1 + prompt_length + len(model.tokenizer.tokenize(passage[:e])),
            module_name="chunk_linear"
        )
    )

spans_list: List[List[TextSpan]]
passage_vectors_list: List[np.ndarray]
passage_vectors_list, _ = model.encode(
    sentences=[passage],
    use_cuda=False,
    show_progress_bar=True,
    chunk_size=64,
    chunk_overlap=12,
    convert_to_tensor=False,
    max_seq_length=max_seq_length,
    batch_size=8,
    normalize_embeddings=True,
    prompt=prompt,
    fast_chunk=True,
    batch_text_spans=[text_spans]
)
print(passage_vectors_list[0].shape, passage_vectors_list[0][:, 2])
# the output is (3, 2048) [0.01461297 0.02085092 0.0022509 ]

📚 Documentation

Evaluation

MTEB(eng, v2)

URL: http://mteb-leaderboard.hf.space/?benchmark_name=MTEB%28eng%2C+v2%29

Reproduction script: https://huggingface.co/infgrad/dewey_en_beta/blob/main/scripts/evaluate/run_evaluate_mteb_dewey_en_beta.py

Model	Zero-shot	Parameters	Dimensions	Max Tokens	Mean (Task)	Mean (TaskType)	Classification	Clustering	Pair Classification	Reranking	Retrieval	STS	Summarization
gemini-embedding-exp-03-07	95%	Unknown	3072	8192	73.3	67.67	90.05	59.39	87.7	48.59	64.35	85.29	38.28
jasper_en_vision_language_v1	56%	1B	8960	131072	71.41	66.65	90.27	60.52	88.14	50	56.05	84.37	37.19
gte-Qwen2-7B-instruct	NA	7B	3584	32768	70.72	65.77	88.52	58.97	85.9	50.47	58.09	82.69	35.74
stella_en_1.5B_v5	56%	1B	8960	131072	69.43	65.32	89.38	57.06	88.02	50.19	52.42	83.27	36.91
SFR-Embedding-2_R	85%	7B	4096	32768	69.82	65.31	90.54	59.39	88.09	48.99	53.75	80.86	35.54
Linq-Embed-Mistral	95%	7B	4096	32768	69.8	65.29	83	54.07	88.44	49.44	60.14	84.69	37.26
NV-Embed-v2	56%	7B	4096	32768	69.81	65	87.19	47.66	88.69	49.61	62.84	83.82	35.21
SFR-Embedding-Mistral	85%	7B	4096	32768	69.31	64.94	80.47	54.93	88.59	50.15	59.33	84.77	36.32
stella_en_400M_v5	56%	435M	4096	8192	69.39	64.84	88.25	57.65	87.17	49.6	52.73	83.93	34.53
text-embedding-004	95%	Unknown	768	2048	69.53	64.82	86.03	51.52	87.65	48.48	59.06	84.84	36.12
text-embedding-005	95%	Unknown	768	2048	69.6	64.77	86.03	51.91	87.62	48.84	58.77	85.18	35.05
e5-mistral-7b-instruct	95%	7B	4096	32768	67.97	64	79.85	51.44	88.42	49.78	57.62	84.32	36.57
text-multilingual-embedding-002	95%	Unknown	768	2048	67.67	63.52	84.65	50.41	86.6	47.48	54.7	83.94	36.84
NV-Embed-v1	56%	7B	4096	32768	68.32	63.37	84.11	49.5	87.05	49.16	60.13	82.2	31.4
infgrad/dewey_en_beta	95%	395M	2048	131072	0.68	63.30	81.83	51.75	86.82	46.35	56.32	84.21	35.79
gte-Qwen2-1.5B-instruct	NA	1B	8960	32768	67.2	63.26	85.84	53.54	87.52	49.25	50.25	82.51	33.94
GritLM-7B	95%	7B	4096	4096	67.07	63.22	81.25	50.82	87.29	49.59	54.95	83.03	35.65
GritLM-8x7B	95%	57B	4096	4096	66.16	62.42	79.98	51.48	85.23	49.22	52.46	82.93	35.65
text-embedding-3-large	NA	Unknown	3072	8191	66.43	62.15	79.15	48.9	85.81	47.45	57.98	81.44	34.31
mxbai-embed-large-v1	100%	335M	1024	512	66.26	62.04	79.1	47.48	87.2	48.05	55.4	84.42	32.63
GIST-large-Embedding-v0	80%	335M	1024	512	66.25	61.96	78.91	48.84	86.7	48.76	54.52	84.44	31.52
bge-large-en-v1.5	100%	335M	1024	512	65.89	61.87	78.34	48.01	87.13	48.26	55.44	82.79	33.13
UAE-Large-V1	100%	335M	1024	512	66.4	61.85	79.08	47.86	87.25	48.35	55.91	84.37	30.13

LongEmbed

URL: http://mteb-leaderboard.hf.space/?benchmark_name=LongEmbed

Reproduction script: https://huggingface.co/infgrad/dewey_en_beta/blob/main/scripts/evaluate/run_evaluate_long_embed.py

Model	Zero-shot	Number of Parameters	Embedding Dimensions	Max Tokens	Mean (Task)	Mean (TaskType)	Retrieval
infgrad/dewey_en_beta-MultiVectors	100%	395M	2048	131072	86.59	86.59	86.59
voyage-multilingual-2	100%	Unknown	1024	32000	79.17	79.17	79.17
voyage-law-2	100%	Unknown	1024	16000	78.85	78.85	78.85
infgrad/dewey_en_beta-SingleVector	100%	395M	2048	131072	77.98	77.98	77.98
voyage-3	100%	Unknown	1024	32000	74.06	74.06	74.06
inf-retriever-v1	100%	7B	3584	32768	73.19	73.19	73.19

LoCoV1

URL: https://huggingface.co/datasets/hazyresearch/LoCoV1-Queries https://huggingface.co/datasets/hazyresearch/LoCoV1-Documents

Reproduction script: https://huggingface.co/infgrad/dewey_en_beta/blob/main/scripts/evaluate/run_evaluate_loco.py

Metric: NDCG@10

Result:

dataset-name	bge-m3-8k	gte-modernbert-base-8k	Linq-Embed-Mistral-4k	Linq-Embed-Mistral-8k	SFR-Embedding-Mistral-8k	e5-mistral-7b-instruct-8k	dewey_en_beta-8k	dewey_en_beta_64k	dewey_en_beta_64k-multi-vectors
2wikimqa_test	0.9271	0.8658	0.8884	0.9067	0.8965	0.8901	0.8953	0.9051	0.9775
courtlistener_HTML_test	0.1933	0.2349	0.3551	0.3670	0.3647	0.3543	0.3415	0.3616	0.4775
courtlistener_Plain_Text_test	0.1888	0.2478	0.3675	0.3761	0.3679	0.3579	0.3377	0.3485	0.4426
gov_report_test	0.9869	0.9750	0.9832	0.9837	0.9816	0.9823	0.9855	0.9883	0.9853
legal_case_reports_test	0.3702	0.4476	0.5398	0.5432	0.5319	0.4850	0.5474	0.5875	0.6534
multifieldqa_test	0.9373	0.9341	0.9345	0.9327	0.9450	0.9321	0.9687	0.9564	0.9754
passage_retrieval_test	0.4493	0.5271	0.3470	0.3407	0.2902	0.3248	0.7562	0.7389	0.8550
qasper_abstract_test	1.0000	0.9806	0.9982	0.9982	0.9973	0.9965	0.9973	0.9982	0.9982
qasper_title_test	0.9860	0.8892	0.9838	0.9833	0.9861	0.9812	0.9742	0.9742	0.9840
qmsum_test	0.6668	0.6307	0.6816	0.7237	0.7169	0.7148	0.7438	0.7613	0.8154
stackoverflow_test	0.9634	0.9087	0.9760	0.9760	0.9766	0.9690	0.9362	0.9369	0.9443
summ_screen_fd_test	0.9320	0.9379	0.9747	0.9635	0.9656	0.9580	0.9796	0.9821	0.9788
Average	0.7168	0.7150	0.7525	0.7579	0.7517	0.7455	0.7886	0.7949	0.8406

🔧 Technical Details

This model is based on answerdotai/ModernBERT-large. A perfect model, thanks for their sharing!

📄 License

The model is under the MIT license.

❗ Limitations

Language support: Only English text.
Short text performance: On short text tasks, the performance might not be as good as that of conventional short text embedding models.
Model stage: As said before, this model is still in alpha or beta stage, the model may have some unexpected behaviour.

📚 Cite

@misc{zhang2025deweylongcontextembedding,
      title={Dewey Long Context Embedding Model: A Technical Report}, 
      author={Dun Zhang and Panxiang Zou and Yudong Zhou},
      year={2025},
      eprint={2503.20376},
      archivePrefix={arXiv},
      primaryClass={cs.IR},
      url={https://arxiv.org/abs/2503.20376}, 
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご