đ Turkish Long Context ColBERT Based Reranker
This model is a Turkish long context reranker based on ColBERT. It is finetuned from 99eren99/ModernBERT-base-Turkish-uncased-mlm using PyLate. It maps sentences and paragraphs to 128 - dimensional dense vector sequences and can be used for semantic textual similarity with the MaxSim operator.
⨠Features
- Sentence Similarity: Maps sentences and paragraphs to 128 - dimensional dense vectors for semantic textual similarity.
- Reranking: Can be used to rerank documents in a retrieval pipeline.
đĻ Installation
First, install the PyLate library:
pip install -U einops flash_attn
pip install -U pylate
Then, normalize your text using lambda x: x.replace("İ", "i").replace("I", "Ĺ").lower()
.
đģ Usage Examples
Basic Usage
The following steps show how to index documents and retrieve the top - k most relevant documents for a given set of queries.
Indexing documents
from pylate import indexes, models, retrieve
document_length = 180
model = models.ColBERT(
model_name_or_path="99eren99/ColBERT-ModernBERT-base-Turkish-uncased", document_length=document_length
)
try:
model.tokenizer.model_input_names.remove("token_type_ids")
except:
pass
index = indexes.Voyager(
index_folder="pylate-index",
index_name="index",
override=True,
)
documents_ids = ["1", "2", "3"]
documents = ["document 1 text", "document 2 text", "document 3 text"]
documents_embeddings = model.encode(
documents,
batch_size=32,
is_query=False,
show_progress_bar=True,
)
index.add_documents(
documents_ids=documents_ids,
documents_embeddings=documents_embeddings,
)
Note that you can reuse the index later by loading it:
index = indexes.Voyager(
index_folder="pylate-index",
index_name="index",
)
Retrieving top - k documents for queries
retriever = retrieve.ColBERT(index=index)
queries_embeddings = model.encode(
["query for document 3", "query for document 1"],
batch_size=32,
is_query=True,
show_progress_bar=True,
)
scores = retriever.retrieve(
queries_embeddings=queries_embeddings,
k=10,
)
Advanced Usage
If you only want to use the ColBERT model to perform reranking on top of your first - stage retrieval pipeline without building an index, you can use the following code:
from pylate import rank, models
queries = [
"query A",
"query B",
]
documents = [
["document A", "document B"],
["document 1", "document C", "document B"],
]
documents_ids = [
[1, 2],
[1, 3, 2],
]
model = models.ColBERT(
model_name_or_path=pylate_model_id,
)
queries_embeddings = model.encode(
queries,
is_query=True,
)
documents_embeddings = model.encode(
documents,
is_query=False,
)
reranked_documents = rank.rerank(
documents_ids=documents_ids,
queries_embeddings=queries_embeddings,
documents_embeddings=documents_embeddings,
)
đ Documentation
đ Evaluation Results
nDCG and Recall scores for long context late interaction retrieval models, test code and detailed metrics can be found in "./assets"

đ License
This model is licensed under the Apache 2.0 license.
đ Model Information
Property |
Details |
Base Model |
99eren99/ModernBERT-base-Turkish-uncased-mlm |
Language |
tr |
Library Name |
PyLate |
Pipeline Tag |
sentence - similarity |
Tags |
ColBERT, PyLate, sentence - transformers, sentence - similarity, generated_from_trainer, reranker, bert |
License |
apache - 2.0 |