ColBERT Open-Source Turkish Model - Free Computation of Sentence Similarity and Document Re-Ranking

Colbert ModernBERT Base Turkish Uncased

Developed by 99eren99

This is a Turkish language model fine-tuned from ModernBERT-base-Turkish-uncased-mlm using PyLate, designed for sentence similarity calculation and document reranking.

Text Embedding

Safetensors

OtherOpen Source License:Apache-2.0 #Turkish semantic retrieval #Long document reranking #ColBERT architecture

Downloads 74

Release Time : 2/14/2025

Model Overview

The model maps sentences and paragraphs into 128-dimensional dense vector sequences, supporting semantic text similarity computation using the MaxSim operator, suitable for Turkish text retrieval and reranking tasks.

Model Features

Long-context processing

Supports document processing up to 8192 tokens, suitable for long-text retrieval scenarios.

Efficient retrieval

Utilizes Voyager HNSW indexing for fast document retrieval.

Multi-granular representation

Generates 128-dimensional dense vector sequences, preserving fine-grained semantic information of text.

Model Capabilities

Semantic text similarity calculation

Document retrieval

Query-document matching

Search result reranking

Use Cases

Information retrieval

Document search engine

Building a Turkish document search engine to improve search result relevance

Improvement in nDCG and recall metrics

QA systems

Used for reranking answer candidates in question-answering systems

Increased answer accuracy

🚀 Turkish Long Context ColBERT Based Reranker

This model is a Turkish long context reranker based on ColBERT. It is finetuned from 99eren99/ModernBERT-base-Turkish-uncased-mlm using PyLate. It maps sentences and paragraphs to 128 - dimensional dense vector sequences and can be used for semantic textual similarity with the MaxSim operator.

✨ Features

Sentence Similarity: Maps sentences and paragraphs to 128 - dimensional dense vectors for semantic textual similarity.
Reranking: Can be used to rerank documents in a retrieval pipeline.

📦 Installation

First, install the PyLate library:

pip install -U einops flash_attn
pip install -U pylate

Then, normalize your text using lambda x: x.replace("İ", "i").replace("I", "ı").lower().

💻 Usage Examples

Basic Usage

The following steps show how to index documents and retrieve the top - k most relevant documents for a given set of queries.

Indexing documents

from pylate import indexes, models, retrieve

# Step 1: Load the ColBERT model
document_length = 180#some integer [0,8192] for truncating documents, you can maybe try rope scaling for longer inputs  
model = models.ColBERT(
    model_name_or_path="99eren99/ColBERT-ModernBERT-base-Turkish-uncased", document_length=document_length
)
try:
    model.tokenizer.model_input_names.remove("token_type_ids")
except:
    pass
#model.to("cuda")

# Step 2: Initialize the Voyager index
index = indexes.Voyager(
    index_folder="pylate-index",
    index_name="index",
    override=True,  # This overwrites the existing index if any
)

# Step 3: Encode the documents
documents_ids = ["1", "2", "3"]
documents = ["document 1 text", "document 2 text", "document 3 text"]

documents_embeddings = model.encode(
    documents,
    batch_size=32,
    is_query=False,  # Ensure that it is set to False to indicate that these are documents, not queries
    show_progress_bar=True,
)

# Step 4: Add document embeddings to the index by providing embeddings and corresponding ids
index.add_documents(
    documents_ids=documents_ids,
    documents_embeddings=documents_embeddings,
)

Note that you can reuse the index later by loading it:

# To load an index, simply instantiate it with the correct folder/name and without overriding it
index = indexes.Voyager(
    index_folder="pylate-index",
    index_name="index",
)

Retrieving top - k documents for queries

# Step 1: Initialize the ColBERT retriever
retriever = retrieve.ColBERT(index=index)

# Step 2: Encode the queries
queries_embeddings = model.encode(
    ["query for document 3", "query for document 1"],
    batch_size=32,
    is_query=True,  #  # Ensure that it is set to False to indicate that these are queries
    show_progress_bar=True,
)

# Step 3: Retrieve top - k documents
scores = retriever.retrieve(
    queries_embeddings=queries_embeddings, 
    k=10,  # Retrieve the top 10 matches for each query
)

Advanced Usage

If you only want to use the ColBERT model to perform reranking on top of your first - stage retrieval pipeline without building an index, you can use the following code:

from pylate import rank, models

queries = [
    "query A",
    "query B",
]

documents = [
    ["document A", "document B"],
    ["document 1", "document C", "document B"],
]

documents_ids = [
    [1, 2],
    [1, 3, 2],
]

model = models.ColBERT(
    model_name_or_path=pylate_model_id,
)

queries_embeddings = model.encode(
    queries,
    is_query=True,
)

documents_embeddings = model.encode(
    documents,
    is_query=False,
)

reranked_documents = rank.rerank(
    documents_ids=documents_ids,
    queries_embeddings=queries_embeddings,
    documents_embeddings=documents_embeddings,
)

📚 Documentation

Documentation: PyLate Documentation
Repository: PyLate on GitHub
Hugging Face: PyLate models on Hugging Face

📊 Evaluation Results

nDCG and Recall scores for long context late interaction retrieval models, test code and detailed metrics can be found in "./assets" Token Lengths

📄 License

This model is licensed under the Apache 2.0 license.

📋 Model Information

Property	Details
Base Model	99eren99/ModernBERT-base-Turkish-uncased-mlm
Language	tr
Library Name	PyLate
Pipeline Tag	sentence - similarity
Tags	ColBERT, PyLate, sentence - transformers, sentence - similarity, generated_from_trainer, reranker, bert
License	apache - 2.0

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご