ColBERT開源土耳其語模型 - 免費計算句子相似度與文檔重排序

首頁

Colbert ModernBERT Base Turkish Uncased

由99eren99開發

這是一個基於PyLate從ModernBERT-base-Turkish-uncased-mlm微調的土耳其語模型，用於句子相似度計算和文檔重排序。

文本嵌入

Safetensors

其他開源協議:Apache-2.0 #土耳其語語義檢索 #長文檔重排序 #ColBERT架構

下載量 74

發布時間 : 2/14/2025

模型概述

該模型將句子和段落映射為128維密集向量序列，支持使用MaxSim操作符進行語義文本相似度計算，適用於土耳其語文本檢索和重排序任務。

模型特點

長上下文處理

支持長達8192 token的文檔處理，適合長文本檢索場景

高效檢索

利用Voyager HNSW索引實現快速文檔檢索

多粒度表示

生成128維密集向量序列，保留文本的細粒度語義信息

模型能力

語義文本相似度計算

文檔檢索

查詢-文檔匹配

搜索結果重排序

使用案例

信息檢索

文檔搜索引擎

構建土耳其語文檔搜索引擎，提高搜索結果相關性

nDCG和召回率指標提升

問答系統

用於問答系統中答案候選的重排序

提高答案准確率

🚀 土耳其長上下文基於ColBERT的重排器

這是一個基於 PyLate 庫微調的模型，微調基礎模型為 99eren99/ModernBERT-base-Turkish-uncased-mlm。該模型可將句子和段落映射為128維的密集向量序列，通過MaxSim算子進行語義文本相似度計算。

✨ 主要特性

基於 ColBERT 架構，適用於土耳其語長上下文重排任務。
可將文本映射為128維密集向量，用於語義相似度計算。

📦 安裝指南

首先安裝 PyLate 庫：

pip install -U einops flash_attn
pip install -U pylate

然後對文本進行歸一化處理：lambda x: x.replace("İ", "i").replace("I", "ı").lower()

💻 使用示例

基礎用法

文檔索引

from pylate import indexes, models, retrieve

# Step 1: Load the ColBERT model
document_length = 180#some integer [0,8192] for truncating documents, you can maybe try rope scaling for longer inputs  
model = models.ColBERT(
    model_name_or_path="99eren99/ColBERT-ModernBERT-base-Turkish-uncased", document_length=document_length
)
try:
    model.tokenizer.model_input_names.remove("token_type_ids")
except:
    pass
#model.to("cuda")

# Step 2: Initialize the Voyager index
index = indexes.Voyager(
    index_folder="pylate-index",
    index_name="index",
    override=True,  # This overwrites the existing index if any
)

# Step 3: Encode the documents
documents_ids = ["1", "2", "3"]
documents = ["document 1 text", "document 2 text", "document 3 text"]

documents_embeddings = model.encode(
    documents,
    batch_size=32,
    is_query=False,  # Ensure that it is set to False to indicate that these are documents, not queries
    show_progress_bar=True,
)

# Step 4: Add document embeddings to the index by providing embeddings and corresponding ids
index.add_documents(
    documents_ids=documents_ids,
    documents_embeddings=documents_embeddings,
)

注意，無需每次都重新創建索引和編碼文檔。創建索引並添加文檔後，可通過以下方式加載並複用索引：

# To load an index, simply instantiate it with the correct folder/name and without overriding it
index = indexes.Voyager(
    index_folder="pylate-index",
    index_name="index",
)

查詢文檔

# Step 1: Initialize the ColBERT retriever
retriever = retrieve.ColBERT(index=index)

# Step 2: Encode the queries
queries_embeddings = model.encode(
    ["query for document 3", "query for document 1"],
    batch_size=32,
    is_query=True,  #  # Ensure that it is set to False to indicate that these are queries
    show_progress_bar=True,
)

# Step 3: Retrieve top-k documents
scores = retriever.retrieve(
    queries_embeddings=queries_embeddings, 
    k=10,  # Retrieve the top 10 matches for each query
)

重排文檔

from pylate import rank, models

queries = [
    "query A",
    "query B",
]

documents = [
    ["document A", "document B"],
    ["document 1", "document C", "document B"],
]

documents_ids = [
    [1, 2],
    [1, 3, 2],
]

model = models.ColBERT(
    model_name_or_path=pylate_model_id,
)

queries_embeddings = model.encode(
    queries,
    is_query=True,
)

documents_embeddings = model.encode(
    documents,
    is_query=False,
)

reranked_documents = rank.rerank(
    documents_ids=documents_ids,
    queries_embeddings=queries_embeddings,
    documents_embeddings=documents_embeddings,
)