col1-210M-EuroBERT開源模型 - 免費實現西語和英語語義文本相似度計算

首頁

Col1 210M EuroBERT

由fjmgAI開發

這是一個基於EuroBERT-210m微調的ColBERT模型，專門用於西班牙語和英語的語義文本相似度計算。

文本嵌入

Safetensors

支持多種語言開源協議:Apache-2.0 #西班牙語語義搜索 #高精度相似度計算 #詞級MaxSim檢索

下載量 16

發布時間 : 4/3/2025

模型概述

該模型使用PyLate庫在rag-comprehensive-triplets數據集上進行了對比訓練，能夠將句子和段落映射為128維密集向量序列，適用於語義搜索和文檔檢索任務。

模型特點

高效語義搜索

使用MaxSim操作符在詞級別比較嵌入，提供高效的語義搜索能力

西班牙語優化

專門針對西班牙語應用進行了優化和過濾

高準確率

在評估數據集上達到了0.9848的準確率

模型能力

語義文本相似度計算

文檔檢索

問答系統支持

使用案例

信息檢索

文檔相似度匹配

查找與查詢句子最相關的文檔

高準確率的匹配結果

問答系統

答案檢索

從知識庫中檢索最相關的答案

基於語義相似度的高質量答案

🚀 fjmgAI/col1 - 210M - EuroBERT模型

fjmgAI/col1-210M-EuroBERT 是一個基於 EuroBERT/EuroBERT - 210m 微調的模型，可將句子和段落映射為128維的密集向量序列，適用於語義文本相似度任務，在西班牙語應用的高效語義搜索場景中表現出色。

🚀 快速開始

安裝依賴

首先，你需要安裝 PyLate 庫：

pip install -U pylate

計算相似度

以下是一個使用該模型計算句子相似度的示例代碼：

import torch
from pylate import models

# Load the ColBERT model 
model = models.ColBERT("fjmgAI/col1-210M-EuroBERT", trust_remote_code=True)

# Move the model to GPU if available, otherwise use CPU
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

# Example data for similarity comparison
query = "¿Cuál es la capital de España?"  # Query sentence
positive_doc = "La capital de España es Madrid."  # Relevant document
negative_doc = "Florida es un estado en los Estados Unidos."  # Irrelevant document
sentences = [query, positive_doc, negative_doc]  # Combine all texts

# Tokenize the input sentences using ColBERT's tokenizer
inputs = model.tokenize(sentences)

# Move all input tensors to the same device as the model (GPU/CPU)
inputs = {key: value.to(device) for key, value in inputs.items()}

# Generate token embeddings (no gradients needed for inference)
with torch.no_grad():
    embeddings_dict = model(inputs)  
    embeddings = embeddings_dict['token_embeddings']

# Define ColBERT's MaxSim similarity function
def colbert_similarity(query_emb, doc_emb):
    """
    Computes ColBERT-style similarity between query and document embeddings.
    Uses maximum similarity (MaxSim) between individual tokens.
    
    Args:
        query_emb: [query_tokens, embedding_dim]
        doc_emb: [doc_tokens, embedding_dim]
    
    Returns:
        Normalized similarity score
    """
    # Compute dot product between all token pairs
    similarity_matrix = torch.matmul(query_emb, doc_emb.T)  
    
    # Get maximum similarity for each query token (MaxSim)
    max_similarities = similarity_matrix.max(dim=1)[0]
    
    # Return average of maximum similarities (normalized by query length)
    return max_similarities.sum() / query_emb.shape[0]

# Extract embeddings for each text
query_emb = embeddings[0]  
positive_emb = embeddings[1]  
negative_emb = embeddings[2]

# Compute similarity scores
positive_score = colbert_similarity(query_emb, positive_emb)
negative_score = colbert_similarity(query_emb, negative_emb)

print(f"Similarity with positive document: {positive_score.item():.4f}")
print(f"Similarity with negative document: {negative_score.item():.4f}")

✨ 主要特性

基於 EuroBERT/EuroBERT - 210m 進行微調，提升了模型性能。
使用 PyLate 進行微調，在 rag - comprehensive - triplets 數據集上進行對比訓練。
能夠將句子和段落映射為128維的密集向量，適用於語義文本相似度任務。
採用 MaxSim 運算符，可在標記級別比較嵌入，適用於問答和文檔檢索等西班牙語應用。

📦 安裝指南

安裝所需的庫：

pip install -U pylate

📚 詳細文檔

基礎模型

EuroBERT/EuroBERT - 210m

微調方法

使用 PyLate 進行微調，在 [rag - comprehensive - triplets](https://huggingface.co/datasets/baconnier/rag - comprehensive - triplets) 數據集上進行對比訓練。

數據集

[baconnier/rag - comprehensive - triplets](https://huggingface.co/datasets/baconnier/rag - comprehensive - triplets) 該數據集經過篩選，包含303,000個西班牙語示例，專為 rag - comprehensive - triplets 任務設計。

微調細節

採用 對比訓練（Contrastive Training） 方法進行訓練。
使用 pylate.evaluation.colbert_triplet.ColBERTTripletEvaluator 進行評估。

評估指標

屬性	詳情
準確率	0.9848

框架版本

屬性	詳情
Python	3.10.12
Sentence Transformers	3.4.1
PyLate	1.1.7
Transformers	4.48.2
PyTorch	2.5.1+cu121
Accelerate	1.2.1
Datasets	3.3.1
Tokenizers	0.21.0

🔧 技術細節

該模型基於 EuroBERT/EuroBERT - 210m 進行微調，使用 PyLate 庫在 rag - comprehensive - triplets 數據集上進行對比訓練。通過將句子和段落映射為128維的密集向量，利用 MaxSim 運算符在標記級別比較嵌入，實現語義文本相似度計算。