col1-210M-EuroBERT Open-source Model - Free Realization of Semantic Text Similarity Calculation for Spanish and English

Col1 210M EuroBERT

Developed by fjmgAI

This is a ColBERT model fine-tuned on EuroBERT-210m, specifically designed for semantic text similarity calculation in Spanish and English.

Text Embedding

Safetensors

Supports Multiple LanguagesOpen Source License:Apache-2.0 #Spanish semantic search #High-precision similarity calculation #Word-level MaxSim retrieval

Downloads 16

Release Time : 4/3/2025

Model Overview

The model was contrastively trained on the rag-comprehensive-triplets dataset using the PyLate library, capable of mapping sentences and paragraphs into 128-dimensional dense vector sequences, suitable for semantic search and document retrieval tasks.

Model Features

Efficient semantic search

Uses MaxSim operator for word-level embedding comparison, providing efficient semantic search capabilities

Spanish language optimization

Specifically optimized and filtered for Spanish language applications

High accuracy

Achieved an accuracy of 0.9848 on evaluation datasets

Model Capabilities

Semantic text similarity calculation

Document retrieval

Q&A system support

Use Cases

Information retrieval

Document similarity matching

Find documents most relevant to the query sentence

Highly accurate matching results

Q&A systems

Answer retrieval

Retrieve the most relevant answers from a knowledge base

High-quality answers based on semantic similarity

🚀 PyLate model based on EuroBERT/EuroBERT-210m

This fine - tuned model, fjmgAI/col1 - 210M - EuroBERT, is based on EuroBERT/EuroBERT - 210m. It maps sentences and paragraphs to 128 - dimensional dense vectors and can be used for semantic textual similarity, especially suitable for Spanish applications in question - answering and document retrieval.

✨ Features

Based on the EuroBERT/EuroBERT - 210m base model.
Fine - tuned using PyLate with contrastive training.
Can map sentences and paragraphs to 128 - dimensional dense vectors.
Suitable for semantic textual similarity using the MaxSim operator.
Designed for Spanish applications in question - answering and document retrieval.

📦 Installation

First install the PyLate library:

pip install -U pylate

💻 Usage Examples

Basic Usage

import torch
from pylate import models

# Load the ColBERT model 
model = models.ColBERT("fjmgAI/col1-210M-EuroBERT", trust_remote_code=True)

# Move the model to GPU if available, otherwise use CPU
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

# Example data for similarity comparison
query = "¿Cuál es la capital de España?"  # Query sentence
positive_doc = "La capital de España es Madrid."  # Relevant document
negative_doc = "Florida es un estado en los Estados Unidos."  # Irrelevant document
sentences = [query, positive_doc, negative_doc]  # Combine all texts

# Tokenize the input sentences using ColBERT's tokenizer
inputs = model.tokenize(sentences)

# Move all input tensors to the same device as the model (GPU/CPU)
inputs = {key: value.to(device) for key, value in inputs.items()}

# Generate token embeddings (no gradients needed for inference)
with torch.no_grad():
    embeddings_dict = model(inputs)  
    embeddings = embeddings_dict['token_embeddings']

# Define ColBERT's MaxSim similarity function
def colbert_similarity(query_emb, doc_emb):
    """
    Computes ColBERT-style similarity between query and document embeddings.
    Uses maximum similarity (MaxSim) between individual tokens.
    
    Args:
        query_emb: [query_tokens, embedding_dim]
        doc_emb: [doc_tokens, embedding_dim]
    
    Returns:
        Normalized similarity score
    """
    # Compute dot product between all token pairs
    similarity_matrix = torch.matmul(query_emb, doc_emb.T)  
    
    # Get maximum similarity for each query token (MaxSim)
    max_similarities = similarity_matrix.max(dim=1)[0]
    
    # Return average of maximum similarities (normalized by query length)
    return max_similarities.sum() / query_emb.shape[0]

# Extract embeddings for each text
query_emb = embeddings[0]  
positive_emb = embeddings[1]  
negative_emb = embeddings[2]

# Compute similarity scores
positive_score = colbert_similarity(query_emb, positive_emb)
negative_score = colbert_similarity(query_emb, negative_emb)

print(f"Similarity with positive document: {positive_score.item():.4f}")
print(f"Similarity with negative document: {negative_score.item():.4f}")

📚 Documentation

Base Model

EuroBERT/EuroBERT - 210m

Fine - Tuning Method

Fine - tuning was performed using PyLate, with contrastive training on the [rag - comprehensive - triplets](https://huggingface.co/datasets/baconnier/rag - comprehensive - triplets) dataset. It maps sentences & paragraphs to sequences of 128 - dimensional dense vectors and can be used for semantic textual similarity using the MaxSim operator.

Dataset

[baconnier/rag - comprehensive - triplets](https://huggingface.co/datasets/baconnier/rag - comprehensive - triplets)

Description

This dataset has been filtered for the Spanish language containing 303,000 examples, designed for rag - comprehensive - triplets.

Fine - Tuning Details

The model was trained using the Contrastive Training.
Evaluated with pylate.evaluation.colbert_triplet.ColBERTTripletEvaluator

Property	Details
Model Type	PyLate model based on EuroBERT/EuroBERT - 210m
Training Data	baconnier/rag - comprehensive - triplets
Metric	Accuracy: 0.9848384857177734

Framework Versions

Python: 3.10.12
Sentence Transformers: 3.4.1
PyLate: 1.1.7
Transformers: 4.48.2
PyTorch: 2.5.1+cu121
Accelerate: 1.2.1
Datasets: 3.3.1
Tokenizers: 0.21.0

Purpose

This tuned model is designed for Spanish applications that require the use of efficient semantic search comparing embeddings at the token level with its MaxSim operation, ideal for question - answering and document retrieval.

📄 License

Developed by: fjmgAI
License: apache - 2.0

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご