Semantic_XLMR Open-source Multilingual Sentence Embedding Model - Optimizing Bengali Semantic Computation and Cluster Analysis

Semantic Xlmr

Developed by headlesstech

A multilingual sentence embedding model based on sentence-transformers, specially optimized for Bengali, suitable for semantic similarity calculation and clustering analysis

Text Embedding

Transformers

#Multilingual Semantic Similarity #Bengali Optimization #Knowledge Distillation Model

Downloads 28

Release Time : 4/5/2023

Model Overview

This model can map sentences and paragraphs into a 768-dimensional dense vector space, mainly used for tasks such as semantic similarity calculation, clustering analysis, and semantic search

Model Features

Multilingual Support

Based on the XLM-RoBERTa architecture, supports multiple languages, with special optimization for Bengali

Knowledge Distillation Training

Uses paraphrase-distilroberta-base-v2 as the teacher model for knowledge distillation training to improve model performance

Efficient Semantic Encoding

Can convert text into 768-dimensional dense vectors, preserving semantic information, suitable for large-scale semantic search

Model Capabilities

Sentence Similarity Calculation

Text Clustering Analysis

Semantic Search

Multilingual Text Encoding

Use Cases

Information Retrieval

Document Retrieval System

Build a semantic-based document retrieval system to improve the relevance of search results

Recommendation System

Content Recommendation

Provide personalized recommendations based on user history and content semantic similarity

Intelligent Customer Service

FAQ Matching

Match user questions with common questions in the knowledge base through semantic analysis

🚀 `semantic_xlmr`

This is a sentence-transformers model that maps sentences and paragraphs to a 768-dimensional dense vector space, which can be used for tasks such as clustering or semantic search.

🚀 Quick Start

This is a sentence-transformers model: It maps sentences & paragraphs to a 768 dimensional dense vector space and can be used for tasks like clustering or semantic search.

✨ Features

Multilingual Capability: Fine - tuned for the Bengali language, suitable for multilingual tasks.
Versatile Applications: Can be used for semantic similarity, clustering, semantic searches, document retrieval, information retrieval, recommendation systems, chatbot systems, and FAQ systems.

📦 Installation

If you want to use this model, you need to install sentence-transformers first:

pip install -U sentence-transformers

💻 Usage Examples

Basic Usage

from sentence_transformers import SentenceTransformer
sentences = ["I sing in bengali", "আমি বাংলায় গান গাই"]

model = SentenceTransformer('headlesstech/semantic_xlmr')
embeddings = model.encode(sentences)
print(embeddings)

Advanced Usage

Without sentence-transformers, you can use the model like this: First, you pass your input through the transformer model, then you have to apply the right pooling - operation on - top of the contextualized word embeddings.

from transformers import AutoTokenizer, AutoModel
import torch


#Mean Pooling - Take attention mask into account for correct averaging
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0] #First element of model_output contains all token embeddings
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)


# Sentences we want sentence embeddings for
sentences = ["I sing in bengali", "আমি বাংলায় গান গাই"]

# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained('headlesstech/semantic_xlmr')
model = AutoModel.from_pretrained('headlesstech/semantic_xlmr')

# Tokenize sentences
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')

# Compute token embeddings
with torch.no_grad():
    model_output = model(**encoded_input)

# Perform pooling. In this case, mean pooling.
sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])

print("Sentence embeddings:")
print(sentence_embeddings)

📚 Documentation

Model Details

Property	Details
Model Name	semantic_xlmr
Model Version	1.0
Architecture	Sentence Transformer
Language	Multilingual (fine - tuned for Bengali Language)

Training

The model was fine - tuned using the Multilingual Knowledge Distillation method. We took paraphrase - distilroberta - base - v2 as the teacher model and xlm - roberta - large as the student model.

Intended Use

Primary Use Case: Semantic similarity, clustering, and semantic searches
Potential Use Cases: Document retrieval, information retrieval, recommendation systems, chatbot systems, FAQ system

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: XLMRobertaModel
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
)

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご