Semantic-XLMR-BN Open-source Model - Optimize Bengali Processing and Accurately Achieve Text Vector Mapping

Home

Semantic Xlmr Bn

Developed by afschowdhury

A multilingual sentence embedding model optimized for Bengali, mapping text to a 768-dimensional vector space

Text Embedding

Transformers

Other#Bengali semantic understanding #Multilingual sentence embeddings #Knowledge distillation optimization

Downloads 225

Release Time : 2/1/2023

Model Overview

A sentence transformer model based on XLM-RoBERTa architecture, fine-tuned specifically for Bengali, suitable for tasks such as semantic similarity calculation, text clustering, and semantic search

Model Features

Multilingual knowledge distillation

Trained using paraphrase-distilroberta-base-v2 as the teacher model for knowledge distillation

Bengali optimization

Fine-tuned specifically for Bengali text to enhance semantic understanding

Efficient vector representation

Converts sentences into 768-dimensional dense vectors for easy downstream task processing

Model Capabilities

Calculate sentence similarity

Generate text embedding vectors

Support multilingual processing

Perform semantic search

Text clustering analysis

Use Cases

Information retrieval

Document retrieval system

Document search and ranking based on semantic similarity

Dialogue systems

FAQ matching

Identify semantic similarity between user questions and knowledge base questions

Content recommendation

🚀 `s-xlmr-bn`

This is a sentence-transformers model. It maps sentences and paragraphs to a 768-dimensional dense vector space, which can be used for tasks such as clustering or semantic search.

🚀 Quick Start

If you have sentence-transformers installed, using this model is straightforward. First, install the sentence-transformers library:

pip install -U sentence-transformers

Then, you can use the model as follows:

from sentence_transformers import SentenceTransformer
sentences = ["I sing in bengali", "আমি বাংলায় গান গাই"]

model = SentenceTransformer('afschowdhury/s-xlmr-bn')
embeddings = model.encode(sentences)
print(embeddings)

✨ Features

Multilingual Capability: It's a multilingual model, fine - tuned for the Bengali language, enabling it to handle various languages effectively.
Diverse Use Cases: Can be used for a wide range of tasks like semantic similarity, clustering, semantic searches, document retrieval, information retrieval, recommendation systems, chatbot systems, and FAQ systems.

📦 Installation

To use this model, you need to install the sentence-transformers library. You can do this using the following command:

pip install -U sentence-transformers

💻 Usage Examples

Basic Usage

If you have the sentence-transformers library installed, you can use the model like this:

from sentence_transformers import SentenceTransformer
sentences = ["I sing in bengali", "আমি বাংলায় গান গাই"]

model = SentenceTransformer('afschowdhury/s-xlmr-bn')
embeddings = model.encode(sentences)
print(embeddings)

Advanced Usage

If you don't have the sentence-transformers library installed, you can use the model with the transformers library. Here is an example:

from transformers import AutoTokenizer, AutoModel
import torch


#Mean Pooling - Take attention mask into account for correct averaging
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0] #First element of model_output contains all token embeddings
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)


# Sentences we want sentence embeddings for
sentences = ["I sing in bengali", "আমি বাংলায় গান গাই"]

# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained('afschowdhury/s-xlmr-bn')
model = AutoModel.from_pretrained('afschowdhury/s-xlmr-bn')

# Tokenize sentences
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')

# Compute token embeddings
with torch.no_grad():
    model_output = model(**encoded_input)

# Perform pooling. In this case, mean pooling.
sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])

print("Sentence embeddings:")
print(sentence_embeddings)

📚 Documentation

Model Details

Property	Details
Model Name	s-xlmr-bn
Model Version	1.0
Architecture	Sentence Transformer
Language	Multilingual (fine - tuned for Bengali Language)
Base Models	paraphrase-distilroberta-base-v2 [Teacher Model] xlm-roberta-large [Student Model]

Training

The model was fine - tuned using the Multilingual Knowledge Distillation method. paraphrase-distilroberta-base-v2 was used as the teacher model, and xlm-roberta-large was used as the student model.

Intended Use

Primary Use Case: Semantic similarity, clustering, and semantic searches.
Potential Use Cases: Document retrieval, information retrieval, recommendation systems, chatbot systems, FAQ system.

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: XLMRobertaModel
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
)

Point of Contact

Asif Faisal Chowdhury
E-mail: afschowdhury@gmail.com | Linked-in: afschowdhury

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご