đ s-xlmr-bn
This is a sentence-transformers model. It maps sentences and paragraphs to a 768-dimensional dense vector space, which can be used for tasks such as clustering or semantic search.
đ Quick Start
If you have sentence-transformers installed, using this model is straightforward. First, install the sentence-transformers
library:
pip install -U sentence-transformers
Then, you can use the model as follows:
from sentence_transformers import SentenceTransformer
sentences = ["I sing in bengali", "āĻāĻŽāĻŋ āĻŦāĻžāĻāϞāĻžāϝāĻŧ āĻāĻžāύ āĻāĻžāĻ"]
model = SentenceTransformer('afschowdhury/s-xlmr-bn')
embeddings = model.encode(sentences)
print(embeddings)
⨠Features
- Multilingual Capability: It's a multilingual model, fine - tuned for the Bengali language, enabling it to handle various languages effectively.
- Diverse Use Cases: Can be used for a wide range of tasks like semantic similarity, clustering, semantic searches, document retrieval, information retrieval, recommendation systems, chatbot systems, and FAQ systems.
đĻ Installation
To use this model, you need to install the sentence-transformers
library. You can do this using the following command:
pip install -U sentence-transformers
đģ Usage Examples
Basic Usage
If you have the sentence-transformers
library installed, you can use the model like this:
from sentence_transformers import SentenceTransformer
sentences = ["I sing in bengali", "āĻāĻŽāĻŋ āĻŦāĻžāĻāϞāĻžāϝāĻŧ āĻāĻžāύ āĻāĻžāĻ"]
model = SentenceTransformer('afschowdhury/s-xlmr-bn')
embeddings = model.encode(sentences)
print(embeddings)
Advanced Usage
If you don't have the sentence-transformers
library installed, you can use the model with the transformers
library. Here is an example:
from transformers import AutoTokenizer, AutoModel
import torch
def mean_pooling(model_output, attention_mask):
token_embeddings = model_output[0]
input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
sentences = ["I sing in bengali", "āĻāĻŽāĻŋ āĻŦāĻžāĻāϞāĻžāϝāĻŧ āĻāĻžāύ āĻāĻžāĻ"]
tokenizer = AutoTokenizer.from_pretrained('afschowdhury/s-xlmr-bn')
model = AutoModel.from_pretrained('afschowdhury/s-xlmr-bn')
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
with torch.no_grad():
model_output = model(**encoded_input)
sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
print("Sentence embeddings:")
print(sentence_embeddings)
đ Documentation
Model Details
Training
The model was fine - tuned using the Multilingual Knowledge Distillation method. paraphrase-distilroberta-base-v2
was used as the teacher model, and xlm-roberta-large
was used as the student model.

Intended Use
- Primary Use Case: Semantic similarity, clustering, and semantic searches.
- Potential Use Cases: Document retrieval, information retrieval, recommendation systems, chatbot systems, FAQ system.
Full Model Architecture
SentenceTransformer(
(0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: XLMRobertaModel
(1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
)
Point of Contact
Asif Faisal Chowdhury
E-mail: afschowdhury@gmail.com | Linked-in: afschowdhury