Bangla - Sentence - Transformer Open - Source Model - Supports Bengali Sentence Similarity Calculation and Semantic Search

Bangla Sentence Transformer

Developed by shihab17

A Bangla sentence embedding model fine-tuned from stsb-xlm-r-multilingual, supporting sentence similarity calculation and semantic search

Text Embedding

Safetensors

Supports Multiple Languages#Bangla Sentence Embedding #Multilingual Similarity Calculation #Knowledge Distillation Optimization

Downloads 1,257

Release Time : 5/18/2023

Model Overview

This model can encode Bangla and English sentences into high-dimensional embedding vectors, suitable for tasks such as text classification, information retrieval, and semantic search.

Model Features

Multilingual Support

Supports sentence embeddings for both Bangla and English

Efficient Semantic Representation

Capable of generating high-quality sentence embeddings that capture deep semantic information

Knowledge Distillation Optimization

Optimized using multilingual knowledge distillation techniques to enhance cross-lingual performance

Model Capabilities

Sentence Embedding Generation

Semantic Similarity Calculation

Cross-lingual Information Retrieval

Text Feature Extraction

Use Cases

Information Retrieval

Bangla Document Search

Used to build Bangla search engines, retrieving relevant documents based on semantic similarity

Text Analysis

Bangla Text Classification

Using sentence embeddings as feature inputs for classifiers to achieve Bangla text classification

🚀 Bangla Sentence Transformer

Sentence Transformer is a state-of-the-art natural language processing (NLP) model that can encode and transform sentences into high-dimensional embeddings. This technology enables powerful insights and applications in various fields such as text classification, information retrieval, and semantic search.

This model is fine-tuned from stsb-xlm-r-multilingual and is now available on Hugging Face! 🎉🎉

🚀 Quick Start

✨ Features

Supports multiple languages including Bengali (bn) and English (en).
Suitable for the task of sentence similarity.
Based on sentence-transformers technology for feature extraction.

📦 Installation

Using this model becomes easy when you have sentence-transformers installed:

pip install -U sentence-transformers

💻 Usage Examples

Basic Usage

from sentence_transformers import SentenceTransformer
sentences = ['আমি আপেল খেতে পছন্দ করি। ', 'আমার একটি আপেল মোবাইল আছে।','আপনি কি এখানে কাছাকাছি থাকেন?', 'আশেপাশে কেউ আছেন?']

model = SentenceTransformer('shihab17/bangla-sentence-transformer')
embeddings = model.encode(sentences)
print(embeddings)

Advanced Usage

from transformers import AutoTokenizer, AutoModel
import torch


#Mean Pooling - Take attention mask into account for correct averaging
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0] #First element of model_output contains all token embeddings
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)


# Sentences we want sentence embeddings for
sentences = ['আমি আপেল খেতে পছন্দ করি। ', 'আমার একটি আপেল মোবাইল আছে।','আপনি কি এখানে কাছাকাছি থাকেন?', 'আশেপাশে কেউ আছেন?']

# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained('shihab17/bangla-sentence-transformer')
model = AutoModel.from_pretrained('shihab17/bangla-sentence-transformer')

# Tokenize sentences
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')

# Compute token embeddings
with torch.no_grad():
    model_output = model(**encoded_input)

# Perform pooling. In this case, mean pooling.
sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])

print("Sentence embeddings:")
print(sentence_embeddings)

Calculating Sentence Similarity

from sentence_transformers import SentenceTransformer
from sentence_transformers.util import pytorch_cos_sim


transformer = SentenceTransformer('shihab17/bangla-sentence-transformer')

sentences = ['আমি আপেল খেতে পছন্দ করি। ', 'আমার একটি আপেল মোবাইল আছে।','আপনি কি এখানে কাছাকাছি থাকেন?', 'আশেপাশে কেউ আছেন?']

sentences_embeddings = transformer.encode(sentences)

for i in range(len(sentences)):
    for j in range(i, len(sentences)):
        sen_1 = sentences[i]
        sen_2 = sentences[j]
        sim_score = float(pytorch_cos_sim(sentences_embeddings[i], sentences_embeddings[j]))
        print(sen_1, '----->', sen_2, sim_score)

📚 Documentation

Best MSE: 2.5556

📄 License

No license information provided in the original document.

📖 Citation

If you use this model, please cite the following paper:

@INPROCEEDINGS{10754765,
  author={Uddin, Md. Shihab and Haque, Mohd Ariful and Rifat, Rakib Hossain and Kamal, Marufa and Gupta, Kishor Datta and George, Roy},
  booktitle={2024 IEEE 15th Annual Ubiquitous Computing, Electronics & Mobile Communication Conference (UEMCON)}, 
  title={Bangla SBERT - Sentence Embedding Using Multilingual Knowledge Distillation}, 
  year={2024},
  volume={},
  number={},
  pages={495-500},
  keywords={Sentiment analysis;Machine learning algorithms;Accuracy;Text categorization;Semantics;Transformers;Mobile communication;Information retrieval;Machine translation;Sentence Similarity;Sentence Transformer;SBERT;Knowledge Distillation;Bangla NLP},
  doi={10.1109/UEMCON62879.2024.10754765}}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご