🚀 IndicSBERT
IndicSBERT is a model based on MuRIL (google/muril-base-cased
), trained on the NLI dataset of ten major Indian languages. It supports multiple languages including English, Hindi, Marathi, etc., and has cross - lingual capabilities.
📋 Model Information
Property |
Details |
Pipeline Tag |
sentence - similarity |
License |
cc - by - 4.0 |
Tags |
sentence - transformers, feature - extraction, sentence - similarity, transformers |
Language |
multilingual, en, hi, mr, kn, ta, te, ml, gu, or, pa, bn |
🔍 Example Widgets
- Monolingual - Marathi
- Source Sentence: दिवाळी आपण मोठ्या उत्साहाने साजरी करतो
- Sentences:
- दिवाळी आपण आनंदाने साजरी करतो
- दिवाळी हा दिव्यांचा सण आहे
- Monolingual - Hindi
- Source Sentence: हम दीपावली उत्साह के साथ मनाते हैं
- Sentences:
- हम दीपावली खुशियों से मनाते हैं
- दिवाली रोशनी का त्योहार है
- Monolingual - Gujarati
- Source Sentence: અમે ઉત્સાહથી દિવાળી ઉજવીએ છીએ
- Sentences:
- દિવાળી આપણે ખુશીઓથી ઉજવીએ છીએ
- દિવાળી એ રોશનીનો તહેવાર છે
- Cross - lingual 1
- Source Sentence: आम्हाला भारतीय असल्याचा अभिमान आहे
- Sentences:
- हमें भारतीय होने पर गर्व है
- భారతీయులమైనందుకు గర్విస్తున్నాం
- અમને ભારતીય હોવાનો ગર્વ છે
- Cross - lingual 2
- Source Sentence: ਬਾਰਿਸ਼ ਤੋਂ ਬਾਅਦ ਬਗੀਚਾ ਸੁੰਦਰ ਦਿਖਾਈ ਦਿੰਦਾ ਹੈ
- Sentences:
- മഴയ്ക്ക് ശേഷം പൂന്തോട്ടം മനോഹരമായി കാണപ്പെടുന്നു
- ବର୍ଷା ପରେ ବଗିଚା ସୁନ୍ଦର ଦେଖାଯାଏ |
- बारिश के बाद बगीचा सुंदर दिखता है
📄 Related Links
- This model is released as a part of project MahaNLP: [https://github.com/l3cube - pune/MarathiNLP](https://github.com/l3cube - pune/MarathiNLP)
- A better sentence similarity model (fine - tuned version of this model) is shared here: [https://huggingface.co/l3cube - pune/indic - sentence - similarity - sbert](https://huggingface.co/l3cube - pune/indic - sentence - similarity - sbert)
- More details on the dataset, models, and baseline results can be found in our paper
📚 Citations
@article{deode2023l3cube,
title={L3Cube - IndicSBERT: A simple approach for learning cross - lingual sentence representations using multilingual BERT},
author={Deode, Samruddhi and Gadre, Janhavi and Kajale, Aditi and Joshi, Ananya and Joshi, Raviraj},
journal={arXiv preprint arXiv:2304.11434},
year={2023}
}
🔗 Other Related Models
- Monolingual Indic sentence BERT models:
- [Marathi SBERT](https://huggingface.co/l3cube - pune/marathi - sentence - bert - nli)
- [Hindi SBERT](https://huggingface.co/l3cube - pune/hindi - sentence - bert - nli)
- [Kannada SBERT](https://huggingface.co/l3cube - pune/kannada - sentence - bert - nli)
- [Telugu SBERT](https://huggingface.co/l3cube - pune/telugu - sentence - bert - nli)
- [Malayalam SBERT](https://huggingface.co/l3cube - pune/malayalam - sentence - bert - nli)
- [Tamil SBERT](https://huggingface.co/l3cube - pune/tamil - sentence - bert - nli)
- [Gujarati SBERT](https://huggingface.co/l3cube - pune/gujarati - sentence - bert - nli)
- [Oriya SBERT](https://huggingface.co/l3cube - pune/odia - sentence - bert - nli)
- [Bengali SBERT](https://huggingface.co/l3cube - pune/bengali - sentence - bert - nli)
- [Punjabi SBERT](https://huggingface.co/l3cube - pune/punjabi - sentence - bert - nli)
- [Indic SBERT (multilingual)](https://huggingface.co/l3cube - pune/indic - sentence - bert - nli)
- Monolingual similarity models:
- [Marathi Similarity](https://huggingface.co/l3cube - pune/marathi - sentence - similarity - sbert)
- [Hindi Similarity](https://huggingface.co/l3cube - pune/hindi - sentence - similarity - sbert)
- [Kannada Similarity](https://huggingface.co/l3cube - pune/kannada - sentence - similarity - sbert)
- [Telugu Similarity](https://huggingface.co/l3cube - pune/telugu - sentence - similarity - sbert)
- [Malayalam Similarity](https://huggingface.co/l3cube - pune/malayalam - sentence - similarity - sbert)
- [Tamil Similarity](https://huggingface.co/l3cube - pune/tamil - sentence - similarity - sbert)
- [Gujarati Similarity](https://huggingface.co/l3cube - pune/gujarati - sentence - similarity - sbert)
- [Oriya Similarity](https://huggingface.co/l3cube - pune/odia - sentence - similarity - sbert)
- [Bengali Similarity](https://huggingface.co/l3cube - pune/bengali - sentence - similarity - sbert)
- [Punjabi Similarity](https://huggingface.co/l3cube - pune/punjabi - sentence - similarity - sbert)
- [Indic Similarity (multilingual)](https://huggingface.co/l3cube - pune/indic - sentence - similarity - sbert)
🚀 Quick Start
💻 Usage Examples
🔧 Using Sentence - Transformers
Using this model becomes easy when you have sentence - transformers installed.
pip install -U sentence - transformers
Then you can use the model like this:
from sentence_transformers import SentenceTransformer
sentences = ["This is an example sentence", "Each sentence is converted"]
model = SentenceTransformer('{MODEL_NAME}')
embeddings = model.encode(sentences)
print(embeddings)
🔧 Using HuggingFace Transformers
Without sentence - transformers, you can use the model like this: First, you pass your input through the transformer model, then you have to apply the right pooling - operation on - top of the contextualized word embeddings.
from transformers import AutoTokenizer, AutoModel
import torch
def mean_pooling(model_output, attention_mask):
token_embeddings = model_output[0]
input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
sentences = ['This is an example sentence', 'Each sentence is converted']
tokenizer = AutoTokenizer.from_pretrained('{MODEL_NAME}')
model = AutoModel.from_pretrained('{MODEL_NAME}')
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
with torch.no_grad():
model_output = model(**encoded_input)
sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
print("Sentence embeddings:")
print(sentence_embeddings)