Model Overview
Model Features
Model Capabilities
Use Cases
๐ PunjabiSBERT
PunjabiSBERT is a model designed for sentence similarity tasks. It is based on a PunjabiBERT model (l3cube - pune/punjabi - bert) trained on the NLI dataset. This model is part of the project MahaNLP: [https://github.com/l3cube - pune/MarathiNLP](https://github.com/l3cube - pune/MarathiNLP). There is also a multilingual version supporting major Indic languages and cross - lingual capabilities available at [indic - sentence - bert - nli](https://huggingface.co/l3cube - pune/indic - sentence - bert - nli). Additionally, a better fine - tuned sentence similarity model can be found at [https://huggingface.co/l3cube - pune/punjabi - sentence - similarity - sbert](https://huggingface.co/l3cube - pune/punjabi - sentence - similarity - sbert).
Model Information
Property | Details |
---|---|
Pipeline Tag | sentence - similarity |
Tags | sentence - transformers, feature - extraction, sentence - similarity, transformers |
License | cc - by - 4.0 |
Language | pa |
Widget Examples
- Example 1
- Source Sentence: "เจชเฉเจเจเจฟเฉฐเจ เจฎเฉเจฐเจพ เจธเจผเฉเจ เจนเฉ"
- Comparison Sentences:
- "เจจเฉฑเจเจฃเจพ เจฎเฉเจฐเจพ เจธเจผเฉเจ เจนเฉ"
- "เจฎเฉเจฐเฉ เจฌเจนเฉเจค เจธเจพเจฐเฉ เจธเจผเฉเจ เจนเจจ"
- "เจฎเฉเจจเฉเฉฐ เจชเฉเจเจเจฟเฉฐเจ เจ เจคเฉ เจกเจพเจเจธ เจฆเฉเจตเจพเจ เจฆเจพ เจเจจเฉฐเจฆ เจเจเจเจฆเจพ เจนเฉ"
- Example 2
- Source Sentence: "เจเฉเจ เจฒเฉเจ เจเจพ เจฐเจนเฉ เจนเจจ"
- Comparison Sentences:
- "เจฒเฉเจเจพเจ เจฆเจพ เจเฉฑเจ เจธเจฎเฉเจน เจเจพ เจฐเจฟเจนเจพ เจนเฉ"
- "เจเฉฑเจ เจฌเจฟเฉฑเจฒเฉ เจฆเฉเฉฑเจง เจชเฉ เจฐเจนเฉ เจนเฉ"
- "เจฆเฉ เจเจฆเจฎเฉ เจฒเฉ เจฐเจนเฉ เจนเจจ"
- Example 3
- Source Sentence: "เจฎเฉเจฐเฉ เจเจฐ เจตเจฟเฉฑเจ เจคเฉเจนเจพเจกเจพ เจธเฉเจเจเจค เจนเฉ"
- Comparison Sentences:
- "เจฎเฉเจ เจคเฉเจนเจพเจกเฉ เจเจฐ เจตเจฟเฉฑเจ เจคเฉเจนเจพเจกเจพ เจธเฉเจเจเจค เจเจฐเจพเจเจเจพ "
- "เจฎเฉเจฐเจพ เจเจฐ เจเจพเจซเฉ เจตเฉฑเจกเจพ เจนเฉ"
- "เจ เฉฑเจ เจฎเฉเจฐเฉ เจเจฐ เจตเจฟเฉฑเจ เจฐเจนเฉ"
Related Papers
- You can find more details on the dataset, models, and baseline results in our paper.
@article{deode2023l3cube, title={L3Cube - IndicSBERT: A simple approach for learning cross - lingual sentence representations using multilingual BERT}, author={Deode, Samruddhi and Gadre, Janhavi and Kajale, Aditi and Joshi, Ananya and Joshi, Raviraj}, journal={arXiv preprint arXiv:2304.11434}, year={2023} }
@article{joshi2022l3cubemahasbert, title={L3Cube - MahaSBERT and HindSBERT: Sentence BERT Models and Benchmarking BERT Sentence Representations for Hindi and Marathi}, author={Joshi, Ananya and Kajale, Aditi and Gadre, Janhavi and Deode, Samruddhi and Joshi, Raviraj}, journal={arXiv preprint arXiv:2211.11187}, year={2022} }
- monolingual Indic SBERT paper
- multilingual Indic SBERT paper
Other Monolingual Indic Sentence BERT Models
- [Marathi SBERT](https://huggingface.co/l3cube - pune/marathi - sentence - bert - nli)
- [Hindi SBERT](https://huggingface.co/l3cube - pune/hindi - sentence - bert - nli)
- [Kannada SBERT](https://huggingface.co/l3cube - pune/kannada - sentence - bert - nli)
- [Telugu SBERT](https://huggingface.co/l3cube - pune/telugu - sentence - bert - nli)
- [Malayalam SBERT](https://huggingface.co/l3cube - pune/malayalam - sentence - bert - nli)
- [Tamil SBERT](https://huggingface.co/l3cube - pune/tamil - sentence - bert - nli)
- [Gujarati SBERT](https://huggingface.co/l3cube - pune/gujarati - sentence - bert - nli)
- [Oriya SBERT](https://huggingface.co/l3cube - pune/odia - sentence - bert - nli)
- [Bengali SBERT](https://huggingface.co/l3cube - pune/bengali - sentence - bert - nli)
- [Punjabi SBERT](https://huggingface.co/l3cube - pune/punjabi - sentence - bert - nli)
- [Indic SBERT (multilingual)](https://huggingface.co/l3cube - pune/indic - sentence - bert - nli)
Other Monolingual Similarity Models
- [Marathi Similarity](https://huggingface.co/l3cube - pune/marathi - sentence - similarity - sbert)
- [Hindi Similarity](https://huggingface.co/l3cube - pune/hindi - sentence - similarity - sbert)
- [Kannada Similarity](https://huggingface.co/l3cube - pune/kannada - sentence - similarity - sbert)
- [Telugu Similarity](https://huggingface.co/l3cube - pune/telugu - sentence - similarity - sbert)
- [Malayalam Similarity](https://huggingface.co/l3cube - pune/malayalam - sentence - similarity - sbert)
- [Tamil Similarity](https://huggingface.co/l3cube - pune/tamil - sentence - similarity - sbert)
- [Gujarati Similarity](https://huggingface.co/l3cube - pune/gujarati - sentence - similarity - sbert)
- [Oriya Similarity](https://huggingface.co/l3cube - pune/odia - sentence - similarity - sbert)
- [Bengali Similarity](https://huggingface.co/l3cube - pune/bengali - sentence - similarity - sbert)
- [Punjabi Similarity](https://huggingface.co/l3cube - pune/punjabi - sentence - similarity - sbert)
- [Indic Similarity (multilingual)](https://huggingface.co/l3cube - pune/indic - sentence - similarity - sbert)
๐ Quick Start
Prerequisites
You need to install sentence - transformers to use this model easily. You can install it using the following command:
pip install -U sentence-transformers
๐ป Usage Examples
Basic Usage (Sentence - Transformers)
from sentence_transformers import SentenceTransformer
sentences = ["This is an example sentence", "Each sentence is converted"]
model = SentenceTransformer('{MODEL_NAME}')
embeddings = model.encode(sentences)
print(embeddings)
Advanced Usage (HuggingFace Transformers)
If you don't have sentence - transformers installed, you can still use the model. First, pass your input through the transformer model, then apply the right pooling - operation on top of the contextualized word embeddings.
from transformers import AutoTokenizer, AutoModel
import torch
#Mean Pooling - Take attention mask into account for correct averaging
def mean_pooling(model_output, attention_mask):
token_embeddings = model_output[0] #First element of model_output contains all token embeddings
input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
# Sentences we want sentence embeddings for
sentences = ['This is an example sentence', 'Each sentence is converted']
# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained('{MODEL_NAME}')
model = AutoModel.from_pretrained('{MODEL_NAME}')
# Tokenize sentences
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
# Compute token embeddings
with torch.no_grad():
model_output = model(**encoded_input)
# Perform pooling. In this case, mean pooling.
sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
print("Sentence embeddings:")
print(sentence_embeddings)
๐ License
This model is released under the cc - by - 4.0 license.





