🚀 孟加拉語句子轉換器
句子轉換器是一種先進的自然語言處理(NLP)模型,能夠將句子編碼並轉換為高維嵌入向量。藉助這項技術,我們可以在文本分類、信息檢索、語義搜索等多個領域挖掘強大的見解和應用。
該模型是基於 stsb-xlm-r-multilingual
進行微調的,現已在 Hugging Face 上發佈!🎉🎉
🚀 快速開始
✨ 主要特性
- 支持孟加拉語(bn)和英語(en)。
- 可用於句子相似度計算。
- 基於句子轉換器技術,可進行特徵提取。
📦 安裝指南
若要使用此模型,需安裝 sentence-transformers:
pip install -U sentence-transformers
💻 使用示例
基礎用法
from sentence_transformers import SentenceTransformer
sentences = ['আমি আপেল খেতে পছন্দ করি। ', 'আমার একটি আপেল মোবাইল আছে।','আপনি কি এখানে কাছাকাছি থাকেন?', 'আশেপাশে কেউ আছেন?']
model = SentenceTransformer('shihab17/bangla-sentence-transformer')
embeddings = model.encode(sentences)
print(embeddings)
高級用法
from transformers import AutoTokenizer, AutoModel
import torch
def mean_pooling(model_output, attention_mask):
token_embeddings = model_output[0]
input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
sentences = ['আমি আপেল খেতে পছন্দ করি। ', 'আমার একটি আপেল মোবাইল আছে।','আপনি কি এখানে কাছাকাছি থাকেন?', 'আশেপাশে কেউ আছেন?']
tokenizer = AutoTokenizer.from_pretrained('shihab17/bangla-sentence-transformer')
model = AutoModel.from_pretrained('shihab17/bangla-sentence-transformer')
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
with torch.no_grad():
model_output = model(**encoded_input)
sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
print("Sentence embeddings:")
print(sentence_embeddings)
如何計算句子相似度
from sentence_transformers import SentenceTransformer
from sentence_transformers.util import pytorch_cos_sim
transformer = SentenceTransformer('shihab17/bangla-sentence-transformer')
sentences = ['আমি আপেল খেতে পছন্দ করি। ', 'আমার একটি আপেল মোবাইল আছে।','আপনি কি এখানে কাছাকাছি থাকেন?', 'আশেপাশে কেউ আছেন?']
sentences_embeddings = transformer.encode(sentences)
for i in range(len(sentences)):
for j in range(i, len(sentences)):
sen_1 = sentences[i]
sen_2 = sentences[j]
sim_score = float(pytorch_cos_sim(sentences_embeddings[i], sentences_embeddings[j]))
print(sen_1, '----->', sen_2, sim_score)
最佳均方誤差:2.5556
📚 詳細文檔
引用說明
如果您使用了此模型,請引用以下論文:
@INPROCEEDINGS{10754765,
author={Uddin, Md. Shihab and Haque, Mohd Ariful and Rifat, Rakib Hossain and Kamal, Marufa and Gupta, Kishor Datta and George, Roy},
booktitle={2024 IEEE 15th Annual Ubiquitous Computing, Electronics & Mobile Communication Conference (UEMCON)},
title={Bangla SBERT - Sentence Embedding Using Multilingual Knowledge Distillation},
year={2024},
volume={},
number={},
pages={495-500},
keywords={Sentiment analysis;Machine learning algorithms;Accuracy;Text categorization;Semantics;Transformers;Mobile communication;Information retrieval;Machine translation;Sentence Similarity;Sentence Transformer;SBERT;Knowledge Distillation;Bangla NLP},
doi={10.1109/UEMCON62879.2024.10754765}}