🚀 孟加拉语句子转换器
句子转换器是一种先进的自然语言处理(NLP)模型,能够将句子编码并转换为高维嵌入向量。借助这项技术,我们可以在文本分类、信息检索、语义搜索等多个领域挖掘强大的见解和应用。
该模型是基于 stsb-xlm-r-multilingual
进行微调的,现已在 Hugging Face 上发布!🎉🎉
🚀 快速开始
✨ 主要特性
- 支持孟加拉语(bn)和英语(en)。
- 可用于句子相似度计算。
- 基于句子转换器技术,可进行特征提取。
📦 安装指南
若要使用此模型,需安装 sentence-transformers:
pip install -U sentence-transformers
💻 使用示例
基础用法
from sentence_transformers import SentenceTransformer
sentences = ['আমি আপেল খেতে পছন্দ করি। ', 'আমার একটি আপেল মোবাইল আছে।','আপনি কি এখানে কাছাকাছি থাকেন?', 'আশেপাশে কেউ আছেন?']
model = SentenceTransformer('shihab17/bangla-sentence-transformer')
embeddings = model.encode(sentences)
print(embeddings)
高级用法
from transformers import AutoTokenizer, AutoModel
import torch
def mean_pooling(model_output, attention_mask):
token_embeddings = model_output[0]
input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
sentences = ['আমি আপেল খেতে পছন্দ করি। ', 'আমার একটি আপেল মোবাইল আছে।','আপনি কি এখানে কাছাকাছি থাকেন?', 'আশেপাশে কেউ আছেন?']
tokenizer = AutoTokenizer.from_pretrained('shihab17/bangla-sentence-transformer')
model = AutoModel.from_pretrained('shihab17/bangla-sentence-transformer')
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
with torch.no_grad():
model_output = model(**encoded_input)
sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
print("Sentence embeddings:")
print(sentence_embeddings)
如何计算句子相似度
from sentence_transformers import SentenceTransformer
from sentence_transformers.util import pytorch_cos_sim
transformer = SentenceTransformer('shihab17/bangla-sentence-transformer')
sentences = ['আমি আপেল খেতে পছন্দ করি। ', 'আমার একটি আপেল মোবাইল আছে।','আপনি কি এখানে কাছাকাছি থাকেন?', 'আশেপাশে কেউ আছেন?']
sentences_embeddings = transformer.encode(sentences)
for i in range(len(sentences)):
for j in range(i, len(sentences)):
sen_1 = sentences[i]
sen_2 = sentences[j]
sim_score = float(pytorch_cos_sim(sentences_embeddings[i], sentences_embeddings[j]))
print(sen_1, '----->', sen_2, sim_score)
最佳均方误差:2.5556
📚 详细文档
引用说明
如果您使用了此模型,请引用以下论文:
@INPROCEEDINGS{10754765,
author={Uddin, Md. Shihab and Haque, Mohd Ariful and Rifat, Rakib Hossain and Kamal, Marufa and Gupta, Kishor Datta and George, Roy},
booktitle={2024 IEEE 15th Annual Ubiquitous Computing, Electronics & Mobile Communication Conference (UEMCON)},
title={Bangla SBERT - Sentence Embedding Using Multilingual Knowledge Distillation},
year={2024},
volume={},
number={},
pages={495-500},
keywords={Sentiment analysis;Machine learning algorithms;Accuracy;Text categorization;Semantics;Transformers;Mobile communication;Information retrieval;Machine translation;Sentence Similarity;Sentence Transformer;SBERT;Knowledge Distillation;Bangla NLP},
doi={10.1109/UEMCON62879.2024.10754765}}