模型简介
模型特点
模型能力
使用案例
🚀 希腊媒体SBERT(无大小写区分)
本项目基于 希腊媒体BERT(无大小写区分) 模型,借助 sentence-transformers 技术,将句子和段落映射到768维的密集向量空间,可用于聚类或语义搜索等任务。
🚀 快速开始
安装依赖
若已安装 sentence-transformers,使用该模型将非常便捷。可通过以下命令安装:
pip install -U sentence-transformers
使用示例
基础用法
from sentence_transformers import SentenceTransformer
sentences = ["This is an example sentence", "Each sentence is converted"]
model = SentenceTransformer('dimitriz/st-greek-media-bert-base-uncased')
embeddings = model.encode(sentences)
print(embeddings)
高级用法
若未安装 sentence-transformers,可按以下方式使用模型:首先将输入数据传入Transformer模型,然后对上下文词嵌入应用正确的池化操作。
from transformers import AutoTokenizer, AutoModel
import torch
# Mean Pooling - Take attention mask into account for correct averaging
def mean_pooling(model_output, attention_mask):
token_embeddings = model_output[0] # First element of model_output contains all token embeddings
input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
# Sentences we want sentence embeddings for
sentences = ['This is an example sentence', 'Each sentence is converted']
# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained('dimitriz/st-greek-media-bert-base-uncased')
model = AutoModel.from_pretrained('dimitriz/st-greek-media-bert-base-uncased')
# Tokenize sentences
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
# Compute token embeddings
with torch.no_grad():
model_output = model(**encoded_input)
# Perform pooling. In this case, mean pooling.
sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
print("Sentence embeddings:")
print(sentence_embeddings)
📊 评估结果
若要对该模型进行自动化评估,可参考 Sentence Embeddings Benchmark:https://seb.sbert.net。
🔧 训练详情
训练数据
该模型在包含来自希腊“互联网”、“社交媒体”和“新闻媒体”领域三元组的自定义数据集上进行训练,相关内容在论文 DACL 中有详细描述。
- 数据集通过从同一领域采样句子三元组创建,其中前两个句子比第三个句子更相似。
- 训练目标是最大化前两个句子之间的相似度,并最小化第一个和第三个句子之间的相似度。
训练参数
- 训练轮数:3 轮
- 批次大小:16
- 最大序列长度:512 个标记
- 训练设备:单块 NVIDIA RTX A6000 GPU(48GB 内存)
详细参数配置
DataLoader
torch.utils.data.dataloader.DataLoader
,长度为 10807,参数如下:
{'batch_size': 16, 'sampler': 'torch.utils.data.sampler.RandomSampler', 'batch_sampler': 'torch.utils.data.sampler.BatchSampler'}
Loss
sentence_transformers.losses.TripletLoss.TripletLoss
,参数如下:
{'distance_metric': 'TripletDistanceMetric.EUCLIDEAN', 'triplet_margin': 5}
fit() 方法参数
{
"epochs": 3,
"evaluation_steps": 1000,
"evaluator": "sentence_transformers.evaluation.TripletEvaluator.TripletEvaluator",
"max_grad_norm": 1,
"optimizer_class": "<class 'torch.optim.adamw.AdamW'>",
"optimizer_params": {
"lr": 2e-05
},
"scheduler": "WarmupLinear",
"steps_per_epoch": null,
"warmup_steps": 17290,
"weight_decay": 0.01
}
📚 完整模型架构
SentenceTransformer(
(0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel
(1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
)
📄 引用与作者
该模型随论文 "DACL: A Domain-Adapted Contrastive Learning Approach to Low Resource Language Representations for Document Clustering Tasks" 正式发布。作者为 Dimitrios Zaikis、Stylianos Kokkas 和 Ioannis Vlahavas。该论文发表于 "Iliadis, L., Maglogiannis, I., Alonso, S., Jayne, C., Pimenidis, E. (eds) Engineering Applications of Neural Networks. EANN 2023. Communications in Computer and Information Science, vol 1826. Springer, Cham"。
若使用该模型,请引用以下内容:
@InProceedings{10.1007/978-3-031-34204-2_47,
author="Zaikis, Dimitrios
and Kokkas, Stylianos
and Vlahavas, Ioannis",
editor="Iliadis, Lazaros
and Maglogiannis, Ilias
and Alonso, Serafin
and Jayne, Chrisina
and Pimenidis, Elias",
title="DACL: A Domain-Adapted Contrastive Learning Approach to Low Resource Language Representations for Document Clustering Tasks",
booktitle="Engineering Applications of Neural Networks",
year="2023",
publisher="Springer Nature Switzerland",
address="Cham",
pages="585--598",
isbn="978-3-031-34204-2"
}
📋 模型信息表格
属性 | 详情 |
---|---|
语言 | 希腊语 |
任务类型 | 句子相似度 |
标签 | sentence-transformers、feature-extraction、sentence-similarity、transformers |
评估指标 | accuracy_cosinus、accuracy_euclidean、accuracy_manhattan |
模型名称 | st-greek-media-bert-base-uncased |
数据集 | all_custom_greek_media_triplets |
数据集类型 | sentence-pair |
余弦相似度准确率 | 0.9563965089445283 |
欧几里得距离准确率 | 0.9566394253292384 |
曼哈顿距离准确率 | 0.9565353183072198 |







