模型概述
模型特點
模型能力
使用案例
🚀 希臘媒體SBERT(無大小寫區分)
本項目基於 希臘媒體BERT(無大小寫區分) 模型,藉助 sentence-transformers 技術,將句子和段落映射到768維的密集向量空間,可用於聚類或語義搜索等任務。
🚀 快速開始
安裝依賴
若已安裝 sentence-transformers,使用該模型將非常便捷。可通過以下命令安裝:
pip install -U sentence-transformers
使用示例
基礎用法
from sentence_transformers import SentenceTransformer
sentences = ["This is an example sentence", "Each sentence is converted"]
model = SentenceTransformer('dimitriz/st-greek-media-bert-base-uncased')
embeddings = model.encode(sentences)
print(embeddings)
高級用法
若未安裝 sentence-transformers,可按以下方式使用模型:首先將輸入數據傳入Transformer模型,然後對上下文詞嵌入應用正確的池化操作。
from transformers import AutoTokenizer, AutoModel
import torch
# Mean Pooling - Take attention mask into account for correct averaging
def mean_pooling(model_output, attention_mask):
token_embeddings = model_output[0] # First element of model_output contains all token embeddings
input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
# Sentences we want sentence embeddings for
sentences = ['This is an example sentence', 'Each sentence is converted']
# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained('dimitriz/st-greek-media-bert-base-uncased')
model = AutoModel.from_pretrained('dimitriz/st-greek-media-bert-base-uncased')
# Tokenize sentences
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
# Compute token embeddings
with torch.no_grad():
model_output = model(**encoded_input)
# Perform pooling. In this case, mean pooling.
sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
print("Sentence embeddings:")
print(sentence_embeddings)
📊 評估結果
若要對該模型進行自動化評估,可參考 Sentence Embeddings Benchmark:https://seb.sbert.net。
🔧 訓練詳情
訓練數據
該模型在包含來自希臘“互聯網”、“社交媒體”和“新聞媒體”領域三元組的自定義數據集上進行訓練,相關內容在論文 DACL 中有詳細描述。
- 數據集通過從同一領域採樣句子三元組創建,其中前兩個句子比第三個句子更相似。
- 訓練目標是最大化前兩個句子之間的相似度,並最小化第一個和第三個句子之間的相似度。
訓練參數
- 訓練輪數:3 輪
- 批次大小:16
- 最大序列長度:512 個標記
- 訓練設備:單塊 NVIDIA RTX A6000 GPU(48GB 內存)
詳細參數配置
DataLoader
torch.utils.data.dataloader.DataLoader
,長度為 10807,參數如下:
{'batch_size': 16, 'sampler': 'torch.utils.data.sampler.RandomSampler', 'batch_sampler': 'torch.utils.data.sampler.BatchSampler'}
Loss
sentence_transformers.losses.TripletLoss.TripletLoss
,參數如下:
{'distance_metric': 'TripletDistanceMetric.EUCLIDEAN', 'triplet_margin': 5}
fit() 方法參數
{
"epochs": 3,
"evaluation_steps": 1000,
"evaluator": "sentence_transformers.evaluation.TripletEvaluator.TripletEvaluator",
"max_grad_norm": 1,
"optimizer_class": "<class 'torch.optim.adamw.AdamW'>",
"optimizer_params": {
"lr": 2e-05
},
"scheduler": "WarmupLinear",
"steps_per_epoch": null,
"warmup_steps": 17290,
"weight_decay": 0.01
}
📚 完整模型架構
SentenceTransformer(
(0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel
(1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
)
📄 引用與作者
該模型隨論文 "DACL: A Domain-Adapted Contrastive Learning Approach to Low Resource Language Representations for Document Clustering Tasks" 正式發佈。作者為 Dimitrios Zaikis、Stylianos Kokkas 和 Ioannis Vlahavas。該論文發表於 "Iliadis, L., Maglogiannis, I., Alonso, S., Jayne, C., Pimenidis, E. (eds) Engineering Applications of Neural Networks. EANN 2023. Communications in Computer and Information Science, vol 1826. Springer, Cham"。
若使用該模型,請引用以下內容:
@InProceedings{10.1007/978-3-031-34204-2_47,
author="Zaikis, Dimitrios
and Kokkas, Stylianos
and Vlahavas, Ioannis",
editor="Iliadis, Lazaros
and Maglogiannis, Ilias
and Alonso, Serafin
and Jayne, Chrisina
and Pimenidis, Elias",
title="DACL: A Domain-Adapted Contrastive Learning Approach to Low Resource Language Representations for Document Clustering Tasks",
booktitle="Engineering Applications of Neural Networks",
year="2023",
publisher="Springer Nature Switzerland",
address="Cham",
pages="585--598",
isbn="978-3-031-34204-2"
}
📋 模型信息表格
屬性 | 詳情 |
---|---|
語言 | 希臘語 |
任務類型 | 句子相似度 |
標籤 | sentence-transformers、feature-extraction、sentence-similarity、transformers |
評估指標 | accuracy_cosinus、accuracy_euclidean、accuracy_manhattan |
模型名稱 | st-greek-media-bert-base-uncased |
數據集 | all_custom_greek_media_triplets |
數據集類型 | sentence-pair |
餘弦相似度準確率 | 0.9563965089445283 |
歐幾里得距離準確率 | 0.9566394253292384 |
曼哈頓距離準確率 | 0.9565353183072198 |







