đ SPhilBerta
SPhilBerta is a Sentence Transformer model based on PhilBERTa, aiming to identify cross - lingual references between Latin and Ancient Greek texts, contributing to the field of Classical Philology.
đ Quick Start
The paper Exploring Language Models for Classical Philology is the first attempt to systematically offer state - of - the - art language models for Classical Philology. Leveraging PhilBERTa, we present SPhilBERTa, a Sentence Transformer model designed to identify cross - lingual references between Latin and Ancient Greek texts. We adopt the knowledge distillation method proposed by Reimers and Gurevych (2020). Our paper can be accessed here.
⨠Features
- Multilingual Support: Supports multiple languages including multilingual, Ancient Greek (
grc
), English (en
), and Latin (la
).
- Sentence Similarity: Specialized for sentence - similarity tasks.
- Knowledge Distillation: Employs the knowledge distillation method for better performance.
đĻ Installation
This README does not provide specific installation steps. However, to use the model, you may need to install relevant libraries such as sentence - transformers
or transformers
.
đģ Usage Examples
Basic Usage
Sentence - Transformers
When you have sentence - transformers installed, you can use the model as follows:
from sentence_transformers import SentenceTransformer
sentences = ["This is an example sentence", "Each sentence is converted"]
model = SentenceTransformer('{MODEL_NAME}')
embeddings = model.encode(sentences)
print(embeddings)
HuggingFace Transformers
Without sentence - transformers, you can use the model in the following way: First, pass your input through the transformer model, then apply the appropriate pooling - operation on top of the contextualized word embeddings.
from transformers import AutoTokenizer, AutoModel
import torch
def mean_pooling(model_output, attention_mask):
token_embeddings = model_output[0]
input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
sentences = ['This is an example sentence', 'Each sentence is converted']
tokenizer = AutoTokenizer.from_pretrained('{MODEL_NAME}')
model = AutoModel.from_pretrained('{MODEL_NAME}')
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
with torch.no_grad():
model_output = model(**encoded_input)
sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
print("Sentence embeddings:")
print(sentence_embeddings)
đ Documentation
Contact
If you have any questions or encounter problems, feel free to [reach out](mailto:riemenschneider@cl.uni - heidelberg.de).
Citation
@incollection{riemenschneiderfrank:2023b,
author = "Riemenschneider, Frederick and Frank, Anette",
title = "{Graecia capta ferum victorem cepit. Detecting Latin Allusions to Ancient Greek Literature}",
year = "2023",
url = "https://arxiv.org/abs/2308.12008",
note = "to appear",
publisher = "Association for Computational Linguistics",
booktitle = "Proceedings of the First Workshop on Ancient Language Processing",
address = "Varna, Bulgaria"
}
đ License
This project is licensed under the apache - 2.0
license.
Property |
Details |
Pipeline Tag |
sentence - similarity |
Supported Languages |
multilingual, grc, en, la |
License |
apache - 2.0 |
Tags |
sentence - transformers, sentence - similarity |