🚀 shlm-grc-en
This model generates sentence embeddings for English and Ancient Greek text in a shared vector space.
🚀 Quick Start
This model creates sentence embeddings in a shared vector space for Ancient Greek and English text. The base model uses a modified version of the HLM architecture described in Heidelberg-Boston @ SIGTYP 2024 Shared Task: Enhancing Low-Resource Language Analysis With Character-Aware Hierarchical Transformers (arXiv). It is trained to produce sentence embeddings using the multilingual knowledge distillation method and datasets described in Sentence Embedding Models for Ancient Greek Using Multilingual Knowledge Distillation (arXiv). This model was distilled from BAAI/bge-base-en-v1.5
for embedding English and Ancient Greek text.
✨ Features
- Generate sentence embeddings for English and Ancient Greek in a shared vector space.
- Based on a modified HLM architecture.
- Trained with multilingual knowledge distillation.
📦 Installation
This model is currently incompatible with the latest version of the sentence-transformers library. For now, either use HuggingFace Transformers directly (see below) or the following fork of sentence-transformers:
https://github.com/kevinkrahn/sentence-transformers
💻 Usage Examples
Basic Usage (Sentence-Transformers)
from sentence_transformers import SentenceTransformer
sentences = ["This is an example sentence", "Each sentence is converted"]
model = SentenceTransformer('kevinkrahn/shlm-grc-en')
embeddings = model.encode(sentences)
print(embeddings)
Advanced Usage (HuggingFace Transformers)
Without sentence-transformers, you can use the model like this: First, you pass your input through the transformer model, then you have to apply the right pooling-operation on-top of the contextualized word embeddings.
from transformers import AutoTokenizer, AutoModel
import torch
def cls_pooling(model_output):
return model_output[0][:,0]
sentences = ['This is an English sentence', 'Ὁ Παρθενών ἐστιν ἱερὸν καλὸν τῆς Ἀθήνης.']
model = AutoModel.from_pretrained('kevinkrahn/shlm-grc-en', trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained('kevinkrahn/shlm-grc-en', trust_remote_code=True)
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
with torch.no_grad():
model_output = model(**encoded_input)
sentence_embeddings = cls_pooling(model_output)
print("Sentence embeddings:")
print(sentence_embeddings)
📚 Documentation
If you use this model please cite the following papers:
@inproceedings{riemenschneider-krahn-2024-heidelberg,
title = "Heidelberg-Boston @ {SIGTYP} 2024 Shared Task: Enhancing Low-Resource Language Analysis With Character-Aware Hierarchical Transformers",
author = "Riemenschneider, Frederick and
Krahn, Kevin",
editor = "Hahn, Michael and
Sorokin, Alexey and
Kumar, Ritesh and
Shcherbakov, Andreas and
Otmakhova, Yulia and
Yang, Jinrui and
Serikov, Oleg and
Rani, Priya and
Ponti, Edoardo M. and
Murado{\u{g}}lu, Saliha and
Gao, Rena and
Cotterell, Ryan and
Vylomova, Ekaterina",
booktitle = "Proceedings of the 6th Workshop on Research in Computational Linguistic Typology and Multilingual NLP",
month = mar,
year = "2024",
address = "St. Julian's, Malta",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2024.sigtyp-1.16",
pages = "131--141",
}
@inproceedings{krahn-etal-2023-sentence,
title = "Sentence Embedding Models for {A}ncient {G}reek Using Multilingual Knowledge Distillation",
author = "Krahn, Kevin and
Tate, Derrick and
Lamicela, Andrew C.",
editor = "Anderson, Adam and
Gordin, Shai and
Li, Bin and
Liu, Yudong and
Passarotti, Marco C.",
booktitle = "Proceedings of the Ancient Language Processing Workshop",
month = sep,
year = "2023",
address = "Varna, Bulgaria",
publisher = "INCOMA Ltd., Shoumen, Bulgaria",
url = "https://aclanthology.org/2023.alp-1.2",
pages = "13--22",
}
📄 License
This project is licensed under the MIT license.