shlm-grc-en Open-Source Model - Freely Create Sentence Embeddings for Ancient Greek and English Texts

Shlm Grc En

Developed by kevinkrahn

This model creates sentence embeddings for Ancient Greek and English texts in a shared vector space, based on an improved HLM architecture and trained through multilingual knowledge distillation.

Text Embedding

Transformers

Supports Multiple LanguagesOpen Source License:MIT #Ancient Greek-English Embeddings #Character-Aware Hierarchical Transformer #Multilingual Knowledge Distillation

Downloads 62

Release Time : 5/29/2024

Model Overview

This model is used to generate sentence embeddings for English and Ancient Greek texts, supporting cross-lingual semantic search and sentence similarity computation.

Model Features

Cross-Lingual Shared Vector Space

Capable of embedding English and Ancient Greek sentences in the same vector space, supporting cross-lingual semantic search.

Character-Aware Architecture

Utilizes an improved HLM architecture, particularly suitable for low-resource languages like Ancient Greek.

Knowledge Distillation Training

Distilled from the BAAI/bge-base-en-v1.5 model, retaining high-quality embedding capabilities.

Model Capabilities

Sentence Embedding Generation

Cross-Lingual Semantic Search

Sentence Similarity Computation

Feature Extraction

Use Cases

Academic Research

Classical Literature Analysis

Used to analyze semantic correspondences between Ancient Greek texts and their English translations.

Information Retrieval

Cross-Lingual Document Retrieval

Performs semantic search in databases containing Ancient Greek and English documents.

🚀 shlm-grc-en

This model generates sentence embeddings for English and Ancient Greek text in a shared vector space.

🚀 Quick Start

This model creates sentence embeddings in a shared vector space for Ancient Greek and English text. The base model uses a modified version of the HLM architecture described in Heidelberg-Boston @ SIGTYP 2024 Shared Task: Enhancing Low-Resource Language Analysis With Character-Aware Hierarchical Transformers (arXiv). It is trained to produce sentence embeddings using the multilingual knowledge distillation method and datasets described in Sentence Embedding Models for Ancient Greek Using Multilingual Knowledge Distillation (arXiv). This model was distilled from BAAI/bge-base-en-v1.5 for embedding English and Ancient Greek text.

✨ Features

Generate sentence embeddings for English and Ancient Greek in a shared vector space.
Based on a modified HLM architecture.
Trained with multilingual knowledge distillation.

📦 Installation

This model is currently incompatible with the latest version of the sentence-transformers library. For now, either use HuggingFace Transformers directly (see below) or the following fork of sentence-transformers: https://github.com/kevinkrahn/sentence-transformers

💻 Usage Examples

Basic Usage (Sentence-Transformers)

from sentence_transformers import SentenceTransformer
sentences = ["This is an example sentence", "Each sentence is converted"]

model = SentenceTransformer('kevinkrahn/shlm-grc-en')
embeddings = model.encode(sentences)
print(embeddings)

Advanced Usage (HuggingFace Transformers)

Without sentence-transformers, you can use the model like this: First, you pass your input through the transformer model, then you have to apply the right pooling-operation on-top of the contextualized word embeddings.

from transformers import AutoTokenizer, AutoModel
import torch


def cls_pooling(model_output):
    return model_output[0][:,0]


# Sentences we want sentence embeddings for
sentences = ['This is an English sentence', 'Ὁ Παρθενών ἐστιν ἱερὸν καλὸν τῆς Ἀθήνης.']

# Load model from HuggingFace Hub
model = AutoModel.from_pretrained('kevinkrahn/shlm-grc-en', trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained('kevinkrahn/shlm-grc-en', trust_remote_code=True)

# Tokenize sentences
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')

# Compute token embeddings
with torch.no_grad():
    model_output = model(**encoded_input)

# Perform pooling. In this case, cls pooling.
sentence_embeddings = cls_pooling(model_output)

print("Sentence embeddings:")
print(sentence_embeddings)

📚 Documentation

If you use this model please cite the following papers:

@inproceedings{riemenschneider-krahn-2024-heidelberg,
    title = "Heidelberg-Boston @ {SIGTYP} 2024 Shared Task: Enhancing Low-Resource Language Analysis With Character-Aware Hierarchical Transformers",
    author = "Riemenschneider, Frederick  and
      Krahn, Kevin",
    editor = "Hahn, Michael  and
      Sorokin, Alexey  and
      Kumar, Ritesh  and
      Shcherbakov, Andreas  and
      Otmakhova, Yulia  and
      Yang, Jinrui  and
      Serikov, Oleg  and
      Rani, Priya  and
      Ponti, Edoardo M.  and
      Murado{\u{g}}lu, Saliha  and
      Gao, Rena  and
      Cotterell, Ryan  and
      Vylomova, Ekaterina",
    booktitle = "Proceedings of the 6th Workshop on Research in Computational Linguistic Typology and Multilingual NLP",
    month = mar,
    year = "2024",
    address = "St. Julian's, Malta",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2024.sigtyp-1.16",
    pages = "131--141",
}

@inproceedings{krahn-etal-2023-sentence,
    title = "Sentence Embedding Models for {A}ncient {G}reek Using Multilingual Knowledge Distillation",
    author = "Krahn, Kevin  and
      Tate, Derrick  and
      Lamicela, Andrew C.",
    editor = "Anderson, Adam  and
      Gordin, Shai  and
      Li, Bin  and
      Liu, Yudong  and
      Passarotti, Marco C.",
    booktitle = "Proceedings of the Ancient Language Processing Workshop",
    month = sep,
    year = "2023",
    address = "Varna, Bulgaria",
    publisher = "INCOMA Ltd., Shoumen, Bulgaria",
    url = "https://aclanthology.org/2023.alp-1.2",
    pages = "13--22",
}

📄 License

This project is licensed under the MIT license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご