sbert-cased-finnish-paraphrase Open Source Model - Free Implementation of Finnish Sentence Similarity Calculation and Feature Extraction

Home

Sbert Cased Finnish Paraphrase

Developed by TurkuNLP

Finnish sentence BERT model trained based on FinBERT, used for sentence similarity calculation and feature extraction

Text Embedding

Transformers

Other#Finnish sentence embeddings #Paraphrase pair detection #Semantic similarity calculation

Downloads 1,769

Release Time : 3/2/2022

Model Overview

This model is a Finnish sentence BERT model trained based on FinBERT, specifically designed for handling Finnish sentence similarity calculation and feature extraction tasks. It supports invocation via SentenceTransformers or HuggingFace Transformers.

Model Features

Finnish language optimization

Specially trained for Finnish, using case-sensitive FinBERT as the base model

Large-scale training data

Trained using a Finnish paraphrase corpus containing 500,000 positive examples and 5 million negative examples

Efficient sentence encoding

Supports fast conversion of sentences into 768-dimensional embedding vectors for subsequent similarity calculations

Model Capabilities

Sentence feature extraction

Sentence similarity calculation

Finnish text processing

Use Cases

Information retrieval

Similar sentence retrieval

Finding semantically similar sentences from a large text database

Can retrieve the most similar sentences from a 400-million-sentence dataset via the demo system

Text analysis

Paraphrase recognition

Identifying whether two Finnish sentences are paraphrases of each other

🚀 Cased Finnish Sentence BERT model

This is a Finnish Sentence BERT model trained from FinBERT. It can be used for tasks like retrieving the most similar sentences from a large dataset.

🚀 Quick Start

This Finnish Sentence BERT is trained from FinBERT. You can find a demo on retrieving the most similar sentences from a dataset of 400 million sentences here.

✨ Features

Language: Finnish
Pipeline Tag: Sentence-similarity
Tags: sentence-transformers, feature-extraction, sentence-similarity, transformers
Widget Example: "Minusta täällä on ihana asua!"

📦 Installation

The installation is related to the libraries used in the training. You can refer to the official documentation of these libraries:

sentence-transformers

💻 Usage Examples

Basic Usage

The usage is the same as in the HuggingFace documentation of the English Sentence Transformer. You can use it either through SentenceTransformer or HuggingFace Transformers.

SentenceTransformer

from sentence_transformers import SentenceTransformer
sentences = ["Tämä on esimerkkilause.", "Tämä on toinen lause."]

model = SentenceTransformer('TurkuNLP/sbert-cased-finnish-paraphrase')
embeddings = model.encode(sentences)
print(embeddings)

HuggingFace Transformers

from transformers import AutoTokenizer, AutoModel
import torch


#Mean Pooling - Take attention mask into account for correct averaging
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0] #First element of model_output contains all token embeddings
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)


# Sentences we want sentence embeddings for
sentences = ["Tämä on esimerkkilause.", "Tämä on toinen lause."]

# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained('TurkuNLP/sbert-cased-finnish-paraphrase')
model = AutoModel.from_pretrained('TurkuNLP/sbert-cased-finnish-paraphrase')

# Tokenize sentences
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')

# Compute token embeddings
with torch.no_grad():
    model_output = model(**encoded_input)

# Perform pooling. In this case, mean pooling.
sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])

print("Sentence embeddings:")
print(sentence_embeddings)

🔧 Technical Details

Training

Library: sentence-transformers
FinBERT model: TurkuNLP/bert-base-finnish-cased-v1
Data: The data provided here, including the Finnish Paraphrase Corpus and the automatically collected paraphrase candidates (500K positive and 5M negative)
Pooling: mean pooling
Task: Binary prediction, whether two sentences are paraphrases or not. Note: the labels 3 and 4 are considered paraphrases, and labels 1 and 2 non-paraphrases. Details on labels

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 128, 'do_lower_case': False}) with Transformer model: BertModel
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
)

📚 Documentation

Citing & Authors

While the publication is being drafted, please cite this page.

References

J. Kanerva, F. Ginter, LH. Chang, I. Rastas, V. Skantsi, J. Kilpeläinen, HM. Kupari, J. Saarni, M. Sevón, and O. Tarkka. Finnish Paraphrase Corpus. In NoDaLiDa 2021, 2021.
N. Reimers and I. Gurevych. Sentence-BERT: Sentence embeddings using Siamese BERT-networks. In EMNLP-IJCNLP, pages 3982–3992, 2019.
A. Virtanen, J. Kanerva, R. Ilo, J. Luoma, J. Luotolahti, T. Salakoski, F. Ginter, and S. Pyysalo. Multilingual is not enough: BERT for Finnish. arXiv preprint arXiv:1912.07076, 2019.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご