sbert-uncased-finnish-paraphrase Open Source Model - Free to Calculate Finnish Sentence Similarity and Perform Feature Extraction

Home

Sbert Uncased Finnish Paraphrase

Developed by TurkuNLP

Finnish sentence BERT model based on FinBERT training, used for sentence similarity calculation and feature extraction

Text Embedding

Transformers

Other#Finnish sentence semantic matching #Case-insensitive #Sentence embedding generation

Downloads 895

Release Time : 3/2/2022

Model Overview

This is a sentence transformer model based on FinBERT training, specifically designed for Finnish sentence similarity calculation and feature extraction. The model processes sentence embeddings through mean pooling and is suitable for tasks such as paraphrase identification.

Model Features

Case-insensitive

The model is case-insensitive, suitable for processing Finnish text in different case forms

High-quality Finnish training

Trained on Finnish paraphrase corpora and automatically collected paraphrase candidate sentences (500,000 positive examples, 5 million negative examples)

Efficient sentence embeddings

Generates high-quality sentence-level embeddings using mean pooling

Model Capabilities

Sentence feature extraction

Sentence similarity calculation

Semantic similarity comparison

Finnish text processing

Use Cases

Text similarity

Paraphrase identification

Identify whether two Finnish sentences are paraphrases

Performs well on Finnish paraphrase corpora

Semantic search

Retrieve semantically similar sentences from large-scale text

Can be used to build a semantic retrieval system with 4 million sentences

Feature extraction

Sentence embedding generation

Generate sentence-level feature representations for downstream tasks

Produces 768-dimensional sentence embedding vectors

🚀 Uncased Finnish Sentence BERT model

This is a Finnish Sentence BERT model trained from FinBERT, which can be used for sentence similarity tasks.

🚀 Quick Start

Finnish Sentence BERT is trained from FinBERT. A demo on retrieving the most similar sentences from a dataset of 400 million sentences using the cased model can be found here.

✨ Features

Language: Finnish
Pipeline Tag: Sentence Similarity
Tags: sentence-transformers, feature-extraction, sentence-similarity, transformers

📦 Installation

The installation process is mainly about using relevant libraries. You can install sentence-transformers and transformers according to the official documentation.

💻 Usage Examples

Basic Usage

The usage is the same as in HuggingFace documentation. You can use it either through SentenceTransformer or HuggingFace Transformers.

SentenceTransformer

from sentence_transformers import SentenceTransformer
sentences = ["Tämä on esimerkkilause.", "Tämä on toinen lause."]

model = SentenceTransformer('TurkuNLP/sbert-uncased-finnish-paraphrase')
embeddings = model.encode(sentences)
print(embeddings)

HuggingFace Transformers

from transformers import AutoTokenizer, AutoModel
import torch


# Mean Pooling - Take attention mask into account for correct averaging
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0] #First element of model_output contains all token embeddings
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)


# Sentences we want sentence embeddings for
sentences = ["Tämä on esimerkkilause.", "Tämä on toinen lause."]

# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained('TurkuNLP/sbert-uncased-finnish-paraphrase')
model = AutoModel.from_pretrained('TurkuNLP/sbert-uncased-finnish-paraphrase')

# Tokenize sentences
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')

# Compute token embeddings
with torch.no_grad():
    model_output = model(**encoded_input)

# Perform pooling. In this case, mean pooling.
sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])

print("Sentence embeddings:")
print(sentence_embeddings)

📚 Documentation

Training

Library: sentence-transformers
FinBERT model: TurkuNLP/bert-base-finnish-uncased-v1
Data: The data provided here, including the Finnish Paraphrase Corpus and the automatically collected paraphrase candidates (500K positive and 5M negative)
Pooling: mean pooling
Task: Binary prediction, whether two sentences are paraphrases or not. Note: the labels 3 and 4 are considered paraphrases, and labels 1 and 2 non-paraphrases. Details on labels

Evaluation Results

A publication detailing the evaluation results is currently being drafted.

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 128, 'do_lower_case': True}) with Transformer model: BertModel
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
    )

📄 License

No license information is provided in the original document.

🔗 References

J. Kanerva, F. Ginter, LH. Chang, I. Rastas, V. Skantsi, J. Kilpeläinen, HM. Kupari, J. Saarni, M. Sevón, and O. Tarkka. Finnish Paraphrase Corpus. In NoDaLiDa 2021, 2021.
N. Reimers and I. Gurevych. Sentence-BERT: Sentence embeddings using Siamese BERT-networks. In EMNLP-IJCNLP, pages 3982–3992, 2019.
A. Virtanen, J. Kanerva, R. Ilo, J. Luoma, J. Luotolahti, T. Salakoski, F. Ginter, and S. Pyysalo. Multilingual is not enough: BERT for Finnish. arXiv preprint arXiv:1912.07076, 2019.

📝 Citing & Authors

While the publication is being drafted, please cite this page.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご