turemb_512 Open Source Model - Freely Implement Sentence and Paragraph Mapping for Clustering and Semantic Search

Turemb 512

Developed by cenfis

This is a model based on sentence-transformers that maps sentences and paragraphs into a 512-dimensional dense vector space, suitable for tasks like clustering or semantic search.

Text Embedding

Transformers

#Sentence Vectorization #Semantic Similarity #512-Dimensional Embedding

Downloads 16

Release Time : 11/16/2023

Model Overview

This model is specifically designed for vectorized representation of sentences and paragraphs, generating 512-dimensional dense vectors that can be used for natural language processing tasks such as text similarity calculation, semantic search, and clustering analysis.

Model Features

High-Dimensional Vector Representation

Generates 512-dimensional dense vectors capable of capturing rich semantic information.

Sentence-Level Semantic Understanding

Optimized specifically for sentence and paragraph-level text, enabling accurate semantic understanding.

Efficient Feature Extraction

Quickly converts text into vector representations for subsequent processing and analysis.

Model Capabilities

Sentence Vectorization

Semantic Similarity Calculation

Text Clustering

Semantic Search

Use Cases

Information Retrieval

Semantic Search Engine

Build a search engine based on semantics rather than keywords.

Improves the relevance and accuracy of search results.

Text Analysis

Document Clustering

Automatically group documents with similar content.

Enables automatic classification and organization of documents.

Recommendation System

🚀 turemb_512

This is a sentence-transformers model that maps sentences and paragraphs to a 512-dimensional dense vector space. It can be used for tasks such as clustering or semantic search.

🚀 Quick Start

✨ Features

Maps sentences and paragraphs to a 512-dimensional dense vector space.
Suitable for clustering and semantic search tasks.

📦 Installation

Using this model becomes easy when you have sentence-transformers installed. You can install it with the following command:

pip install -U sentence-transformers

💻 Usage Examples

Basic Usage

If you have sentence-transformers installed, you can use the model like this:

from sentence_transformers import SentenceTransformer
sentences = ["This is an example sentence", "Each sentence is converted"]

model = SentenceTransformer('{MODEL_NAME}')
embeddings = model.encode(sentences)
print(embeddings)

Advanced Usage

Without sentence-transformers, you can use the model by passing your input through the transformer model and then applying the right pooling operation on top of the contextualized word embeddings.

from transformers import AutoTokenizer, AutoModel
import torch


#Mean Pooling - Take attention mask into account for correct averaging
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0] #First element of model_output contains all token embeddings
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)


# Sentences we want sentence embeddings for
sentences = ['This is an example sentence', 'Each sentence is converted']

# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained('{MODEL_NAME}')
model = AutoModel.from_pretrained('{MODEL_NAME}')

# Tokenize sentences
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')

# Compute token embeddings
with torch.no_grad():
    model_output = model(**encoded_input)

# Perform pooling. In this case, mean pooling.
sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])

print("Sentence embeddings:")
print(sentence_embeddings)

📚 Documentation

Evaluation Results

For an automated evaluation of this model, see the Sentence Embeddings Benchmark: https://seb.sbert.net

Training

The model was trained with the following parameters:

DataLoader: torch.utils.data.dataloader.DataLoader of length 14435 with parameters:

{'batch_size': 16, 'sampler': 'torch.utils.data.sampler.RandomSampler', 'batch_sampler': 'torch.utils.data.sampler.BatchSampler'}

Loss: sentence_transformers.losses.MultipleNegativesRankingLoss.MultipleNegativesRankingLoss with parameters:

{'scale': 20.0, 'similarity_fct': 'cos_sim'}

Parameters of the fit()-Method:

{
    "epochs": 12,
    "evaluation_steps": 1000,
    "evaluator": "sentence_transformers.evaluation.EmbeddingSimilarityEvaluator.EmbeddingSimilarityEvaluator",
    "max_grad_norm": 1,
    "optimizer_class": "<class 'torch.optim.adamw.AdamW'>",
    "optimizer_params": {
        "lr": 0.0001
    },
    "scheduler": "WarmupLinear",
    "steps_per_epoch": null,
    "warmup_steps": 866,
    "weight_decay": 0.005
}

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': None, 'do_lower_case': False}) with Transformer model: T5EncoderModel 
  (1): Pooling({'word_embedding_dimension': 512, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
)

📄 License

The citations for this model are as follows:

@article{,
  title={Translation Aligned Sentence Embeddings for Turkish Language},
  author={Unlu, Eren and Ciftci, Unver},
  journal={arXiv preprint arXiv:2311.09748},
  year={2023}
}

@article{chung2022scaling,
  title={Scaling instruction-finetuned language models},
  author={Chung, Hyung Won and Hou, Le and Longpre, Shayne and Zoph, Barret and Tay, Yi and Fedus, William and Li, Yunxuan and Wang, Xuezhi and Dehghani, Mostafa and Brahma, Siddhartha and others},
  journal={arXiv preprint arXiv:2210.11416},
  year={2022}
}

@article{budur2020data,
  title={Data and representation for turkish natural language inference},
  author={Budur, Emrah and {\"O}z{\c{c}}elik, R{\i}za and G{\"u}ng{\"o}r, Tunga and Potts, Christopher},
  journal={arXiv preprint arXiv:2004.14963},
  year={2020}
}

@article{tiedemann2020tatoeba,
  title={The Tatoeba Translation Challenge--Realistic Data Sets for Low Resource and Multilingual MT},
  author={Tiedemann, J{\"o}rg},
  journal={arXiv preprint arXiv:2010.06354},
  year={2020}
}

@article{unal2016tasviret,
  title={Tasviret: G{\"o}r{\"u}nt{\"u}lerden otomatik t{\"u}rk{\c{c}}e a{\c{c}}{\i}klama olusturma I{\c{c}}in bir denekta{\c{c}}{\i} veri k{\"u}mesi (TasvirEt: A benchmark dataset for automatic Turkish description generation from images)},
  author={Unal, Mesut Erhan and Citamak, Begum and Yagcioglu, Semih and Erdem, Aykut and Erdem, Erkut and Cinbis, Nazli Ikizler and Cakici, Ruket},
  journal={IEEE Sinyal Isleme ve Iletisim Uygulamalar{\i} Kurultay{\i} (SIU 2016)},
  year={2016}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご