DeCLUTR-base Open Source Sentence Encoder Model - Generate High-Quality Text Representations for Free

Declutr Base

Developed by johngiorgi

DeCLUTR-base is a universal sentence encoder model trained through deep contrastive learning for generating high-quality text representations.

Text Embedding EnglishOpen Source License:Apache-2.0 #Unsupervised sentence embedding #Text similarity calculation #Contrastive learning

Downloads 99

Release Time : 3/2/2022

Model Overview

This model is designed as a universal sentence encoder, capable of converting text into high-dimensional vector representations for tasks such as calculating sentence similarity.

Model Features

Unsupervised learning

Trained through deep contrastive learning without requiring labeled data

Universal sentence encoding

Capable of converting any text into high-quality vector representations

Efficient similarity calculation

The generated embedding vectors can be used for efficient semantic similarity calculations

Model Capabilities

Text feature extraction

Sentence similarity calculation

Semantic search

Use Cases

Information retrieval

Semantic search

Improving search results by calculating semantic similarity between queries and documents

Enhancing the relevance of search results

Text analysis

Document clustering

Automatically grouping documents based on semantic similarity

Discovering thematic structures in document collections

🚀 DeCLUTR-base

The "DeCLUTR-base" model is designed for sentence similarity tasks, offering a powerful solution for encoding sentences and computing semantic similarities.

🚀 Quick Start

The "DeCLUTR-base" model is sourced from the paper DeCLUTR: Deep Contrastive Learning for Unsupervised Textual Representations. It serves as a universal sentence encoder, similar to Google's Universal Sentence Encoder or Sentence Transformers.

✨ Features

Universal Sentence Encoder: Can be used as a general - purpose sentence encoder for various natural language processing tasks.
Semantic Similarity Computation: Capable of computing semantic similarities between sentences effectively.

📦 Installation

No specific installation steps are provided in the original document.

💻 Usage Examples

Basic Usage

With SentenceTransformers

from scipy.spatial.distance import cosine
from sentence_transformers import SentenceTransformer

# Load the model
model = SentenceTransformer("johngiorgi/declutr-base")

# Prepare some text to embed
texts = [
    "A smiling costumed woman is holding an umbrella.",
    "A happy woman in a fairy costume holds an umbrella.",
]

# Embed the text
embeddings = model.encode(texts)

# Compute a semantic similarity via the cosine distance
semantic_sim = 1 - cosine(embeddings[0], embeddings[1])

With 🤗 Transformers

import torch
from scipy.spatial.distance import cosine
from transformers import AutoModel, AutoTokenizer

# Load the model
tokenizer = AutoTokenizer.from_pretrained("johngiorgi/declutr-base")
model = AutoModel.from_pretrained("johngiorgi/declutr-base")

# Prepare some text to embed
text = [
    "A smiling costumed woman is holding an umbrella.",
    "A happy woman in a fairy costume holds an umbrella.",
]
inputs = tokenizer(text, padding=True, truncation=True, return_tensors="pt")

# Embed the text
with torch.no_grad():
    sequence_output = model(**inputs)[0]

# Mean pool the token - level embeddings to get sentence - level embeddings
embeddings = torch.sum(
    sequence_output * inputs["attention_mask"].unsqueeze(-1), dim=1
) / torch.clamp(torch.sum(inputs["attention_mask"], dim=1, keepdims=True), min=1e-9)

# Compute a semantic similarity via the cosine distance
semantic_sim = 1 - cosine(embeddings[0], embeddings[1])

📚 Documentation

For full details, please see our repo.

📄 License

This model is licensed under the apache-2.0 license.

BibTeX entry and citation info

@inproceedings{giorgi-etal-2021-declutr,
    title        = {{D}e{CLUTR}: Deep Contrastive Learning for Unsupervised Textual Representations},
    author       = {Giorgi, John  and Nitski, Osvald  and Wang, Bo  and Bader, Gary},
    year         = 2021,
    month        = aug,
    booktitle    = {Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)},
    publisher    = {Association for Computational Linguistics},
    address      = {Online},
    pages        = {879--895},
    doi          = {10.18653/v1/2021.acl-long.72},
    url          = {https://aclanthology.org/2021.acl-long.72}
}

Property	Details
Pipeline Tag	sentence - similarity
Tags	sentence - transformers, feature - extraction, sentence - similarity
Language	en
License	apache - 2.0
Datasets	openwebtext

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご