Keyphrase-mpnet-v1 Open-source Model - Empowering Phrase Processing, Suitable for Clustering and Semantic Search Tasks

Keyphrase Mpnet V1

Developed by uclanlp

A sentence transformer model optimized for phrases, mapping phrases into a 768-dimensional dense vector space, suitable for tasks like clustering or semantic search.

Text Embedding

Transformers

#Keyphrase Embedding #Semantic Similarity Calculation #SimCSE Optimization

Downloads 4,278

Release Time : 5/8/2023

Model Overview

This model is based on sentence-transformers/all-mpnet-base-v2 and fine-tuned using the SimCSE method on 1 million keyphrase data entries, primarily used for evaluating semantic-based keyphrase model metrics.

Model Features

Phrase Optimization

Specifically optimized for phrases, better capturing phrase-level semantics compared to general sentence embedding models.

SimCSE Fine-tuning

Fine-tuned using the SimCSE method on 1 million keyphrase data entries to enhance semantic representation quality.

Multi-domain Applicability

Training data covers multiple domains including science, news, online forums, and web pages, ensuring broad applicability.

Model Capabilities

Phrase Vectorization

Semantic Similarity Calculation

Keyphrase Clustering

Semantic Search

Use Cases

Academic Research

Keyphrase Evaluation

Used to compute semantic-based keyphrase model evaluation metrics.

Served as an evaluation benchmark in the KPEval paper.

Information Retrieval

Semantic Search

Maps query phrases and document phrases into the same vector space for similarity matching.

🚀 keyphrase-mpnet-v1

This is a sentence-transformers model specialized for phrases. It maps phrases to a 768-dimensional dense vector space and can be used for tasks such as clustering or semantic search. In the original paper, this model is used for calculating semantic-based evaluation metrics of keyphrase models. This model is based on sentence-transformers/all-mpnet-base-v2 and further fine-tuned on 1 million keyphrase data with SimCSE.

🚀 Quick Start

This model is designed to map phrases to a 768-dimensional dense vector space, enabling tasks like clustering and semantic search. It's based on sentence-transformers/all-mpnet-base-v2 and fine - tuned on keyphrase data.

✨ Features

Specialized for phrases, mapping them to a 768-dimensional dense vector space.
Applicable for clustering and semantic search tasks.
Used for calculating semantic-based evaluation metrics of keyphrase models in the original paper.

📦 Installation

Using this model becomes easy when you have sentence-transformers installed:

pip install -U sentence-transformers

💻 Usage Examples

Basic Usage

from sentence_transformers import SentenceTransformer
phrases = ["information retrieval", "text mining", "natural language processing"]

model = SentenceTransformer('uclanlp/keyphrase-mpnet-v1')
embeddings = model.encode(phrases)
print(embeddings)

Advanced Usage

Without sentence-transformers, you can use the model like this: First, you pass your input through the transformer model, then you have to apply the right pooling-operation on-top of the contextualized word embeddings.

from transformers import AutoTokenizer, AutoModel
import torch


#Mean Pooling - Take attention mask into account for correct averaging
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0] #First element of model_output contains all token embeddings
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)


# Sentences we want sentence embeddings for
phrases = ["information retrieval", "text mining", "natural language processing"]

# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained('uclanlp/keyphrase-mpnet-v1')
model = AutoModel.from_pretrained('uclanlp/keyphrase-mpnet-v1')

# Tokenize sentences
encoded_input = tokenizer(phrases, padding=True, truncation=True, return_tensors='pt')

# Compute token embeddings
with torch.no_grad():
    model_output = model(**encoded_input)

# Perform pooling. In this case, mean pooling.
sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])

print("Phrase embeddings:")
print(sentence_embeddings)

📚 Documentation

Training

The model is trained on phrases from four keyphrase datasets covering a wide range of domains.

Property	Details
Dataset Name	KP20k (Science, 715369 phrases), KPTimes (News, 113456 phrases), StackEx (Online Forum, 8149 phrases), OpenKP (Web, 200335 phrases), Total: 1030309 phrases
DataLoader	`torch.utils.data.dataloader.DataLoader` of length 2025 with parameters: `{'batch_size': 512, 'sampler': 'torch.utils.data.sampler.RandomSampler', 'batch_sampler': 'torch.utils.data.sampler.BatchSampler'}`
Loss	`sentence_transformers.losses.MultipleNegativesRankingLoss.MultipleNegativesRankingLoss` with parameters: `{'scale': 20.0, 'similarity_fct': 'cos_sim'}`
Fit() - Method Parameters	`{ "epochs": 1, "evaluation_steps": 0, "evaluator": "NoneType", "max_grad_norm": 1, "optimizer_class": "<class 'torch.optim.adamw.AdamW'>", "optimizer_params": { "lr": 1e-06 }, "scheduler": "WarmupLinear", "steps_per_epoch": null, "warmup_steps": 203, "weight_decay": 0.01 }`

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 12, 'do_lower_case': False}) with Transformer model: MPNetModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
)

📄 License

Citing & Authors

Paper: KPEval: Towards Fine-grained Semantic-based Keyphrase Evaluation

@inproceedings{wu-etal-2024-kpeval,
    title = "{KPE}val: Towards Fine-Grained Semantic-Based Keyphrase Evaluation",
    author = "Wu, Di  and
      Yin, Da  and
      Chang, Kai-Wei",
    editor = "Ku, Lun-Wei  and
      Martins, Andre  and
      Srikumar, Vivek",
    booktitle = "Findings of the Association for Computational Linguistics ACL 2024",
    month = aug,
    year = "2024",
    address = "Bangkok, Thailand and virtual meeting",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2024.findings-acl.117",
    pages = "1959--1981",
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご