VLT5-base-keywords Open-source Model - Free Keyword Extraction from English and Polish Scientific Paper Abstracts

Vlt5 Base Keywords

Developed by Voicelab

An encoder-decoder keyword generation model based on Google's Transformer architecture, supporting Polish and English, primarily used for extracting keywords from scientific paper abstracts.

Text Generation

Transformers

Supports Multiple Languages#Multilingual Keyword Generation #Scientific Paper Abstract Processing #T5 Transfer Learning

Downloads 6,736

Release Time : 9/27/2022

Model Overview

The vlT5 model is a keyword generation model based on the T5 architecture. Through joint training with scientific paper abstracts and titles, it can generate precise but not necessarily complete key phrases based on the abstract content.

Model Features

Transferability

The model can adapt to different domains and text types, demonstrating strong transfer learning capabilities.

Hybrid Generation Capability

Combines extractive and generative capabilities to produce precise but not necessarily complete key phrases.

Multilingual Support

Natively supports Polish and English, with some performance in other languages.

Model Capabilities

Keyword Generation

Text Summarization

Multilingual Processing

Use Cases

Academic Research

Scientific Paper Keyword Extraction

Automatically generates keywords describing the content from scientific paper abstracts

Typically generates 3-5 keywords

Text Processing

News Summary Keyword Extraction

Extracts key information from news texts

🚀 Keyword Extraction from Short Texts with T5

Our vlT5 model is a keyword generation model based on an encoder - decoder architecture using Transformer blocks. It can generate precise keyphrases for scientific articles based on their abstracts and titles.

🚀 Quick Start

The vlT5 model offers a powerful solution for keyword extraction from short texts. It leverages an encoder - decoder architecture with Transformer blocks, trained on a scientific articles corpus to predict keyphrases.

✨ Features

Transferability: The vlT5 model works well across all domains and text types.
Dual - mode operation: It can work both extractively and abstractively.
Multilingual support: While trained on Polish and English, it performs relatively well with other languages too.

📦 Installation

No specific installation steps are provided in the original README.

💻 Usage Examples

Basic Usage

from transformers import T5Tokenizer, T5ForConditionalGeneration

model = T5ForConditionalGeneration.from_pretrained("Voicelab/vlt5-base-keywords")
tokenizer = T5Tokenizer.from_pretrained("Voicelab/vlt5-base-keywords")

task_prefix = "Keywords: "
inputs = [
    "Christina Katrakis, who spoke to the BBC from Vorokhta in western Ukraine, relays the account of one family, who say Russian soldiers shot at their vehicles while they were leaving their village near Chernobyl in northern Ukraine. She says the cars had white flags and signs saying they were carrying children.",
    "Decays the learning rate of each parameter group by gamma every step_size epochs. Notice that such decay can happen simultaneously with other changes to the learning rate from outside this scheduler. When last_epoch=-1, sets initial lr as lr.",
    "Hello, I'd like to order a pizza with salami topping.",
]

for sample in inputs:
    input_sequences = [task_prefix + sample]
    input_ids = tokenizer(
        input_sequences, return_tensors="pt", truncation=True
    ).input_ids
    output = model.generate(input_ids, no_repeat_ngram_size=3, num_beams=4)
    predicted = tokenizer.decode(output[0], skip_special_tokens=True)
    print(sample, "\n --->", predicted)

📚 Documentation

vlT5

The biggest advantage of the vlT5 model is its transferability, performing well on all domains and text types. However, the text length and the number of keywords are similar to the training data. For longer texts, they need to be split into smaller chunks before being fed into the model.

Overview

Property	Details
Language model	[t5 - base](https://huggingface.co/t5 - base)
Language	pl, en (but works relatively well with others)
Training data	POSMAC
Online Demo	Visit our online demo for better results [https://nlp - demo - 1.voicelab.ai/](https://nlp - demo - 1.voicelab.ai/)
Paper	Keyword Extraction from Short Texts with a Text - To - Text Transfer Transformer, ACIIDS 2022

Corpus

The model was trained on a POSMAC corpus. The Polish Open Science Metadata Corpus (POSMAC) is a collection of 216,214 abstracts of scientific publications compiled in the CURLICAT project.

Domains	Documents	With keywords
Engineering and technical sciences	58 974	57 165
Social sciences	58 166	41 799
Agricultural sciences	29 811	15 492
Humanities	22 755	11 497
Exact and natural sciences	13 579	9 185
Humanities, Social sciences	12 809	7 063
Medical and health sciences	6 030	3 913
Medical and health sciences, Social sciences	828	571
Humanities, Medical and health sciences, Social sciences	601	455
Engineering and technical sciences, Humanities	312	312

Tokenizer

As in the original plT5 implementation, the training dataset was tokenized into subwords using a sentencepiece unigram model with a vocabulary size of 50k tokens.

Inference

Our results showed that the best generation results were achieved with no_repeat_ngram_size = 3, num_beams = 4

Results

Method	Rank	Micro			Macro
		P	R	F1	P	R	F1
extremeText	1	0.175	0.038	0.063	0.007	0.004	0.005
	3	0.117	0.077	0.093	0.011	0.011	0.011
	5	0.090	0.099	0.094	0.013	0.016	0.015
	10	0.060	0.131	0.082	0.015	0.025	0.019
vlT5kw	1	0.345	0.076	0.124	0.054	0.047	0.050
	3	0.328	0.212	0.257	0.133	0.127	0.129
	5	0.318	0.237	0.271	0.143	0.140	0.141
KeyBERT	1	0.030	0.007	0.011	0.004	0.003	0.003
	3	0.015	0.010	0.012	0.006	0.004	0.005
	5	0.011	0.012	0.011	0.006	0.005	0.005
TermoPL	1	0.118	0.026	0.043	0.004	0.003	0.003
	3	0.070	0.046	0.056	0.006	0.005	0.006
	5	0.051	0.056	0.053	0.007	0.007	0.007
	all	0.025	0.339	0.047	0.017	0.030	0.022
extremeText	1	0.210	0.077	0.112	0.037	0.017	0.023
	3	0.139	0.152	0.145	0.045	0.042	0.043
	5	0.107	0.196	0.139	0.049	0.063	0.055
	10	0.072	0.262	0.112	0.041	0.098	0.058
vlT5kw	1	0.377	0.138	0.202	0.119	0.071	0.089
	3	0.361	0.301	0.328	0.185	0.147	0.164
	5	0.357	0.316	0.335	0.188	0.153	0.169
KeyBERT	1	0.018	0.007	0.010	0.003	0.001	0.001
	3	0.009	0.010	0.009	0.004	0.001	0.002
	5	0.007	0.012	0.009	0.004	0.001	0.002
TermoPL	1	0.076	0.028	0.041	0.002	0.001	0.001
	3	0.046	0.051	0.048	0.003	0.001	0.002
	5	0.033	0.061	0.043	0.003	0.001	0.002
	all	0.021	0.457	0.040	0.004	0.008	0.005

📄 License

CC BY 4.0

🔧 Technical Details

The model uses an encoder - decoder architecture with Transformer blocks. The training dataset was tokenized using a sentencepiece unigram model with a vocabulary size of 50k tokens.

📖 Citation

If you use this model, please cite the following paper: [Pęzik, P., Mikołajczyk, A., Wawrzyński, A., Żarnecki, F., Nitoń, B., Ogrodniczuk, M. (2023). Transferable Keyword Extraction and Generation with Text - to - Text Language Models. In: Mikyška, J., de Mulatier, C., Paszynski, M., Krzhizhanovskaya, V.V., Dongarra, J.J., Sloot, P.M. (eds) Computational Science – ICCS 2023. ICCS 2023. Lecture Notes in Computer Science, vol 14074. Springer, Cham. https://doi.org/10.1007/978 - 3 - 031 - 36021 - 3_42](https://link.springer.com/chapter/10.1007/978 - 3 - 031 - 36021 - 3_42)

Piotr Pęzik, Agnieszka Mikołajczyk - Bareła, Adam Wawrzyński, Bartłomiej Nitoń, Maciej Ogrodniczuk, Keyword Extraction from Short Texts with a Text - To - Text Transfer Transformer, ACIIDS 2022

👥 Authors

The model was trained by the NLP Research Team at Voicelab.ai. You can contact us here.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご