đ Keyword Extraction Model
This is a keyword extraction model based on the fine - tuned Flan - T5 architecture. It can effectively extract key phrases from paragraphs, helping users quickly summarize text, generate tags, and identify main themes.
đ Quick Start
The model is a fine - tuned version of the [Flan - T5 small](https://huggingface.co/google/flan - t5 - small) model, specifically designed for extracting keywords from paragraphs. It leverages the T5 architecture to identify and output key phrases that capture the essence of the input text.
⨠Features
- Text Summarization: Summarize long texts by extracting key phrases.
- Tag Generation: Generate tags for articles or blog posts.
- Theme Identification: Identify main themes in documents.
đĻ Installation
The installation mainly involves using the transformers
library. You can install it via the following command:
pip install transformers
đģ Usage Examples
Basic Usage
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
model_name = "agentlans/flan-t5-small-keywords"
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
input_text = "Your paragraph here..."
inputs = tokenizer(input_text, return_tensors="pt")
outputs = model.generate(**inputs, max_length=512)
decoded_output = tokenizer.decode(outputs[0], skip_special_tokens=True)
keywords = list(set(decoded_output.split('||')))
print(keywords)
Example Input and Output
Example input paragraph:
In the heart of the bustling city, a hidden gem awaits discovery: a quaint little bookstore that seems to have escaped the relentless march of time. As you step inside, the scent of aged paper and rich coffee envelops you, creating an inviting atmosphere that beckons you to explore its shelves. Each corner is adorned with carefully curated collections, from classic literature to contemporary bestsellers, inviting readers of all tastes to lose themselves in the pages of a good book. The soft glow of warm lighting casts a cozy ambiance, while the gentle hum of conversation among fellow book lovers adds to the charm. This bookstore is not just a place to buy books; it's a sanctuary for those seeking solace, inspiration, and a sense of community in the fast - paced world outside.
Example output keywords:
['old paper coffee scent', 'cosy hum of conversation', 'quaint bookstore', 'community in the fast - paced world', 'solace inspiration', 'curated collections']
đ Documentation
Intended Uses & Limitations
Intended Uses:
- Quick summarization of long paragraphs.
- Generating metadata for content management systems.
- Assisting in SEO keyword identification.
Limitations:
- The model may sometimes generate irrelevant keywords.
- Performance may vary depending on the length and complexity of the input text.
- For best results, use long clean texts.
- Length limit is 512 tokens due to Flan - T5 architecture.
- The model is trained on English text and may not perform well on other languages.
Training and Evaluation
The model was fine - tuned on a dataset of English Wikipedia paragraphs and their corresponding keywords, which includes a diverse range of topics to ensure broad applicability.
Limitations and Bias
This model has been trained on English Wikipedia paragraphs, which may introduce biases. Users should be aware that the keywords generated might reflect these biases and should use the output judiciously.
Ethical Considerations
When using this model, consider the potential impact of automated keyword extraction on content creation and SEO practices. Ensure that the use of this model complies with relevant guidelines and does not contribute to the creation of misleading or spammy content.
đ§ Technical Details
Training Details
- Training Data: dataset of Wikipedia paragraphs and keywords
- Training Procedure: Fine - tuning of google/flan - t5 - small
Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 5e - 05
- train_batch_size: 8
- eval_batch_size: 8
- seed: 42
- optimizer: Adam with betas=(0.9,0.999) and epsilon = 1e - 08
- lr_scheduler_type: linear
- num_epochs: 10.0
Framework versions
- Transformers 4.45.1
- Pytorch 2.4.1+cu121
- Datasets 3.0.1
- Tokenizers 0.20.0
đ License
This project is licensed under the MIT license.
Information Table
Property |
Details |
Model Type |
Fine - tuned Flan - T5 small for keyword extraction |
Training Data |
Dataset of English Wikipedia paragraphs and their corresponding keywords |
Base Model |
google/flan - t5 - small |
Library Name |
transformers |
Tags |
keyword - extraction, text - summarization, flan - t5 |
License |
MIT |
Datasets |
agentlans/wikipedia - paragraph - keywords |