Model Overview
Model Features
Model Capabilities
Use Cases
🚀 NoticIA-7B: A Model for Clickbait Article Summarization in Spanish
NoticIA-7B is a 7B parameter model trained on Spanish clickbait news. It can generate concise summaries for clickbait articles, helping users quickly understand the real content behind the sensational headlines.
- 📖 Spanish Dataset Card: https://huggingface.co/somosnlp/NoticIA-7B/blob/main/README_es.md
✨ Features
- Clickbait Summarization: Capable of generating single - sentence summaries for clickbait articles, revealing the truth behind the headlines.
- Research - Oriented: Ideal for scientific research, especially for evaluating the performance of task - specific models compared to instruction - tuned models in zero - shot settings.
📦 Installation
The README does not provide specific installation steps, so this section is skipped.
💻 Usage Examples
Basic Usage
Making a summary of a clickbait article on the Web
import torch # pip install torch
from newspaper import Article #pip3 install newspaper3k
from transformers import AutoTokenizer, AutoModelForCausalLM, GenerationConfig # pip install transformers
from transformers import BitsAndBytesConfig # pip install bitsandbytes
article_url ="https://www.huffingtonpost.es/virales/le-compra-abrigo-abuela-97nos-reaccion-fantasia.html"
article = Article(article_url)
article.download()
article.parse()
headline=article.title
body = article.text
def prompt(
headline: str,
body: str,
) -> str:
"""
Generate the prompt for the model.
Args:
headline (`str`):
The headline of the article.
body (`str`):
The body of the article.
Returns:
`str`: The formatted prompt.
"""
return (
f"Ahora eres una Inteligencia Artificial experta en desmontar titulares sensacionalistas o clickbait. "
f"Tu tarea consiste en analizar noticias con titulares sensacionalistas y "
f"generar un resumen de una sola frase que revele la verdad detrás del titular.\n"
f"Este es el titular de la noticia: {headline}\n"
f"El titular plantea una pregunta o proporciona información incompleta. "
f"Debes buscar en el cuerpo de la noticia una frase que responda lo que se sugiere en el título. "
f"Siempre que puedas cita el texto original, especialmente si se trata de una frase que alguien ha dicho. "
f"Si citas una frase que alguien ha dicho, usa comillas para indicar que es una cita. "
f"Usa siempre las mínimas palabras posibles. No es necesario que la respuesta sea una oración completa. "
f"Puede ser sólo el foco de la pregunta. "
f"Recuerda responder siempre en Español.\n"
f"Este es el cuerpo de la noticia:\n"
f"{body}\n"
)
prompt = prompt(headline=headline, body=body)
tokenizer = AutoTokenizer.from_pretrained("somosnlp/NoticIA-7B")
quantization_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True,
)
model = AutoModelForCausalLM.from_pretrained(
"somosnlp/NoticIA-7B", torch_dtype=torch.bfloat16, device_map="auto", quantization_config=quantization_config,
)
formatted_prompt = tokenizer.apply_chat_template(
[{"role": "user", "content": prompt}],
tokenize=False,
add_generation_prompt=True,
)
model_inputs = tokenizer(
[formatted_prompt], return_tensors="pt", add_special_tokens=False
)
model_output = model.generate(**model_inputs.to(model.device), generation_config=GenerationConfig(
max_new_tokens=64,
min_new_tokens=1,
do_sample=False,
num_beams=1,
use_cache=True
))
summary = tokenizer.batch_decode(model_output,skip_special_tokens=True)[0]
print(summary.strip().split("\n")[-1]) # Get only the summary, without the prompt.
Performing inference on the NoticIA dataset
import torch # pip install torch
from datasets import load_dataset # pip install datasets
from transformers import AutoTokenizer, AutoModelForCausalLM, GenerationConfig # pip install transformers
from transformers import BitsAndBytesConfig # pip install bitsandbytes
dataset = load_dataset("somosnlp/NoticIA-it",split="test")
tokenizer = AutoTokenizer.from_pretrained("somosnlp/NoticIA-7B")
quantization_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True,
)
model = AutoModelForCausalLM.from_pretrained(
"somosnlp/NoticIA-7B", torch_dtype=torch.bfloat16, device_map="auto", quantization_config=quantization_config,
)
formatted_prompt = tokenizer.apply_chat_template(
[{"role": "user", "content": dataset[0]["pregunta"]}],
tokenize=False,
add_generation_prompt=True,
)
model_inputs = tokenizer(
[formatted_prompt], return_tensors="pt", add_special_tokens=False
)
model_output = model.generate(**model_inputs.to(model.device), generation_config=GenerationConfig(
max_new_tokens=64,
min_new_tokens=1,
do_sample=False,
num_beams=1,
use_cache=True
))
summary = tokenizer.batch_decode(model_output,skip_special_tokens=True)[0]
print(summary.strip().split("\n")[-1]) # Get only the summary, without the prompt.
📚 Documentation
Model Details
Model Description
A clickbait article attracts readers through curiosity. Its headline poses a question or makes an incomplete, sensationalist, exaggerated, or misleading statement. The answer usually appears at the end after a lot of irrelevant content. The goal is to get users to view more ads. Clickbait articles are of low quality and offer little value.
NoticIA-7B is a 7B parameter model trained on the NoticIA-it dataset. It can generate high - quality summaries for clickbait articles.
- Developed by: [Iker García - Ferrero](https://ikergarcia1996.github.io/Iker - Garcia - Ferrero/), [Begoña Altuna](https://www.linkedin.com/in/bego%C3%B1a - altuna - 78014139)
- Funded by: SomosNLP, HuggingFace, HiTZ Zentroa
- Model type: Language model, instruction tuned
- Language(s): es - ES
- License: apache - 2.0
- Fine - tuned from model: [openchat/openchat - 3.5 - 0106](https://huggingface.co/openchat/openchat - 3.5 - 0106)
- Dataset used: https://huggingface.co/datasets/somosnlp/NoticIA - it
Model Sources
- 💻 Repository: https://github.com/ikergarcia1996/NoticIA
- 📖 Paper: NoticIA: A Clickbait Article Summarization Dataset in Spanish
- 🤖 Dataset and Pre - Trained Models: [https://huggingface.co/collections/Iker/noticia - and - clickbaitfighter - 65fdb2f80c34d7c063d3e48e](https://huggingface.co/collections/Iker/noticia - and - clickbaitfighter - 65fdb2f80c34d7c063d3e48e)
- 🔌 Demo: https://huggingface.co/spaces/somosnlp/NoticIA - demo
- ▶️ Video presentation (Spanish): https://youtu.be/xc60K_NzUgk?si=QMqk6OzQZfKP1EUS
- 🐱💻 Hackathon #Somos600M: https://somosnlp.org/hackathon
Uses
Direct Use
- 📖 Summarization of clickbait articles
- 📈 Evaluation of Language Models in Spanish.
- 📚 Develop new academic resources (i.e., synthetic data generation)
- 🎓 Any other academic research purpose.
Out - of - Scope Use
The use of this model for any action that may harm the legitimacy or economic viability of legitimate and professional media outlets is prohibited.
Bias, Risks, and Limitations
The model is mainly trained on Spanish news from Spain, and the data annotators are also from Spain. So, it is expected to perform well on Spanish from Spain, but its performance on Latin American news or other languages is not guaranteed.
Training Details
Training Data
A clickbait article attracts readers by arousing curiosity. Its headline often poses a question or makes an incomplete, sensational, exaggerated, or misleading statement. The answer to the question in the headline usually comes at the end, after a large amount of irrelevant content. The aim is to get users to view more ads. Clickbait articles are generally of low quality and offer little value beyond initial curiosity. This phenomenon undermines public trust in news sources and affects the advertising revenue of legitimate content creators.
The model is trained on [NoticIA](https://huggingface.co/datasets/somosnlp/NoticIA - it), a dataset with 850 Spanish clickbait news articles, each paired with high - quality, single - sentence human - written summaries.
Training Procedure
A custom training and annotation library https://github.com/ikergarcia1996/NoticIA is developed, which uses 🤗 Transformers, 🤗 PEFT, Bitsandbytes, and Deepspeed.
For the hackathon, a 7 - trillion - parameter model is trained. Using 4 - bit quantization, the model can run on domestic hardware. After evaluating many LLMs, [openchat - 3.5 - 0106](https://huggingface.co/openchat/openchat - 3.5 - 0106) is chosen for its high performance without pretraining. To minimize the impact on the model's prior knowledge, the Low - Rank Adaptation (LoRA) training technique is used.
Training Hyperparameters
Property | Details |
---|---|
Training regime | bfloat16 |
Training method | LoRA + Deepspeed Zero3 |
Batch size | 64 |
Sequence Length | 8192 |
Epochs | 3 |
Optimizer | AdamW |
Software | Huggingface, Peft, Pytorch, Deepspeed |
📄 License
The model is licensed under the apache - 2.0 license.






