NoticIA - 7B Open-Source Spanish Summarization Model - Generate Concise and High-Quality Summaries for Sensational News

Noticia 7B

Developed by somosnlp

A 7-billion-parameter Spanish clickbait news summarization model capable of generating concise, high-quality summaries for sensational news

Text Generation

Transformers

SpanishOpen Source License:Apache-2.0 #Spanish News Summarization #Clickbait Analysis #LoRA Fine-tuning

Downloads 17

Release Time : 3/27/2024

Model Overview

Specially designed for Spanish clickbait news, this model extracts the core truth from news content by analyzing exaggerated or misleading headlines

Model Features

Spanish Clickbait Analysis

Specifically designed for Spanish clickbait news, effectively identifying exaggerated or misleading headlines

Concise Summary Generation

Generates single-sentence summaries focusing on core issues while removing irrelevant content

Original Text Handling

Prioritizes direct quotes from news content, using quotation marks to ensure accuracy

Efficient Inference

Optimized with 4-bit quantization technology for operation on consumer-grade hardware

Model Capabilities

Spanish Text Comprehension

News Content Summarization

Clickbait Content Recognition

Key Information Extraction

Use Cases

News Media

Clickbait News Analysis

Provides automatic summarization services for clickbait content to news organizations

Helps readers quickly grasp the core content of news

News Quality Assessment

Identifies exaggerated or misleading news headlines

Assists news editors in improving headline quality

Academic Research

Language Model Evaluation

Serves as a benchmark for evaluating Spanish language model performance

Synthetic Data Generation

Generates training data for other NLP tasks

🚀 NoticIA-7B: A Model for Clickbait Article Summarization in Spanish

NoticIA-7B is a 7B parameter model trained on Spanish clickbait news. It can generate concise summaries for clickbait articles, helping users quickly understand the real content behind the sensational headlines.

📖 Spanish Dataset Card: https://huggingface.co/somosnlp/NoticIA-7B/blob/main/README_es.md

✨ Features

Clickbait Summarization: Capable of generating single - sentence summaries for clickbait articles, revealing the truth behind the headlines.
Research - Oriented: Ideal for scientific research, especially for evaluating the performance of task - specific models compared to instruction - tuned models in zero - shot settings.

📦 Installation

The README does not provide specific installation steps, so this section is skipped.

💻 Usage Examples

Basic Usage

Making a summary of a clickbait article on the Web

import torch # pip install torch
from newspaper import Article #pip3 install newspaper3k
from transformers import AutoTokenizer, AutoModelForCausalLM, GenerationConfig # pip install transformers
from transformers import BitsAndBytesConfig # pip install bitsandbytes

article_url ="https://www.huffingtonpost.es/virales/le-compra-abrigo-abuela-97nos-reaccion-fantasia.html"
article = Article(article_url)
article.download()
article.parse()
headline=article.title
body = article.text

def prompt(
    headline: str,
    body: str,
) -> str:
    """
    Generate the prompt for the model.

    Args:
        headline (`str`):
            The headline of the article.
        body (`str`):
            The body of the article.
    Returns:
        `str`: The formatted prompt.
    """

    return (
        f"Ahora eres una Inteligencia Artificial experta en desmontar titulares sensacionalistas o clickbait. "
        f"Tu tarea consiste en analizar noticias con titulares sensacionalistas y "
        f"generar un resumen de una sola frase que revele la verdad detrás del titular.\n"
        f"Este es el titular de la noticia: {headline}\n"
        f"El titular plantea una pregunta o proporciona información incompleta. "
        f"Debes buscar en el cuerpo de la noticia una frase que responda lo que se sugiere en el título. "
        f"Siempre que puedas cita el texto original, especialmente si se trata de una frase que alguien ha dicho. "
        f"Si citas una frase que alguien ha dicho, usa comillas para indicar que es una cita. "
        f"Usa siempre las mínimas palabras posibles. No es necesario que la respuesta sea una oración completa. "
        f"Puede ser sólo el foco de la pregunta. "
        f"Recuerda responder siempre en Español.\n"
        f"Este es el cuerpo de la noticia:\n"
        f"{body}\n"
    )

prompt = prompt(headline=headline, body=body)

tokenizer = AutoTokenizer.from_pretrained("somosnlp/NoticIA-7B")


quantization_config = BitsAndBytesConfig(
   load_in_4bit=True,
   bnb_4bit_compute_dtype=torch.bfloat16,
   bnb_4bit_use_double_quant=True,
)

model = AutoModelForCausalLM.from_pretrained(
    "somosnlp/NoticIA-7B", torch_dtype=torch.bfloat16, device_map="auto", quantization_config=quantization_config,
)

formatted_prompt = tokenizer.apply_chat_template(
    [{"role": "user", "content": prompt}],
    tokenize=False,
    add_generation_prompt=True,
)

model_inputs = tokenizer(
    [formatted_prompt], return_tensors="pt", add_special_tokens=False
)

model_output = model.generate(**model_inputs.to(model.device), generation_config=GenerationConfig(
  max_new_tokens=64,
  min_new_tokens=1,
  do_sample=False,
  num_beams=1,
  use_cache=True
))

summary = tokenizer.batch_decode(model_output,skip_special_tokens=True)[0]

print(summary.strip().split("\n")[-1]) # Get only the summary, without the prompt.

Performing inference on the NoticIA dataset

import torch # pip install torch
from datasets import load_dataset # pip install datasets
from transformers import AutoTokenizer, AutoModelForCausalLM, GenerationConfig # pip install transformers
from transformers import BitsAndBytesConfig # pip install bitsandbytes


dataset = load_dataset("somosnlp/NoticIA-it",split="test")

tokenizer = AutoTokenizer.from_pretrained("somosnlp/NoticIA-7B")

quantization_config = BitsAndBytesConfig(
   load_in_4bit=True,
   bnb_4bit_compute_dtype=torch.bfloat16,
   bnb_4bit_use_double_quant=True,
)

model = AutoModelForCausalLM.from_pretrained(
    "somosnlp/NoticIA-7B", torch_dtype=torch.bfloat16, device_map="auto", quantization_config=quantization_config,
)

formatted_prompt = tokenizer.apply_chat_template(
    [{"role": "user", "content": dataset[0]["pregunta"]}],
    tokenize=False,
    add_generation_prompt=True,
)

model_inputs = tokenizer(
    [formatted_prompt], return_tensors="pt", add_special_tokens=False
)

model_output = model.generate(**model_inputs.to(model.device), generation_config=GenerationConfig(
  max_new_tokens=64,
  min_new_tokens=1,
  do_sample=False,
  num_beams=1,
  use_cache=True
))

summary = tokenizer.batch_decode(model_output,skip_special_tokens=True)[0]

print(summary.strip().split("\n")[-1]) # Get only the summary, without the prompt.

📚 Documentation

Model Details

Model Description

A clickbait article attracts readers through curiosity. Its headline poses a question or makes an incomplete, sensationalist, exaggerated, or misleading statement. The answer usually appears at the end after a lot of irrelevant content. The goal is to get users to view more ads. Clickbait articles are of low quality and offer little value.

NoticIA-7B is a 7B parameter model trained on the NoticIA-it dataset. It can generate high - quality summaries for clickbait articles.

Developed by: [Iker García - Ferrero](https://ikergarcia1996.github.io/Iker - Garcia - Ferrero/), [Begoña Altuna](https://www.linkedin.com/in/bego%C3%B1a - altuna - 78014139)
Funded by: SomosNLP, HuggingFace, HiTZ Zentroa
Model type: Language model, instruction tuned
Language(s): es - ES
License: apache - 2.0
Fine - tuned from model: [openchat/openchat - 3.5 - 0106](https://huggingface.co/openchat/openchat - 3.5 - 0106)
Dataset used: https://huggingface.co/datasets/somosnlp/NoticIA - it

Model Sources

💻 Repository: https://github.com/ikergarcia1996/NoticIA
📖 Paper: NoticIA: A Clickbait Article Summarization Dataset in Spanish
🤖 Dataset and Pre - Trained Models: [https://huggingface.co/collections/Iker/noticia - and - clickbaitfighter - 65fdb2f80c34d7c063d3e48e](https://huggingface.co/collections/Iker/noticia - and - clickbaitfighter - 65fdb2f80c34d7c063d3e48e)
🔌 Demo: https://huggingface.co/spaces/somosnlp/NoticIA - demo
▶️ Video presentation (Spanish): https://youtu.be/xc60K_NzUgk?si=QMqk6OzQZfKP1EUS
🐱‍💻 Hackathon #Somos600M: https://somosnlp.org/hackathon

Uses

Direct Use

📖 Summarization of clickbait articles
📈 Evaluation of Language Models in Spanish.
📚 Develop new academic resources (i.e., synthetic data generation)
🎓 Any other academic research purpose.

Out - of - Scope Use

The use of this model for any action that may harm the legitimacy or economic viability of legitimate and professional media outlets is prohibited.

Bias, Risks, and Limitations

The model is mainly trained on Spanish news from Spain, and the data annotators are also from Spain. So, it is expected to perform well on Spanish from Spain, but its performance on Latin American news or other languages is not guaranteed.

Training Details

Training Data

A clickbait article attracts readers by arousing curiosity. Its headline often poses a question or makes an incomplete, sensational, exaggerated, or misleading statement. The answer to the question in the headline usually comes at the end, after a large amount of irrelevant content. The aim is to get users to view more ads. Clickbait articles are generally of low quality and offer little value beyond initial curiosity. This phenomenon undermines public trust in news sources and affects the advertising revenue of legitimate content creators.

The model is trained on [NoticIA](https://huggingface.co/datasets/somosnlp/NoticIA - it), a dataset with 850 Spanish clickbait news articles, each paired with high - quality, single - sentence human - written summaries.

Training Procedure

A custom training and annotation library https://github.com/ikergarcia1996/NoticIA is developed, which uses 🤗 Transformers, 🤗 PEFT, Bitsandbytes, and Deepspeed.

For the hackathon, a 7 - trillion - parameter model is trained. Using 4 - bit quantization, the model can run on domestic hardware. After evaluating many LLMs, [openchat - 3.5 - 0106](https://huggingface.co/openchat/openchat - 3.5 - 0106) is chosen for its high performance without pretraining. To minimize the impact on the model's prior knowledge, the Low - Rank Adaptation (LoRA) training technique is used.

Training Hyperparameters

Property	Details
Training regime	bfloat16
Training method	LoRA + Deepspeed Zero3
Batch size	64
Sequence Length	8192
Epochs	3
Optimizer	AdamW
Software	Huggingface, Peft, Pytorch, Deepspeed

📄 License

The model is licensed under the apache - 2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご