ClickbaitFighter-10B開源模型 - 免費部署揭秘西班牙語標題黨新聞背後真相

首頁

Clickbaitfighter 10B

由Iker開發

基於NoticIA數據集微調的西班牙語標題黨新聞摘要生成模型，能揭示聳動標題背後的真實內容

大型語言模型

Transformers

西班牙語#西班牙語標題黨解析 #新聞摘要生成 #高ROUGE分數

下載量 48

發布時間 : 3/22/2024

模型概述

該模型專門用於分析西班牙語標題黨新聞，生成簡潔摘要揭示標題背後的真實信息。基於Nous-Hermes-2-SOLAR-10.7B微調，在NoticIA數據集上表現優異。

模型特點

標題黨新聞解析

專門針對西班牙語標題黨新聞設計，能有效識別和拆解聳動標題

精確摘要生成

生成一句話摘要，聚焦核心事實，特別擅長處理直接引語

高質量微調

基於NoticIA專業數據集微調，ROUGE分數達52.01

模型能力

西班牙語文本理解

標題黨內容識別

新聞摘要生成

直接引語提取

使用案例

新聞媒體

標題黨新聞事實核查

為新聞平臺自動生成標題黨新聞的真相摘要

幫助讀者快速瞭解新聞實質內容

內容審核

社交媒體內容審核

識別社交媒體上的誤導性標題內容

輔助人工審核團隊提高效率

🚀 標題黨新聞摘要生成模型

本項目基於 NousResearch/Nous-Hermes-2-SOLAR-10.7B 模型，使用 Iker/NoticIA 數據集微調得到，可用於分析標題黨新聞並生成一句話摘要，揭示標題背後的真相。

🚀 快速開始

本模型是使用 NoticIA 數據集微調得到的，可用於生成標題黨新聞的摘要。

模型信息

屬性	詳情
模型類型	基於 `transformers` 庫的因果語言模型
訓練數據	`Iker/NoticIA` 數據集
評估指標	ROUGE

開源模型

模型名稱	Iker/ClickbaitFighter-2B	Iker/ClickbaitFighter-7B	Iker/ClickbaitFighter-10B
參數數量	2B	7B	10M
ROUGE 得分	36.26	49.81	52.01

評估結果

💻 使用示例

總結網頁文章

import torch # pip install torch
from newspaper import Article #pip3 install newspaper3k
from transformers import AutoTokenizer, AutoModelForCausalLM, GenerationConfig # pip install transformers

article_url ="https://www.huffingtonpost.es/virales/le-compra-abrigo-abuela-97nos-reaccion-fantasia.html"
article = Article(article_url)
article.download()
article.parse()
headline=article.title
body = article.text

def prompt(
    headline: str,
    body: str,
) -> str:
    """
    Generate the prompt for the model.

    Args:
        headline (`str`):
            The headline of the article.
        body (`str`):
            The body of the article.
    Returns:
        `str`: The formatted prompt.
    """

    return (
        f"Ahora eres una Inteligencia Artificial experta en desmontar titulares sensacionalistas o clickbait. "
        f"Tu tarea consiste en analizar noticias con titulares sensacionalistas y "
        f"generar un resumen de una sola frase que revele la verdad detrás del titular.\n"
        f"Este es el titular de la noticia: {headline}\n"
        f"El titular plantea una pregunta o proporciona información incompleta. "
        f"Debes buscar en el cuerpo de la noticia una frase que responda lo que se sugiere en el título. "
        f"Siempre que puedas cita el texto original, especialmente si se trata de una frase que alguien ha dicho. "
        f"Si citas una frase que alguien ha dicho, usa comillas para indicar que es una cita. "
        f"Usa siempre las mínimas palabras posibles. No es necesario que la respuesta sea una oración completa. "
        f"Puede ser sólo el foco de la pregunta. "
        f"Recuerda responder siempre en Español.\n"
        f"Este es el cuerpo de la noticia:\n"
        f"{body}\n"
    )

prompt = prompt(headline=headline, body=body)

tokenizer = AutoTokenizer.from_pretrained("Iker/ClickbaitFighter-10B")
model = AutoModelForCausalLM.from_pretrained(
    "Iker/ClickbaitFighter-10B", torch_dtype=torch.bfloat16, device_map="auto"
)

formatted_prompt = tokenizer.apply_chat_template(
    [{"role": "user", "content": prompt}],
    tokenize=False,
    add_generation_prompt=True,
)

model_inputs = tokenizer(
    [formatted_prompt], return_tensors="pt", add_special_tokens=False
)

model_output = model.generate(**model_inputs.to(model.device), generation_config=GenerationConfig(
  max_new_tokens=32,
  min_new_tokens=1,
  do_sample=False,
  num_beams=1,
  use_cache=True
))

summary = tokenizer.batch_decode(model_output,skip_special_tokens=True)[0]

print(summary.strip().split("\n")[-1]) # Get only the summary, without the prompt.

在 NoticIA 數據集上進行推理

import torch # pip install torch
from datasets import load_dataset # pip install datasets
from transformers import AutoTokenizer, AutoModelForCausalLM, GenerationConfig # pip install transformers

dataset = load_dataset("Iker/NoticIA")
example = dataset["test"][0]
headline = example["web_headline"]
body = example["web_text"]

def prompt(
    headline: str,
    body: str,
) -> str:
    """
    Generate the prompt for the model.

    Args:
        headline (`str`):
            The headline of the article.
        body (`str`):
            The body of the article.
    Returns:
        `str`: The formatted prompt.
    """

    return (
        f"Ahora eres una Inteligencia Artificial experta en desmontar titulares sensacionalistas o clickbait. "
        f"Tu tarea consiste en analizar noticias con titulares sensacionalistas y "
        f"generar un resumen de una sola frase que revele la verdad detrás del titular.\n"
        f"Este es el titular de la noticia: {headline}\n"
        f"El titular plantea una pregunta o proporciona información incompleta. "
        f"Debes buscar en el cuerpo de la noticia una frase que responda lo que se sugiere en el título. "
        f"Siempre que puedas cita el texto original, especialmente si se trata de una frase que alguien ha dicho. "
        f"Si citas una frase que alguien ha dicho, usa comillas para indicar que es una cita. "
        f"Usa siempre las mínimas palabras posibles. No es necesario que la respuesta sea una oración completa. "
        f"Puede ser sólo el foco de la pregunta. "
        f"Recuerda responder siempre en Español.\n"
        f"Este es el cuerpo de la noticia:\n"
        f"{body}\n"
    )

prompt = prompt(headline=headline, body=body)

tokenizer = AutoTokenizer.from_pretrained("Iker/ClickbaitFighter-10B")
model = AutoModelForCausalLM.from_pretrained(
    "Iker/ClickbaitFighter-10B", torch_dtype=torch.bfloat16, device_map="auto"
)

formatted_prompt = tokenizer.apply_chat_template(
    [{"role": "user", "content": prompt}],
    tokenize=False,
    add_generation_prompt=True,
)

model_inputs = tokenizer(
    [formatted_prompt], return_tensors="pt", add_special_tokens=False
)

model_output = model.generate(**model_inputs.to(model.device), generation_config=GenerationConfig(
  max_new_tokens=32,
  min_new_tokens=1,
  do_sample=False,
  num_beams=1,
  use_cache=True
))

summary = tokenizer.batch_decode(model_output,skip_special_tokens=True)[0]

print(summary.strip().split("\n")[-1]) # Get only the summary, without the prompt.

📄 許可證

本項目採用 cc-by-nc-sa-4.0 許可證。

📚 引用

@misc{noticia2024,
      title={NoticIA: A Clickbait Article Summarization Dataset in Spanish}, 
      author={Iker García-Ferrero and Begoña Altuna},
      year={2024},
      eprint={2404.07611},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}