interpress-turkish-news-classification開源模型 - 高準確率免費實現土耳其語新聞分類

首頁

Interpress Turkish News Classification

由serdarakyol開發

這是一個土耳其語新聞分類模型，基於interpress新聞數據集訓練，準確率達97%。

文本分類其他#土耳其語新聞分類 #高準確率(97%)#多類別新聞識別

下載量 40

發布時間 : 3/2/2022

模型概述

該模型用於對土耳其語新聞進行分類，支持10個類別，包括政治、經濟、國際等。

模型特點

高準確率

在訓練和驗證數據上達到97%的準確率

多類別分類

支持10個不同的新聞類別分類

土耳其語支持

專門針對土耳其語新聞優化

模型能力

土耳其語文本分類

新聞內容分析

多類別預測

使用案例

新聞媒體

新聞自動分類

自動將新聞文章分類到預定義的10個類別中

準確率97%

內容分析

新聞趨勢分析

通過分類結果分析特定時間段內的新聞趨勢

🚀 INTERPRESS新聞分類

本項目聚焦於INTERPRESS新聞分類，藉助特定數據集訓練模型，實現對新聞的精準分類，為新聞信息的高效處理提供了有力支持。

🚀 快速開始

本項目提供了使用Torch和Tensorflow進行新聞分類預測的方法，你可以根據自己的需求選擇合適的方式。

✨ 主要特性

真實數據集：使用從INTERPRESS下載的真實世界數據，經過篩選後使用了108K條數據進行模型訓練。
高準確率：模型在訓練數據和驗證數據上的準確率達到了97%。
多框架支持：支持Torch和Tensorflow兩種深度學習框架進行使用。

📦 安裝指南

Torch

pip install transformers or pip install transformers==4.3.3

Tensorflow

pip install transformers or pip install transformers==4.3.3

💻 使用示例

Torch

from transformers import AutoTokenizer, AutoModelForSequenceClassification

tokenizer = AutoTokenizer.from_pretrained("serdarakyol/interpress-turkish-news-classification")
model = AutoModelForSequenceClassification.from_pretrained("serdarakyol/interpress-turkish-news-classification")

import torch

if torch.cuda.is_available():    
    device = torch.device("cuda")
    model = model.cuda()
    print('There are %d GPU(s) available.' % torch.cuda.device_count())
    print('GPU name is:', torch.cuda.get_device_name(0))
else:
    print('No GPU available, using the CPU instead.')
    device = torch.device("cpu")

import numpy as np

def prediction(news):
    news=[news]
    indices=tokenizer.batch_encode_plus(
    news,
    max_length=512,
    add_special_tokens=True,
    return_attention_mask=True,
    padding='max_length',
    truncation=True,
    return_tensors='pt')

    inputs = indices["input_ids"].clone().detach().to(device)
    masks = indices["attention_mask"].clone().detach().to(device)

    with torch.no_grad():
        output = model(inputs, token_type_ids=None,attention_mask=masks)

    logits = output[0]
    logits = logits.detach().cpu().numpy()
    pred = np.argmax(logits,axis=1)[0]
    return pred

news = r"ABD'den Prens Selman'a yaptırım yok Beyaz Saray Sözcüsü Psaki, Muhammed bin Selman'a yaptırım uygulamamanın \"doğru karar\" olduğunu savundu. Psaki, \"Tarihimizde, Demokrat ve Cumhuriyetçi başkanların yönetimlerinde diplomatik ilişki içinde olduğumuz ülkelerin liderlerine yönelik yaptırım getirilmemiştir\" dedi."
# 你可以在這個鏈接找到該新聞：https://www.ntv.com.tr/dunya/abdden-prens-selmana-yaptirim-yok,YTeWNv0-oU6Glbhnpjs1JQ (新聞日期：02/03/2021)

labels = {
    0 : "Culture-Art",
    1 : "Economy",
    2 : "Politics",
    3 : "Education",
    4 : "World",
    5 : "Sport",
    6 : "Technology",
    7 : "Magazine",
    8 : "Health",
    9 : "Agenda"
}
pred = prediction(news)
print(labels[pred])
# > World

Tensorflow

import tensorflow as tf
from transformers import BertTokenizer, TFBertForSequenceClassification
import numpy as np

tokenizer = BertTokenizer.from_pretrained('serdarakyol/interpress-turkish-news-classification')
model = TFBertForSequenceClassification.from_pretrained("serdarakyol/interpress-turkish-news-classification")

news = r"ABD'den Prens Selman'a yaptırım yok Beyaz Saray Sözcüsü Psaki, Muhammed bin Selman'a yaptırım uygulamamanın \"doğru karar\" olduğunu savundu. Psaki, \"Tarihimizde, Demokrat ve Cumhuriyetçi başkanların yönetimlerinde diplomatik ilişki içinde olduğumuz ülkelerin liderlerine yönelik yaptırım getirilmemiştir\" dedi."

inputs = tokenizer(news, return_tensors="tf")
inputs["labels"] = tf.reshape(tf.constant(1), (-1, 1)) # Batch size 1

outputs = model(inputs)
loss = outputs.loss
logits = outputs.logits
pred = np.argmax(logits,axis=1)[0]
print(labels[pred])
# > World