bert-base-turkish-sentiment-cased開源模型 - 精準判斷土耳其語文本情感極性

首頁

Bert Base Turkish Sentiment Cased

由savasy開發

基於土耳其語BERTurk模型微調的情感分析模型，用於判斷土耳其語文本的情感極性（正面/負面）

文本分類其他#土耳其語情感分析 #BERT微調模型 #高準確率(95.4%)

下載量 17.32k

發布時間 : 3/2/2022

模型概述

該模型專門針對土耳其語情感分析任務開發，基於BERT架構，在土耳其商品評論、電影評論和推特數據集上訓練，能夠準確識別土耳其語文本的情感傾向。

模型特點

高準確率

在測試集上達到95.4%的準確率，表現優異

多源數據訓練

整合了商品評論、電影評論和推特數據，覆蓋多種文本類型

專業土耳其語處理

基於專門針對土耳其語優化的BERTurk模型開發

模型能力

土耳其語文本情感分析

正面/負面情感分類

商品評論情感判斷

電影評論情感判斷

社交媒體文本情感分析

使用案例

電子商務

商品評論分析

分析土耳其電商平臺的商品評論情感傾向

可準確識別書籍、DVD、電子產品等商品評論的情感

影視娛樂

電影評價分析

分析土耳其電影網站的用戶評價情感

能有效區分正面(≥4分)和負面(≤2分)評價

社交媒體監測

推特情感分析

監測土耳其語推特中的公眾情緒

可應用於品牌聲譽管理和輿情監控

🚀 土耳其語BERT基礎情感分析模型

該模型用於情感分析，基於適用於土耳其語的BERTurk模型構建。模型鏈接：https://huggingface.co/savasy/bert-base-turkish-sentiment-cased ，BERTurk模型鏈接：https://huggingface.co/dbmdz/bert-base-turkish-cased

🚀 快速開始

你可以按照以下步驟使用該模型進行情感分析：

安裝必要的庫：

pip install transformers

使用以下代碼進行情感分析：

from transformers import AutoModelForSequenceClassification, AutoTokenizer, pipeline

model = AutoModelForSequenceClassification.from_pretrained("savasy/bert-base-turkish-sentiment-cased")
tokenizer = AutoTokenizer.from_pretrained("savasy/bert-base-turkish-sentiment-cased")
sa= pipeline("sentiment-analysis", tokenizer=tokenizer, model=model)

p = sa("bu telefon modelleri çok kaliteli , her parçası çok özel bence")
print(p)
# [{'label': 'LABEL_1', 'score': 0.9871089}]
print(p[0]['label'] == 'LABEL_1')
# True

p = sa("Film çok kötü ve çok sahteydi")
print(p)
# [{'label': 'LABEL_0', 'score': 0.9975505}]
print(p[0]['label'] == 'LABEL_1')
# False

✨ 主要特性

基於BERTurk模型，適用於土耳其語的情感分析任務。
提供了詳細的訓練和使用示例，方便用戶快速上手。
在實驗中取得了約95.4%的準確率。

📦 安裝指南

要使用該模型，你需要安裝transformers庫，可以使用以下命令進行安裝：

pip install transformers

💻 使用示例

基礎用法

from transformers import AutoModelForSequenceClassification, AutoTokenizer, pipeline

model = AutoModelForSequenceClassification.from_pretrained("savasy/bert-base-turkish-sentiment-cased")
tokenizer = AutoTokenizer.from_pretrained("savasy/bert-base-turkish-sentiment-cased")
sa= pipeline("sentiment-analysis", tokenizer=tokenizer, model=model)

p = sa("bu telefon modelleri çok kaliteli , her parçası çok özel bence")
print(p)
# [{'label': 'LABEL_1', 'score': 0.9871089}]
print(p[0]['label'] == 'LABEL_1')
# True

p = sa("Film çok kötü ve çok sahteydi")
print(p)
# [{'label': 'LABEL_0', 'score': 0.9975505}]
print(p[0]['label'] == 'LABEL_1')
# False

高級用法

對文件中的評論進行情感分析

假設你的文件包含多行評論和標籤（1或0），以製表符分隔：

from transformers import AutoModelForSequenceClassification, AutoTokenizer, pipeline

model = AutoModelForSequenceClassification.from_pretrained("savasy/bert-base-turkish-sentiment-cased")
tokenizer = AutoTokenizer.from_pretrained("savasy/bert-base-turkish-sentiment-cased")
sa = pipeline("sentiment-analysis", tokenizer=tokenizer, model=model)

input_file = "/path/to/your/file/yourfile.tsv"

i, crr = 0, 0
for line in open(input_file):
    lines = line.strip().split("\t")
    if len(lines) == 2:
        
        i = i + 1
        if i%100 == 0:
            print(i)
        
        pred = sa(lines[0])
        pred = pred[0]["label"].split("_")[1]
        
        if pred == lines[1]:
            crr = crr + 1

print(crr, i, crr/i)

📚 詳細文檔

數據集

該數據集取自研究[2]和[3]，並進行了合併。

研究[2]收集了電影和產品評論。產品包括書籍、DVD、電子產品和廚房用品。電影數據集來自一個電影院網頁（Beyazperde），包含5331條正面和5331條負面句子。網頁上的評論由發表評論的用戶以0到5的評分進行標記。該研究認為，如果評分大於或等於4，則評論情感為正面；如果評分小於或等於2，則為負面。他們還從一個在線零售商網頁構建了土耳其語產品評論數據集。他們構建了一個基準數據集，包含一些產品（書籍、DVD等）的評論。同樣，評論的評分範圍為1到5，大多數評論的評分為5。每個類別有700條正面和700條負面評論，其中負面評論的平均評分為2.27，正面評論的平均評分為4.5。該數據集也被研究[1]使用。
研究[3]收集了推文數據集。他們提出了一種新的方法，用於自動對微博消息的情感進行分類。該方法基於利用強大的特徵表示和融合。

合併後的數據集

大小	數據
8000	dev.tsv
8262	test.tsv
32000	train.tsv
48290	總計

引用該數據集的論文

[1] Yildirim, Savaş. (2020). Comparing Deep Neural Networks to Traditional Models for Sentiment Analysis in Turkish Language. 10.1007/978-981-15-1216-2_12.

[2] Demirtas, Erkin and Mykola Pechenizkiy. 2013. Cross-lingual polarity detection with machine translation. In Proceedings of the Second International Workshop on Issues of Sentiment Discovery and Opinion Mining (WISDOM ’13)

[3] Hayran, A., Sert, M. (2017), "Sentiment Analysis on Microblog Data based on Word Embedding and Fusion Techniques", IEEE 25th Signal Processing and Communications Applications Conference (SIU 2017), Belek, Turkey

訓練

export GLUE_DIR="./sst-2-newall"
export TASK_NAME=SST-2

python3 run_glue.py \
  --model_type bert \
  --model_name_or_path dbmdz/bert-base-turkish-uncased\
  --task_name "SST-2" \
  --do_train \
  --do_eval \
  --data_dir "./sst-2-newall" \
  --max_seq_length 128 \
  --per_gpu_train_batch_size 32 \
  --learning_rate 2e-5 \
  --num_train_epochs 3.0 \
  --output_dir "./model"

結果

05/10/2020 17:00:43 - INFO - transformers.trainer -   ****** Running Evaluation ******  
05/10/2020 17:00:43 - INFO - transformers.trainer -     Num examples = 7999  
05/10/2020 17:00:43 - INFO - transformers.trainer -     Batch size = 8  
Evaluation: 100% 1000/1000 [00:34<00:00, 29.04it/s]  
05/10/2020 17:01:17 - INFO - __main__ -   ****** Eval results sst-2 ******  
05/10/2020 17:01:17 - INFO - __main__ -     acc = 0.9539942492811602  
05/10/2020 17:01:17 - INFO - __main__ -     loss = 0.16348013816401363

準確率約為95.4%

📄 許可證

如果你在研究中使用了該模型，請進行引用：

@misc{yildirim2024finetuning,
      title={Fine-tuning Transformer-based Encoder for Turkish Language Understanding Tasks}, 
      author={Savas Yildirim},
      year={2024},
      eprint={2401.17396},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

@book{yildirim2021mastering,
  title={Mastering Transformers: Build state-of-the-art models from scratch with advanced natural language processing techniques},
  author={Yildirim, Savas and Asgari-Chenaghlu, Meysam},
  year={2021},
  publisher={Packt Publishing Ltd}
}