模型概述
模型特點
模型能力
使用案例
🚀 土耳其語BERT基礎情感分析模型
該模型用於情感分析,基於適用於土耳其語的BERTurk模型構建。模型鏈接:https://huggingface.co/savasy/bert-base-turkish-sentiment-cased ,BERTurk模型鏈接:https://huggingface.co/dbmdz/bert-base-turkish-cased
🚀 快速開始
你可以按照以下步驟使用該模型進行情感分析:
- 安裝必要的庫:
pip install transformers
- 使用以下代碼進行情感分析:
from transformers import AutoModelForSequenceClassification, AutoTokenizer, pipeline
model = AutoModelForSequenceClassification.from_pretrained("savasy/bert-base-turkish-sentiment-cased")
tokenizer = AutoTokenizer.from_pretrained("savasy/bert-base-turkish-sentiment-cased")
sa= pipeline("sentiment-analysis", tokenizer=tokenizer, model=model)
p = sa("bu telefon modelleri çok kaliteli , her parçası çok özel bence")
print(p)
# [{'label': 'LABEL_1', 'score': 0.9871089}]
print(p[0]['label'] == 'LABEL_1')
# True
p = sa("Film çok kötü ve çok sahteydi")
print(p)
# [{'label': 'LABEL_0', 'score': 0.9975505}]
print(p[0]['label'] == 'LABEL_1')
# False
✨ 主要特性
- 基於BERTurk模型,適用於土耳其語的情感分析任務。
- 提供了詳細的訓練和使用示例,方便用戶快速上手。
- 在實驗中取得了約95.4%的準確率。
📦 安裝指南
要使用該模型,你需要安裝transformers
庫,可以使用以下命令進行安裝:
pip install transformers
💻 使用示例
基礎用法
from transformers import AutoModelForSequenceClassification, AutoTokenizer, pipeline
model = AutoModelForSequenceClassification.from_pretrained("savasy/bert-base-turkish-sentiment-cased")
tokenizer = AutoTokenizer.from_pretrained("savasy/bert-base-turkish-sentiment-cased")
sa= pipeline("sentiment-analysis", tokenizer=tokenizer, model=model)
p = sa("bu telefon modelleri çok kaliteli , her parçası çok özel bence")
print(p)
# [{'label': 'LABEL_1', 'score': 0.9871089}]
print(p[0]['label'] == 'LABEL_1')
# True
p = sa("Film çok kötü ve çok sahteydi")
print(p)
# [{'label': 'LABEL_0', 'score': 0.9975505}]
print(p[0]['label'] == 'LABEL_1')
# False
高級用法
對文件中的評論進行情感分析
假設你的文件包含多行評論和標籤(1或0),以製表符分隔:
from transformers import AutoModelForSequenceClassification, AutoTokenizer, pipeline
model = AutoModelForSequenceClassification.from_pretrained("savasy/bert-base-turkish-sentiment-cased")
tokenizer = AutoTokenizer.from_pretrained("savasy/bert-base-turkish-sentiment-cased")
sa = pipeline("sentiment-analysis", tokenizer=tokenizer, model=model)
input_file = "/path/to/your/file/yourfile.tsv"
i, crr = 0, 0
for line in open(input_file):
lines = line.strip().split("\t")
if len(lines) == 2:
i = i + 1
if i%100 == 0:
print(i)
pred = sa(lines[0])
pred = pred[0]["label"].split("_")[1]
if pred == lines[1]:
crr = crr + 1
print(crr, i, crr/i)
📚 詳細文檔
數據集
- 研究[2]收集了電影和產品評論。產品包括書籍、DVD、電子產品和廚房用品。電影數據集來自一個電影院網頁(Beyazperde),包含5331條正面和5331條負面句子。網頁上的評論由發表評論的用戶以0到5的評分進行標記。該研究認為,如果評分大於或等於4,則評論情感為正面;如果評分小於或等於2,則為負面。他們還從一個在線零售商網頁構建了土耳其語產品評論數據集。他們構建了一個基準數據集,包含一些產品(書籍、DVD等)的評論。同樣,評論的評分範圍為1到5,大多數評論的評分為5。每個類別有700條正面和700條負面評論,其中負面評論的平均評分為2.27,正面評論的平均評分為4.5。該數據集也被研究[1]使用。
- 研究[3]收集了推文數據集。他們提出了一種新的方法,用於自動對微博消息的情感進行分類。該方法基於利用強大的特徵表示和融合。
合併後的數據集
大小 | 數據 |
---|---|
8000 | dev.tsv |
8262 | test.tsv |
32000 | train.tsv |
48290 | 總計 |
引用該數據集的論文
[1] Yildirim, Savaş. (2020). Comparing Deep Neural Networks to Traditional Models for Sentiment Analysis in Turkish Language. 10.1007/978-981-15-1216-2_12.
[2] Demirtas, Erkin and Mykola Pechenizkiy. 2013. Cross-lingual polarity detection with machine translation. In Proceedings of the Second International Workshop on Issues of Sentiment Discovery and Opinion Mining (WISDOM ’13)
[3] Hayran, A., Sert, M. (2017), "Sentiment Analysis on Microblog Data based on Word Embedding and Fusion Techniques", IEEE 25th Signal Processing and Communications Applications Conference (SIU 2017), Belek, Turkey
訓練
export GLUE_DIR="./sst-2-newall"
export TASK_NAME=SST-2
python3 run_glue.py \
--model_type bert \
--model_name_or_path dbmdz/bert-base-turkish-uncased\
--task_name "SST-2" \
--do_train \
--do_eval \
--data_dir "./sst-2-newall" \
--max_seq_length 128 \
--per_gpu_train_batch_size 32 \
--learning_rate 2e-5 \
--num_train_epochs 3.0 \
--output_dir "./model"
結果
05/10/2020 17:00:43 - INFO - transformers.trainer - ****** Running Evaluation ******
05/10/2020 17:00:43 - INFO - transformers.trainer - Num examples = 7999
05/10/2020 17:00:43 - INFO - transformers.trainer - Batch size = 8
Evaluation: 100% 1000/1000 [00:34<00:00, 29.04it/s]
05/10/2020 17:01:17 - INFO - __main__ - ****** Eval results sst-2 ******
05/10/2020 17:01:17 - INFO - __main__ - acc = 0.9539942492811602
05/10/2020 17:01:17 - INFO - __main__ - loss = 0.16348013816401363
準確率約為95.4%
📄 許可證
如果你在研究中使用了該模型,請進行引用:
@misc{yildirim2024finetuning,
title={Fine-tuning Transformer-based Encoder for Turkish Language Understanding Tasks},
author={Savas Yildirim},
year={2024},
eprint={2401.17396},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
@book{yildirim2021mastering,
title={Mastering Transformers: Build state-of-the-art models from scratch with advanced natural language processing techniques},
author={Yildirim, Savas and Asgari-Chenaghlu, Meysam},
year={2021},
publisher={Packt Publishing Ltd}
}








