CryptoBERT開源模型 - 免費分析加密貨幣社交媒體帖文情感與語言

首頁

Cryptobert

由ElKulako開發

CryptoBERT是一個專門用於分析加密貨幣相關社交媒體帖文情感和語言的預訓練自然語言處理模型。

文本分類

Transformers

英語開源協議:MIT #加密貨幣情感分析 #社交媒體文本處理 #BERT微調模型

下載量 276.93k

發布時間 : 6/20/2022

模型概述

該模型基於BERT架構，專門針對加密貨幣領域進行了優化訓練，能夠有效識別社交媒體中關於加密貨幣的看漲、看跌和中性情緒。

模型特點

加密貨幣領域專用

專門針對加密貨幣相關社交媒體內容進行訓練，能更好理解該領域的特殊術語和表達方式

多平臺訓練數據

使用來自StockTwits、Telegram、Reddit和Twitter等多個平臺的320萬條加密貨幣相關帖文進行訓練

三分類情感分析

能夠準確識別'看跌'、'中性'和'看漲'三種加密貨幣相關情緒

模型能力

加密貨幣社交媒體文本分析

情感分類

自然語言理解

使用案例

市場情緒分析

加密貨幣市場情緒監測

即時分析社交媒體上關於特定加密貨幣的情緒傾向

可準確識別看漲/看跌情緒，準確率示例：看漲識別準確率87.3%，看跌識別準確率98.9%

投資決策支持

為加密貨幣投資者提供市場情緒參考數據

學術研究

加密貨幣社區語言研究

分析加密貨幣社區特有的語言模式和表達方式

🚀 CryptoBERT

CryptoBERT是一個預訓練的自然語言處理（NLP）模型，用於分析與加密貨幣相關的社交媒體帖子和消息中的語言及情感傾向。它通過在加密貨幣領域進一步訓練vinai的bertweet-base語言模型構建而成，使用了超過320萬條獨特的與加密貨幣相關的社交媒體帖子作為語料庫。（後續將發佈包含更多細節的研究論文。）

🚀 快速開始

學術引用

如需學術引用，請參考以下論文：https://ieeexplore.ieee.org/document/10223689

分類訓練

模型基於以下標籤進行訓練：“看跌” : 0，“中性” : 1，“看漲” : 2。

CryptoBERT的情感分類頭在一個包含200萬條標記的StockTwits帖子的平衡數據集上進行了微調，這些帖子從ElKulako/stocktwits-crypto中採樣得到。

CryptoBERT訓練時的最大序列長度為128。從技術上講，它可以處理最多514個標記的序列，但不建議超過128。

💻 使用示例

基礎用法

from transformers import TextClassificationPipeline, AutoModelForSequenceClassification, AutoTokenizer
model_name = "ElKulako/cryptobert"
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels = 3)
pipe = TextClassificationPipeline(model=model, tokenizer=tokenizer, max_length=64, truncation=True, padding = 'max_length')
# post_1 & post_3 = bullish, post_2 = bearish
post_1 = " see y'all tomorrow and can't wait to see ada in the morning, i wonder what price it is going to be at. 😎🐂🤠💯😴, bitcoin is looking good go for it and flash by that 45k. "
post_2 = "  alright racers, it’s a race to the bottom! good luck today and remember there are no losers (minus those who invested in currency nobody really uses) take your marks... are you ready? go!!" 
post_3 = " i'm never selling. the whole market can bottom out. i'll continue to hold this dumpster fire until the day i die if i need to." 
df_posts = [post_1, post_2, post_3]
preds = pipe(df_posts)
print(preds)

運行上述代碼後，輸出結果如下：

[{'label': 'Bullish', 'score': 0.8734585642814636}, {'label': 'Bearish', 'score': 0.9889495372772217}, {'label': 'Bullish', 'score': 0.6595883965492249}]

🔧 技術細節

訓練語料庫

CryptoBERT在320萬條關於各種加密貨幣的社交媒體帖子上進行訓練，僅考慮長度超過4個單詞的非重複帖子。語料庫的來源如下：

(1) StockTwits - 187.5萬條關於按交易量排名前100的加密貨幣的帖子。帖子收集時間為2021年11月1日至2022年6月16日。ElKulako/stocktwits-crypto

(2) Telegram - 66.4萬條來自前5個Telegram群組的帖子：Binance、Bittrex、huobi global、Kucoin、OKEx。數據收集時間為2020年11月16日至2021年1月30日。感謝Anton提供的數據。

(3) Reddit - 17.2萬條來自各種加密貨幣投資線程的評論，收集時間為2021年5月至2022年5月。

(4) Twitter - 49.6萬條帶有#XBT、#Bitcoin或#BTC標籤的帖子。收集時間為2018年5月。感謝Paul提供的數據。

📄 許可證

本項目採用MIT許可證。

屬性	詳情
模型類型	預訓練的自然語言處理（NLP）模型
訓練數據	超過320萬條獨特的與加密貨幣相關的社交媒體帖子，包括StockTwits、Telegram、Reddit和Twitter的數據