ESG - BERT開源模型！助力可持續投資領域文本挖掘與ESG文本分類

首頁

ESG BERT

由nbroad開發

專注於可持續投資領域文本挖掘的BERT變體模型，在ESG相關文本分類任務上表現優異

大型語言模型

Transformers

英語#ESG文本挖掘 #可持續投資分析 #金融NLP

下載量 9,800

發布時間 : 3/2/2022

模型概述

基於BERT架構優化的語言模型，專門用於環境、社會和治理(ESG)領域的文本分析任務，能夠有效識別和分類可持續投資相關的非結構化文本內容

模型特點

ESG領域專業化

針對可持續投資領域文本進行優化訓練，相比通用BERT模型在ESG相關任務上表現更優

高性能文本分類

在ESG文本分類任務上F1分數達到0.90，顯著優於通用BERT模型(0.79)和傳統方法(0.67)

多標籤分類能力

支持26種ESG相關標籤的分類，涵蓋商業道德、數據安全、氣候變化等多個ESG維度

模型能力

ESG文本分類

可持續投資文本分析

企業社會責任報告處理

非結構化ESG數據挖掘

使用案例

企業ESG報告分析

碳足跡聲明識別

從企業年報中自動識別和分類碳減排相關聲明

能準確識別如'降低碳足跡'、'減排舉措'等關鍵信息

衝突礦產政策檢測

分析企業報告中關於礦產採購政策的描述

可識別'無衝突礦產'、'負責任採購'等政策聲明

可持續投資研究

ESG因素提取

從大量企業文檔中提取關鍵ESG因素用於投資決策

自動分類26種ESG相關因素，提高研究效率

🚀 ESG - BERT 模型卡片

用於可持續投資文本挖掘的特定領域BERT模型

🚀 快速開始

使用以下代碼來開始使用該模型：

點擊展開

pip install torchserve torch-model-archiver

pip install torchvision

pip install transformers

接下來，我們將設置處理腳本。這是一個用於文本分類的基本處理程序，可根據需要進行改進。將此腳本保存為目錄中的 "handler.py"。[1]

from abc import ABC
import json
import logging
import os
import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer
from ts.torch_handler.base_handler import BaseHandler

logger = logging.getLogger(__name__)

class TransformersClassifierHandler(BaseHandler, ABC):
    """
    Transformers text classifier handler class. This handler takes a text (string) and
    as input and returns the classification text based on the serialized transformers checkpoint.
    """
    def __init__(self):
        super(TransformersClassifierHandler, self).__init__()
        self.initialized = False

    def initialize(self, ctx):
        self.manifest = ctx.manifest
        properties = ctx.system_properties
        model_dir = properties.get("model_dir")
        self.device = torch.device("cuda:" + str(properties.get("gpu_id")) if torch.cuda.is_available() else "cpu")
        # Read model serialize/pt file
        self.model = AutoModelForSequenceClassification.from_pretrained(model_dir)
        self.tokenizer = AutoTokenizer.from_pretrained(model_dir)
        self.model.to(self.device)
        self.model.eval()
        logger.debug('Transformer model from path {0} loaded successfully'.format(model_dir))
        # Read the mapping file, index to object name
        mapping_file_path = os.path.join(model_dir, "index_to_name.json")
        if os.path.isfile(mapping_file_path):
            with open(mapping_file_path) as f:
                self.mapping = json.load(f)
        else:
            logger.warning('Missing the index_to_name.json file. Inference output will not include class name.')
        self.initialized = True

    def preprocess(self, data):
        """ Very basic preprocessing code - only tokenizes.
            Extend with your own preprocessing steps as needed.
        """
        text = data[0].get("data")
        if text is None:
            text = data[0].get("body")
        sentences = text.decode('utf-8')
        logger.info("Received text: '%s'", sentences)
        inputs = self.tokenizer.encode_plus(
            sentences,
            add_special_tokens=True,
            return_tensors="pt"
        )
        return inputs

    def inference(self, inputs):
        """
        Predict the class of a text using a trained transformer model.
        """
        # NOTE: This makes the assumption that your model expects text to be tokenized 
        # with "input_ids" and "token_type_ids" - which is true for some popular transformer models, e.g. bert.
        # If your transformer model expects different tokenization, adapt this code to suit
        # its expected input format.
        prediction = self.model(
            inputs['input_ids'].to(self.device),
            token_type_ids=inputs['token_type_ids'].to(self.device)
        )[0].argmax().item()
        logger.info("Model predicted: '%s'", prediction)
        if self.mapping:
            prediction = self.mapping[str(prediction)]
        return [prediction]

    def postprocess(self, inference_output):
        # TODO: Add any needed post-processing of the model predictions here
        return inference_output

_service = TransformersClassifierHandler()

def handle(data, context):
    try:
        if not _service.initialized:
            _service.initialize(context)
        if data is None:
            return None
        data = _service.preprocess(data)
        data = _service.inference(data)
        data = _service.postprocess(data)
        return data
    except Exception as e:
        raise e

TorchServe 使用一種名為 MAR（模型存檔）的格式。我們可以使用以下命令將 PyTorch 模型轉換為 .mar 文件：

torch-model-archiver --model-name "bert" --version 1.0 --serialized-file ./bert_model/pytorch_model.bin --extra-files "./bert_model/config.json,./bert_model/vocab.txt" --handler "./handler.py"

將 .mar 文件移動到一個新目錄中：

mkdir model_store && mv bert.mar model_store

最後，我們可以使用以下命令啟動 TorchServe：

torchserve --start --model-store model_store --models bert=bert.mar

現在，我們可以在另一個終端窗口中使用推理 API 來查詢模型。我們傳遞一個包含文本的文本文件，模型將嘗試對其進行分類。

curl -X POST http://127.0.0.1:8080/predictions/bert -T predict.txt

這將返回一個標籤編號，該編號與文本標籤相關聯。這些標籤存儲在 label_dict.txt 字典文件中。

__label__Business_Ethics :  0
__label__Data_Security :  1
__label__Access_And_Affordability :  2
__label__Business_Model_Resilience :  3
__label__Competitive_Behavior :  4
__label__Critical_Incident_Risk_Management :  5
__label__Customer_Welfare :  6
__label__Director_Removal :  7
__label__Employee_Engagement_Inclusion_And_Diversity :  8
__label__Employee_Health_And_Safety :  9
__label__Human_Rights_And_Community_Relations :  10
__label__Labor_Practices :  11
__label__Management_Of_Legal_And_Regulatory_Framework :  12
__label__Physical_Impacts_Of_Climate_Change :  13
__label__Product_Quality_And_Safety :  14
__label__Product_Design_And_Lifecycle_Management :  15
__label__Selling_Practices_And_Product_Labeling :  16
__label__Supply_Chain_Management :  17
__label__Systemic_Risk_Management :  18
__label__Waste_And_Hazardous_Materials_Management :  19
__label__Water_And_Wastewater_Management :  20
__label__Air_Quality :  21
__label__Customer_Privacy :  22
__label__Ecological_Impacts :  23
__label__Energy_Management :  24
__label__GHG_Emissions :  25

✨ 主要特性

碳足跡降低：在2019財年，公司連續第四年降低了綜合碳足跡，與蘋果碳排放量達到峰值的2015年相比下降了35%，而同期淨收入增長了11%。過去一年，通過減排舉措避免了超過1000萬公噸的碳排放，例如供應商清潔能源計劃降低了440萬公噸的碳足跡。
衝突礦產政策：公司認為在剛果民主共和國及周邊國家建立經認證的無衝突3TG（錫、鉭、鎢和金）來源至關重要，為此制定了衝突礦產政策併成立內部團隊來實施該政策。

📦 安裝指南

安裝依賴庫：

pip install torchserve torch-model-archiver
pip install torchvision
pip install transformers

轉換模型為 MAR 文件：

torch-model-archiver --model-name "bert" --version 1.0 --serialized-file ./bert_model/pytorch_model.bin --extra-files "./bert_model/config.json,./bert_model/vocab.txt" --handler "./handler.py"

mkdir model_store && mv bert.mar model_store

啟動 TorchServe：

torchserve --start --model-store model_store --models bert=bert.mar

📚 詳細文檔

模型詳情

屬性	詳情
開發者	Mukut Mukherjee、Charan Pothireddi 和 Parabole.ai
共享方（可選）	HuggingFace
模型類型	語言模型
語言（NLP）	英語
許可證	需要更多信息
相關模型	父模型：BERT
更多信息資源	GitHub 倉庫、博客文章

用途

直接用途

可持續投資中的文本挖掘。

下游用途（可選）

ESG - BERT 的應用可以遠遠超出文本分類，它可以進行微調以執行可持續投資領域的各種其他下游 NLP 任務。

超出適用範圍的用途

該模型不應被用於故意為人們創造敵對或排斥的環境。

偏差、風險和侷限性

大量研究已經探討了語言模型的偏差和公平性問題（例如，參見 Sheng 等人 (2021) 和 Bender 等人 (2021)）。模型生成的預測可能包括跨受保護類別、身份特徵以及敏感、社會和職業群體的令人不安和有害的刻板印象。

建議

用戶（直接用戶和下游用戶）應該瞭解該模型的風險、偏差和侷限性。需要更多信息以提供進一步的建議。

訓練詳情

訓練數據

需要更多信息。

訓練過程

預處理：需要更多信息。
速度、大小、時間：需要更多信息。

評估

測試數據、因素和指標

測試數據：用於文本分類的微調模型也可以在這裡找到。可以通過幾個簡單步驟直接使用它進行預測。首先，下載微調後的 pytorch_model.bin、config.json 和 vocab.txt。
因素：需要更多信息。
指標：需要更多信息。

結果

ESG - BERT 在非結構化文本數據上進一步訓練，下一句預測和掩碼語言建模任務的準確率分別為 100% 和 98%。對 ESG - BERT 進行文本分類微調後的 F1 分數為 0.90。相比之下，通用 BERT（BERT - base）模型在微調後得分為 0.79，而 sci - kit learn 方法得分為 0.67。

模型檢查

需要更多信息。

環境影響

可以使用 Lacoste 等人 (2019) 提出的機器學習影響計算器來估算碳排放。

屬性	詳情
硬件類型	需要更多信息
使用時長	需要更多信息
雲服務提供商	需要更多信息
計算區域	需要更多信息
碳排放	需要更多信息