ESG - BERTオープンソースモデル！持続可能な投資分野のテキストマイニングとESGテキスト分類を支援します

ホーム

ESG BERT

nbroadによって開発

持続可能な投資分野のテキストマイニングに特化したBERT変種モデルで、ESG関連のテキスト分類タスクで優れた性能を発揮

大規模言語モデル

Transformers

英語#ESGテキストマイニング #持続可能な投資分析 #金融NLP

ダウンロード数 9,800

リリース時間 : 3/2/2022

モデル概要

BERTアーキテクチャを最適化した言語モデルで、環境、社会、ガバナンス(ESG)分野のテキスト分析タスクに特化しており、持続可能な投資関連の非構造化テキストコンテンツを効果的に識別・分類可能

モデル特徴

ESG分野専門化

持続可能な投資分野のテキストに対して最適化されたトレーニングを実施し、汎用BERTモデルと比べてESG関連タスクでより優れた性能を発揮

高性能テキスト分類

ESGテキスト分類タスクでF1スコア0.90を達成し、汎用BERTモデル(0.79)や従来手法(0.67)を大幅に上回る

マルチラベル分類能力

26種類のESG関連ラベルの分類をサポートし、ビジネス倫理、データセキュリティ、気候変動など複数のESG次元をカバー

モデル能力

ESGテキスト分類

持続可能な投資テキスト分析

企業の社会的責任報告書処理

非構造化ESGデータマイニング

使用事例

企業ESGレポート分析

カーボンフットプリント声明識別

企業年次報告書から炭素削減関連の声明を自動識別・分類

'カーボンフットプリント削減'、'排出削減施策'などのキー情報を正確に識別可能

紛争鉱物政策検出

企業報告書における鉱物調達政策に関する記述を分析

'紛争鉱物不使用'、'責任ある調達'などの政策声明を識別可能

持続可能な投資研究

ESG要素抽出

大量の企業文書から投資意思決定に必要な主要ESG要素を抽出

26種類のESG関連要素を自動分類し、研究効率を向上

🚀 ESG - BERTモデルカード

持続可能な投資におけるテキストマイニング用のドメイン固有BERTモデル

🚀 クイックスタート

下記のコードを使用して、モデルの使用を開始します。

クリックして展開

pip install torchserve torch-model-archiver

pip install torchvision

pip install transformers

次に、ハンドラースクリプトを設定します。これはテキスト分類用の基本的なハンドラーで、改善することができます。このスクリプトをディレクトリに "handler.py" として保存します。 [1]

from abc import ABC
import json
import logging
import os
import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer
from ts.torch_handler.base_handler import BaseHandler

logger = logging.getLogger(__name__)

class TransformersClassifierHandler(BaseHandler, ABC):
    """
    Transformers text classifier handler class. This handler takes a text (string) and
    as input and returns the classification text based on the serialized transformers checkpoint.
    """
    def __init__(self):
        super(TransformersClassifierHandler, self).__init__()
        self.initialized = False

    def initialize(self, ctx):
        self.manifest = ctx.manifest
        properties = ctx.system_properties
        model_dir = properties.get("model_dir")
        self.device = torch.device("cuda:" + str(properties.get("gpu_id")) if torch.cuda.is_available() else "cpu")
        # Read model serialize/pt file
        self.model = AutoModelForSequenceClassification.from_pretrained(model_dir)
        self.tokenizer = AutoTokenizer.from_pretrained(model_dir)
        self.model.to(self.device)
        self.model.eval()
        logger.debug('Transformer model from path {0} loaded successfully'.format(model_dir))
        # Read the mapping file, index to object name
        mapping_file_path = os.path.join(model_dir, "index_to_name.json")
        if os.path.isfile(mapping_file_path):
            with open(mapping_file_path) as f:
                self.mapping = json.load(f)
        else:
            logger.warning('Missing the index_to_name.json file. Inference output will not include class name.')
        self.initialized = True

    def preprocess(self, data):
        """ Very basic preprocessing code - only tokenizes.
            Extend with your own preprocessing steps as needed.
        """
        text = data[0].get("data")
        if text is None:
            text = data[0].get("body")
        sentences = text.decode('utf-8')
        logger.info("Received text: '%s'", sentences)
        inputs = self.tokenizer.encode_plus(
            sentences,
            add_special_tokens=True,
            return_tensors="pt"
        )
        return inputs

    def inference(self, inputs):
        """
        Predict the class of a text using a trained transformer model.
        """
        # NOTE: This makes the assumption that your model expects text to be tokenized 
        # with "input_ids" and "token_type_ids" - which is true for some popular transformer models, e.g. bert.
        # If your transformer model expects different tokenization, adapt this code to suit
        # its expected input format.
        prediction = self.model(
            inputs['input_ids'].to(self.device),
            token_type_ids=inputs['token_type_ids'].to(self.device)
        )[0].argmax().item()
        logger.info("Model predicted: '%s'", prediction)
        if self.mapping:
            prediction = self.mapping[str(prediction)]
        return [prediction]

    def postprocess(self, inference_output):
        # TODO: Add any needed post-processing of the model predictions here
        return inference_output

_service = TransformersClassifierHandler()

def handle(data, context):
    try:
        if not _service.initialized:
            _service.initialize(context)
        if data is None:
            return None
        data = _service.preprocess(data)
        data = _service.inference(data)
        data = _service.postprocess(data)
        return data
    except Exception as e:
        raise e

TorcheServeはMAR (Model Archive) という形式を使用します。次のコマンドを使用して、PyTorchモデルを .mar ファイルに変換できます。

torch-model-archiver --model-name "bert" --version 1.0 --serialized-file ./bert_model/pytorch_model.bin --extra-files "./bert_model/config.json,./bert_model/vocab.txt" --handler "./handler.py"

.mar ファイルを新しいディレクトリに移動します。

mkdir model_store && mv bert.mar model_store

最後に、次のコマンドを使用してTorchServeを起動できます。

torchserve --start --model-store model_store --models bert=bert.mar

これで、別のターミナルウィンドウからInference APIを使用してモデルにクエリを送信できます。モデルが分類しようとするテキストを含むテキストファイルを渡します。

curl -X POST http://127.0.0.1:8080/predictions/bert -T predict.txt

これにより、テキストラベルに対応するラベル番号が返されます。これは label_dict.txt 辞書ファイルに保存されています。

__label__Business_Ethics :  0
__label__Data_Security :  1
__label__Access_And_Affordability :  2
__label__Business_Model_Resilience :  3
__label__Competitive_Behavior :  4
__label__Critical_Incident_Risk_Management :  5
__label__Customer_Welfare :  6
__label__Director_Removal :  7
__label__Employee_Engagement_Inclusion_And_Diversity :  8
__label__Employee_Health_And_Safety :  9
__label__Human_Rights_And_Community_Relations :  10
__label__Labor_Practices :  11
__label__Management_Of_Legal_And_Regulatory_Framework :  12
__label__Physical_Impacts_Of_Climate_Change :  13
__label__Product_Quality_And_Safety :  14
__label__Product_Design_And_Lifecycle_Management :  15
__label__Selling_Practices_And_Product_Labeling :  16
__label__Supply_Chain_Management :  17
__label__Systemic_Risk_Management :  18
__label__Waste_And_Hazardous_Materials_Management :  19
__label__Water_And_Wastewater_Management :  20
__label__Air_Quality :  21
__label__Customer_Privacy :  22
__label__Ecological_Impacts :  23
__label__Energy_Management :  24
__label__GHG_Emissions :  25

✨ 主な機能

持続可能な投資におけるテキストマイニング：ESG - BERTは、持続可能な投資分野のテキストマイニングに特化しています。
下流のNLPタスクへの適用：テキスト分類だけでなく、他の下流のNLPタスクにも微調整可能です。

📦 インストール

pip install torchserve torch-model-archiver
pip install torchvision
pip install transformers

📚 ドキュメント

モデル詳細

開発者：Mukut Mukherjee、Charan Pothireddi および Parabole.ai
共有元（オプション）：HuggingFace
モデルタイプ：言語モデル
言語（NLP）：英語
ライセンス：詳細情報が必要です。
関連モデル：
- 親モデル：BERT
詳細情報のリソース：
GitHubリポジトリ
ブログ記事

使用方法

直接使用

持続可能な投資におけるテキストマイニング

下流の使用（オプション）

ESG - BERTのアプリケーションは、テキスト分類だけに留まらず、持続可能な投資分野の他の下流のNLPタスクにも拡張できます。

範囲外の使用

このモデルは、人々に敵意を抱かせたり疎外感を与える環境を意図的に作るために使用してはいけません。

バイアス、リスク、制限事項

多くの研究が言語モデルのバイアスと公平性の問題を探っています（例えば、Sheng et al. (2021) および Bender et al. (2021) を参照）。このモデルによって生成される予測には、保護されたクラス、アイデンティティの特性、および敏感な社会的および職業的グループにまたがる不快で有害なステレオタイプが含まれる可能性があります。

推奨事項

ユーザー（直接ユーザーと下流ユーザーの両方）は、このモデルのリスク、バイアス、制限事項を認識すべきです。さらなる推奨事項については、詳細情報が必要です。

トレーニング詳細

トレーニングデータ

詳細情報が必要です。

トレーニング手順

前処理

詳細情報が必要です。

速度、サイズ、時間

詳細情報が必要です。

評価

テストデータ、要因、メトリクス

テストデータ

テキスト分類用に微調整されたモデルはこちらでも入手できます。数ステップで直接予測を行うために使用できます。まず、微調整されたpytorch_model.bin、config.json、およびvocab.txtをダウンロードします。

要因

詳細情報が必要です。

メトリクス

詳細情報が必要です。

結果

ESG - BERTは、非構造化テキストデータでさらにトレーニングされ、Next Sentence PredictionとMasked Language Modellingタスクでそれぞれ100％と98％の精度を達成しました。テキスト分類のためにESG - BERTを微調整すると、F - 1スコアが0.90になりました。比較のために、一般的なBERT（BERT - base）モデルは微調整後に0.79のスコアを記録し、sci - kit learnアプローチは0.67のスコアを記録しました。