ESG - BERT开源模型！助力可持续投资领域文本挖掘与ESG文本分类

首页

ESG BERT

由 nbroad 开发

专注于可持续投资领域文本挖掘的BERT变体模型，在ESG相关文本分类任务上表现优异

大型语言模型

Transformers

英语#ESG文本挖掘 #可持续投资分析 #金融NLP

下载量 9,800

发布时间 : 3/2/2022

模型简介

基于BERT架构优化的语言模型，专门用于环境、社会和治理(ESG)领域的文本分析任务，能够有效识别和分类可持续投资相关的非结构化文本内容

模型特点

ESG领域专业化

针对可持续投资领域文本进行优化训练，相比通用BERT模型在ESG相关任务上表现更优

高性能文本分类

在ESG文本分类任务上F1分数达到0.90，显著优于通用BERT模型(0.79)和传统方法(0.67)

多标签分类能力

支持26种ESG相关标签的分类，涵盖商业道德、数据安全、气候变化等多个ESG维度

模型能力

ESG文本分类

可持续投资文本分析

企业社会责任报告处理

非结构化ESG数据挖掘

使用案例

企业ESG报告分析

碳足迹声明识别

从企业年报中自动识别和分类碳减排相关声明

能准确识别如'降低碳足迹'、'减排举措'等关键信息

冲突矿产政策检测

分析企业报告中关于矿产采购政策的描述

可识别'无冲突矿产'、'负责任采购'等政策声明

可持续投资研究

ESG因素提取

从大量企业文档中提取关键ESG因素用于投资决策

自动分类26种ESG相关因素，提高研究效率

🚀 ESG - BERT 模型卡片

用于可持续投资文本挖掘的特定领域BERT模型

🚀 快速开始

使用以下代码来开始使用该模型：

点击展开

pip install torchserve torch-model-archiver

pip install torchvision

pip install transformers

接下来，我们将设置处理脚本。这是一个用于文本分类的基本处理程序，可根据需要进行改进。将此脚本保存为目录中的 "handler.py"。[1]

from abc import ABC
import json
import logging
import os
import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer
from ts.torch_handler.base_handler import BaseHandler

logger = logging.getLogger(__name__)

class TransformersClassifierHandler(BaseHandler, ABC):
    """
    Transformers text classifier handler class. This handler takes a text (string) and
    as input and returns the classification text based on the serialized transformers checkpoint.
    """
    def __init__(self):
        super(TransformersClassifierHandler, self).__init__()
        self.initialized = False

    def initialize(self, ctx):
        self.manifest = ctx.manifest
        properties = ctx.system_properties
        model_dir = properties.get("model_dir")
        self.device = torch.device("cuda:" + str(properties.get("gpu_id")) if torch.cuda.is_available() else "cpu")
        # Read model serialize/pt file
        self.model = AutoModelForSequenceClassification.from_pretrained(model_dir)
        self.tokenizer = AutoTokenizer.from_pretrained(model_dir)
        self.model.to(self.device)
        self.model.eval()
        logger.debug('Transformer model from path {0} loaded successfully'.format(model_dir))
        # Read the mapping file, index to object name
        mapping_file_path = os.path.join(model_dir, "index_to_name.json")
        if os.path.isfile(mapping_file_path):
            with open(mapping_file_path) as f:
                self.mapping = json.load(f)
        else:
            logger.warning('Missing the index_to_name.json file. Inference output will not include class name.')
        self.initialized = True

    def preprocess(self, data):
        """ Very basic preprocessing code - only tokenizes.
            Extend with your own preprocessing steps as needed.
        """
        text = data[0].get("data")
        if text is None:
            text = data[0].get("body")
        sentences = text.decode('utf-8')
        logger.info("Received text: '%s'", sentences)
        inputs = self.tokenizer.encode_plus(
            sentences,
            add_special_tokens=True,
            return_tensors="pt"
        )
        return inputs

    def inference(self, inputs):
        """
        Predict the class of a text using a trained transformer model.
        """
        # NOTE: This makes the assumption that your model expects text to be tokenized 
        # with "input_ids" and "token_type_ids" - which is true for some popular transformer models, e.g. bert.
        # If your transformer model expects different tokenization, adapt this code to suit
        # its expected input format.
        prediction = self.model(
            inputs['input_ids'].to(self.device),
            token_type_ids=inputs['token_type_ids'].to(self.device)
        )[0].argmax().item()
        logger.info("Model predicted: '%s'", prediction)
        if self.mapping:
            prediction = self.mapping[str(prediction)]
        return [prediction]

    def postprocess(self, inference_output):
        # TODO: Add any needed post-processing of the model predictions here
        return inference_output

_service = TransformersClassifierHandler()

def handle(data, context):
    try:
        if not _service.initialized:
            _service.initialize(context)
        if data is None:
            return None
        data = _service.preprocess(data)
        data = _service.inference(data)
        data = _service.postprocess(data)
        return data
    except Exception as e:
        raise e

TorchServe 使用一种名为 MAR（模型存档）的格式。我们可以使用以下命令将 PyTorch 模型转换为 .mar 文件：

torch-model-archiver --model-name "bert" --version 1.0 --serialized-file ./bert_model/pytorch_model.bin --extra-files "./bert_model/config.json,./bert_model/vocab.txt" --handler "./handler.py"

将 .mar 文件移动到一个新目录中：

mkdir model_store && mv bert.mar model_store

最后，我们可以使用以下命令启动 TorchServe：

torchserve --start --model-store model_store --models bert=bert.mar

现在，我们可以在另一个终端窗口中使用推理 API 来查询模型。我们传递一个包含文本的文本文件，模型将尝试对其进行分类。

curl -X POST http://127.0.0.1:8080/predictions/bert -T predict.txt

这将返回一个标签编号，该编号与文本标签相关联。这些标签存储在 label_dict.txt 字典文件中。

__label__Business_Ethics :  0
__label__Data_Security :  1
__label__Access_And_Affordability :  2
__label__Business_Model_Resilience :  3
__label__Competitive_Behavior :  4
__label__Critical_Incident_Risk_Management :  5
__label__Customer_Welfare :  6
__label__Director_Removal :  7
__label__Employee_Engagement_Inclusion_And_Diversity :  8
__label__Employee_Health_And_Safety :  9
__label__Human_Rights_And_Community_Relations :  10
__label__Labor_Practices :  11
__label__Management_Of_Legal_And_Regulatory_Framework :  12
__label__Physical_Impacts_Of_Climate_Change :  13
__label__Product_Quality_And_Safety :  14
__label__Product_Design_And_Lifecycle_Management :  15
__label__Selling_Practices_And_Product_Labeling :  16
__label__Supply_Chain_Management :  17
__label__Systemic_Risk_Management :  18
__label__Waste_And_Hazardous_Materials_Management :  19
__label__Water_And_Wastewater_Management :  20
__label__Air_Quality :  21
__label__Customer_Privacy :  22
__label__Ecological_Impacts :  23
__label__Energy_Management :  24
__label__GHG_Emissions :  25

✨ 主要特性

碳足迹降低：在2019财年，公司连续第四年降低了综合碳足迹，与苹果碳排放量达到峰值的2015年相比下降了35%，而同期净收入增长了11%。过去一年，通过减排举措避免了超过1000万公吨的碳排放，例如供应商清洁能源计划降低了440万公吨的碳足迹。
冲突矿产政策：公司认为在刚果民主共和国及周边国家建立经认证的无冲突3TG（锡、钽、钨和金）来源至关重要，为此制定了冲突矿产政策并成立内部团队来实施该政策。

📦 安装指南

安装依赖库：

pip install torchserve torch-model-archiver
pip install torchvision
pip install transformers

转换模型为 MAR 文件：

torch-model-archiver --model-name "bert" --version 1.0 --serialized-file ./bert_model/pytorch_model.bin --extra-files "./bert_model/config.json,./bert_model/vocab.txt" --handler "./handler.py"

mkdir model_store && mv bert.mar model_store

启动 TorchServe：

torchserve --start --model-store model_store --models bert=bert.mar

📚 详细文档

模型详情

属性	详情
开发者	Mukut Mukherjee、Charan Pothireddi 和 Parabole.ai
共享方（可选）	HuggingFace
模型类型	语言模型
语言（NLP）	英语
许可证	需要更多信息
相关模型	父模型：BERT
更多信息资源	GitHub 仓库、博客文章

用途

直接用途

可持续投资中的文本挖掘。

下游用途（可选）

ESG - BERT 的应用可以远远超出文本分类，它可以进行微调以执行可持续投资领域的各种其他下游 NLP 任务。

超出适用范围的用途

该模型不应被用于故意为人们创造敌对或排斥的环境。

偏差、风险和局限性

大量研究已经探讨了语言模型的偏差和公平性问题（例如，参见 Sheng 等人 (2021) 和 Bender 等人 (2021)）。模型生成的预测可能包括跨受保护类别、身份特征以及敏感、社会和职业群体的令人不安和有害的刻板印象。

建议

用户（直接用户和下游用户）应该了解该模型的风险、偏差和局限性。需要更多信息以提供进一步的建议。

训练详情

训练数据

需要更多信息。

训练过程

预处理：需要更多信息。
速度、大小、时间：需要更多信息。

评估

测试数据、因素和指标

测试数据：用于文本分类的微调模型也可以在这里找到。可以通过几个简单步骤直接使用它进行预测。首先，下载微调后的 pytorch_model.bin、config.json 和 vocab.txt。
因素：需要更多信息。
指标：需要更多信息。

结果

ESG - BERT 在非结构化文本数据上进一步训练，下一句预测和掩码语言建模任务的准确率分别为 100% 和 98%。对 ESG - BERT 进行文本分类微调后的 F1 分数为 0.90。相比之下，通用 BERT（BERT - base）模型在微调后得分为 0.79，而 sci - kit learn 方法得分为 0.67。

模型检查

需要更多信息。

环境影响

可以使用 Lacoste 等人 (2019) 提出的机器学习影响计算器来估算碳排放。

属性	详情
硬件类型	需要更多信息
使用时长	需要更多信息
云服务提供商	需要更多信息
计算区域	需要更多信息
碳排放	需要更多信息