Sec-BERT-Shape Open-Source Financial AI Model - Preserve Financial Data Integrity to Assist Financial Analysis

Sec Bert Shape

Developed by nlpaueb

A BERT variant for the financial domain, preserving financial data integrity through numeric morphology pseudo-token processing

Large Language Model

Transformers

English#Financial Numeric Morphology Processing #Financial Report Analysis #SEC Document Pre-training

Downloads 30

Release Time : 3/2/2022

Model Overview

A BERT model specifically designed for financial texts, optimizing numeric processing by converting numbers into morphological pseudo-tokens (e.g., '53.2'→'[XX.X]'), suitable for analyzing financial documents like 10-K annual reports

Model Features

Numeric Morphology Standardization

Converts numbers into 214 predefined morphological tokens (e.g., '[XX.X]') to avoid numeric fragmentation issues

Financial Domain Pre-training

Trained on 260,000 SEC 10-K annual reports, deeply adapted to financial text characteristics

Multi-Version Adaptation

Offers three variants (base/numeric/morphology editions) to meet different scenario requirements

Model Capabilities

Financial text masked prediction

Financial numeric morphology recognition

Financial verb prediction

Numeric unit inference

Use Cases

Financial Report Analysis

Financial Metric Trend Prediction

Predicts trends in metrics like sales/profits in annual reports

3x accuracy improvement over base BERT in verb prediction tasks

Numeric Unit Completion

Automatically completes units for financial values (millions/billions, etc.)

Unit prediction accuracy >97%

Regulatory Document Processing

XBRL Tagging Assistance

Identifies financial numeric entities to assist XBRL tag generation

Related technology published in ACL 2022 paper

🚀 SEC-BERT

SEC-BERT is a family of BERT models designed for the financial domain. It aims to support financial NLP research and FinTech applications, offering specialized models tailored to handle financial text effectively.

🚀 Quick Start

SEC-BERT consists of the following models:

SEC-BERT-BASE: Has the same architecture as BERT-BASE and is trained on financial documents.
SEC-BERT-NUM: Similar to SEC-BERT-BASE, but replaces every number token with a [NUM] pseudo - token to handle all numeric expressions uniformly.
SEC-BERT-SHAPE (this model): Similar to SEC-BERT-BASE, but replaces numbers with pseudo - tokens representing the number’s shape, preventing fragmentation of numeric expressions, e.g., '53.2' becomes '[XX.X]' and '40,200.5' becomes '[XX,XXX.X]'.

✨ Features

SEC-BERT is specifically designed for the financial domain, which provides more accurate and relevant results for financial NLP tasks compared to general - purpose models. It uses different strategies to handle numbers in financial texts, such as replacing numbers with pseudo - tokens, which can improve the model's performance in understanding and processing financial data.

📦 Installation

No specific installation steps are provided in the original document.

💻 Usage Examples

Basic Usage

from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained("nlpaueb/sec-bert-shape")
model = AutoModel.from_pretrained("nlpaueb/sec-bert-shape")

Advanced Usage

import re
import spacy
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("nlpaueb/sec-bert-shape")
spacy_tokenizer = spacy.load("en_core_web_sm")

sentence = "Total net sales decreased 2% or $5.4 billion during 2019 compared to 2018."

def sec_bert_shape_preprocess(text):
    tokens = [t.text for t in spacy_tokenizer(sentence)]

    processed_text = []
    for token in tokens:
        if re.fullmatch(r"(\d+[\d,.]*)|([,.]\d+)", token):
            shape = '[' + re.sub(r'\d', 'X', token) + ']'
            if shape in tokenizer.additional_special_tokens:
                processed_text.append(shape)
            else:
                processed_text.append('[NUM]')
        else:
            processed_text.append(token)
            
    return ' '.join(processed_text)
        
tokenized_sentence = tokenizer.tokenize(sec_bert_shape_preprocess(sentence))
print(tokenized_sentence)
"""
['total', 'net', 'sales', 'decreased', '[X]', '%', 'or', '$', '[X.X]', 'billion', 'during', '[XXXX]', 'compared', 'to', '[XXXX]', '.']
"""

📚 Documentation

Pre - training corpus

The model was pre - trained on 260,773 10 - K filings from 1993 - 2019, publicly available at U.S. Securities and Exchange Commission (SEC).

Pre - training details

A new vocabulary of 30k subwords was created by training a BertWordPieceTokenizer from scratch on the pre - training corpus.
BERT was trained using the official code provided in Google BERT's GitHub repository.
The TF checkpoint was converted to the desired format using Hugging Face's Transformers conversion script, enabling users to load the model in two lines of code for both PyTorch and TF2.
A model similar to the English BERT - BASE model (12 - layer, 768 - hidden, 12 - heads, 110M parameters) was released.
The same training set - up was followed: 1 million training steps with batches of 256 sequences of length 512 and an initial learning rate of 1e - 4.
A single Google Cloud TPU v3 - 8 provided for free from TensorFlow Research Cloud (TRC) was used, along with GCP research credits.

Using SEC - BERT variants as Language Models

Sample	Masked Token
Total net sales [MASK] 2% or $5.4 billion during 2019 compared to 2018.	decreased

Model	Predictions (Probability)
BERT - BASE - UNCASED	increased (0.221), were (0.131), are (0.103), rose (0.075), of (0.058)
SEC - BERT - BASE	increased (0.678), decreased (0.282), declined (0.017), grew (0.016), rose (0.004)
SEC - BERT - NUM	increased (0.753), decreased (0.211), grew (0.019), declined (0.010), rose (0.006)
SEC - BERT - SHAPE	increased (0.747), decreased (0.214), grew (0.021), declined (0.013), rose (0.002)

Sample	Masked Token
Total net sales decreased 2% or $5.4 [MASK] during 2019 compared to 2018.	billion

Model	Predictions (Probability)
BERT - BASE - UNCASED	billion (0.841), million (0.097), trillion (0.028), ##m (0.015), ##bn (0.006)
SEC - BERT - BASE	million (0.972), billion (0.028), millions (0.000), ##million (0.000), m (0.000)
SEC - BERT - NUM	million (0.974), billion (0.012), , (0.010), thousand (0.003), m (0.000)
SEC - BERT - SHAPE	million (0.978), billion (0.021), % (0.000), , (0.000), millions (0.000)

Sample	Masked Token
Total net sales decreased [MASK]% or $5.4 billion during 2019 compared to 2018.	2

Model	Predictions (Probability)
BERT - BASE - UNCASED	20 (0.031), 10 (0.030), 6 (0.029), 4 (0.027), 30 (0.027)
SEC - BERT - BASE	13 (0.045), 12 (0.040), 11 (0.040), 14 (0.035), 10 (0.035)
SEC - BERT - NUM	[NUM] (1.000), one (0.000), five (0.000), three (0.000), seven (0.000)
SEC - BERT - SHAPE	[XX] (0.316), [XX.X] (0.253), [X.X] (0.237), [X] (0.188), [X.XX] (0.002)

Sample	Masked Token
Total net sales decreased 2[MASK] or $5.4 billion during 2019 compared to 2018.	%

Model	Predictions (Probability)
BERT - BASE - UNCASED	% (0.795), percent (0.174), ##fold (0.009), billion (0.004), times (0.004)
SEC - BERT - BASE	% (0.924), percent (0.076), points (0.000), , (0.000), times (0.000)
SEC - BERT - NUM	% (0.882), percent (0.118), million (0.000), units (0.000), bps (0.000)
SEC - BERT - SHAPE	% (0.961), percent (0.039), bps (0.000), , (0.000), bcf (0.000)

Sample	Masked Token
Total net sales decreased 2% or $[MASK] billion during 2019 compared to 2018.	5.4

Model	Predictions (Probability)
BERT - BASE - UNCASED	1 (0.074), 4 (0.045), 3 (0.044), 2 (0.037), 5 (0.034)
SEC - BERT - BASE	1 (0.218), 2 (0.136), 3 (0.078), 4 (0.066), 5 (0.048)
SEC - BERT - NUM	[NUM] (1.000), l (0.000), 1 (0.000), - (0.000), 30 (0.000)
SEC - BERT - SHAPE	[X.X] (0.787), [X.XX] (0.095), [XX.X] (0.049), [X.XXX] (0.046), [X] (0.013)

Sample	Masked Token
Total net sales decreased 2% or $5.4 billion during [MASK] compared to 2018.	2019

Model	Predictions (Probability)
BERT - BASE - UNCASED	2017 (0.485), 2018 (0.169), 2016 (0.164), 2015 (0.070), 2014 (0.022)
SEC - BERT - BASE	2019 (0.990), 2017 (0.007), 2018 (0.003), 2020 (0.000), 2015 (0.000)
SEC - BERT - NUM	[NUM] (1.000), as (0.000), fiscal (0.000), year (0.000), when (0.000)
SEC - BERT - SHAPE	[XXXX] (1.000), as (0.000), year (0.000), periods (0.000), , (0.000)

Sample	Masked Token
Total net sales decreased 2% or $5.4 billion during 2019 compared to [MASK].	2018

Model	Predictions (Probability)
BERT - BASE - UNCASED	2017 (0.100), 2016 (0.097), above (0.054), inflation (0.050), previously (0.037)
SEC - BERT - BASE	2018 (0.999), 2019 (0.000), 2017 (0.000), 2016 (0.000), 2014 (0.000)
SEC - BERT - NUM	[NUM] (1.000), year (0.000), last (0.000), sales (0.000), fiscal (0.000)
SEC - BERT - SHAPE	[XXXX] (1.000), year (0.000), sales (0.000), prior (0.000), years (0.000)

Sample	Masked Token
During 2019, the Company [MASK] $67.1 billion of its common stock and paid dividend equivalents of $14.1 billion.	repurchased

Model	Predictions (Probability)
BERT - BASE - UNCASED	held (0.229), sold (0.192), acquired (0.172), owned (0.052), traded (0.033)
SEC - BERT - BASE	repurchased (0.913), issued (0.036), purchased (0.029), redeemed (0.010), sold (0.003)
SEC - BERT - NUM	repurchased (0.917), purchased (0.054), reacquired (0.013), issued (0.005), acquired (0.003)
SEC - BERT - SHAPE	repurchased (0.902), purchased (0.068), issued (0.010), reacquired (0.008), redeemed (0.006)

Sample	Masked Token
During 2019, the Company repurchased $67.1 billion of its common [MASK] and paid dividend equivalents of $14.1 billion.	stock

Model	Predictions (Probability)
BERT - BASE - UNCASED	stock (0.835), assets (0.039), equity (0.025), debt (0.021), bonds (0.017)
SEC - BERT - BASE	stock (0.857), shares (0.135), equity (0.004), units (0.002), securities (0.000)
SEC - BERT - NUM	stock (0.842), shares (0.157), equity (0.000), securities (0.000), units (0.000)
SEC - BERT - SHAPE	stock (0.888), shares (0.109), equity (0.001), securities (0.001), stocks (0.000)

Sample	Masked Token
During 2019, the Company repurchased $67.1 billion of its common stock and paid [MASK] equivalents of $14.1 billion.	dividend

Model	Predictions (Probability)
BERT - BASE - UNCASED	cash (0.276), net (0.128), annual (0.083), the (0.040), debt (0.027)
SEC - BERT - BASE	dividend (0.890), cash (0.018), dividends (0.016), share (0.013), tax (0.010)
SEC - BERT - NUM	dividend (0.735), cash (0.115), share (0.087), tax (0.025), stock (0.013)
SEC - BERT - SHAPE	dividend (0.655), cash (0.248), dividends (0.042), share (0.019), out (0.003)

🔧 Technical Details

The model's pre - training on a large corpus of 10 - K filings from 1993 - 2019 provides it with a rich understanding of financial language. The creation of a new vocabulary and the use of specific strategies to handle numbers in financial texts contribute to its performance in financial NLP tasks.

📄 License

This model is licensed under the [CC - BY - SA - 4.0](https://creativecommons.org/licenses/by - sa/4.0/) license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご