๐ SEC-BERT
SEC-BERT is a family of BERT models designed for the financial domain. It aims to support financial NLP research and FinTech applications, offering specialized models tailored to handle financial text effectively.
๐ Quick Start
SEC-BERT consists of the following models:
- SEC-BERT-BASE: Has the same architecture as BERT-BASE and is trained on financial documents.
- SEC-BERT-NUM: Similar to SEC-BERT-BASE, but replaces every number token with a [NUM] pseudo - token to handle all numeric expressions uniformly.
- SEC-BERT-SHAPE (this model): Similar to SEC-BERT-BASE, but replaces numbers with pseudo - tokens representing the numberโs shape, preventing fragmentation of numeric expressions, e.g., '53.2' becomes '[XX.X]' and '40,200.5' becomes '[XX,XXX.X]'.

โจ Features
SEC-BERT is specifically designed for the financial domain, which provides more accurate and relevant results for financial NLP tasks compared to general - purpose models. It uses different strategies to handle numbers in financial texts, such as replacing numbers with pseudo - tokens, which can improve the model's performance in understanding and processing financial data.
๐ฆ Installation
No specific installation steps are provided in the original document.
๐ป Usage Examples
Basic Usage
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("nlpaueb/sec-bert-shape")
model = AutoModel.from_pretrained("nlpaueb/sec-bert-shape")
Advanced Usage
import re
import spacy
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("nlpaueb/sec-bert-shape")
spacy_tokenizer = spacy.load("en_core_web_sm")
sentence = "Total net sales decreased 2% or $5.4 billion during 2019 compared to 2018."
def sec_bert_shape_preprocess(text):
tokens = [t.text for t in spacy_tokenizer(sentence)]
processed_text = []
for token in tokens:
if re.fullmatch(r"(\d+[\d,.]*)|([,.]\d+)", token):
shape = '[' + re.sub(r'\d', 'X', token) + ']'
if shape in tokenizer.additional_special_tokens:
processed_text.append(shape)
else:
processed_text.append('[NUM]')
else:
processed_text.append(token)
return ' '.join(processed_text)
tokenized_sentence = tokenizer.tokenize(sec_bert_shape_preprocess(sentence))
print(tokenized_sentence)
"""
['total', 'net', 'sales', 'decreased', '[X]', '%', 'or', '$', '[X.X]', 'billion', 'during', '[XXXX]', 'compared', 'to', '[XXXX]', '.']
"""
๐ Documentation
Pre - training corpus
The model was pre - trained on 260,773 10 - K filings from 1993 - 2019, publicly available at U.S. Securities and Exchange Commission (SEC).
Pre - training details
- A new vocabulary of 30k subwords was created by training a BertWordPieceTokenizer from scratch on the pre - training corpus.
- BERT was trained using the official code provided in Google BERT's GitHub repository.
- The TF checkpoint was converted to the desired format using Hugging Face's Transformers conversion script, enabling users to load the model in two lines of code for both PyTorch and TF2.
- A model similar to the English BERT - BASE model (12 - layer, 768 - hidden, 12 - heads, 110M parameters) was released.
- The same training set - up was followed: 1 million training steps with batches of 256 sequences of length 512 and an initial learning rate of 1e - 4.
- A single Google Cloud TPU v3 - 8 provided for free from TensorFlow Research Cloud (TRC) was used, along with GCP research credits.
Using SEC - BERT variants as Language Models
Sample |
Masked Token |
Total net sales [MASK] 2% or $5.4 billion during 2019 compared to 2018. |
decreased |
Model |
Predictions (Probability) |
BERT - BASE - UNCASED |
increased (0.221), were (0.131), are (0.103), rose (0.075), of (0.058) |
SEC - BERT - BASE |
increased (0.678), decreased (0.282), declined (0.017), grew (0.016), rose (0.004) |
SEC - BERT - NUM |
increased (0.753), decreased (0.211), grew (0.019), declined (0.010), rose (0.006) |
SEC - BERT - SHAPE |
increased (0.747), decreased (0.214), grew (0.021), declined (0.013), rose (0.002) |
Sample |
Masked Token |
Total net sales decreased 2% or $5.4 [MASK] during 2019 compared to 2018. |
billion |
Model |
Predictions (Probability) |
BERT - BASE - UNCASED |
billion (0.841), million (0.097), trillion (0.028), ##m (0.015), ##bn (0.006) |
SEC - BERT - BASE |
million (0.972), billion (0.028), millions (0.000), ##million (0.000), m (0.000) |
SEC - BERT - NUM |
million (0.974), billion (0.012), , (0.010), thousand (0.003), m (0.000) |
SEC - BERT - SHAPE |
million (0.978), billion (0.021), % (0.000), , (0.000), millions (0.000) |
Sample |
Masked Token |
Total net sales decreased [MASK]% or $5.4 billion during 2019 compared to 2018. |
2 |
Model |
Predictions (Probability) |
BERT - BASE - UNCASED |
20 (0.031), 10 (0.030), 6 (0.029), 4 (0.027), 30 (0.027) |
SEC - BERT - BASE |
13 (0.045), 12 (0.040), 11 (0.040), 14 (0.035), 10 (0.035) |
SEC - BERT - NUM |
[NUM] (1.000), one (0.000), five (0.000), three (0.000), seven (0.000) |
SEC - BERT - SHAPE |
[XX] (0.316), [XX.X] (0.253), [X.X] (0.237), [X] (0.188), [X.XX] (0.002) |
Sample |
Masked Token |
Total net sales decreased 2[MASK] or $5.4 billion during 2019 compared to 2018. |
% |
Model |
Predictions (Probability) |
BERT - BASE - UNCASED |
% (0.795), percent (0.174), ##fold (0.009), billion (0.004), times (0.004) |
SEC - BERT - BASE |
% (0.924), percent (0.076), points (0.000), , (0.000), times (0.000) |
SEC - BERT - NUM |
% (0.882), percent (0.118), million (0.000), units (0.000), bps (0.000) |
SEC - BERT - SHAPE |
% (0.961), percent (0.039), bps (0.000), , (0.000), bcf (0.000) |
Sample |
Masked Token |
Total net sales decreased 2% or $[MASK] billion during 2019 compared to 2018. |
5.4 |
Model |
Predictions (Probability) |
BERT - BASE - UNCASED |
1 (0.074), 4 (0.045), 3 (0.044), 2 (0.037), 5 (0.034) |
SEC - BERT - BASE |
1 (0.218), 2 (0.136), 3 (0.078), 4 (0.066), 5 (0.048) |
SEC - BERT - NUM |
[NUM] (1.000), l (0.000), 1 (0.000), - (0.000), 30 (0.000) |
SEC - BERT - SHAPE |
[X.X] (0.787), [X.XX] (0.095), [XX.X] (0.049), [X.XXX] (0.046), [X] (0.013) |
Sample |
Masked Token |
Total net sales decreased 2% or $5.4 billion during [MASK] compared to 2018. |
2019 |
Model |
Predictions (Probability) |
BERT - BASE - UNCASED |
2017 (0.485), 2018 (0.169), 2016 (0.164), 2015 (0.070), 2014 (0.022) |
SEC - BERT - BASE |
2019 (0.990), 2017 (0.007), 2018 (0.003), 2020 (0.000), 2015 (0.000) |
SEC - BERT - NUM |
[NUM] (1.000), as (0.000), fiscal (0.000), year (0.000), when (0.000) |
SEC - BERT - SHAPE |
[XXXX] (1.000), as (0.000), year (0.000), periods (0.000), , (0.000) |
Sample |
Masked Token |
Total net sales decreased 2% or $5.4 billion during 2019 compared to [MASK]. |
2018 |
Model |
Predictions (Probability) |
BERT - BASE - UNCASED |
2017 (0.100), 2016 (0.097), above (0.054), inflation (0.050), previously (0.037) |
SEC - BERT - BASE |
2018 (0.999), 2019 (0.000), 2017 (0.000), 2016 (0.000), 2014 (0.000) |
SEC - BERT - NUM |
[NUM] (1.000), year (0.000), last (0.000), sales (0.000), fiscal (0.000) |
SEC - BERT - SHAPE |
[XXXX] (1.000), year (0.000), sales (0.000), prior (0.000), years (0.000) |
Sample |
Masked Token |
During 2019, the Company [MASK] $67.1 billion of its common stock and paid dividend equivalents of $14.1 billion. |
repurchased |
Model |
Predictions (Probability) |
BERT - BASE - UNCASED |
held (0.229), sold (0.192), acquired (0.172), owned (0.052), traded (0.033) |
SEC - BERT - BASE |
repurchased (0.913), issued (0.036), purchased (0.029), redeemed (0.010), sold (0.003) |
SEC - BERT - NUM |
repurchased (0.917), purchased (0.054), reacquired (0.013), issued (0.005), acquired (0.003) |
SEC - BERT - SHAPE |
repurchased (0.902), purchased (0.068), issued (0.010), reacquired (0.008), redeemed (0.006) |
Sample |
Masked Token |
During 2019, the Company repurchased $67.1 billion of its common [MASK] and paid dividend equivalents of $14.1 billion. |
stock |
Model |
Predictions (Probability) |
BERT - BASE - UNCASED |
stock (0.835), assets (0.039), equity (0.025), debt (0.021), bonds (0.017) |
SEC - BERT - BASE |
stock (0.857), shares (0.135), equity (0.004), units (0.002), securities (0.000) |
SEC - BERT - NUM |
stock (0.842), shares (0.157), equity (0.000), securities (0.000), units (0.000) |
SEC - BERT - SHAPE |
stock (0.888), shares (0.109), equity (0.001), securities (0.001), stocks (0.000) |
Sample |
Masked Token |
During 2019, the Company repurchased $67.1 billion of its common stock and paid [MASK] equivalents of $14.1 billion. |
dividend |
Model |
Predictions (Probability) |
BERT - BASE - UNCASED |
cash (0.276), net (0.128), annual (0.083), the (0.040), debt (0.027) |
SEC - BERT - BASE |
dividend (0.890), cash (0.018), dividends (0.016), share (0.013), tax (0.010) |
SEC - BERT - NUM |
dividend (0.735), cash (0.115), share (0.087), tax (0.025), stock (0.013) |
SEC - BERT - SHAPE |
dividend (0.655), cash (0.248), dividends (0.042), share (0.019), out (0.003) |
๐ง Technical Details
The model's pre - training on a large corpus of 10 - K filings from 1993 - 2019 provides it with a rich understanding of financial language. The creation of a new vocabulary and the use of specific strategies to handle numbers in financial texts contribute to its performance in financial NLP tasks.
๐ License
This model is licensed under the [CC - BY - SA - 4.0](https://creativecommons.org/licenses/by - sa/4.0/) license.