๐ SEC-BERT
SEC-BERT is a family of BERT models tailored for the financial domain. It aims to support financial NLP research and FinTech applications, offering specialized models for more accurate financial text analysis.
๐ Quick Start
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("nlpaueb/sec-bert-base")
model = AutoModel.from_pretrained("nlpaueb/sec-bert-base")
โจ Features
SEC-BERT consists of the following models:
- SEC-BERT-BASE (this model): It has the same architecture as BERT-BASE and is trained on financial documents.
- SEC-BERT-NUM: Similar to SEC-BERT-BASE, but it replaces every number token with a [NUM] pseudo - token, handling all numeric expressions uniformly and preventing fragmentation.
- SEC-BERT-SHAPE: Also similar to SEC-BERT-BASE, but it replaces numbers with pseudo - tokens representing the numberโs shape, so numeric expressions (of known shapes) are no longer fragmented. For example, '53.2' becomes '[XX.X]' and '40,200.5' becomes '[XX,XXX.X]'.
๐ฆ Installation
No specific installation steps are provided in the original document.
๐ป Usage Examples
Basic Usage
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("nlpaueb/sec-bert-base")
model = AutoModel.from_pretrained("nlpaueb/sec-bert-base")
Advanced Usage
The following tables show how different SEC - BERT variants perform as language models in various financial text scenarios:
Sample |
Masked Token |
Total net sales [MASK] 2% or $5.4 billion during 2019 compared to 2018. |
decreased |
Model |
Predictions (Probability) |
BERT-BASE-UNCASED |
increased (0.221), were (0.131), are (0.103), rose (0.075), of (0.058) |
SEC-BERT-BASE |
increased (0.678), decreased (0.282), declined (0.017), grew (0.016), rose (0.004) |
SEC-BERT-NUM |
increased (0.753), decreased (0.211), grew (0.019), declined (0.010), rose (0.006) |
SEC-BERT-SHAPE |
increased (0.747), decreased (0.214), grew (0.021), declined (0.013), rose (0.002) |
Sample |
Masked Token |
Total net sales decreased 2% or $5.4 [MASK] during 2019 compared to 2018. |
billion |
Model |
Predictions (Probability) |
BERT-BASE-UNCASED |
billion (0.841), million (0.097), trillion (0.028), ##m (0.015), ##bn (0.006) |
SEC-BERT-BASE |
million (0.972), billion (0.028), millions (0.000), ##million (0.000), m (0.000) |
SEC-BERT-NUM |
million (0.974), billion (0.012), , (0.010), thousand (0.003), m (0.000) |
SEC-BERT-SHAPE |
million (0.978), billion (0.021), % (0.000), , (0.000), millions (0.000) |
Sample |
Masked Token |
Total net sales decreased [MASK]% or $5.4 billion during 2019 compared to 2018. |
2 |
Model |
Predictions (Probability) |
BERT-BASE-UNCASED |
20 (0.031), 10 (0.030), 6 (0.029), 4 (0.027), 30 (0.027) |
SEC-BERT-BASE |
13 (0.045), 12 (0.040), 11 (0.040), 14 (0.035), 10 (0.035) |
SEC-BERT-NUM |
[NUM] (1.000), one (0.000), five (0.000), three (0.000), seven (0.000) |
SEC-BERT-SHAPE |
[XX] (0.316), [XX.X] (0.253), [X.X] (0.237), [X] (0.188), [X.XX] (0.002) |
Sample |
Masked Token |
Total net sales decreased 2[MASK] or $5.4 billion during 2019 compared to 2018. |
% |
Model |
Predictions (Probability) |
BERT-BASE-UNCASED |
% (0.795), percent (0.174), ##fold (0.009), billion (0.004), times (0.004) |
SEC-BERT-BASE |
% (0.924), percent (0.076), points (0.000), , (0.000), times (0.000) |
SEC-BERT-NUM |
% (0.882), percent (0.118), million (0.000), units (0.000), bps (0.000) |
SEC-BERT-SHAPE |
% (0.961), percent (0.039), bps (0.000), , (0.000), bcf (0.000) |
Sample |
Masked Token |
Total net sales decreased 2% or $[MASK] billion during 2019 compared to 2018. |
5.4 |
Model |
Predictions (Probability) |
BERT-BASE-UNCASED |
1 (0.074), 4 (0.045), 3 (0.044), 2 (0.037), 5 (0.034) |
SEC-BERT-BASE |
1 (0.218), 2 (0.136), 3 (0.078), 4 (0.066), 5 (0.048) |
SEC-BERT-NUM |
[NUM] (1.000), l (0.000), 1 (0.000), - (0.000), 30 (0.000) |
SEC-BERT-SHAPE |
[X.X] (0.787), [X.XX] (0.095), [XX.X] (0.049), [X.XXX] (0.046), [X] (0.013) |
Sample |
Masked Token |
Total net sales decreased 2% or $5.4 billion during [MASK] compared to 2018. |
2019 |
Model |
Predictions (Probability) |
BERT-BASE-UNCASED |
2017 (0.485), 2018 (0.169), 2016 (0.164), 2015 (0.070), 2014 (0.022) |
SEC-BERT-BASE |
2019 (0.990), 2017 (0.007), 2018 (0.003), 2020 (0.000), 2015 (0.000) |
SEC-BERT-NUM |
[NUM] (1.000), as (0.000), fiscal (0.000), year (0.000), when (0.000) |
SEC-BERT-SHAPE |
[XXXX] (1.000), as (0.000), year (0.000), periods (0.000), , (0.000) |
Sample |
Masked Token |
Total net sales decreased 2% or $5.4 billion during 2019 compared to [MASK]. |
2018 |
Model |
Predictions (Probability) |
BERT-BASE-UNCASED |
2017 (0.100), 2016 (0.097), above (0.054), inflation (0.050), previously (0.037) |
SEC-BERT-BASE |
2018 (0.999), 2019 (0.000), 2017 (0.000), 2016 (0.000), 2014 (0.000) |
SEC-BERT-NUM |
[NUM] (1.000), year (0.000), last (0.000), sales (0.000), fiscal (0.000) |
SEC-BERT-SHAPE |
[XXXX] (1.000), year (0.000), sales (0.000), prior (0.000), years (0.000) |
Sample |
Masked Token |
During 2019, the Company [MASK] $67.1 billion of its common stock and paid dividend equivalents of $14.1 billion. |
repurchased |
Model |
Predictions (Probability) |
BERT-BASE-UNCASED |
held (0.229), sold (0.192), acquired (0.172), owned (0.052), traded (0.033) |
SEC-BERT-BASE |
repurchased (0.913), issued (0.036), purchased (0.029), redeemed (0.010), sold (0.003) |
SEC-BERT-NUM |
repurchased (0.917), purchased (0.054), reacquired (0.013), issued (0.005), acquired (0.003) |
SEC-BERT-SHAPE |
repurchased (0.902), purchased (0.068), issued (0.010), reacquired (0.008), redeemed (0.006) |
Sample |
Masked Token |
During 2019, the Company repurchased $67.1 billion of its common [MASK] and paid dividend equivalents of $14.1 billion. |
stock |
Model |
Predictions (Probability) |
BERT-BASE-UNCASED |
stock (0.835), assets (0.039), equity (0.025), debt (0.021), bonds (0.017) |
SEC-BERT-BASE |
stock (0.857), shares (0.135), equity (0.004), units (0.002), securities (0.000) |
SEC-BERT-NUM |
stock (0.842), shares (0.157), equity (0.000), securities (0.000), units (0.000) |
SEC-BERT-SHAPE |
stock (0.888), shares (0.109), equity (0.001), securities (0.001), stocks (0.000) |
Sample |
Masked Token |
During 2019, the Company repurchased $67.1 billion of its common stock and paid [MASK] equivalents of $14.1 billion. |
dividend |
Model |
Predictions (Probability) |
BERT-BASE-UNCASED |
cash (0.276), net (0.128), annual (0.083), the (0.040), debt (0.027) |
SEC-BERT-BASE |
dividend (0.890), cash (0.018), dividends (0.016), share (0.013), tax (0.010) |
SEC-BERT-NUM |
dividend (0.735), cash (0.115), share (0.087), tax (0.025), stock (0.013) |
SEC-BERT-SHAPE |
dividend (0.655), cash (0.248), dividends (0.042), share (0.019), out (0.003) |
Sample |
Masked Token |
During 2019, the Company repurchased $67.1 billion of its common stock and paid dividend [MASK] of $14.1 billion. |
equivalents |
Model |
Predictions (Probability) |
BERT-BASE-UNCASED |
revenue (0.085), earnings (0.078), rates (0.065), amounts (0.064), proceeds (0.062) |
SEC-BERT-BASE |
payments (0.790), distributions (0.087), equivalents (0.068), cash (0.013), amounts (0.004) |
SEC-BERT-NUM |
payments (0.845), equivalents (0.097), distributions (0.024), increases (0.005), dividends (0.004) |
SEC-BERT-SHAPE |
payments (0.784), equivalents (0.093), distributions (0.043), dividends (0.015), requirements (0.009) |
๐ Documentation
Pre - training corpus
The model was pre - trained on 260,773 10 - K filings from 1993 - 2019, publicly available at U.S. Securities and Exchange Commission (SEC)
Pre - training details
- A new vocabulary of 30k subwords was created by training a BertWordPieceTokenizer from scratch on the pre - training corpus.
- BERT was trained using the official code provided in Google BERT's GitHub repository.
- The Hugging Face's Transformers conversion script was used to convert the TF checkpoint into the desired format, enabling users of both PyTorch and TF2 to load the model in two lines of code.
- A model similar to the English BERT - BASE model (12 - layer, 768 - hidden, 12 - heads, 110M parameters) was released.
- The same training set - up was followed: 1 million training steps with batches of 256 sequences of length 512 and an initial learning rate of 1e - 4.
- A single Google Cloud TPU v3 - 8 provided for free from [TensorFlow Research Cloud (TRC)]((https://sites.research.google/trc) was used, along with GCP research credits.
๐ง Technical Details
The technical details involve the pre - training process, including vocabulary creation, model training using official code, checkpoint conversion, and the specific training setup with the use of Google Cloud TPU and research credits.
๐ License
This project is licensed under the cc - by - sa - 4.0 license.
๐ Publication
If you use this model, please cite the following article:
FiNER: Financial Numeric Entity Recognition for XBRL Tagging
Lefteris Loukas, Manos Fergadiotis, Ilias Chalkidis, Eirini Spyropoulou, Prodromos Malakasiotis, Ion Androutsopoulos and George Paliouras
In the Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (ACL 2022) (Long Papers), Dublin, Republic of Ireland, May 22 - 27, 2022
@inproceedings{loukas-etal-2022-finer,