Sec-BERT Base Open-Source Financial Domain Model - Supports Financial Research and Fintech Applications

Sec Bert Base

Developed by nlpaueb

SEC-BERT is a series of BERT models tailored for the financial domain, trained on 10-K annual reports from the U.S. Securities and Exchange Commission (SEC), supporting financial natural language processing research and fintech applications.

Large Language Model

Transformers

English#Financial Text Analysis #Financial Report Data Understanding #Numeric Entity Recognition

Downloads 749

Release Time : 3/2/2022

Model Overview

SEC-BERT is a BERT model specifically optimized for the financial domain, primarily used for natural language understanding tasks in financial documents, such as financial report analysis and financial entity recognition.

Model Features

Financial Domain Optimization

Trained specifically on 260,773 10-K annual reports, providing better understanding of financial terminology and expressions.

Multiple Variant Models

Offers three variants: base model (SEC-BERT), numeric processing model (SEC-BERT-NUM), and numeric shape model (SEC-BERT-SHAPE).

Financial Entity Recognition

Excels in financial numeric entity recognition tasks, such as percentages, monetary amounts, and years.

Model Capabilities

Financial Text Understanding

Financial Report Analysis

Financial Entity Recognition

Financial Numeric Processing

Use Cases

Financial Report Analysis

Sales Trend Analysis

Analyzing sales trend changes in annual reports.

Accurately predicts financial trend words such as 'growth' or 'decline'.

Financial Entity Recognition

Financial Numeric Identification

Identifying monetary amounts, percentages, and other numeric information in reports.

Higher accuracy in recognizing numbers and units compared to general BERT models.

🚀 SEC-BERT

SEC-BERT is a family of BERT models tailored for the financial domain. It aims to support financial NLP research and FinTech applications, offering specialized models for more accurate financial text analysis.

🚀 Quick Start

from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained("nlpaueb/sec-bert-base")
model = AutoModel.from_pretrained("nlpaueb/sec-bert-base")

✨ Features

SEC-BERT consists of the following models:

SEC-BERT-BASE (this model): It has the same architecture as BERT-BASE and is trained on financial documents.
SEC-BERT-NUM: Similar to SEC-BERT-BASE, but it replaces every number token with a [NUM] pseudo - token, handling all numeric expressions uniformly and preventing fragmentation.
SEC-BERT-SHAPE: Also similar to SEC-BERT-BASE, but it replaces numbers with pseudo - tokens representing the number’s shape, so numeric expressions (of known shapes) are no longer fragmented. For example, '53.2' becomes '[XX.X]' and '40,200.5' becomes '[XX,XXX.X]'.

📦 Installation

No specific installation steps are provided in the original document.

💻 Usage Examples

Basic Usage

from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained("nlpaueb/sec-bert-base")
model = AutoModel.from_pretrained("nlpaueb/sec-bert-base")

Advanced Usage

The following tables show how different SEC - BERT variants perform as language models in various financial text scenarios:

Sample	Masked Token
Total net sales [MASK] 2% or $5.4 billion during 2019 compared to 2018.	decreased

Model	Predictions (Probability)
BERT-BASE-UNCASED	increased (0.221), were (0.131), are (0.103), rose (0.075), of (0.058)
SEC-BERT-BASE	increased (0.678), decreased (0.282), declined (0.017), grew (0.016), rose (0.004)
SEC-BERT-NUM	increased (0.753), decreased (0.211), grew (0.019), declined (0.010), rose (0.006)
SEC-BERT-SHAPE	increased (0.747), decreased (0.214), grew (0.021), declined (0.013), rose (0.002)

Sample	Masked Token
Total net sales decreased 2% or $5.4 [MASK] during 2019 compared to 2018.	billion

Model	Predictions (Probability)
BERT-BASE-UNCASED	billion (0.841), million (0.097), trillion (0.028), ##m (0.015), ##bn (0.006)
SEC-BERT-BASE	million (0.972), billion (0.028), millions (0.000), ##million (0.000), m (0.000)
SEC-BERT-NUM	million (0.974), billion (0.012), , (0.010), thousand (0.003), m (0.000)
SEC-BERT-SHAPE	million (0.978), billion (0.021), % (0.000), , (0.000), millions (0.000)

Sample	Masked Token
Total net sales decreased [MASK]% or $5.4 billion during 2019 compared to 2018.	2

Model	Predictions (Probability)
BERT-BASE-UNCASED	20 (0.031), 10 (0.030), 6 (0.029), 4 (0.027), 30 (0.027)
SEC-BERT-BASE	13 (0.045), 12 (0.040), 11 (0.040), 14 (0.035), 10 (0.035)
SEC-BERT-NUM	[NUM] (1.000), one (0.000), five (0.000), three (0.000), seven (0.000)
SEC-BERT-SHAPE	[XX] (0.316), [XX.X] (0.253), [X.X] (0.237), [X] (0.188), [X.XX] (0.002)

Sample	Masked Token
Total net sales decreased 2[MASK] or $5.4 billion during 2019 compared to 2018.	%

Model	Predictions (Probability)
BERT-BASE-UNCASED	% (0.795), percent (0.174), ##fold (0.009), billion (0.004), times (0.004)
SEC-BERT-BASE	% (0.924), percent (0.076), points (0.000), , (0.000), times (0.000)
SEC-BERT-NUM	% (0.882), percent (0.118), million (0.000), units (0.000), bps (0.000)
SEC-BERT-SHAPE	% (0.961), percent (0.039), bps (0.000), , (0.000), bcf (0.000)

Sample	Masked Token
Total net sales decreased 2% or $[MASK] billion during 2019 compared to 2018.	5.4

Model	Predictions (Probability)
BERT-BASE-UNCASED	1 (0.074), 4 (0.045), 3 (0.044), 2 (0.037), 5 (0.034)
SEC-BERT-BASE	1 (0.218), 2 (0.136), 3 (0.078), 4 (0.066), 5 (0.048)
SEC-BERT-NUM	[NUM] (1.000), l (0.000), 1 (0.000), - (0.000), 30 (0.000)
SEC-BERT-SHAPE	[X.X] (0.787), [X.XX] (0.095), [XX.X] (0.049), [X.XXX] (0.046), [X] (0.013)

Sample	Masked Token
Total net sales decreased 2% or $5.4 billion during [MASK] compared to 2018.	2019

Model	Predictions (Probability)
BERT-BASE-UNCASED	2017 (0.485), 2018 (0.169), 2016 (0.164), 2015 (0.070), 2014 (0.022)
SEC-BERT-BASE	2019 (0.990), 2017 (0.007), 2018 (0.003), 2020 (0.000), 2015 (0.000)
SEC-BERT-NUM	[NUM] (1.000), as (0.000), fiscal (0.000), year (0.000), when (0.000)
SEC-BERT-SHAPE	[XXXX] (1.000), as (0.000), year (0.000), periods (0.000), , (0.000)

Sample	Masked Token
Total net sales decreased 2% or $5.4 billion during 2019 compared to [MASK].	2018

Model	Predictions (Probability)
BERT-BASE-UNCASED	2017 (0.100), 2016 (0.097), above (0.054), inflation (0.050), previously (0.037)
SEC-BERT-BASE	2018 (0.999), 2019 (0.000), 2017 (0.000), 2016 (0.000), 2014 (0.000)
SEC-BERT-NUM	[NUM] (1.000), year (0.000), last (0.000), sales (0.000), fiscal (0.000)
SEC-BERT-SHAPE	[XXXX] (1.000), year (0.000), sales (0.000), prior (0.000), years (0.000)

Sample	Masked Token
During 2019, the Company [MASK] $67.1 billion of its common stock and paid dividend equivalents of $14.1 billion.	repurchased

Model	Predictions (Probability)
BERT-BASE-UNCASED	held (0.229), sold (0.192), acquired (0.172), owned (0.052), traded (0.033)
SEC-BERT-BASE	repurchased (0.913), issued (0.036), purchased (0.029), redeemed (0.010), sold (0.003)
SEC-BERT-NUM	repurchased (0.917), purchased (0.054), reacquired (0.013), issued (0.005), acquired (0.003)
SEC-BERT-SHAPE	repurchased (0.902), purchased (0.068), issued (0.010), reacquired (0.008), redeemed (0.006)

Sample	Masked Token
During 2019, the Company repurchased $67.1 billion of its common [MASK] and paid dividend equivalents of $14.1 billion.	stock

Model	Predictions (Probability)
BERT-BASE-UNCASED	stock (0.835), assets (0.039), equity (0.025), debt (0.021), bonds (0.017)
SEC-BERT-BASE	stock (0.857), shares (0.135), equity (0.004), units (0.002), securities (0.000)
SEC-BERT-NUM	stock (0.842), shares (0.157), equity (0.000), securities (0.000), units (0.000)
SEC-BERT-SHAPE	stock (0.888), shares (0.109), equity (0.001), securities (0.001), stocks (0.000)

Sample	Masked Token
During 2019, the Company repurchased $67.1 billion of its common stock and paid [MASK] equivalents of $14.1 billion.	dividend

Model	Predictions (Probability)
BERT-BASE-UNCASED	cash (0.276), net (0.128), annual (0.083), the (0.040), debt (0.027)
SEC-BERT-BASE	dividend (0.890), cash (0.018), dividends (0.016), share (0.013), tax (0.010)
SEC-BERT-NUM	dividend (0.735), cash (0.115), share (0.087), tax (0.025), stock (0.013)
SEC-BERT-SHAPE	dividend (0.655), cash (0.248), dividends (0.042), share (0.019), out (0.003)

Sample	Masked Token
During 2019, the Company repurchased $67.1 billion of its common stock and paid dividend [MASK] of $14.1 billion.	equivalents

Model	Predictions (Probability)
BERT-BASE-UNCASED	revenue (0.085), earnings (0.078), rates (0.065), amounts (0.064), proceeds (0.062)
SEC-BERT-BASE	payments (0.790), distributions (0.087), equivalents (0.068), cash (0.013), amounts (0.004)
SEC-BERT-NUM	payments (0.845), equivalents (0.097), distributions (0.024), increases (0.005), dividends (0.004)
SEC-BERT-SHAPE	payments (0.784), equivalents (0.093), distributions (0.043), dividends (0.015), requirements (0.009)

📚 Documentation

Pre - training corpus

The model was pre - trained on 260,773 10 - K filings from 1993 - 2019, publicly available at U.S. Securities and Exchange Commission (SEC)

Pre - training details

A new vocabulary of 30k subwords was created by training a BertWordPieceTokenizer from scratch on the pre - training corpus.
BERT was trained using the official code provided in Google BERT's GitHub repository.
The Hugging Face's Transformers conversion script was used to convert the TF checkpoint into the desired format, enabling users of both PyTorch and TF2 to load the model in two lines of code.
A model similar to the English BERT - BASE model (12 - layer, 768 - hidden, 12 - heads, 110M parameters) was released.
The same training set - up was followed: 1 million training steps with batches of 256 sequences of length 512 and an initial learning rate of 1e - 4.
A single Google Cloud TPU v3 - 8 provided for free from [TensorFlow Research Cloud (TRC)]((https://sites.research.google/trc) was used, along with GCP research credits.

🔧 Technical Details

The technical details involve the pre - training process, including vocabulary creation, model training using official code, checkpoint conversion, and the specific training setup with the use of Google Cloud TPU and research credits.

📄 License

This project is licensed under the cc - by - sa - 4.0 license.

📚 Publication

If you use this model, please cite the following article: FiNER: Financial Numeric Entity Recognition for XBRL Tagging Lefteris Loukas, Manos Fergadiotis, Ilias Chalkidis, Eirini Spyropoulou, Prodromos Malakasiotis, Ion Androutsopoulos and George Paliouras In the Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (ACL 2022) (Long Papers), Dublin, Republic of Ireland, May 22 - 27, 2022

@inproceedings{loukas-etal-2022-finer,

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご