BERT-small-japanese-fin Open-source Model - Free to Use, Precise Handling of Financial Japanese Texts

Home

Bert Small Japanese Fin

Developed by izumi-lab

This is a BERT model pre-trained on Japanese text, specifically optimized for the financial domain.

Large Language Model

Transformers

Japanese#Japanese Financial Text #Domain-Specific BERT #Financial Report Analysis

Downloads 4,446

Release Time : 3/2/2022

Model Overview

This model is a BERT model pre-trained on Japanese Wikipedia and financial domain corpora, suitable for financial text mining tasks.

Model Features

Domain-Specific Pre-training

Pre-trained using a combination of general Wikipedia corpus and financial domain-specific corpus

Efficient Architecture

Adopts a small BERT architecture to balance performance and efficiency

Professional Tokenization

Uses MeCab tool with IPA dictionary for professional tokenization

Model Capabilities

Japanese Text Understanding

Financial Text Analysis

Text Feature Extraction

Use Cases

Financial Analysis

Financial Report Analysis

Analyze company financial report summaries

Securities Report Processing

Parse and understand securities report content

🚀 BERT small Japanese finance

This is a pre - trained BERT model on Japanese texts, which can be used for financial text mining and related tasks.

🚀 Quick Start

This is a BERT model pretrained on texts in the Japanese language. The codes for the pretraining are available at retarfi/language-pretraining.

✨ Features

The model architecture is based on BERT small, suitable for Japanese text processing.
Trained on both Wikipedia and financial corpora, enhancing its performance in the financial domain.

📚 Documentation

Model architecture

The model architecture is the same as BERT small in the original ELECTRA paper; 12 layers, 256 dimensions of hidden states, and 4 attention heads.

Training Data

Property	Details
Model Type	BERT small Japanese finance
Training Data	The models are trained on Wikipedia corpus and financial corpus. The Wikipedia corpus is generated from the Japanese Wikipedia dump file as of June 1, 2021. The corpus file is 2.9GB, consisting of approximately 20M sentences. The financial corpus consists of 2 corpora: summaries of financial results from October 9, 2012, to December 31, 2020 and securities reports from February 8, 2018, to December 31, 2020. The financial corpus file is 5.2GB, consisting of approximately 27M sentences.

Property

Details

Model Type

BERT small Japanese finance

Training Data

The models are trained on Wikipedia corpus and financial corpus. The Wikipedia corpus is generated from the Japanese Wikipedia dump file as of June 1, 2021. The corpus file is 2.9GB, consisting of approximately 20M sentences. The financial corpus consists of 2 corpora: summaries of financial results from October 9, 2012, to December 31, 2020 and securities reports from February 8, 2018, to December 31, 2020. The financial corpus file is 5.2GB, consisting of approximately 27M sentences.

Tokenization

The texts are first tokenized by MeCab with IPA dictionary and then split into subwords by the WordPiece algorithm. The vocabulary size is 32768.

Training

The models are trained with the same configuration as BERT small in the original ELECTRA paper; 128 tokens per instance, 128 instances per batch, and 1.45M training steps.

Citation

@article{Suzuki-etal-2023-ipm,
  title = {Constructing and analyzing domain-specific language model for financial text mining}
  author = {Masahiro Suzuki and Hiroki Sakaji and Masanori Hirano and Kiyoshi Izumi},
  journal = {Information Processing & Management},
  volume = {60},
  number = {2},
  pages = {103194},
  year = {2023},
  doi = {10.1016/j.ipm.2022.103194}
}

📄 License

The pretrained models are distributed under the terms of the Creative Commons Attribution - ShareAlike 4.0.

🔧 Technical Details

The model is built on the BERT small architecture, which has been proven effective in natural language processing tasks. By training on a large - scale Japanese Wikipedia corpus and a financial corpus, the model can better understand the semantics and context of Japanese financial texts. The tokenization process using MeCab and WordPiece algorithms helps to handle the complex Japanese language structure.

Acknowledgments

This work was supported by JSPS KAKENHI Grant Number JP21K12010.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご