ChronoBERT Open-Source Temporal Large Model - Eliminate Data Bias for Free and Enhance Language Understanding in Temporal Applications

Chrono Bert V1 19991231

Developed by manelalab

ChronoBERT is a series of high-performance temporally consistent large language models designed to eliminate look-ahead bias and training data leakage while maintaining strong language understanding capabilities in time-sensitive applications.

Large Language Model

Transformers

EnglishOpen Source License:MIT #Temporally Consistent LLM #Financial Time Series Forecasting #Historical Text Analysis

Downloads 167

Release Time : 2/28/2025

Model Overview

The model is pre-trained on diverse, high-quality, open-source, and timestamped texts to ensure temporal consistency. It outperforms standard BERT on the GLUE benchmark, supporting more reliable economic and financial modeling.

Model Features

Temporal Consistency

Eliminates look-ahead bias and training data leakage to ensure the integrity of historical analysis.

High Performance

Outperforms standard BERT on the GLUE benchmark while maintaining strong language understanding capabilities.

Diverse Pre-training Data

Pre-trained on 460 billion diverse, high-quality open-source text data from before the year 2000.

Incremental Updates

Updated annually from 2000 to 2024, adding 65 billion timestamped text data each year.

Model Capabilities

Language Understanding

Financial Forecasting

Time-Sensitive Analysis

Use Cases

Financial Modeling

Stock Return Prediction

Evaluated using return prediction tasks based on Dow Jones news data.

Achieves a Sharpe ratio of 4.80, outperforming BERT, FinBERT, and StoriesLM-v1-1963, and comparable to Llama 3.1 8B (4.90).

Natural Language Processing

GLUE Benchmark

Evaluates language understanding capabilities.

chrono-bert-v1-19991231 and chrono-bert-v1-20241231 scored 84.71 and 85.54, respectively, outperforming BERT (84.52).

🚀 ChronoBERT

ChronoBERT is a high - performance chronologically consistent large language model series. It aims to eliminate lookahead bias and training leakage, and maintain good language understanding in time - sensitive applications. The series models achieve GLUE benchmark scores that surpass standard BERT, which is beneficial for historical analysis and economic/financial modeling.

🚀 Quick Start

The model is compatible with the transformers library starting from v4.48.0. You can install the necessary libraries using the following commands:

pip install -U transformers>=4.48.0
pip install flash-attn

Here is an example of using the model:

from transformers import AutoTokenizer, AutoModel
device = 'cuda:0'

tokenizer = AutoTokenizer.from_pretrained("manelalab/chrono-bert-v1-19991231")
model = AutoModel.from_pretrained("manelalab/chrono-bert-v1-19991231").to(device)

text = "Obviously, the time continuum has been disrupted, creating a new temporal event sequence resulting in this alternate reality. -- Dr. Brown, Back to the Future Part II"

inputs = tokenizer(text, return_tensors="pt").to(device)
outputs = model(**inputs)

✨ Features

Chronological Consistency: Pretrained on diverse, high - quality, open - source, and timestamped text to eliminate lookahead bias and training leakage.
High Performance: Achieves GLUE benchmark scores that surpass standard BERT.
Good Language Understanding: Maintains good language understanding in time - sensitive applications.

📦 Installation

The model is compatible with the transformers library starting from v4.48.0. Use the following commands to install the required libraries:

pip install -U transformers>=4.48.0
pip install flash-attn

💻 Usage Examples

Basic Usage

from transformers import AutoTokenizer, AutoModel
device = 'cuda:0'

tokenizer = AutoTokenizer.from_pretrained("manelalab/chrono-bert-v1-19991231")
model = AutoModel.from_pretrained("manelalab/chrono-bert-v1-19991231").to(device)

text = "Obviously, the time continuum has been disrupted, creating a new temporal event sequence resulting in this alternate reality. -- Dr. Brown, Back to the Future Part II"

inputs = tokenizer(text, return_tensors="pt").to(device)
outputs = model(**inputs)

📚 Documentation

Model Description

ChronoBERT is a series of high - performance chronologically consistent large language models (LLM). It is designed to eliminate lookahead bias and training leakage while maintaining good language understanding in time - sensitive applications. The model is pretrained on diverse, high - quality, open - source, and timestamped text to maintain chronological consistency. All models in the series achieve GLUE benchmark scores that surpass standard BERT, which preserves the integrity of historical analysis and enables more reliable economic and financial modeling.

Developed by: Songrun He, Linying Lv, Asaf Manela, Jimmy Wu
Model type: Transformer - based bidirectional encoder (ModernBERT architecture)
Language(s) (NLP): English
License: MIT License

Model Sources

Paper: "Chronologically Consistent Large Language Models" (He, Lv, Manela, Wu, 2025)

🔧 Technical Details

Training Data

Pretraining corpus: Our initial model chrono - bert - v1 - 19991231 is pretrained on 460 billion tokens of pre - 2000, diverse, high - quality, and open - source text data to ensure no leakage of data afterwards.
Incremental updates: Yearly updates from 2000 to 2024 with an additional 65 billion tokens of timestamped text.

Training Procedure

Architecture: ModernBERT - based model with rotary embeddings and flash attention.
Objective: Masked token prediction.

Evaluation

Testing Data, Factors & Metrics

Language understanding: Evaluated on GLUE benchmark tasks.
Financial forecasting: Evaluated using return prediction task based on Dow Jones Newswire data.
Comparison models: ChronoBERT was benchmarked against BERT, FinBERT, StoriesLM - v1 - 1963, and Llama 3.1.

Results

GLUE Score: chrono - bert - v1 - 19991231 and chrono - bert - v1 - 20241231 achieved GLUE scores of 84.71 and 85.54 respectively, outperforming BERT (84.52).
Stock return predictions: During the sample from 2008 - 01 to 2023 - 07, chrono - bert - v1 - realtime achieves a long - short portfolio Sharpe ratio of 4.80, outperforming BERT, FinBERT, and StoriesLM - v1 - 1963, and comparable to Llama 3.1 8B (4.90).

📄 License

The model is released under the MIT License.

Citation

@article{He2025ChronoBERT,
  title={Chronologically Consistent Large Language Models},
  author={He, Songrun and Lv, Linying and Manela, Asaf and Wu, Jimmy},
  journal={Working Paper},
  year={2025}
}

Model Card Authors

Songrun He (Washington University in St. Louis, h.songrun@wustl.edu)
Linying Lv (Washington University in St. Louis, llyu@wustl.edu)
Asaf Manela (Washington University in St. Louis, amanela@wustl.edu)
Jimmy Wu (Washington University in St. Louis, jimmywu@wustl.edu)

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご