đ CrisisTransformers
CrisisTransformers is a collection of pre - trained language models and sentence encoders. It addresses the need for effective processing of crisis - related social media texts, offering high - performance solutions for tasks such as classification and sentence encoding.
đ Quick Start
CrisisTransformers is introduced in the papers "CrisisTransformers: Pre - trained language models and sentence encoders for crisis - related social media texts" and "Semantically Enriched Cross - Lingual Sentence Embeddings for Crisis - related Social Media Texts". The models are trained on a vast corpus of over 15 billion word tokens from tweets related to more than 30 crisis events like disease outbreaks, natural disasters, and conflicts. For more details, refer to the associated paper.
⨠Features
- High - Performance: Evaluated on 18 public crisis - specific datasets, the pre - trained models outperform strong baselines in classification tasks across all datasets. The best - performing mono - lingual sentence encoder outperforms the state - of - the - art by over 17% in sentence encoding tasks.
- Multi - lingual Support: The multi - lingual sentence encoders support 50+ languages, approximating the embedding space of the best - performing mono - lingual encoder.
đ Documentation
Uses
CrisisTransformers includes 8 pre - trained models, 1 mono - lingual, and 2 multi - lingual sentence encoders. Similar to [BERT](https://huggingface.co/bert - base - cased) and [RoBERTa](https://huggingface.co/roberta - base), the pre - trained models need to be fine - tuned for downstream tasks. The sentence encoders can be used directly, like [Sentence - Transformers](https://huggingface.co/sentence - transformers/all - mpnet - base - v2), for tasks such as semantic search, clustering, and topic modelling.
Models and naming conventions
- Training Differences: CT - M1 models are trained from scratch for up to 40 epochs. CT - M2 models are initialized with pre - trained RoBERTa's weights, and CT - M3 models are initialized with pre - trained BERTweet's weights, both trained for up to 20 epochs.
- Checkpoint Meanings: OneLook represents the checkpoint after 1 epoch, BestLoss represents the checkpoint with the lowest loss during training, and Complete represents the checkpoint after all epochs. SE represents sentence encoder.
Pre - trained models
Property |
Details |
CT - M1 - BestLoss |
[crisistransformers/CT - M1 - BestLoss](https://huggingface.co/crisistransformers/CT - M1 - BestLoss) |
CT - M1 - Complete |
[crisistransformers/CT - M1 - Complete](https://huggingface.co/crisistransformers/CT - M1 - Complete) |
CT - M2 - OneLook |
[crisistransformers/CT - M2 - OneLook](https://huggingface.co/crisistransformers/CT - M2 - OneLook) |
CT - M2 - BestLoss |
[crisistransformers/CT - M2 - BestLoss](https://huggingface.co/crisistransformers/CT - M2 - BestLoss) |
CT - M2 - Complete |
[crisistransformers/CT - M2 - Complete](https://huggingface.co/crisistransformers/CT - M2 - Complete) |
CT - M3 - OneLook |
[crisistransformers/CT - M3 - OneLook](https://huggingface.co/crisistransformers/CT - M3 - OneLook) |
CT - M3 - BestLoss |
[crisistransformers/CT - M3 - BestLoss](https://huggingface.co/crisistransformers/CT - M3 - BestLoss) |
CT - M3 - Complete |
[crisistransformers/CT - M3 - Complete](https://huggingface.co/crisistransformers/CT - M3 - Complete) |
Sentence encoders
Property |
Details |
CT - M1 - Complete - SE (mono - lingual: EN) |
[crisistransformers/CT - M1 - Complete - SE](https://huggingface.co/crisistransformers/CT - M1 - Complete - SE) |
CT - XLMR - SE (multi - lingual) |
[crisistransformers/CT - XLMR - SE](https://huggingface.co/crisistransformers/CT - XLMR - SE) |
CT - mBERT - SE (multi - lingual) |
[crisistransformers/CT - mBERT - SE](https://huggingface.co/crisistransformers/CT - mBERT - SE) |
The multi - lingual sentence encoders support languages such as Albanian, Arabic, Armenian, and many others.
đ License
Citation
If you use CrisisTransformers and the mono - lingual sentence encoder, please cite the following paper:
@article{lamsal2023crisistransformers,
title={CrisisTransformers: Pre - trained language models and sentence encoders for crisis - related social media texts},
author={Rabindra Lamsal and
Maria Rodriguez Read and
Shanika Karunasekera},
journal={Knowledge - Based Systems},
pages={111916},
year={2024},
publisher={Elsevier}
}
If you use the multi - lingual sentence encoders, please cite the following paper:
@article{lamsal2024semantically,
title={Semantically Enriched Cross - Lingual Sentence Embeddings for Crisis - related Social Media Texts},
author={Rabindra Lamsal and
Maria Rodriguez Read and
Shanika Karunasekera},
year={2024},
eprint={2403.16614},
archivePrefix={arXiv},
primaryClass={cs.CL}
}