CT-M1-Complete-SE Open-source Model - Free and Accurate Analysis of Crisis-related Social Media Texts

CT M1 Complete SE

Developed by crisistransformers

CrisisTransformers is a series of pre-trained language models and sentence encoders for crisis-related social media texts, based on the RoBERTa architecture, trained on a 15-billion-token crisis event dataset.

Text Embedding

Transformers

#Crisis Text Analysis #Multilingual Embeddings #Social Media Semantics

Downloads 60

Release Time : 9/11/2023

Model Overview

Monolingual (English) sentence encoder that can be directly used to generate sentence embeddings, supporting tasks such as semantic search, clustering, and topic modeling.

Model Features

Crisis Text Optimization

Specially trained on crisis-related social media texts, excelling in over 30 types of crisis events such as disease outbreaks and natural disasters.

Performance Improvement

Tested on 18 public crisis datasets, the best monolingual encoder performance improved by over 17% compared to existing technologies.

Ready-to-Use Encoder

Can be directly used for sentence embedding generation without fine-tuning, supporting rapid deployment of downstream applications.

Model Capabilities

Sentence embedding generation

Semantic similarity calculation

Text clustering

Topic modeling

Use Cases

Crisis Response

Disaster Information Classification

Automatically classify disaster-related tweets to identify information types such as requests for help and reports.

Classification accuracy outperforms general models in benchmark tests

Multilingual Crisis Monitoring

Achieve cross-language crisis information monitoring and analysis through multilingual encoders.

Social Media Analysis

Event Topic Discovery

Automatically discover and cluster key topics from crisis event-related tweets.

🚀 CrisisTransformers

CrisisTransformers is a collection of pre - trained language models and sentence encoders. It addresses the need for effective processing of crisis - related social media texts, offering high - performance solutions for tasks such as classification and sentence encoding.

🚀 Quick Start

CrisisTransformers is introduced in the papers "CrisisTransformers: Pre - trained language models and sentence encoders for crisis - related social media texts" and "Semantically Enriched Cross - Lingual Sentence Embeddings for Crisis - related Social Media Texts". The models are trained on a vast corpus of over 15 billion word tokens from tweets related to more than 30 crisis events like disease outbreaks, natural disasters, and conflicts. For more details, refer to the associated paper.

✨ Features

High - Performance: Evaluated on 18 public crisis - specific datasets, the pre - trained models outperform strong baselines in classification tasks across all datasets. The best - performing mono - lingual sentence encoder outperforms the state - of - the - art by over 17% in sentence encoding tasks.
Multi - lingual Support: The multi - lingual sentence encoders support 50+ languages, approximating the embedding space of the best - performing mono - lingual encoder.

📚 Documentation

Uses

CrisisTransformers includes 8 pre - trained models, 1 mono - lingual, and 2 multi - lingual sentence encoders. Similar to [BERT](https://huggingface.co/bert - base - cased) and [RoBERTa](https://huggingface.co/roberta - base), the pre - trained models need to be fine - tuned for downstream tasks. The sentence encoders can be used directly, like [Sentence - Transformers](https://huggingface.co/sentence - transformers/all - mpnet - base - v2), for tasks such as semantic search, clustering, and topic modelling.

Models and naming conventions

Training Differences: CT - M1 models are trained from scratch for up to 40 epochs. CT - M2 models are initialized with pre - trained RoBERTa's weights, and CT - M3 models are initialized with pre - trained BERTweet's weights, both trained for up to 20 epochs.
Checkpoint Meanings: OneLook represents the checkpoint after 1 epoch, BestLoss represents the checkpoint with the lowest loss during training, and Complete represents the checkpoint after all epochs. SE represents sentence encoder.

Pre - trained models

Property	Details
CT - M1 - BestLoss	[crisistransformers/CT - M1 - BestLoss](https://huggingface.co/crisistransformers/CT - M1 - BestLoss)
CT - M1 - Complete	[crisistransformers/CT - M1 - Complete](https://huggingface.co/crisistransformers/CT - M1 - Complete)
CT - M2 - OneLook	[crisistransformers/CT - M2 - OneLook](https://huggingface.co/crisistransformers/CT - M2 - OneLook)
CT - M2 - BestLoss	[crisistransformers/CT - M2 - BestLoss](https://huggingface.co/crisistransformers/CT - M2 - BestLoss)
CT - M2 - Complete	[crisistransformers/CT - M2 - Complete](https://huggingface.co/crisistransformers/CT - M2 - Complete)
CT - M3 - OneLook	[crisistransformers/CT - M3 - OneLook](https://huggingface.co/crisistransformers/CT - M3 - OneLook)
CT - M3 - BestLoss	[crisistransformers/CT - M3 - BestLoss](https://huggingface.co/crisistransformers/CT - M3 - BestLoss)
CT - M3 - Complete	[crisistransformers/CT - M3 - Complete](https://huggingface.co/crisistransformers/CT - M3 - Complete)

Sentence encoders

Property	Details
CT - M1 - Complete - SE (mono - lingual: EN)	[crisistransformers/CT - M1 - Complete - SE](https://huggingface.co/crisistransformers/CT - M1 - Complete - SE)
CT - XLMR - SE (multi - lingual)	[crisistransformers/CT - XLMR - SE](https://huggingface.co/crisistransformers/CT - XLMR - SE)
CT - mBERT - SE (multi - lingual)	[crisistransformers/CT - mBERT - SE](https://huggingface.co/crisistransformers/CT - mBERT - SE)

The multi - lingual sentence encoders support languages such as Albanian, Arabic, Armenian, and many others.

📄 License

Citation

If you use CrisisTransformers and the mono - lingual sentence encoder, please cite the following paper:

@article{lamsal2023crisistransformers,
      title={CrisisTransformers: Pre - trained language models and sentence encoders for crisis - related social media texts}, 
      author={Rabindra Lamsal and
		      Maria Rodriguez Read and
		      Shanika Karunasekera},
      journal={Knowledge - Based Systems},
      pages={111916},
      year={2024},
      publisher={Elsevier}
}

If you use the multi - lingual sentence encoders, please cite the following paper:

@article{lamsal2024semantically,
      title={Semantically Enriched Cross - Lingual Sentence Embeddings for Crisis - related Social Media Texts}, 
      author={Rabindra Lamsal and
		      Maria Rodriguez Read and
		      Shanika Karunasekera},
      year={2024},
      eprint={2403.16614},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご