🚀 BiodivBERT
BiodivBERT is a pre - trained language model tailored for the biodiversity domain, offering high - performance solutions for named entity recognition and relation extraction tasks.
🚀 Quick Start
You can use BiodivBERT via the huggingface library as follows:
💻 Usage Examples
Basic Usage
>>> from transformers import AutoTokenizer, AutoModelForMaskedLM
>>> tokenizer = AutoTokenizer.from_pretrained("NoYo25/BiodivBERT")
>>> model = AutoModelForMaskedLM.from_pretrained("NoYo25/BiodivBERT")
>>> from transformers import AutoTokenizer, AutoModelForTokenClassification
>>> tokenizer = AutoTokenizer.from_pretrained("NoYo25/BiodivBERT")
>>> model = AutoModelForTokenClassification.from_pretrained("NoYo25/BiodivBERT")
>>> from transformers import AutoTokenizer, AutoModelForSequenceClassification
>>> tokenizer = AutoTokenizer.from_pretrained("NoYo25/BiodivBERT")
>>> model = AutoModelForSequenceClassification.from_pretrained("NoYo25/BiodivBERT")
✨ Features
- BiodivBERT is a domain - specific BERT based cased model for the biodiversity literature.
- It uses the tokenizer from BERT base cased model.
- BiodivBERT is pre - trained on abstracts and full text from biodiversity literature.
- BiodivBERT is fine - tuned on two down - stream tasks for Named Entity Recognition and Relation Extraction in the biodiversity domain.
📦 Installation
No specific installation steps provided in the original document.
📚 Documentation
Model Description
BiodivBERT is a domain - specific BERT based cased model for the biodiversity literature. It uses the tokenizer from BERT base cased model, is pre - trained on abstracts and full text from biodiversity literature, and is fine - tuned on two down - stream tasks for Named Entity Recognition and Relation Extraction in the biodiversity domain. Please visit our [GitHub Repo](https://github.com/fusion - jena/BiodivBERT) for more details.
Training Data
- BiodivBERT is pre - trained on abstracts and full text from biodiversity domain - related publications.
- We used both Elsevier and Springer APIs to crawl such data.
- We covered publications over the duration of 1990 - 2020.
Evaluation Results
BiodivBERT overperformed both BERT_base_cased
, biobert_v1.1
, and BiLSTM
as a baseline approach on the down - stream tasks.
Other Information
Property |
Details |
Thumbnail |
[https://www.fusion.uni - jena.de/fusionmedia/fusionpictures/fusion - service/fusion - transp.png?height=383&width=680](https://www.fusion.uni - jena.de/fusionmedia/fusionpictures/fusion - service/fusion - transp.png?height=383&width=680) |
Tags |
bert - base - cased, biodiversity, token - classification, sequence - classification |
License |
apache - 2.0 |
Citation |
"Abdelmageed, N., Löffler, F., & König - Ries, B. (2023). BiodivBERT: a Pre - Trained Language Model for the Biodiversity Domain." |
Paper |
[https://ceur - ws.org/Vol - 3415/paper - 7.pdf](https://ceur - ws.org/Vol - 3415/paper - 7.pdf) |
Metrics |
f1, precision, recall, accuracy |
Evaluation Datasets |
- URL: https://doi.org/10.5281/zenodo.6554208 - Named Entity Recognition: COPIOUS, QEMP, BiodivNER, LINNAEUS, Species800 - Relation Extraction: GAD, EU - ADR, BiodivRE, BioRelEx |
Training Data |
- Crawling - keywords: biodivers, genetic diversity, omic diversity, phylogenetic diversity, soil diversity, population diversity, species diversity, ecosystem diversity, functional diversity, microbial diversity - Corpora: (+Abs) Springer and Elsevier abstracts in the duration of 1990 - 2020; (+Abs+Full) Springer and Elsevier abstracts and open access full publication text in the duration of 1990 - 2020 |
Pre - training Hyperparams |
- MAX_LEN = 512 # Default of BERT Tokenizer - MLM_PROP = 0.15 # Data Collator - num_train_epochs = 3 # the minimum sufficient epochs found on many articles && default of trainer here - per_device_train_batch_size = 16 # the maximumn that could be held by V100 on Ara with 512 MAX_LEN was 8 in the old run - per_device_eval_batch_size = 16 # usually as above - gradient_accumulation_steps = 4 # this will grant a minim batch size 16 * 4 * nGPUs. |
📄 License
The model is licensed under apache - 2.0.