BiodivBERT Open-source Model - Freely Empowering Biodiversity Literature Research and Analysis

Biodivbert

Developed by NoYo25

BiodivBERT is a domain-specific model based on BERT, specifically designed for biodiversity literature.

EnglishOpen Source License:Apache-2.0 #Biodiversity text mining #Domain pre-trained model #Scientific literature analysis

Downloads 49

Release Time : 5/16/2022

Model Overview

BiodivBERT is a pre-trained language model for the biodiversity domain, optimized for named entity recognition and relationship extraction tasks in biodiversity literature.

Model Features

Optimized for the biodiversity domain

Pre-trained specifically for biodiversity literature, performing better than the general BERT model on related tasks.

Multi-task support

Supports two downstream tasks of named entity recognition and relationship extraction simultaneously.

Large-scale training data

Trained using abstracts and open-access full-text publications from Springer and Elsevier between 1990 and 2020.

Model Capabilities

Biodiversity text understanding

Named entity recognition

Relationship extraction

Masked language model prediction

Use Cases

Academic research

Biodiversity literature analysis

Extract key entities and relationships from biodiversity-related literature

Performs better than the general BERT model on multiple biodiversity datasets

Information extraction

Species relationship recognition

Identify ecological relationships between species from scientific literature

🚀 BiodivBERT

BiodivBERT is a pre - trained language model tailored for the biodiversity domain, offering high - performance solutions for named entity recognition and relation extraction tasks.

🚀 Quick Start

You can use BiodivBERT via the huggingface library as follows:

💻 Usage Examples

Basic Usage

# Masked Language Model
>>> from transformers import AutoTokenizer, AutoModelForMaskedLM
>>> tokenizer = AutoTokenizer.from_pretrained("NoYo25/BiodivBERT")
>>> model = AutoModelForMaskedLM.from_pretrained("NoYo25/BiodivBERT")

# Token Classification - Named Entity Recognition
>>> from transformers import AutoTokenizer, AutoModelForTokenClassification
>>> tokenizer = AutoTokenizer.from_pretrained("NoYo25/BiodivBERT")
>>> model = AutoModelForTokenClassification.from_pretrained("NoYo25/BiodivBERT")

# Sequence Classification - Relation Extraction
>>> from transformers import AutoTokenizer, AutoModelForSequenceClassification
>>> tokenizer = AutoTokenizer.from_pretrained("NoYo25/BiodivBERT")
>>> model = AutoModelForSequenceClassification.from_pretrained("NoYo25/BiodivBERT")

✨ Features

BiodivBERT is a domain - specific BERT based cased model for the biodiversity literature.
It uses the tokenizer from BERT base cased model.
BiodivBERT is pre - trained on abstracts and full text from biodiversity literature.
BiodivBERT is fine - tuned on two down - stream tasks for Named Entity Recognition and Relation Extraction in the biodiversity domain.

📦 Installation

No specific installation steps provided in the original document.

📚 Documentation

Model Description

BiodivBERT is a domain - specific BERT based cased model for the biodiversity literature. It uses the tokenizer from BERT base cased model, is pre - trained on abstracts and full text from biodiversity literature, and is fine - tuned on two down - stream tasks for Named Entity Recognition and Relation Extraction in the biodiversity domain. Please visit our [GitHub Repo](https://github.com/fusion - jena/BiodivBERT) for more details.

Training Data

BiodivBERT is pre - trained on abstracts and full text from biodiversity domain - related publications.
We used both Elsevier and Springer APIs to crawl such data.
We covered publications over the duration of 1990 - 2020.

Evaluation Results

BiodivBERT overperformed both BERT_base_cased, biobert_v1.1, and BiLSTM as a baseline approach on the down - stream tasks.

Other Information

Property	Details
Thumbnail	[https://www.fusion.uni - jena.de/fusionmedia/fusionpictures/fusion - service/fusion - transp.png?height=383&width=680](https://www.fusion.uni - jena.de/fusionmedia/fusionpictures/fusion - service/fusion - transp.png?height=383&width=680)
Tags	bert - base - cased, biodiversity, token - classification, sequence - classification
License	apache - 2.0
Citation	"Abdelmageed, N., Löffler, F., & König - Ries, B. (2023). BiodivBERT: a Pre - Trained Language Model for the Biodiversity Domain."
Paper	[https://ceur - ws.org/Vol - 3415/paper - 7.pdf](https://ceur - ws.org/Vol - 3415/paper - 7.pdf)
Metrics	f1, precision, recall, accuracy
Evaluation Datasets	- URL: https://doi.org/10.5281/zenodo.6554208 - Named Entity Recognition: COPIOUS, QEMP, BiodivNER, LINNAEUS, Species800 - Relation Extraction: GAD, EU - ADR, BiodivRE, BioRelEx
Training Data	- Crawling - keywords: biodivers, genetic diversity, omic diversity, phylogenetic diversity, soil diversity, population diversity, species diversity, ecosystem diversity, functional diversity, microbial diversity - Corpora: (+Abs) Springer and Elsevier abstracts in the duration of 1990 - 2020; (+Abs+Full) Springer and Elsevier abstracts and open access full publication text in the duration of 1990 - 2020
Pre - training Hyperparams	- MAX_LEN = 512 # Default of BERT Tokenizer - MLM_PROP = 0.15 # Data Collator - num_train_epochs = 3 # the minimum sufficient epochs found on many articles && default of trainer here - per_device_train_batch_size = 16 # the maximumn that could be held by V100 on Ara with 512 MAX_LEN was 8 in the old run - per_device_eval_batch_size = 16 # usually as above - gradient_accumulation_steps = 4 # this will grant a minim batch size 16 * 4 * nGPUs.

📄 License

The model is licensed under apache - 2.0.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご