🚀 astroBERT: a language model for astrophysics
This project offers an NLP language model customized for astrophysics by NASA/ADS, along with tutorials and related files. The model is cased, distinguishing between ads
and ADS
.
🚀 Quick Start
This public repository contains the work of the NASA/ADS on building an NLP language model tailored to astrophysics, along with tutorials and miscellaneous related files.
This model is cased (it treats ads
and ADS
differently).
✨ Features
astroBERT models
- Base model: Pretrained on English with masked language modeling (MLM) and next sentence prediction (NSP) objectives. Introduced in this paper at ADASS 2021 and made public at ADASS 2022.
- NER - DEAL model: Adds a token classification head to the base model, finetuned on the DEAL@WIESP2022 named entity recognition task. Load from the
revision='NER - DEAL'
branch (see tutorial 2).
- SciX Categorizer: Finetuned to classify text into one of 7 categories relevant to SciX (Astronomy, Heliophysics, Planetary Science, Earth Science, NASA - funded Biophysics, Other Physics, Other, Text Garbage).
Tutorials
📄 License
This project is licensed under the MIT license.
BibTeX
@ARTICLE{2021arXiv211200590G,
author = {{Grezes}, Felix and {Blanco-Cuaresma}, Sergi and {Accomazzi}, Alberto and {Kurtz}, Michael J. and {Shapurian}, Golnaz and {Henneken}, Edwin and {Grant}, Carolyn S. and {Thompson}, Donna M. and {Chyla}, Roman and {McDonald}, Stephen and {Hostetler}, Timothy W. and {Templeton}, Matthew R. and {Lockhart}, Kelly E. and {Martinovic}, Nemanja and {Chen}, Shinyi and {Tanner}, Chris and {Protopapas}, Pavlos},
title = "{Building astroBERT, a language model for Astronomy \& Astrophysics}",
journal = {arXiv e-prints},
keywords = {Computer Science - Computation and Language, Astrophysics - Instrumentation and Methods for Astrophysics},
year = 2021,
month = dec,
eid = {arXiv:2112.00590},
pages = {arXiv:2112.00590},
archivePrefix = {arXiv},
eprint = {2112.00590},
primaryClass = {cs.CL},
adsurl = {https://ui.adsabs.harvard.edu/abs/2021arXiv211200590G},
adsnote = {Provided by the SAO/NASA Astrophysics Data System}
}
Widget Examples
Example Title |
Input Text |
M67 |
"M67 is one of the most studied [MASK] clusters." |
solar twin |
"A solar twin is a star with [MASK] parameters and chemical composition very similar to our Sun." |
dynamical evolution |
"The dynamical evolution of planets close to their star is affected by [MASK] effects" |
Kepler satellite |
"The Kepler satellite collected high - precision long - term and continuous light [MASK] for more than 100,000 solar - type stars" |
Local Group |
"The Local Group is composed of the Milky Way, the [MASK] Galaxy, and numerous smaller satellite galaxies." |
Cepheid |
"Cepheid variables are used to determine the [MASK] to galaxies in the local universe." |
Jets |
"Jets are created and sustained by [MASK] of matter onto a compact massive object." |
single star |
"A single star of one solar mass will evolve into a [MASK] dwarf." |
Very Large Array |
"The Very Large Array observes the sky at [MASK] wavelengths." |
Elements |
"Elements heavier than [MASK] are generated in supernovae explosions." |
Spitzer |
"Spitzer was the first [MASK] to fly in an Earth - trailing orbit." |
galaxies collide |
"Galaxy [MASK] can occur when two (or more) galaxies collide" |
hypothetical matter |
"Dark [MASK] is a hypothetical form of matter thought to account for approximately 85% of the matter in the universe." |
CMBR |
"The cosmic microwave background (CMB, CMBR), in Big Bang cosmology, is electromagnetic radiation which is a remnant from an early stage of the [MASK]." |
galaxies pulled |
"The Local Group of galaxies is pulled toward The Great [MASK]." |
Moon |
"The Moon is the only [MASK] of the Earth." |
morphology |
"Galaxies are categorized according to their visual morphology as [MASK], spiral, or irregular." |
Stars moslyl |
"Stars are made mostly of [MASK]." |
Comet tails |
"Comet tails are created as comets approach the [MASK]." |
Pluto |
"Pluto is a dwarf [MASK] in the Kuiper Belt." |
Magellanic Clouds |
"The Large and Small Magellanic Clouds are irregular [MASK] galaxies and are two satellite galaxies of the Milky Way." |
Milky Way |
"The Milky Way has a [MASK] black hole, Sagittarius A*, at its center." |
Andromeda |
"Andromeda is the nearest large [MASK] to the Milky Way and is roughly its equal in mass." |
gast and dust |
"The [MASK] medium is the gas and dust between stars." |