đ Model Card for Indus (nasa-smd-ibm-v0.1)
Indus (previously known as nasa-smd-ibm-v0.1) is a RoBERTa-based, Encoder-only transformer model. It is domain-adapted for NASA Science Mission Directorate (SMD) applications. Fine-tuned on scientific journals and articles relevant to NASA SMD, it aims to enhance natural language technologies such as information retrieval and intelligent search.
⨠Features
- Named Entity Recognition (NER): Identify named entities in text.
- Information Retrieval: Retrieve relevant information from text.
- Sentence Transformers: Generate sentence embeddings.
- Extractive QA: Extract answers from text.
đĻ Installation
No installation steps are provided in the original document, so this section is skipped.
đģ Usage Examples
No code examples are provided in the original document, so this section is skipped.
đ Documentation
Model Details
Training Data
- Wikipedia English (Feb 1, 2020)
- AGU Publications
- AMS Publications
- Scientific papers from Astrophysics Data Systems (ADS)
- PubMed abstracts
- PubMedCentral (PMC) (commercial license subset)

Training Procedure
- Framework: fairseq 0.12.1 with PyTorch 1.9.1
- transformers Version: 4.2.0
- Strategy: Masked Language Modeling (MLM)
Evaluation
BLURB benchmark
(Standard deviation across 10 random seeds in parenthesis. Macro avg. reported across datasets and micro avg. computed by averaging scores on each task then averaging across task averages.)
Climate Change NER, and NASA-QA benchmark
(Climate Change NER and NASA-QA benchmark results. Standard Deviation over multiple runs given in parantheses)
Please refer to the following dataset cards for further benchmarks and evaluation:
Uses
This model is suitable for NASA SMD related, scientific usecases, including:
- Named Entity Recognition (NER)
- Information Retrieval
- Sentence Transformers
- Extractive QA
Note
Accompanying preprint paper can be found here: https://arxiv.org/abs/2405.10725.
đ§ Technical Details
The model is a RoBERTa-based, Encoder-only transformer model, domain-adapted for NASA Science Mission Directorate (SMD) applications. It is fine-tuned on relevant scientific journals and articles to enhance natural language technologies. The pretraining and training procedures use Masked Language Modeling (MLM) strategy.
đ License
This project is licensed under the Apache-2.0 license.
đ Citation
If you find this work useful, please cite using the following bibtex citation:
@misc {nasa-impact_2023,
author = {Masayasu Maraoka and Bishwaranjan Bhattacharjee and Muthukumaran Ramasubramanian and Ikhsa Gurung and Rahul Ramachandran and Manil Maskey and Kaylin Bugbee and Rong Zhang and Yousef El Kurdi and Bharath Dandala and Mike Little and Elizabeth Fancher and Lauren Sanders and Sylvain Costes and Sergi Blanco-Cuaresma and Kelly Lockhart and Thomas Allen and Felix Grazes and Megan Ansdell and Alberto Accomazzi and Sanaz Vahidinia and Ryan McGranaghan and Armin Mehrabian and Tsendgar Lee},
title = { nasa-smd-ibm-v0.1 (Revision f01d42f) },
year = 2023,
url = { https://huggingface.co/nasa-impact/nasa-smd-ibm-v0.1 },
doi = { 10.57967/hf/1429 },
publisher = { Hugging Face }
}
đĨ Attribution
IBM Research
- Masayasu Muraoka
- Bishwaranjan Bhattacharjee
- Rong Zhang
- Yousef El Kurdi
- Bharath Dandala
NASA SMD
- Muthukumaran Ramasubramanian
- Iksha Gurung
- Rahul Ramachandran
- Manil Maskey
- Kaylin Bugbee
- Mike Little
- Elizabeth Fancher
- Lauren Sanders
- Sylvain Costes
- Sergi Blanco-Cuaresma
- Kelly Lockhart
- Thomas Allen
- Felix Grazes
- Megan Ansdell
- Alberto Accomazzi
- Sanaz Vahidinia
- Ryan McGranaghan
- Armin Mehrabian
- Tsendgar Lee
â ī¸ Disclaimer
This Encoder-only model is currently in an experimental phase. We are working to improve the model's capabilities and performance, and as we progress, we invite the community to engage with this model, provide feedback, and contribute to its evolution.