Open-source nasa - smd - ibm - v0.1 (Indus) model - Empowering NASA scientific information retrieval and intelligent search

Nasa Smd Ibm V0.1

Developed by nasa-impact

Indus is an encoder-only Transformer model based on RoBERTa, specifically optimized for NASA Science Mission Directorate (SMD) applications, suitable for scientific information retrieval and intelligent search.

Large Language Model

Transformers

EnglishOpen Source License:Apache-2.0 #NASA Science Text Adaptation #Masked Language Modeling #Earth Science NER

Downloads 631

Release Time : 12/4/2023

Model Overview

This model is fine-tuned on scientific journals and articles related to NASA SMD, aiming to enhance natural language processing technologies such as information retrieval and intelligent search.

Model Features

Scientific Domain Optimization

Specifically optimized for NASA Science Mission Directorate (SMD) related fields, including Earth science, climate, and biology.

High-Quality Training Data

Trained using professional data sources such as scientific publications (AGU, AMS), ADS papers, and PubMed.

Distilled Version Available

A distilled version with 30 million parameters is available, suitable for resource-constrained environments.

Model Capabilities

Named Entity Recognition (NER)

Information Retrieval

Sentence Embedding Transformation

Extractive Question Answering

Use Cases

Scientific Research

Scientific Literature Information Retrieval

Quickly retrieve relevant information from NASA-related scientific literature

Performs excellently in the NASA-IR benchmark

Scientific Q&A System

Build a Q&A system for NASA scientific content

Achieves good results in the NASA-QA benchmark

Climate Change Research

Climate Change Entity Recognition

Identify specialized entities in climate change-related texts

Performs well in the Climate Change NER benchmark

🚀 Model Card for Indus (nasa-smd-ibm-v0.1)

Indus (previously known as nasa-smd-ibm-v0.1) is a RoBERTa-based, Encoder-only transformer model. It is domain-adapted for NASA Science Mission Directorate (SMD) applications. Fine-tuned on scientific journals and articles relevant to NASA SMD, it aims to enhance natural language technologies such as information retrieval and intelligent search.

✨ Features

Named Entity Recognition (NER): Identify named entities in text.
Information Retrieval: Retrieve relevant information from text.
Sentence Transformers: Generate sentence embeddings.
Extractive QA: Extract answers from text.

📦 Installation

No installation steps are provided in the original document, so this section is skipped.

💻 Usage Examples

No code examples are provided in the original document, so this section is skipped.

📚 Documentation

Model Details

Property	Details
Base Model	RoBERTa
Tokenizer	Custom
Parameters	125M
Pretraining Strategy	Masked Language Modeling (MLM)
Distilled Version	You can download a distilled version of the model (30 Million Parameters) here: https://huggingface.co/nasa-impact/nasa-smd-ibm-distil-v0.1

Training Data

Wikipedia English (Feb 1, 2020)
AGU Publications
AMS Publications
Scientific papers from Astrophysics Data Systems (ADS)
PubMed abstracts
PubMedCentral (PMC) (commercial license subset)

image/png

Training Procedure

Framework: fairseq 0.12.1 with PyTorch 1.9.1
transformers Version: 4.2.0
Strategy: Masked Language Modeling (MLM)

Evaluation

BLURB benchmark

image/png (Standard deviation across 10 random seeds in parenthesis. Macro avg. reported across datasets and micro avg. computed by averaging scores on each task then averaging across task averages.)

Climate Change NER, and NASA-QA benchmark

image/png (Climate Change NER and NASA-QA benchmark results. Standard Deviation over multiple runs given in parantheses)

Please refer to the following dataset cards for further benchmarks and evaluation:

NASA-IR Benchmark - https://huggingface.co/datasets/nasa-impact/nasa-smd-IR-benchmark
NASA-QA Benchmark - https://huggingface.co/datasets/nasa-impact/nasa-smd-qa-benchmark
Climate Change NER Benchmark - https://huggingface.co/datasets/ibm/Climate-Change-NER

Uses

This model is suitable for NASA SMD related, scientific usecases, including:

Named Entity Recognition (NER)
Information Retrieval
Sentence Transformers
Extractive QA

Note

Accompanying preprint paper can be found here: https://arxiv.org/abs/2405.10725.

🔧 Technical Details

The model is a RoBERTa-based, Encoder-only transformer model, domain-adapted for NASA Science Mission Directorate (SMD) applications. It is fine-tuned on relevant scientific journals and articles to enhance natural language technologies. The pretraining and training procedures use Masked Language Modeling (MLM) strategy.

📄 License

This project is licensed under the Apache-2.0 license.

📚 Citation

If you find this work useful, please cite using the following bibtex citation:

@misc {nasa-impact_2023,
	author       = {Masayasu Maraoka and Bishwaranjan Bhattacharjee and Muthukumaran Ramasubramanian and Ikhsa Gurung and Rahul Ramachandran and Manil Maskey and Kaylin Bugbee and Rong Zhang and Yousef El Kurdi and Bharath Dandala and Mike Little and Elizabeth Fancher and Lauren Sanders and Sylvain Costes and Sergi Blanco-Cuaresma and Kelly Lockhart and Thomas Allen and Felix Grazes and Megan Ansdell and Alberto Accomazzi and Sanaz Vahidinia and Ryan McGranaghan and Armin Mehrabian and Tsendgar Lee},
	title        = { nasa-smd-ibm-v0.1 (Revision f01d42f) },
	year         = 2023,
	url          = { https://huggingface.co/nasa-impact/nasa-smd-ibm-v0.1 },
	doi          = { 10.57967/hf/1429 },
	publisher    = { Hugging Face }
}

👥 Attribution

IBM Research

Masayasu Muraoka
Bishwaranjan Bhattacharjee
Rong Zhang
Yousef El Kurdi
Bharath Dandala

NASA SMD

Muthukumaran Ramasubramanian
Iksha Gurung
Rahul Ramachandran
Manil Maskey
Kaylin Bugbee
Mike Little
Elizabeth Fancher
Lauren Sanders
Sylvain Costes
Sergi Blanco-Cuaresma
Kelly Lockhart
Thomas Allen
Felix Grazes
Megan Ansdell
Alberto Accomazzi
Sanaz Vahidinia
Ryan McGranaghan
Armin Mehrabian
Tsendgar Lee

⚠️ Disclaimer

This Encoder-only model is currently in an experimental phase. We are working to improve the model's capabilities and performance, and as we progress, we invite the community to engage with this model, provide feedback, and contribute to its evolution.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご