Indus - Retriever Open - source Dual Encoder Model Improves Information Retrieval and Intelligent Search Capabilities for NASA Missions

Nasa Smd Ibm St V2

Developed by nasa-impact

Indus-Retriever is a dual-encoder sentence transformation model fine-tuned from the nasa-smd-ibm-v0.1 encoder model, specifically designed for natural language processing tasks under NASA's Science Mission Directorate (SMD) to enhance information retrieval and intelligent search capabilities.

Text Embedding

PyTorch

EnglishOpen Source License:Apache-2.0 #Scientific Text Retrieval #High-precision Similarity #NASA Domain Optimization

Downloads 621

Release Time : 2/20/2024

Model Overview

The model was trained on 271 million examples and 2.6 million NASA SMD domain-specific documents, optimizing sentence similarity calculations for scientific information retrieval tasks.

Model Features

Large-scale Domain-specific Training

Trained on 271 million examples and 2.6 million NASA SMD documents, optimized for scientific domains.

Performance Optimization

Significant performance improvements compared to the previous model nasa-smd-ibm-st.

Distilled Version Available

A lightweight distilled version of the model, nasa-ibm-st.38m, is provided.

Model Capabilities

Sentence Similarity Calculation

Information Retrieval

Scientific Text Processing

Use Cases

Information Retrieval

Scientific Literature Retrieval

Quickly retrieve relevant literature from NASA's scientific document repository.

Improved retrieval accuracy and efficiency

Intelligent Search

Domain-specific Q&A Systems

Build Q&A systems for fields such as Earth science and climate.

Provides more precise domain knowledge answers

🚀 Indus-Retriever Model Card

Indus-Retriever (nasa-smd-ibm-st-v2) is a Bi-encoder sentence transformer model fine-tuned from the nasa-smd-ibm-v0.1 encoder model. It's an updated version of nasa-smd-ibm-st with better performance. Trained with 271 million examples and a domain-specific dataset of 2.6 million examples from documents curated by NASA Science Mission Directorate (SMD), this model aims to enhance natural language technologies such as information retrieval and intelligent search for SMD NLP applications.

You can also use the distilled version of the model here: Distilled Model

🚀 Quick Start

from sentence_transformers import SentenceTransformer, Util

model = SentenceTransformer("nasa-impact/nasa-smd-ibm-st-v2")

input_queries = [
'query: how much protein should a female eat', 'query: summit define']
input_passages = [
"As a general guideline, the CDC's average requirement of protein for women ages 19 to 70 is 46 grams per day. But, as you can see from this chart, you'll need to increase that if you're expecting or training for a marathon. Check out the chart below to see how much protein you should be eating each day.",
"Definition of summit for English Language Learners. : 1 the highest point of a mountain : the top of a mountain. : 2 the highest level. : 3 a meeting or series of meetings between the leaders of two or more governments."]
query_embeddings = model.encode(input_queries)
passage_embeddings = model.encode(input_passages)
print(util.cos_sim(query_embeddings, passage_embeddings))

✨ Features

Enhanced Performance: An updated version of nasa-smd-ibm-st with better performance.
Domain-Specific Training: Trained with a domain-specific dataset from NASA SMD documents.
Multiple Use Cases: Suitable for information retrieval and sentence similarity search in NASA SMD related scientific usecases.

📦 Installation

No specific installation steps are provided in the original document, so this section is skipped.

💻 Usage Examples

Basic Usage

from sentence_transformers import SentenceTransformer, Util

model = SentenceTransformer("nasa-impact/nasa-smd-ibm-st-v2")

input_queries = [
'query: how much protein should a female eat', 'query: summit define']
input_passages = [
"As a general guideline, the CDC's average requirement of protein for women ages 19 to 70 is 46 grams per day. But, as you can see from this chart, you'll need to increase that if you're expecting or training for a marathon. Check out the chart below to see how much protein you should be eating each day.",
"Definition of summit for English Language Learners. : 1 the highest point of a mountain : the top of a mountain. : 2 the highest level. : 3 a meeting or series of meetings between the leaders of two or more governments."]
query_embeddings = model.encode(input_queries)
passage_embeddings = model.encode(input_passages)
print(util.cos_sim(query_embeddings, passage_embeddings))

📚 Documentation

Model Details

Property	Details
Base Encoder Model	INDUS
Tokenizer	Custom
Parameters	125M
Training Strategy	Sentence Pairs, and score indicating relevancy. The model encodes the two sentence pairs independently and cosine similarity is calculated. The similarity is optimized using the relevance score.

Training Data

image/png Figure: Open dataset sources for sentence transformers (269M in total)

Additionally, 2.6M abstract + title pairs collected from NASA SMD documents.

Training Procedure

Property	Details
Framework	PyTorch 1.9.1
sentence-transformers version	4.30.2
Strategy	Sentence Pairs

Evaluation

The following models are evaluated:

All-MiniLM-l6-v2 [sentence-transformers/all-MiniLM-L6-v2]
BGE-base [BAAI/bge-base-en-v1.5]
RoBERTa-base [roberta-base]
nasa-smd-ibm-rtvr_v0.1 [nasa-impact/nasa-smd-ibm-st]

image/png Figure: BEIR and NASA-IR Evaluation Metrics

🔧 Technical Details

The model is fine-tuned from the nasa-smd-ibm-v0.1 encoder model. It uses sentence pairs and a relevance score for training. The model encodes the two sentence pairs independently and calculates the cosine similarity, which is then optimized using the relevance score.

📄 License

This project is licensed under the Apache-2.0 license.

Citation

If you find this work useful, please cite using the following bibtex citation:

@misc {nasa-impact_2024,
	author       = { {NASA-IMPACT} },
	title        = { nasa-smd-ibm-st-v2 (Revision d249d84) },
	year         = 2024,
	url          = { https://huggingface.co/nasa-impact/nasa-smd-ibm-st-v2 },
	doi          = { 10.57967/hf/1800 },
	publisher    = { Hugging Face }
}

Attribution

IBM Research: Aashka Trivedi, Masayasu Muraoka, Bishwaranjan Bhattacharjee
NASA SMD: Muthukumaran Ramasubramanian, Iksha Gurung, Rahul Ramachandran, Manil Maskey, Kaylin Bugbee, Mike Little, Elizabeth Fancher, Lauren Sanders, Sylvain Costes, Sergi Blanco-Cuaresma, Kelly Lockhart, Thomas Allen, Felix Grazes, Megan Ansdell, Alberto Accomazzi, Sanaz Vahidinia, Ryan McGranaghan, Armin Mehrabian, Tsendgar Lee

Disclaimer

This sentence-transformer model is currently in an experimental phase. We are working to improve the model's capabilities and performance, and as we progress, we invite the community to engage with this model, provide feedback, and contribute to its evolution.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご