BioinspiredLLM Open-Source Professional Large Model - Free Support for the Field of Biology and Biomimetic Materials Science

Bioinspiredllm

Developed by lamm-mit

Specialized LLM fine-tuned from 13B-parameter Orca-2, focused on biological and bioinspired materials science

Large Language Model Supports Multiple Languages#Biomaterials Knowledge Base #Bionic Design Engine #Retrieval-Augmented Generation

Downloads 129

Release Time : 12/13/2023

Model Overview

Open-source LLM fine-tuned on thousands of biomaterials research papers, capable of knowledge recall, research assistance, and creative generation, supporting Retrieval-Augmented Generation (RAG) technology

Model Features

Expertise in Biomaterials

Fine-tuned on thousands of peer-reviewed papers, with precise biomaterials knowledge recall capabilities

Retrieval-Augmented Generation (RAG)

Can integrate external knowledge bases and trace information sources, supporting knowledge updates and cross-domain connections

Biomaterial Design Hypothesis Generation

Capable of proposing reasonable design hypotheses for unexplored materials, demonstrating scientific creativity

Multi-model Collaborative Workflow

Can collaborate with other generative AIs to reshape traditional materials design processes

Model Capabilities

Biomaterials knowledge Q&A

Research literature analysis

Bionic design idea generation

Cross-domain knowledge connection

Hypothesis generation and validation

Use Cases

Scientific Research

Biomaterial Property Query

Quick access to professional explanations of concepts like hierarchical biomaterials

Achieved 85% accuracy in 100-question test

Experimental Protocol Design

Generate feasible protocols for bioinspired material preparation

Can propose innovative designs not documented in literature

Education

Teaching Assistance

Answer student questions related to biomechanics

Provide structured professional explanations

🚀 BioinspiredLLM: Conversational Large Language Model for the Mechanics of Biological and Bio-Inspired Materials

BioinspiredLLM is an open - source autoregressive transformer LLM. It's finetuned with over a thousand peer - reviewed articles in relevant fields, capable of assisting in biological materials research, generating hypotheses, and collaborating with other AI models to reshape materials design workflows.

🚀 Quick Start

Installation

The following code demonstrates how to load the BioinspiredLLM model and its tokenizer:

from transformers import AutoModelForCausalLM, AutoTokenizer
from accelerate import infer_auto_device_map

model = AutoModelForCausalLM.from_pretrained('lamm-mit/BioinspiredLLM')
tokenizer = AutoTokenizer.from_pretrained('lamm-mit/BioinspiredLLM')

Basic Usage

The following is a basic code example for text generation:

device='cuda'
def generate_response (text_input="Biological materials offer amazing",
                      num_return_sequences=1,
                      temperature=1.,  
                      max_new_tokens=127,
                      num_beams=1,
                      top_k = 50,
                      top_p =0.9,repetition_penalty=1.,eos_token_id=2,verbatim=False,
                      exponential_decay_length_penalty_fac=None,
                      ):

    inputs = tokenizer.encode(text_input,  add_special_tokens  =False,  return_tensors ='pt')
    if verbatim:
        print ("Length of input, tokenized: ", inputs.shape, inputs)
    with torch.no_grad():
          outputs = model.generate(input_ids=inputs.to(device), 
                                   max_new_tokens=max_new_tokens,
                                   temperature=temperature, #value used to modulate the next token probabilities.
                                   num_beams=num_beams,
                                   top_k = top_k,
                                   top_p =top_p,
                                   num_return_sequences = num_return_sequences, eos_token_id=eos_token_id,
                                   do_sample =True, 
                                   repetition_penalty=repetition_penalty,
                                  )
    return tokenizer.batch_decode(outputs[:,inputs.shape[1]:].detach().cpu().numpy(), skip_special_tokens=True)

Example of Generation

The following is an example of using the prompt template for text generation:

system_prompt = "You are BioinspiredLLM. You are knowledgeable in biological and bio - inspired materials and provide accurate and qualitative insights about biological materials found in Nature. You are a cautious assistant. You think step by step. You carefully follow instructions."
user_message = "What are hierarchical, biological materials?"

txt =  f"<|im_start|>system\n{system_prompt}<|im_end|>\n<|im_start|>user\n{user_message}<|im_end|>\n<|im_start|>assistant"

# modulate temperature(0.1 - 1.0) to adjust 'creativity'
# modulate max_new_tokens to change length of generated response 
output_text=generate_response ( text_input=txt,eos_token_id=2,
                                num_return_sequences=1,
                                repetition_penalty=1.1,
                                top_p=0.95,
                                top_k=50, 
                                temperature=0.1,
                                max_new_tokens=512,
                                verbatim=False, 
                              )

print(output_text)

✨ Features

Knowledge Recall: BioinspiredLLM can accurately recall information about biological materials from a large - scale corpus.
Enhanced Reasoning: It has enhanced reasoning ability, which helps in answering complex questions.
Retrieval - Augmented Generation (RAG): This method allows the model to incorporate new data during generation, trace back sources, update the knowledge base, and connect knowledge domains.
Hypothesis Generation: It can develop sound hypotheses regarding biological materials design, even for materials that have never been explicitly studied before.
Collaboration with Other AI Models: It shows promise in collaborating with other generative artificial intelligence models to reshape the traditional materials design process.

📚 Documentation

Dataset

The dataset used for this model can be downloaded from: [Dataset Link](https://onlinelibrary.wiley.com/action/downloadSupplement?doi=10.1002%2Fadvs.202306724&file=advs7235 - sup - 0002 - SuppMat.csv)

Performance

The performance of BioinspiredLLM is evaluated through knowledge recall evaluation experiments. The following figure shows:

Total Scores: Scores of different models (Llama 13b - chat, Orca - 2 13b, Llama - BioLLM, BioinspiredLLM, and BioinspiredLLM with RAG) on a 100 - question biological materials exam.
Scores by Question Category: Scores separated by question categories (general, specific, numerical, and non - biological).
RAG Method: The framework of the Retrieval - Augmented Generation (RAG) method and examples of BioinspiredLLM's responses when supplemented using RAG, along with the source of the retrieved content.

![Performance Image](https://cdn - uploads.huggingface.co/production/uploads/623ce1c6b66fedf374859fe7/V8g9f0lCKb2HMepf7lGbM.png)

Retrieval Augmented Generation (RAG)

The following is an example of using the RAG method based on Llama Index:

Set up BioinspiredLMM as custom LLM

from llama_index.prompts.prompts import SimpleInputPrompt
from llama_index import (
    VectorStoreIndex,
    get_response_synthesizer,)
from llama_index.retrievers import VectorIndexRetriever
from llama_index.query_engine import RetrieverQueryEngine
eos_token=32000

#system_prompt = ""
system_prompt = "You are BioinspiredLLM. You are knowledgeable in biological and bio - inspired materials and provide accurate and qualitative insights about biological materials found in Nature. You are a cautious assistant. You think step by step. You carefully follow instructions."

query_wrapper_prompt = SimpleInputPrompt( "<|im_start|>system\n"+system_prompt+"<|im_end|>\n<|im_start|>user\n{query_str}<|im_end|>\n<|im_start|>assistant")

from llama_index.llms import HuggingFaceLLM
llm_custom = HuggingFaceLLM(context_window=2048,
                    max_new_tokens=300,
                    query_wrapper_prompt=query_wrapper_prompt,
                    stopping_ids=[eos_token, 2],
                    model=model,
                    generate_kwargs={"temperature": 0.1, "do_sample": True,
                    "repetition_penalty":1.1, "top_p":0.95, "top_k":50,  "eos_token_id": [eos_token, 2] , #"max_new_tokens": 1024,
                                    },
                    tokenizer=tokenizer)
llm_custom.model_name='BioinspiredLLM'

Use Chroma database collection

import chromadb
from llama_index import VectorStoreIndex, SimpleDirectoryReader
from chromadb.config import Settings
from llama_index.vector_stores import ChromaVectorStore
from llama_index.storage.storage_context import StorageContext

coll_name="Bioinspired"
coll_path='./Bioinspired_Chroma'    ## PATH TO CHROMA DATABASE

client = chromadb.PersistentClient(path=coll_path)
collection = client.get_collection (name=coll_name,)

db2 = chromadb.PersistentClient(path=coll_path)
chroma_collection = db2.get_or_create_collection(coll_name)
vector_store = ChromaVectorStore(chroma_collection=chroma_collection)

chroma_collection.count()

Set up custom LLM service context and vector store indedx

from llama_index.llms import LlamaCPP
from llama_index import ServiceContext
from llama_index.llms.llama_utils import (
        messages_to_prompt,
        completion_to_prompt,
    )

service_context = ServiceContext.from_defaults(
    llm=llm_custom,
    chunk_size=1024,
    embed_model="local:BAAI/bge-large-en"
)
index = VectorStoreIndex.from_vector_store(
     vector_store,
     service_context=service_context,
)

Set up query engine

from IPython.display import Markdown, display
query_engine = index.as_query_engine( 
    #response_mode="tree_summarize",
    #response_mode='compact',
    #response_mode='accumulate',
    #streaming=True,
    similarity_top_k=5,
                                    )

question = "Which horn does not have tubules? A) big horn sheep B) pronghorn C) mountain goat"
response = query_engine.query(question)
display(Markdown(f"<b>{response}</b>"))

Alternatively, load new documents

from langchain.embeddings import HuggingFaceEmbeddings
embeddings = HuggingFaceEmbeddings(
    model_name="sentence-transformers/all-mpnet-base-v2",
)
documents_graph = SimpleDirectoryReader(
            input_files=[
            "./XXXXXXXXXX/XXXXX.pdf", 
                    ]
    ).load_data()
index_doc = VectorStoreIndex.from_documents(documents_graph, service_context=
                                            service_context,
                                            show_progress=True,
                                            embeddings=embeddings,
                                           )

Query

question="Which rapid prototyping techniques would be useful for creating hierarchical, bio - inspired materials?"

response = index_doc.as_query_engine(service_context=service_context,
    response_mode="tree_summarize",
    similarity_top_k=5,      
    ).query(question, 
    )
print(response)

📄 License

Orca 2: Licensed under the Microsoft Research License ([License Link](https://huggingface.co/microsoft/Orca - 2 - 13b/blob/main/LICENSE)).
Llama 2: Licensed under the LLAMA 2 Community License (License Link).

Bias, Risks, and Limitations

⚠️ Important Note

Like all modeling techniques, there are possibilities of errors. Although the base models Llama 2 and Orca 2 are aligned to avoid spreading misinformation, researchers should still verify responses to avoid propagating errors. Employing chain - of - thought prompting and RAG methods can help minimize risks.

💡 Usage Tip

The system prompt of BioinspiredLLM can be edited to guide context. For more details, refer to the main paper.

BioinspiredLLM has the following limitations:

Data Biases: Due to the extensive data used for training, the model may carry biases from the source data, resulting in potentially biased or unfair outputs.
Lack of Contextual Understanding: Despite its language capabilities, the model has limited real - world understanding, which may lead to inaccurate or nonsensical responses.
Lack of Transparency: The complexity and size of the model make it a “black box”, making it difficult to understand the rationale behind specific outputs. For more information, review the transparency notes from Azure.

🔧 Technical Details

BioinspiredLLM is a 13b parameter model, fine - tuned based on the Orca - 2 model, using the LLaMA - 2 13b base model. For details on the model architecture, please refer to the LLaMA - 2 technical report. More details can be found at: Technical Report Link

📄 Reference

Rachel K. Luu, Markus J. Buehler, BioinspiredLLM: Conversational Large Language Model for the Mechanics of Biological and Bio - Inspired Materials. arxiv.org/abs/2309.08788 and Adv. Science, 2023

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご