BioinspiredLLM開源專業大模型 - 免費提供生物與仿生材料科學領域支持

首頁

Bioinspiredllm

由lamm-mit開發

基於130億參數Orca-2微調的專業LLM，專注於生物與仿生材料科學領域

大型語言模型支持多種語言#生物材料知識庫 #仿生設計引擎 #檢索增強生成

下載量 129

發布時間 : 12/13/2023

模型概述

通過上千篇生物材料領域論文微調的開源LLM，具備知識回憶、研究輔助和創意生成能力，支持檢索增強生成(RAG)技術

模型特點

生物材料專業領域知識

通過上千篇同行評審論文微調，具備精準的生物材料知識回憶能力

檢索增強生成(RAG)

可整合外部知識庫並追溯信息來源，支持知識更新與跨領域連接

生物材料設計假設生成

能針對未研究材料提出合理設計假設，展現科學創造力

多模型協作工作流

可與其它生成式AI協作，重塑傳統材料設計流程

模型能力

生物材料知識問答

研究文獻分析

仿生設計創意生成

跨領域知識連接

假設生成與驗證

使用案例

科學研究

生物材料特性查詢

快速獲取層級化生物材料等專業概念解釋

在100題測試中準確率達85%

實驗方案設計

生成仿生材料製備的可行性方案

可提出未被文獻記載的創新設計

教育

教學輔助

解答生物力學相關學生提問

提供結構化專業解釋

🚀 生物啟發大語言模型（BioinspiredLLM）：用於生物及生物啟發材料力學的對話式大語言模型

BioinspiredLLM是一個開源的自迴歸Transformer大語言模型，它基於一千多篇生物及生物啟發材料領域的同行評審文章進行微調。該模型可用於信息檢索、輔助研究任務，還能作為創意引擎，在生物材料設計假設生成、與其他生成式人工智能模型協作等方面展現出了巨大潛力，有助於連接多個科學領域的知識。

🚀 快速開始

模型加載

from transformers import AutoModelForCausalLM, AutoTokenizer
from accelerate import infer_auto_device_map

model = AutoModelForCausalLM.from_pretrained('lamm-mit/BioinspiredLLM')
tokenizer = AutoTokenizer.from_pretrained('lamm-mit/BioinspiredLLM')

生成文本

device='cuda'
def generate_response (text_input="Biological materials offer amazing",
                      num_return_sequences=1,
                      temperature=1.,  
                      max_new_tokens=127,
                      num_beams=1,
                      top_k = 50,
                      top_p =0.9,
                      repetition_penalty=1.,
                      eos_token_id=2,
                      verbatim=False,
                      exponential_decay_length_penalty_fac=None,
                      ):

    inputs = tokenizer.encode(text_input,  add_special_tokens  =False,  return_tensors ='pt')
    if verbatim:
        print ("Length of input, tokenized: ", inputs.shape, inputs)
    with torch.no_grad():
          outputs = model.generate(input_ids=inputs.to(device), 
                                   max_new_tokens=max_new_tokens,
                                   temperature=temperature, #value used to modulate the next token probabilities.
                                   num_beams=num_beams,
                                   top_k = top_k,
                                   top_p =top_p,
                                   num_return_sequences = num_return_sequences,
                                   eos_token_id=eos_token_id,
                                   do_sample =True, 
                                   repetition_penalty=repetition_penalty,
                                  )
    return tokenizer.batch_decode(outputs[:,inputs.shape[1]:].detach().cpu().numpy(), skip_special_tokens=True)

生成示例

system_prompt = "You are BioinspiredLLM. You are knowledgeable in biological and bio-inspired materials and provide accurate and qualitative insights about biological materials found in Nature. You are a cautious assistant. You think step by step. You carefully follow instructions."
user_message = "What are hierarchical, biological materials?"

txt =  f"<|im_start|>system\n{system_prompt}<|im_end|>\n<|im_start|>user\n{user_message}<|im_end|>\n<|im_start|>assistant"

# modulate temperature(0.1-1.0) to adjust 'creativity'
# modulate max_new_tokens to change length of generated response 
output_text=generate_response ( text_input=txt,
                                eos_token_id=2,
                                num_return_sequences=1,
                                repetition_penalty=1.1,
                                top_p=0.95,
                                top_k=50, 
                                temperature=0.1,
                                max_new_tokens=512,
                                verbatim=False, 
                              )

print(output_text)

✨ 主要特性

信息準確召回：能夠準確回憶生物材料相關信息，增強了推理能力。
檢索增強生成（RAG）：在生成過程中引入新數據，可追溯來源、更新知識庫並連接知識領域。
假設生成：能夠針對生物材料設計提出合理假設，即使是從未明確研究過的材料。
協作能力：與其他生成式人工智能模型協作，重塑傳統材料設計流程。

📦 安裝指南

文檔未提及具體安裝步驟，暫不提供。

💻 使用示例

基礎用法

from transformers import AutoModelForCausalLM, AutoTokenizer
from accelerate import infer_auto_device_map

model = AutoModelForCausalLM.from_pretrained('lamm-mit/BioinspiredLLM')
tokenizer = AutoTokenizer.from_pretrained('lamm-mit/BioinspiredLLM')

高級用法

# 基於Llama Index的檢索增強生成（RAG）示例
from llama_index.prompts.prompts import SimpleInputPrompt
from llama_index import (
    VectorStoreIndex,
    get_response_synthesizer,)
from llama_index.retrievers import VectorIndexRetriever
from llama_index.query_engine import RetrieverQueryEngine
eos_token=32000

#system_prompt = ""
system_prompt = "You are BioinspiredLLM. You are knowledgeable in biological and bio-inspired materials and provide accurate and qualitative insights about biological materials found in Nature. You are a cautious assistant. You think step by step. You carefully follow instructions."

query_wrapper_prompt = SimpleInputPrompt( "<|im_start|>system\n"+system_prompt+"<|im_end|>\n<|im_start|>user\n{query_str}<|im_end|>\n<|im_start|>assistant")

from llama_index.llms import HuggingFaceLLM
llm_custom = HuggingFaceLLM(context_window=2048,
                    max_new_tokens=300,
                    query_wrapper_prompt=query_wrapper_prompt,
                    stopping_ids=[eos_token, 2],
                    model=model,
                    generate_kwargs={"temperature": 0.1, "do_sample": True,
                    "repetition_penalty":1.1, "top_p":0.95, "top_k":50,  "eos_token_id": [eos_token, 2] , #"max_new_tokens": 1024,
                                    },
                    tokenizer=tokenizer)
llm_custom.model_name='BioinspiredLLM'

# 使用Chroma數據庫集合
import chromadb
from llama_index import VectorStoreIndex, SimpleDirectoryReader
from chromadb.config import Settings
from llama_index.vector_stores import ChromaVectorStore
from llama_index.storage.storage_context import StorageContext

coll_name="Bioinspired"
coll_path='./Bioinspired_Chroma'    ## PATH TO CHROMA DATABASE

client = chromadb.PersistentClient(path=coll_path)
collection = client.get_collection (name=coll_name,)

db2 = chromadb.PersistentClient(path=coll_path)
chroma_collection = db2.get_or_create_collection(coll_name)
vector_store = ChromaVectorStore(chroma_collection=chroma_collection)

chroma_collection.count()

# 設置自定義LLM服務上下文和向量存儲索引
from llama_index.llms import LlamaCPP
from llama_index import ServiceContext
from llama_index.llms.llama_utils import (
        messages_to_prompt,
        completion_to_prompt,
    )

service_context = ServiceContext.from_defaults(
    llm=llm_custom,
    chunk_size=1024,
    embed_model="local:BAAI/bge-large-en"
)
index = VectorStoreIndex.from_vector_store(
     vector_store,
     service_context=service_context,
)

# 設置查詢引擎
from IPython.display import Markdown, display
query_engine = index.as_query_engine( 
    #response_mode="tree_summarize",
    #response_mode='compact',
    #response_mode='accumulate',
    #streaming=True,
    similarity_top_k=5,
                                    )

question = "Which horn does not have tubules? A) big horn sheep B) pronghorn C) mountain goat"
response = query_engine.query(question)
display(Markdown(f"<b>{response}</b>"))

📚 詳細文檔

數據集

數據集鏈接：https://onlinelibrary.wiley.com/action/downloadSupplement?doi=10.1002%2Fadvs.202306724&file=advs7235-sup-0002-SuppMat.csv

性能評估

性能評估圖

整體得分：展示了BioinspiredLLM、Llama 13b-chat、Orca-2 13b、Llama-BioLLM等模型在100道生物材料考試中的總分。
分類得分：按問題類別（通用、特定、數值和非生物）分開的考試得分。
檢索增強生成（RAG）：展示了RAG方法框架和BioinspiredLLM使用RAG補充時的兩個響應示例，並顯示了檢索內容的來源。

🔧 技術細節

BioinspiredLLM是一個130億參數的模型，基於Orca - 2模型（https://huggingface.co/microsoft/Orca-2-13b）進行微調，使用LLaMA - 2 13b基礎模型。具體模型架構細節請參考LLaMA - 2技術報告：https://onlinelibrary.wiley.com/doi/full/10.1002/advs.202306724

📄 許可證

Orca 2：遵循Microsoft Research License（https://huggingface.co/microsoft/Orca-2-13b/blob/main/LICENSE）。
Llama 2：遵循LLAMA 2 Community License（https://ai.meta.com/llama/license/）。

偏差、風險和侷限性

誤差可能性：儘管BioinspiredLLM繼承了基礎模型不傳播錯誤信息和產生更安全響應的特性，但研究人員仍需驗證響應並避免傳播錯誤。
數據偏差：可能攜帶源數據中的偏差，導致輸出存在潛在的偏見或不公平。
缺乏上下文理解：對現實世界的理解有限，可能導致輸出不準確或無意義。
缺乏透明度：由於模型的複雜性和規模，難以理解特定輸出或決策的原理。
內容危害：可能導致各種類型的內容危害，建議使用內容審核服務。
幻覺問題：模型可能會編造內容，在關鍵決策或信息獲取時需謹慎。
潛在濫用風險：如果沒有適當的保障措施，可能被惡意用於生成虛假信息或有害內容。

該模型僅用於研究環境，在下游應用中使用前需要進行額外的分析以評估潛在的危害或偏差。

引用

Rachel K. Luu, Markus J. Buehler, BioinspiredLLM: Conversational Large Language Model for the Mechanics of Biological and Bio-Inspired Materials arxiv.org/abs/2309.08788 and Adv. Science, https://doi.org/10.1002/advs.202306724, 2023