ner-gene-dna-rna-jnlpba-pubmed開源模型 - 精準識別生物醫學基因等實體

首頁

Ner Gene Dna Rna Jnlpba Pubmed

由raynardj開發

該模型基於jnlpba數據集訓練，在預訓練的PubMed版RoBERTa模型基礎上微調，專門用於識別基因、DNA、RNA、蛋白質等生物醫學實體

序列標註

Transformers

支持多種語言開源協議:Apache-2.0 #生物醫學NER #基因實體識別 #RoBERTa微調

下載量 149

發布時間 : 3/2/2022

模型概述

一個生物醫學領域的命名實體識別模型，能夠從文本中識別基因、DNA、RNA、蛋白質等生物分子實體

模型特點

生物醫學實體識別

專門針對基因、DNA、RNA、蛋白質等生物醫學實體進行優化

基於PubMed數據預訓練

在PubMed生物醫學文獻數據上預訓練，具有領域適應性

簡化標籤系統

移除了傳統的'B-','I-'前綴標記，使用更簡單的標籤系統

模型能力

識別基因實體

識別DNA序列

識別RNA分子

識別蛋白質

識別細胞系

識別細胞類型

使用案例

生物醫學文獻挖掘

基因文獻分析

從生物醫學文獻中提取基因和蛋白質相關信息

可準確識別文獻中提到的各種生物分子實體

生物醫學知識圖譜構建

作為知識圖譜構建的預處理步驟，識別文本中的生物實體

生物信息學研究

實驗數據分析

幫助研究人員從實驗數據描述中提取關鍵生物分子信息

🚀 基因與基因產物命名實體識別

本項目旨在通過命名實體識別（NER）技術，精準識別文本中的基因及基因產物。模型基於 jnlpba 數據集進行訓練，並在 pubmed-pretrained roberta 模型上進行預訓練，能夠有效處理生物信息學領域的文本，識別 DNA、RNA、蛋白質等關鍵實體。

🚀 快速開始

環境準備

確保你已經安裝了 transformers 和 pandas 庫，可以使用以下命令進行安裝：

pip install transformers pandas

模型使用

以下是使用該模型進行命名實體識別的基本代碼示例：

from transformers import pipeline

PRETRAINED = "raynardj/ner-gene-dna-rna-jnlpba-pubmed"
ner = pipeline(task="ner", model=PRETRAINED, tokenizer=PRETRAINED)
ner("Your text", aggregation_strategy="first")

輸出處理

為了使輸出結果更加連貫，我們提供了以下代碼示例：

import pandas as pd
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(PRETRAINED)

def clean_output(outputs):
    results = []
    current = []
    last_idx = 0
    # make to sub group by position
    for output in outputs:
        if output["index"]-1==last_idx:
            current.append(output)
        else:
            results.append(current)
            current = [output, ]
        last_idx = output["index"]
    if len(current)>0:
        results.append(current)
    
    # from tokens to string
    strings = []
    for c in results:
        tokens = []
        starts = []
        ends = []
        for o in c:
            tokens.append(o['word'])
            starts.append(o['start'])
            ends.append(o['end'])

        new_str = tokenizer.convert_tokens_to_string(tokens)
        if new_str!='':
            strings.append(dict(
                word=new_str,
                start = min(starts),
                end = max(ends),
                entity = c[0]['entity']
            ))
    return strings

def entity_table(pipeline, **pipeline_kw):
    if "aggregation_strategy" not in pipeline_kw:
        pipeline_kw["aggregation_strategy"] = "first"
    def create_table(text):
        return pd.DataFrame(
            clean_output(
                pipeline(text, **pipeline_kw)
            )
        )
    return create_table

# will return a dataframe
entity_table(ner)("YOUR_VERY_CONTENTFUL_TEXT")

✨ 主要特性

多實體識別：能夠識別多種生物信息學相關的實體，包括 DNA、RNA、蛋白質、細胞系和細胞類型。
簡化標籤：去除了數據標籤中的 'B-'、'I-' 等前綴，使標籤更加簡潔。
輸出處理：提供了輸出處理函數，使識別結果更加連貫，方便後續分析。

📦 安裝指南

使用 pip 安裝所需的庫：

pip install transformers pandas

💻 使用示例

基礎用法

from transformers import pipeline

PRETRAINED = "raynardj/ner-gene-dna-rna-jnlpba-pubmed"
ner = pipeline(task="ner", model=PRETRAINED, tokenizer=PRETRAINED)
ner("It consists of 25 exons encoding a 1,278-amino acid glycoprotein that is composed of 13 transmembrane domains", aggregation_strategy="first")

高級用法

import pandas as pd
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(PRETRAINED)

# 定義輸出處理函數
def clean_output(outputs):
    results = []
    current = []
    last_idx = 0
    # 按位置分組
    for output in outputs:
        if output["index"] - 1 == last_idx:
            current.append(output)
        else:
            results.append(current)
            current = [output]
        last_idx = output["index"]
    if len(current) > 0:
        results.append(current)

    # 將 token 轉換為字符串
    strings = []
    for c in results:
        tokens = []
        starts = []
        ends = []
        for o in c:
            tokens.append(o['word'])
            starts.append(o['start'])
            ends.append(o['end'])

        new_str = tokenizer.convert_tokens_to_string(tokens)
        if new_str != '':
            strings.append(dict(
                word=new_str,
                start=min(starts),
                end=max(ends),
                entity=c[0]['entity']
            ))
    return strings

def entity_table(pipeline, **pipeline_kw):
    if "aggregation_strategy" not in pipeline_kw:
        pipeline_kw["aggregation_strategy"] = "first"

    def create_table(text):
        return pd.DataFrame(
            clean_output(
                pipeline(text, **pipeline_kw)
            )
        )
    return create_table

# 返回一個 DataFrame
entity_table(ner)("It consists of 25 exons encoding a 1,278-amino acid glycoprotein that is composed of 13 transmembrane domains")

📚 詳細文檔

標籤說明

所有可能的標籤及其對應的 ID 如下：

{"label2id": {
    "DNA": 2,
    "O": 0,
    "RNA": 5,
    "cell_line": 4,
    "cell_type": 3,
    "protein": 1
  }
 }

注意，我們去除了數據標籤中的 'B-'、'I-' 等前綴。🗡

其他模型

你可以查看我們的其他 NER 模型：

📄 許可證

本項目採用 Apache-2.0 許可證。

精選推薦AI模型

Llama 3 Typhoon V1.5x 8b Instruct

專為泰語設計的80億參數指令模型，性能媲美GPT-3.5-turbo，優化了應用場景、檢索增強生成、受限生成和推理任務

Cadet-Tiny是一個基於SODA數據集訓練的超小型對話模型，專為邊緣設備推理設計，體積僅為Cosmo-3B模型的2%左右。

Roberta Base Chinese Extractive Qa

基於RoBERTa架構的中文抽取式問答模型，適用於從給定文本中提取答案的任務。

智啟未來，您的人工智能解決方案智庫