🚀 基因與基因產物命名實體識別
本項目旨在通過命名實體識別(NER)技術,精準識別文本中的基因及基因產物。模型基於 jnlpba 數據集進行訓練,並在 pubmed-pretrained roberta 模型 上進行預訓練,能夠有效處理生物信息學領域的文本,識別 DNA、RNA、蛋白質等關鍵實體。
🚀 快速開始
環境準備
確保你已經安裝了 transformers
和 pandas
庫,可以使用以下命令進行安裝:
pip install transformers pandas
模型使用
以下是使用該模型進行命名實體識別的基本代碼示例:
from transformers import pipeline
PRETRAINED = "raynardj/ner-gene-dna-rna-jnlpba-pubmed"
ner = pipeline(task="ner", model=PRETRAINED, tokenizer=PRETRAINED)
ner("Your text", aggregation_strategy="first")
輸出處理
為了使輸出結果更加連貫,我們提供了以下代碼示例:
import pandas as pd
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(PRETRAINED)
def clean_output(outputs):
results = []
current = []
last_idx = 0
for output in outputs:
if output["index"]-1==last_idx:
current.append(output)
else:
results.append(current)
current = [output, ]
last_idx = output["index"]
if len(current)>0:
results.append(current)
strings = []
for c in results:
tokens = []
starts = []
ends = []
for o in c:
tokens.append(o['word'])
starts.append(o['start'])
ends.append(o['end'])
new_str = tokenizer.convert_tokens_to_string(tokens)
if new_str!='':
strings.append(dict(
word=new_str,
start = min(starts),
end = max(ends),
entity = c[0]['entity']
))
return strings
def entity_table(pipeline, **pipeline_kw):
if "aggregation_strategy" not in pipeline_kw:
pipeline_kw["aggregation_strategy"] = "first"
def create_table(text):
return pd.DataFrame(
clean_output(
pipeline(text, **pipeline_kw)
)
)
return create_table
entity_table(ner)("YOUR_VERY_CONTENTFUL_TEXT")
✨ 主要特性
- 多實體識別:能夠識別多種生物信息學相關的實體,包括 DNA、RNA、蛋白質、細胞系和細胞類型。
- 簡化標籤:去除了數據標籤中的 'B-'、'I-' 等前綴,使標籤更加簡潔。
- 輸出處理:提供了輸出處理函數,使識別結果更加連貫,方便後續分析。
📦 安裝指南
使用 pip
安裝所需的庫:
pip install transformers pandas
💻 使用示例
基礎用法
from transformers import pipeline
PRETRAINED = "raynardj/ner-gene-dna-rna-jnlpba-pubmed"
ner = pipeline(task="ner", model=PRETRAINED, tokenizer=PRETRAINED)
ner("It consists of 25 exons encoding a 1,278-amino acid glycoprotein that is composed of 13 transmembrane domains", aggregation_strategy="first")
高級用法
import pandas as pd
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(PRETRAINED)
def clean_output(outputs):
results = []
current = []
last_idx = 0
for output in outputs:
if output["index"] - 1 == last_idx:
current.append(output)
else:
results.append(current)
current = [output]
last_idx = output["index"]
if len(current) > 0:
results.append(current)
strings = []
for c in results:
tokens = []
starts = []
ends = []
for o in c:
tokens.append(o['word'])
starts.append(o['start'])
ends.append(o['end'])
new_str = tokenizer.convert_tokens_to_string(tokens)
if new_str != '':
strings.append(dict(
word=new_str,
start=min(starts),
end=max(ends),
entity=c[0]['entity']
))
return strings
def entity_table(pipeline, **pipeline_kw):
if "aggregation_strategy" not in pipeline_kw:
pipeline_kw["aggregation_strategy"] = "first"
def create_table(text):
return pd.DataFrame(
clean_output(
pipeline(text, **pipeline_kw)
)
)
return create_table
entity_table(ner)("It consists of 25 exons encoding a 1,278-amino acid glycoprotein that is composed of 13 transmembrane domains")
📚 詳細文檔
標籤說明
所有可能的標籤及其對應的 ID 如下:
{"label2id": {
"DNA": 2,
"O": 0,
"RNA": 5,
"cell_line": 4,
"cell_type": 3,
"protein": 1
}
}
注意,我們去除了數據標籤中的 'B-'、'I-' 等前綴。🗡
其他模型
你可以查看我們的其他 NER 模型:
📄 許可證
本項目採用 Apache-2.0 許可證。