🚀 基因与基因产物命名实体识别
本项目旨在通过命名实体识别(NER)技术,精准识别文本中的基因及基因产物。模型基于 jnlpba 数据集进行训练,并在 pubmed-pretrained roberta 模型 上进行预训练,能够有效处理生物信息学领域的文本,识别 DNA、RNA、蛋白质等关键实体。
🚀 快速开始
环境准备
确保你已经安装了 transformers
和 pandas
库,可以使用以下命令进行安装:
pip install transformers pandas
模型使用
以下是使用该模型进行命名实体识别的基本代码示例:
from transformers import pipeline
PRETRAINED = "raynardj/ner-gene-dna-rna-jnlpba-pubmed"
ner = pipeline(task="ner", model=PRETRAINED, tokenizer=PRETRAINED)
ner("Your text", aggregation_strategy="first")
输出处理
为了使输出结果更加连贯,我们提供了以下代码示例:
import pandas as pd
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(PRETRAINED)
def clean_output(outputs):
results = []
current = []
last_idx = 0
for output in outputs:
if output["index"]-1==last_idx:
current.append(output)
else:
results.append(current)
current = [output, ]
last_idx = output["index"]
if len(current)>0:
results.append(current)
strings = []
for c in results:
tokens = []
starts = []
ends = []
for o in c:
tokens.append(o['word'])
starts.append(o['start'])
ends.append(o['end'])
new_str = tokenizer.convert_tokens_to_string(tokens)
if new_str!='':
strings.append(dict(
word=new_str,
start = min(starts),
end = max(ends),
entity = c[0]['entity']
))
return strings
def entity_table(pipeline, **pipeline_kw):
if "aggregation_strategy" not in pipeline_kw:
pipeline_kw["aggregation_strategy"] = "first"
def create_table(text):
return pd.DataFrame(
clean_output(
pipeline(text, **pipeline_kw)
)
)
return create_table
entity_table(ner)("YOUR_VERY_CONTENTFUL_TEXT")
✨ 主要特性
- 多实体识别:能够识别多种生物信息学相关的实体,包括 DNA、RNA、蛋白质、细胞系和细胞类型。
- 简化标签:去除了数据标签中的 'B-'、'I-' 等前缀,使标签更加简洁。
- 输出处理:提供了输出处理函数,使识别结果更加连贯,方便后续分析。
📦 安装指南
使用 pip
安装所需的库:
pip install transformers pandas
💻 使用示例
基础用法
from transformers import pipeline
PRETRAINED = "raynardj/ner-gene-dna-rna-jnlpba-pubmed"
ner = pipeline(task="ner", model=PRETRAINED, tokenizer=PRETRAINED)
ner("It consists of 25 exons encoding a 1,278-amino acid glycoprotein that is composed of 13 transmembrane domains", aggregation_strategy="first")
高级用法
import pandas as pd
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(PRETRAINED)
def clean_output(outputs):
results = []
current = []
last_idx = 0
for output in outputs:
if output["index"] - 1 == last_idx:
current.append(output)
else:
results.append(current)
current = [output]
last_idx = output["index"]
if len(current) > 0:
results.append(current)
strings = []
for c in results:
tokens = []
starts = []
ends = []
for o in c:
tokens.append(o['word'])
starts.append(o['start'])
ends.append(o['end'])
new_str = tokenizer.convert_tokens_to_string(tokens)
if new_str != '':
strings.append(dict(
word=new_str,
start=min(starts),
end=max(ends),
entity=c[0]['entity']
))
return strings
def entity_table(pipeline, **pipeline_kw):
if "aggregation_strategy" not in pipeline_kw:
pipeline_kw["aggregation_strategy"] = "first"
def create_table(text):
return pd.DataFrame(
clean_output(
pipeline(text, **pipeline_kw)
)
)
return create_table
entity_table(ner)("It consists of 25 exons encoding a 1,278-amino acid glycoprotein that is composed of 13 transmembrane domains")
📚 详细文档
标签说明
所有可能的标签及其对应的 ID 如下:
{"label2id": {
"DNA": 2,
"O": 0,
"RNA": 5,
"cell_line": 4,
"cell_type": 3,
"protein": 1
}
}
注意,我们去除了数据标签中的 'B-'、'I-' 等前缀。🗡
其他模型
你可以查看我们的其他 NER 模型:
📄 许可证
本项目采用 Apache-2.0 许可证。