🚀 GLiNER - 命名实体识别模型
GLiNER是一款命名实体识别(NER)模型,它借助双向Transformer编码器(类似BERT),能够识别任意类型的实体。相较于传统的NER模型(局限于预定义实体)和大语言模型(LLMs,虽灵活但在资源受限场景下成本高且体积大),GLiNER提供了一个实用的替代方案。
✨ 主要特性
- 多实体识别:能够一次性识别无限数量的实体。
- 推理加速:若对实体嵌入进行预处理,推理速度更快。
- 泛化能力强:对未见实体有更好的泛化能力。
- 标签理解优:后融合策略相较于传统双编码器,能更好地理解标签间的关系。
📦 安装指南
安装或更新gliner
包:
pip install gliner -U
💻 使用示例
基础用法
from gliner import GLiNER
model = GLiNER.from_pretrained("knowledgator/gliner-poly-small-v1.0")
text = """
Cristiano Ronaldo dos Santos Aveiro (Portuguese pronunciation: [kɾiʃˈtjɐnu ʁɔˈnaldu]; born 5 February 1985) is a Portuguese professional footballer who plays as a forward for and captains both Saudi Pro League club Al Nassr and the Portugal national team. Widely regarded as one of the greatest players of all time, Ronaldo has won five Ballon d'Or awards,[note 3] a record three UEFA Men's Player of the Year Awards, and four European Golden Shoes, the most by a European player. He has won 33 trophies in his career, including seven league titles, five UEFA Champions Leagues, the UEFA European Championship and the UEFA Nations League. Ronaldo holds the records for most appearances (183), goals (140) and assists (42) in the Champions League, goals in the European Championship (14), international goals (128) and international appearances (205). He is one of the few players to have made over 1,200 professional career appearances, the most by an outfield player, and has scored over 850 official senior career goals for club and country, making him the top goalscorer of all time.
"""
labels = ["person", "award", "date", "competitions", "teams"]
entities = model.predict_entities(text, labels, threshold=0.25)
for entity in entities:
print(entity["text"], "=>", entity["label"])
运行上述代码,输出结果如下:
Cristiano Ronaldo dos Santos Aveiro => person
5 February 1985 => date
Al Nassr => teams
Portugal national team => teams
Ballon d'Or => award
UEFA Men's Player of the Year Awards => award
European Golden Shoes => award
UEFA Champions Leagues => competitions
UEFA European Championship => competitions
UEFA Nations League => competitions
Champions League => competitions
European Championship => competitions
高级用法
如果你有大量实体并想对其进行预嵌入,请参考以下代码片段:
labels = ["your entities"]
texts = ["your texts"]
entity_embeddings = model.encode_labels(labels, batch_size = 8)
outputs = model.batch_predict_with_embeds([text], entity_embeddings, labels)
📚 详细文档
基准测试
以下是GLiNER在各种命名实体识别数据集上的基准测试结果:
数据集 |
得分 |
ACE 2004 |
25.4% |
ACE 2005 |
27.2% |
AnatEM |
17.7% |
Broad Tweet Corpus |
70.2% |
CoNLL 2003 |
67.8% |
FabNER |
22.9% |
FindVehicle |
40.2% |
GENIA_NER |
47.7% |
HarveyNER |
15.5% |
MultiNERD |
64.5% |
Ontonotes |
28.7% |
PolyglotNER |
47.5% |
TweetNER7 |
39.3% |
WikiANN en |
56.7% |
WikiNeural |
80.0% |
bc2gm |
56.2% |
bc4chemd |
48.7% |
bc5cdr |
60.5% |
ncbi |
53.5% |
平均得分 |
45.8% |
|
|
CrossNER_AI |
48.9% |
CrossNER_literature |
64.0% |
CrossNER_music |
68.7% |
CrossNER_politics |
69.0% |
CrossNER_science |
62.7% |
mit-movie |
40.3% |
mit-restaurant |
36.2% |
零样本基准测试平均得分 |
55.7% |
加入我们的Discord社区
加入我们的 Discord 社区,获取有关模型的最新消息、技术支持并参与讨论。
📄 许可证
本项目采用Apache 2.0许可证。