gliner-poly-small-v1.0開源命名實體識別模型

首頁

Gliner Poly Small V1.0

由knowledgator開發

GLiNER 是一個靈活的命名實體識別（NER）模型，能夠識別任何實體類型，為傳統 NER 模型和大型語言模型提供了實用替代方案。

序列標註

PyTorch

其他開源協議:Apache-2.0 #多類型實體識別 #零樣本學習 #雙編碼器架構

下載量 18

發布時間 : 8/19/2024

模型概述

GLiNER 是一個命名實體識別（NER）模型，通過雙向 Transformer 編碼器識別任何實體類型。相比傳統 NER 模型僅限於預定義實體，GLiNER 更加靈活，同時比大型語言模型（LLMs）更輕量且成本更低。

模型特點

靈活識別任何實體類型

GLiNER 能夠識別任何實體類型，而不僅限於預定義的實體集合。

雙編碼器架構

採用後融合的雙編碼器架構，文本編碼器為 DeBERTa v3 small，實體標籤編碼器為 BGE-small-en，提高了對標籤間關係的理解能力。

高效推理

若實體嵌入已預處理，推理速度更快；可一次性識別無限數量的實體。

泛化能力強

對未見實體的泛化能力更強，適合處理多樣化的實體類型。

模型能力

命名實體識別

多語言支持

高效推理

靈活實體識別

使用案例

信息提取

人物與事件提取

從文本中提取人物、日期、獎項、賽事和球隊等實體。

如示例中所示，模型能夠準確識別出克里斯蒂亞諾·羅納爾多、1985年2月5日、金球獎等實體。

社交媒體分析

推文實體識別

從社交媒體推文中提取實體，用於趨勢分析或內容分類。

在基準測試中，模型在 Broad Tweet Corpus 數據集上取得了 70.2% 的得分。

生物醫學文本分析

醫學實體識別

從生物醫學文獻中提取疾病、藥物等實體。

在 bc5cdr 數據集上取得了 60.5% 的得分。

🚀 GLiNER - 命名實體識別模型

GLiNER是一款命名實體識別（NER）模型，它藉助雙向Transformer編碼器（類似BERT），能夠識別任意類型的實體。相較於傳統的NER模型（侷限於預定義實體）和大語言模型（LLMs，雖靈活但在資源受限場景下成本高且體積大），GLiNER提供了一個實用的替代方案。

✨ 主要特性

多實體識別：能夠一次性識別無限數量的實體。
推理加速：若對實體嵌入進行預處理，推理速度更快。
泛化能力強：對未見實體有更好的泛化能力。
標籤理解優：後融合策略相較於傳統雙編碼器，能更好地理解標籤間的關係。

📦 安裝指南

安裝或更新gliner包：

pip install gliner -U

💻 使用示例

基礎用法

from gliner import GLiNER

model = GLiNER.from_pretrained("knowledgator/gliner-poly-small-v1.0")

text = """
Cristiano Ronaldo dos Santos Aveiro (Portuguese pronunciation: [kɾiʃˈtjɐnu ʁɔˈnaldu]; born 5 February 1985) is a Portuguese professional footballer who plays as a forward for and captains both Saudi Pro League club Al Nassr and the Portugal national team. Widely regarded as one of the greatest players of all time, Ronaldo has won five Ballon d'Or awards,[note 3] a record three UEFA Men's Player of the Year Awards, and four European Golden Shoes, the most by a European player. He has won 33 trophies in his career, including seven league titles, five UEFA Champions Leagues, the UEFA European Championship and the UEFA Nations League. Ronaldo holds the records for most appearances (183), goals (140) and assists (42) in the Champions League, goals in the European Championship (14), international goals (128) and international appearances (205). He is one of the few players to have made over 1,200 professional career appearances, the most by an outfield player, and has scored over 850 official senior career goals for club and country, making him the top goalscorer of all time.
"""

labels = ["person", "award", "date", "competitions", "teams"]

entities = model.predict_entities(text, labels, threshold=0.25)

for entity in entities:
    print(entity["text"], "=>", entity["label"])

運行上述代碼，輸出結果如下：

Cristiano Ronaldo dos Santos Aveiro => person
5 February 1985 => date
Al Nassr => teams
Portugal national team => teams
Ballon d'Or => award
UEFA Men's Player of the Year Awards => award
European Golden Shoes => award
UEFA Champions Leagues => competitions
UEFA European Championship => competitions
UEFA Nations League => competitions
Champions League => competitions
European Championship => competitions

高級用法

如果你有大量實體並想對其進行預嵌入，請參考以下代碼片段：

labels = ["your entities"]
texts = ["your texts"]

entity_embeddings = model.encode_labels(labels, batch_size = 8)

outputs = model.batch_predict_with_embeds([text], entity_embeddings, labels)

📚 詳細文檔

基準測試

以下是GLiNER在各種命名實體識別數據集上的基準測試結果：

數據集	得分
ACE 2004	25.4%
ACE 2005	27.2%
AnatEM	17.7%
Broad Tweet Corpus	70.2%
CoNLL 2003	67.8%
FabNER	22.9%
FindVehicle	40.2%
GENIA_NER	47.7%
HarveyNER	15.5%
MultiNERD	64.5%
Ontonotes	28.7%
PolyglotNER	47.5%
TweetNER7	39.3%
WikiANN en	56.7%
WikiNeural	80.0%
bc2gm	56.2%
bc4chemd	48.7%
bc5cdr	60.5%
ncbi	53.5%
平均得分	45.8%

CrossNER_AI	48.9%
CrossNER_literature	64.0%
CrossNER_music	68.7%
CrossNER_politics	69.0%
CrossNER_science	62.7%
mit-movie	40.3%
mit-restaurant	36.2%
零樣本基準測試平均得分	55.7%