🚀 GLiNER - 命名實體識別模型
GLiNER是一款命名實體識別(NER)模型,它藉助雙向Transformer編碼器(類似BERT),能夠識別任意類型的實體。相較於傳統的NER模型(侷限於預定義實體)和大語言模型(LLMs,雖靈活但在資源受限場景下成本高且體積大),GLiNER提供了一個實用的替代方案。
✨ 主要特性
- 多實體識別:能夠一次性識別無限數量的實體。
- 推理加速:若對實體嵌入進行預處理,推理速度更快。
- 泛化能力強:對未見實體有更好的泛化能力。
- 標籤理解優:後融合策略相較於傳統雙編碼器,能更好地理解標籤間的關係。
📦 安裝指南
安裝或更新gliner
包:
pip install gliner -U
💻 使用示例
基礎用法
from gliner import GLiNER
model = GLiNER.from_pretrained("knowledgator/gliner-poly-small-v1.0")
text = """
Cristiano Ronaldo dos Santos Aveiro (Portuguese pronunciation: [kɾiʃˈtjɐnu ʁɔˈnaldu]; born 5 February 1985) is a Portuguese professional footballer who plays as a forward for and captains both Saudi Pro League club Al Nassr and the Portugal national team. Widely regarded as one of the greatest players of all time, Ronaldo has won five Ballon d'Or awards,[note 3] a record three UEFA Men's Player of the Year Awards, and four European Golden Shoes, the most by a European player. He has won 33 trophies in his career, including seven league titles, five UEFA Champions Leagues, the UEFA European Championship and the UEFA Nations League. Ronaldo holds the records for most appearances (183), goals (140) and assists (42) in the Champions League, goals in the European Championship (14), international goals (128) and international appearances (205). He is one of the few players to have made over 1,200 professional career appearances, the most by an outfield player, and has scored over 850 official senior career goals for club and country, making him the top goalscorer of all time.
"""
labels = ["person", "award", "date", "competitions", "teams"]
entities = model.predict_entities(text, labels, threshold=0.25)
for entity in entities:
print(entity["text"], "=>", entity["label"])
運行上述代碼,輸出結果如下:
Cristiano Ronaldo dos Santos Aveiro => person
5 February 1985 => date
Al Nassr => teams
Portugal national team => teams
Ballon d'Or => award
UEFA Men's Player of the Year Awards => award
European Golden Shoes => award
UEFA Champions Leagues => competitions
UEFA European Championship => competitions
UEFA Nations League => competitions
Champions League => competitions
European Championship => competitions
高級用法
如果你有大量實體並想對其進行預嵌入,請參考以下代碼片段:
labels = ["your entities"]
texts = ["your texts"]
entity_embeddings = model.encode_labels(labels, batch_size = 8)
outputs = model.batch_predict_with_embeds([text], entity_embeddings, labels)
📚 詳細文檔
基準測試
以下是GLiNER在各種命名實體識別數據集上的基準測試結果:
數據集 |
得分 |
ACE 2004 |
25.4% |
ACE 2005 |
27.2% |
AnatEM |
17.7% |
Broad Tweet Corpus |
70.2% |
CoNLL 2003 |
67.8% |
FabNER |
22.9% |
FindVehicle |
40.2% |
GENIA_NER |
47.7% |
HarveyNER |
15.5% |
MultiNERD |
64.5% |
Ontonotes |
28.7% |
PolyglotNER |
47.5% |
TweetNER7 |
39.3% |
WikiANN en |
56.7% |
WikiNeural |
80.0% |
bc2gm |
56.2% |
bc4chemd |
48.7% |
bc5cdr |
60.5% |
ncbi |
53.5% |
平均得分 |
45.8% |
|
|
CrossNER_AI |
48.9% |
CrossNER_literature |
64.0% |
CrossNER_music |
68.7% |
CrossNER_politics |
69.0% |
CrossNER_science |
62.7% |
mit-movie |
40.3% |
mit-restaurant |
36.2% |
零樣本基準測試平均得分 |
55.7% |
加入我們的Discord社區
加入我們的 Discord 社區,獲取有關模型的最新消息、技術支持並參與討論。
📄 許可證
本項目採用Apache 2.0許可證。