Gliner_large_news-v2.1 Open-Source News Entity Recognition Model - Efficiently Extract News Entities from Long Texts

Gliner Large News V2.1

Developed by EmergentMethods

A news domain entity recognition model fine-tuned based on GLiNER, excelling in long-text news entity extraction, achieving up to 7.5% higher zero-shot accuracy on 18 benchmark datasets.

Sequence Labeling

PyTorch

EnglishOpen Source License:Apache-2.0 #News Entity Extraction #Multilingual Support #Zero-shot Learning

Downloads 2,558

Release Time : 4/18/2024

Model Overview

This model is optimized for entity recognition in the news domain, built on the microsoft/deberta architecture, with improved accuracy for cross-domain topics through synthetic data fine-tuning. Supports processing of translated texts in multiple languages.

Model Features

Cross-domain Performance Improvement

Achieves up to 7.5% higher zero-shot accuracy compared to the base model on 18 benchmark datasets.

News Domain Optimization

Specifically optimized for long-text news entity extraction scenarios.

Global Perspective Data

Training data designed with enforced diversity in country/language/topic/time dimensions.

Efficient Inference

Compact model size suitable for high-throughput production environments.

Model Capabilities

News Entity Recognition

Multilingual Text Processing

Zero-shot Learning

Long-text Analysis

Use Cases

News Analysis

News Event Entity Extraction

Extract key entities such as persons, locations, and organizations from news reports.

Demonstrates over 90% accuracy in key entity recognition in examples.

Multilingual News Processing

Process translated news content in multiple languages.

Supports processing of texts translated into 11 languages.

Content Analysis

Event Correlation Analysis

Establish correlations between news events through entity recognition.

Already applied in the AskNews entity extraction system.

🚀 Model Card for gliner_large_news-v2.1

This model is a fine - tuned version of GLiNER, aiming to enhance accuracy in long - context news entity extraction across various topics.

🚀 Quick Start

To start using the gliner_large_news-v2.1 model, you can follow the code example below:

from gliner import GLiNER

model = GLiNER.from_pretrained("EmergentMethods/gliner_large_news-v2.1")

text = """
The Chihuahua State Public Security Secretariat (SSPE) arrested 35-year-old Salomón C. T. in Ciudad Juárez, found in possession of a stolen vehicle, a white GMC Yukon, which was reported stolen in the city's streets. The arrest was made by intelligence and police analysis personnel during an investigation in the border city. The arrest is related to a previous detention on February 6, which involved armed men in a private vehicle. The detainee and the vehicle were turned over to the Chihuahua State Attorney General's Office for further investigation into the case. 
"""

labels = ["person", "location", "date", "event", "facility", "vehicle", "number", "organization"]

entities = model.predict_entities(text, labels)

for entity in entities:
    print(entity["text"], "=>", entity["label"])

Output:

Chihuahua State Public Security Secretariat => organization
SSPE => organization
35-year-old => number
Salomón C. T. => person
Ciudad Juárez => location
GMC Yukon => vehicle
February 6 => date
Chihuahua State Attorney General's Office => organization

✨ Features

Enhanced Accuracy: This fine - tuned model improves upon the base GLiNER model's zero - shot accuracy by up to 7.5% across 18 benchmark datasets, especially in long - context news entity extraction.
Diverse Dataset: The underlying dataset AskNews - NER - v0 enforces country/language/topic/temporal diversity to diversify global perspectives.
Compact and High - throughput: The model is compact and suitable for high - throughput production use cases.

📚 Documentation

Model Details

Model Description

The synthetic data for this news fine - tune is sourced from the AskNews API. We ensured diversity across country, language, topic, and time.

country distribution entities topics

Developed by: Emergent Methods
Funded by: Emergent Methods
Shared by: Emergent Methods
Model type: microsoft/deberta
Language(s) (NLP): English (en) (English texts and translations from Spanish (es), Portuguese (pt), German (de), Russian (ru), French (fr), Arabic (ar), Italian (it), Ukrainian (uk), Norwegian (no), Swedish (sv), Danish (da))
License: Apache 2.0
Finetuned from model: GLiNER

Model Sources [optional]

Repository: To be added
Paper: To be added
Demo: To be added

Uses

Direct Use

This model is designed for generalist entity extraction. Despite being fine - tuned on news data, it shows improved accuracy across 18 benchmark datasets. The broad and diversified dataset enables it to recognize and extract more entity types. It is currently used by AskNews for entity extraction in their system.

Bias, Risks, and Limitations

Although the dataset aims to reduce bias and improve diversity, it is still biased towards western languages and countries. This bias comes from the translation and summary generation abilities of Llama2 and Llama3. Any bias in their training data will also be present in this dataset.

countries distribution

Training Details

The training dataset is AskNews - NER - v0. Other training details can be found in the companion paper.

Environmental Impact

Hardware Type: 1xA4500
Hours used: 10
Carbon Emitted: 0.6 kg (According to Machine Learning Impact calculator)

Citation

BibTeX: To be added APA: To be added

Model Authors

Elin Törnquist, Emergent Methods elin at emergentmethods.ai Robert Caulk, Emergent Methods rob at emergentmethods.ai

Model Contact

Elin Törnquist, Emergent Methods elin at emergentmethods.ai Robert Caulk, Emergent Methods rob at emergentmethods.ai

📄 License

This model is licensed under the Apache 2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご