Open-source xlm - roberta - large - pooled - cap - media - minor model supporting political and media content classification in English and Danish

Xlm Roberta Large Pooled Cap Media Minor

Developed by poltextlab

A multilingual text classification model fine-tuned based on xlm-roberta-large, supporting English and Danish, focusing on political agenda and media content classification tasks.

Text Classification

PyTorch

OtherOpen Source License:MIT #Multilingual text classification #Political agenda analysis #Media content encoding

Downloads 163

Release Time : 5/8/2025

Model Overview

This model is a fine-tuned text classification model for multilingual (English, Danish) training data, annotated with the Comparative Agendas Project's minor topic codes and supplemented with 7 additional media-related codes.

Model Features

Multilingual support

Supports text classification in English and Danish

Media content extension

Includes 7 additional media-related codes beyond standard classification

Academic specialization

Primarily intended for academic research; non-academic use requires application

Model Capabilities

Multilingual text classification

Political agenda analysis

Media content classification

Use Cases

Political analysis

Political agenda classification

Classifies political texts to identify different agenda topics

Uses the classification system of the Comparative Agendas Project

Media analysis

News content classification

Identifies specific topics in news such as weather, sports, religion, etc.

Supports classification with 7 additional media codes

🚀 xlm-roberta-large-pooled-cap-media-minor

An xlm-roberta-large model finetuned for multilingual text classification with specific topic codes.

🚀 Quick Start

To use this model, you can follow the code example below. It demonstrates how to load the model and perform text classification.

from transformers import AutoTokenizer, pipeline

tokenizer = AutoTokenizer.from_pretrained("xlm-roberta-large")
pipe = pipeline(
    model="poltextlab/xlm-roberta-large-pooled-cap-media-minor",
    task="text-classification",
    tokenizer=tokenizer,
    use_fast=False,
    truncation=True,
    max_length=512,
    token="<your_hf_read_only_token>"
)

text = "We will place an immediate 6-month halt on the finance driven closure of beds and wards, and set up an independent audit of needs and facilities."
pipe(text)

✨ Features

Multilingual Support: Finetuned on multilingual (English, Danish) training data.
Specific Topic Codes: Labelled with minor topic codes from the Comparative Agendas Project and additional 7 media codes.

💻 Usage Examples

Basic Usage

The following code shows the basic way to use the model for text classification:

from transformers import AutoTokenizer, pipeline

tokenizer = AutoTokenizer.from_pretrained("xlm-roberta-large")
pipe = pipeline(
    model="poltextlab/xlm-roberta-large-pooled-cap-media-minor",
    task="text-classification",
    tokenizer=tokenizer,
    use_fast=False,
    truncation=True,
    max_length=512,
    token="<your_hf_read_only_token>"
)

text = "We will place an immediate 6-month halt on the finance driven closure of beds and wards, and set up an independent audit of needs and facilities."
pipe(text)

Gated Access

⚠️ Important Note

Due to the gated access, you must pass the token parameter when loading the model. In earlier versions of the Transformers package, you may need to use the use_auth_token parameter instead.

📚 Documentation

Model description

An xlm-roberta-large model finetuned on multilingual (English, Danish) training data labelled with minor topic codes from the Comparative Agendas Project. Furthermore, we also added used the following 7 media codes:

State and Local Government Administration (24)
Weather and Natural Disaster (26)
Fires(27)
Sports and Recreation (29)
Death Notices (30)
Churches and Religion (31)
Other, Miscellaneous and Human Interest (99)

Model performance

The model was evaluated on a test set of 91 331 examples.

Weighted Average F1-score: 0.68

Cooperation

Model performance can be significantly improved by extending our training sets. We appreciate every submission of CAP-coded corpora (of any domain and language) at poltextlab{at}poltextlab{dot}com or by using the CAP Babel Machine.

Debugging and issues

💡 Usage Tip

This architecture uses the sentencepiece tokenizer. In order to run the model before transformers==4.27 you need to install it manually. If you encounter a RuntimeError when loading the model using the from_pretrained() method, adding ignore_mismatched_sizes=True should solve the issue.

📄 License

This model is released under the MIT license.

Property	Details
Model Type	An `xlm-roberta-large` model finetuned on multilingual training data.
Training Data	Multilingual (English, Danish) data labelled with minor topic codes and 7 media codes.
Metrics	Weighted Average F1-score: 0.68

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご