đ xlm-roberta-large-pooled-cap-media-minor
An xlm-roberta-large
model finetuned for multilingual text classification with specific topic codes.
đ Quick Start
To use this model, you can follow the code example below. It demonstrates how to load the model and perform text classification.
from transformers import AutoTokenizer, pipeline
tokenizer = AutoTokenizer.from_pretrained("xlm-roberta-large")
pipe = pipeline(
model="poltextlab/xlm-roberta-large-pooled-cap-media-minor",
task="text-classification",
tokenizer=tokenizer,
use_fast=False,
truncation=True,
max_length=512,
token="<your_hf_read_only_token>"
)
text = "We will place an immediate 6-month halt on the finance driven closure of beds and wards, and set up an independent audit of needs and facilities."
pipe(text)
⨠Features
đģ Usage Examples
Basic Usage
The following code shows the basic way to use the model for text classification:
from transformers import AutoTokenizer, pipeline
tokenizer = AutoTokenizer.from_pretrained("xlm-roberta-large")
pipe = pipeline(
model="poltextlab/xlm-roberta-large-pooled-cap-media-minor",
task="text-classification",
tokenizer=tokenizer,
use_fast=False,
truncation=True,
max_length=512,
token="<your_hf_read_only_token>"
)
text = "We will place an immediate 6-month halt on the finance driven closure of beds and wards, and set up an independent audit of needs and facilities."
pipe(text)
Gated Access
â ī¸ Important Note
Due to the gated access, you must pass the token
parameter when loading the model. In earlier versions of the Transformers package, you may need to use the use_auth_token
parameter instead.
đ Documentation
Model description
An xlm-roberta-large
model finetuned on multilingual (English, Danish) training data labelled with minor topic codes from the Comparative Agendas Project. Furthermore, we also added used the following 7 media codes:
- State and Local Government Administration (24)
- Weather and Natural Disaster (26)
- Fires(27)
- Sports and Recreation (29)
- Death Notices (30)
- Churches and Religion (31)
- Other, Miscellaneous and Human Interest (99)
Model performance
The model was evaluated on a test set of 91 331 examples.
- Weighted Average F1-score: 0.68
Cooperation
Model performance can be significantly improved by extending our training sets. We appreciate every submission of CAP-coded corpora (of any domain and language) at poltextlab{at}poltextlab{dot}com or by using the CAP Babel Machine.
Debugging and issues
đĄ Usage Tip
This architecture uses the sentencepiece
tokenizer. In order to run the model before transformers==4.27
you need to install it manually. If you encounter a RuntimeError
when loading the model using the from_pretrained()
method, adding ignore_mismatched_sizes=True
should solve the issue.
đ License
This model is released under the MIT license.
Property |
Details |
Model Type |
An xlm-roberta-large model finetuned on multilingual training data. |
Training Data |
Multilingual (English, Danish) data labelled with minor topic codes and 7 media codes. |
Metrics |
Weighted Average F1-score: 0.68 |