🚀 🦔 HEDGEhog 🦔: BERT-based multi-class uncertainty cues recognition
A fine-tuned multi-class classification model that detects four different types of uncertainty cues (a.k.a hedges) on a token level.
🚀 Quick Start
This is a fine - tuned multi - class classification model designed to detect four distinct types of uncertainty cues at the token level.
✨ Features
Uncertainty types
Property |
Details |
Epistemic (E) |
The proposition is possible, but its truth - value cannot be decided at the moment. Example: She may be already asleep. |
Investigation (I) |
The proposition is in the process of having its truth - value determined. Example: She examined the role of NF - kappaB in protein activation. |
Doxatic (D) |
The proposition expresses beliefs and hypotheses, which may be known as true or false by others. Example: She believes that the Earth is flat. |
Condition (N) |
The proposition is true or false based on the truth - value of another proposition. Example: If she gets the job, she will move to Utrecht. |
Certain (C) |
n/a |
📦 Installation
The model was fine - tuned with the Simple Transformers library. Note that this library is based on Transformers, but the model cannot be used directly with Transformers pipeline
and classes; doing so would generate incorrect outputs. For this reason, the API on this page is disabled.
💻 Usage Examples
Basic Usage
To generate predictions with the model, use the Simple Transformers library:
from simpletransformers.ner import NERModel
model = NERModel(
'bert',
'jeniakim/hedgehog',
use_cuda=False,
labels=["C", "D", "E", "I", "N"],
)
example = "As much as I definitely enjoy solitude, I wouldn't mind perhaps spending little time with you (Björk)"
predictions, raw_outputs = model.predict([example])
The predictions look like this:
[[{'As': 'C'},
{'much': 'C'},
{'as': 'C'},
{'I': 'C'},
{'definitely': 'C'},
{'enjoy': 'C'},
{'solitude,': 'C'},
{'I': 'C'},
{"wouldn't": 'C'},
{'mind': 'C'},
{'perhaps': 'E'},
{'spending': 'C'},
{'little': 'C'},
{'time': 'C'},
{'with': 'C'},
{'you': 'C'},
{'(Björk)': 'C'}]]
In other words, the token 'perhaps' is recognized as an epistemic uncertainty cue and all the other tokens are not uncertainty cues.
📚 Documentation
Training Data
HEDGEhog is trained and evaluated on the [Szeged Uncertainty Corpus](https://rgai.inf.u - szeged.hu/node/160) (Szarvas et al. 20121). The original sentence - level XML version of this dataset is available [here](https://rgai.inf.u - szeged.hu/node/160).
The token - level version that was used for the training can be downloaded from here in a form of pickled pandas DataFrame's. You can download either the split sets (train.pkl
137MB, test.pkl
17MB, dev.pkl
17MB) or the full dataset (szeged_fixed.pkl
172MB). Each row in the df contains a token, its features (these are not relevant for HEDGEhog; they were used to train the baseline CRF model, see here), its sentence ID, and its label.
Training Procedure
The following training parameters were used:
- Optimizer: AdamW
- Learning rate: 4e - 5
- Num train epochs: 1
- Train batch size: 16
Evaluation Results
Class |
Precision |
Recall |
F1 - score |
Support |
Epistemic |
0.90 |
0.85 |
0.88 |
624 |
Doxatic |
0.88 |
0.92 |
0.90 |
142 |
Investigation |
0.83 |
0.86 |
0.84 |
111 |
Condition |
0.85 |
0.87 |
0.86 |
86 |
Certain |
1.00 |
1.00 |
1.00 |
104,751 |
macro average |
0.89 |
0.90 |
0.89 |
105,714 |
📄 License
This project is licensed under the MIT license.
🔧 Technical Details
References
1 Szarvas, G., Vincze, V., Farkas, R., Móra, G., & Gurevych, I. (2012). Cross - genre and cross - domain detection of semantic uncertainty. Computational Linguistics, 38(2), 335 - 367.