Hedgehog Open-source Model - Free Deployment, Accurately Identify Four Different Types of Uncertainty Clues in Word Elements

Hedgehog

Developed by jeniakim

A BERT-based multi-category uncertainty cue recognition model capable of detecting four different types of uncertainty cues at the token level.

Sequence Labeling

Transformers

EnglishOpen Source License:MIT #Uncertainty detection #Token-level classification #BERT fine-tuning

Downloads 48

Release Time : 3/2/2022

Model Overview

This fine-tuned model is specifically designed to identify four types of uncertainty cues in text: epistemic, investigative, doxastic, and conditional. Suitable for academic literature, legal texts, and other fields requiring precise identification of uncertain expressions.

Model Features

Multi-category uncertainty recognition

Capable of distinguishing between four types of uncertainty cues: epistemic, investigative, doxastic, and conditional

Token-level granular analysis

Performs annotation at the token level, providing more precise uncertainty localization

BERT-based architecture

Utilizes BERT's powerful semantic understanding capabilities to improve recognition accuracy

Model Capabilities

Text uncertainty analysis

Hedge detection

Academic text analysis

Legal text parsing

Use Cases

Academic research

Academic paper analysis

Identifying uncertain expressions in academic literature to help researchers evaluate conclusion reliability

Can accurately identify epistemic and investigative uncertainties

Legal text processing

Contract clause analysis

Identifying conditional uncertainty clauses in contracts

Can effectively detect conditional uncertainty expressions

🚀 🦔 HEDGEhog 🦔: BERT-based multi-class uncertainty cues recognition

A fine-tuned multi-class classification model that detects four different types of uncertainty cues (a.k.a hedges) on a token level.

🚀 Quick Start

This is a fine - tuned multi - class classification model designed to detect four distinct types of uncertainty cues at the token level.

✨ Features

Uncertainty types

Property	Details
Epistemic (E)	The proposition is possible, but its truth - value cannot be decided at the moment. Example: She may be already asleep.
Investigation (I)	The proposition is in the process of having its truth - value determined. Example: She examined the role of NF - kappaB in protein activation.
Doxatic (D)	The proposition expresses beliefs and hypotheses, which may be known as true or false by others. Example: She believes that the Earth is flat.
Condition (N)	The proposition is true or false based on the truth - value of another proposition. Example: If she gets the job, she will move to Utrecht.
Certain (C)	n/a

📦 Installation

The model was fine - tuned with the Simple Transformers library. Note that this library is based on Transformers, but the model cannot be used directly with Transformers pipeline and classes; doing so would generate incorrect outputs. For this reason, the API on this page is disabled.

💻 Usage Examples

Basic Usage

To generate predictions with the model, use the Simple Transformers library:

from simpletransformers.ner import NERModel

model = NERModel(
    'bert',
    'jeniakim/hedgehog',
    use_cuda=False,
    labels=["C", "D", "E", "I", "N"],
)

example = "As much as I definitely enjoy solitude, I wouldn't mind perhaps spending little time with you (Björk)"
predictions, raw_outputs = model.predict([example])

The predictions look like this:

[[{'As': 'C'},
  {'much': 'C'},
  {'as': 'C'},
  {'I': 'C'},
  {'definitely': 'C'},
  {'enjoy': 'C'},
  {'solitude,': 'C'},
  {'I': 'C'},
  {"wouldn't": 'C'},
  {'mind': 'C'},
  {'perhaps': 'E'},
  {'spending': 'C'},
  {'little': 'C'},
  {'time': 'C'},
  {'with': 'C'},
  {'you': 'C'},
  {'(Björk)': 'C'}]]

In other words, the token 'perhaps' is recognized as an epistemic uncertainty cue and all the other tokens are not uncertainty cues.

📚 Documentation

Training Data

HEDGEhog is trained and evaluated on the [Szeged Uncertainty Corpus](https://rgai.inf.u - szeged.hu/node/160) (Szarvas et al. 2012¹). The original sentence - level XML version of this dataset is available [here](https://rgai.inf.u - szeged.hu/node/160).

The token - level version that was used for the training can be downloaded from here in a form of pickled pandas DataFrame's. You can download either the split sets (train.pkl 137MB, test.pkl 17MB, dev.pkl 17MB) or the full dataset (szeged_fixed.pkl 172MB). Each row in the df contains a token, its features (these are not relevant for HEDGEhog; they were used to train the baseline CRF model, see here), its sentence ID, and its label.

Training Procedure

The following training parameters were used:

Optimizer: AdamW
Learning rate: 4e - 5
Num train epochs: 1
Train batch size: 16

Evaluation Results

Class	Precision	Recall	F1 - score	Support
Epistemic	0.90	0.85	0.88	624
Doxatic	0.88	0.92	0.90	142
Investigation	0.83	0.86	0.84	111
Condition	0.85	0.87	0.86	86
Certain	1.00	1.00	1.00	104,751
macro average	0.89	0.90	0.89	105,714

📄 License

This project is licensed under the MIT license.

🔧 Technical Details

References

¹ Szarvas, G., Vincze, V., Farkas, R., Móra, G., & Gurevych, I. (2012). Cross - genre and cross - domain detection of semantic uncertainty. Computational Linguistics, 38(2), 335 - 367.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご