đ TrueTeacher
TrueTeacher is a Factual Consistency Evaluation model. It addresses the challenge of evaluating the factual consistency in summarization, providing a reliable solution for research in this field.
đ Quick Start
TrueTeacher is a model optimized for evaluating factual consistency in summarization. The input format for the model is: "premise: GROUNDING_DOCUMENT hypothesis: HYPOTHESIS_SUMMARY". It's recommended to set max_length to 2048 to accommodate the input length of common summarization datasets. The model predicts a binary label ('1' - Factually Consistent, '0' - Factually Inconsistent).
⨠Features
- Optimized for Summarization: Specifically designed to evaluate factual consistency in summarization tasks.
- Based on T5 - 11B: Built upon the powerful T5 - 11B architecture and fine - tuned with multiple high - quality datasets.
- Binary Prediction: Provides clear binary labels for factual consistency evaluation.
đĻ Installation
The provided README does not contain installation steps, so this section is skipped.
đģ Usage Examples
Basic Usage
from transformers import T5ForConditionalGeneration
from transformers import T5Tokenizer
model_path = 'google/t5_11b_trueteacher_and_anli'
tokenizer = T5Tokenizer.from_pretrained(model_path)
model = T5ForConditionalGeneration.from_pretrained(model_path)
premise = 'the sun is shining'
for hypothesis, expected in [('the sun is out in the sky', '1'),
('the cat is shiny', '0')]:
input_ids = tokenizer(
f'premise: {premise} hypothesis: {hypothesis}',
return_tensors='pt',
truncation=True,
max_length=2048).input_ids
outputs = model.generate(input_ids)
result = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(f'premise: {premise}')
print(f'hypothesis: {hypothesis}')
print(f'result: {result} (expected: {expected})\n')
Advanced Usage
from transformers import T5ForConditionalGeneration
from transformers import T5Tokenizer
import torch
model_path = 'google/t5_11b_trueteacher_and_anli'
tokenizer = T5Tokenizer.from_pretrained(model_path)
model = T5ForConditionalGeneration.from_pretrained(model_path)
premise = 'the sun is shining'
for hypothesis, expected in [('the sun is out in the sky', '>> 0.5'),
('the cat is shiny', '<< 0.5')]:
input_ids = tokenizer(
f'premise: {premise} hypothesis: {hypothesis}',
return_tensors='pt',
truncation=True,
max_length=2048).input_ids
decoder_input_ids = torch.tensor([[tokenizer.pad_token_id]])
outputs = model(input_ids=input_ids, decoder_input_ids=decoder_input_ids)
logits = outputs.logits
probs = torch.softmax(logits[0], dim=-1)
one_token_id = tokenizer('1').input_ids[0]
entailment_prob = probs[0, one_token_id].item()
print(f'premise: {premise}')
print(f'hypothesis: {hypothesis}')
print(f'score: {entailment_prob:.3f} (expected: {expected})\n')
đ Documentation
Model Details
The model is based on a T5 - 11B (Raffel et al., 2020) and fine - tuned with a mixture of the following datasets:
The TrueTeacher dataset contains model - generated summaries of articles from the train split of the CNN/DailyMail dataset (Hermann et al., 2015) which are annotated for factual consistency using FLAN - PaLM 540B (Chung et al.,2022). Summaries were generated using summarization models which were trained on the XSum dataset (Narayan et al., 2018).
Evaluation Results
This model achieves the following ROC AUC results on the summarization subset of the TRUE benchmark (Honovich et al, 2022):
Property |
Details |
MNBM |
78.1 |
QAGS - X |
89.4 |
FRANK |
93.6 |
SummEval |
88.5 |
QAGS - C |
89.4 |
Average |
87.8 |
Intended Use
This model is intended for a research use (non - commercial) in English. The recommended use case is evaluating factual consistency in summarization.
Out - of - scope use
- Any use cases which violate the cc - by - nc - 4.0 license.
- Usage in languages other than English.
đ§ Technical Details
The model is fine - tuned on multiple datasets to enhance its performance in evaluating factual consistency in summarization. The use of T5 - 11B as the base model provides a strong foundation, and the fine - tuning process with datasets like TrueTeacher and ANLI further optimizes its ability to handle real - world summarization tasks. The input format and the binary prediction mechanism are carefully designed to ensure accurate and efficient evaluation.
đ License
This model is licensed under the cc - by - nc - 4.0 license.
đ Citation
If you use this model for a research publication, please cite the TrueTeacher paper (using the bibtex entry below), as well as the ANLI, CNN/DailyMail, XSum, T5 and FLAN papers mentioned above.
@misc{gekhman2023trueteacher,
title={TrueTeacher: Learning Factual Consistency Evaluation with Large Language Models},
author={Zorik Gekhman and Jonathan Herzig and Roee Aharoni and Chen Elkind and Idan Szpektor},
year={2023},
eprint={2305.11171},
archivePrefix={arXiv},
primaryClass={cs.CL}
}