đ Estonian Parliament Stenograms Summarization Model
This is an Estonian Parliament stenograms summarization model, aiming to summarize Estonian Parliament talks stenograms and potentially work with other Estonian texts.
đ Quick Start
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
tokenizer = AutoTokenizer.from_pretrained("rristo/mlong-t5-tglobal-base-et-riigikogu-summary")
model = AutoModelForSeq2SeqLM.from_pretrained("rristo/mlong-t5-tglobal-base-et-riigikogu-summary")
text="""Varasematest uuringutest on teada, et punetav nĂ€gu vĂ”ib mĂ€rku anda erutusest nĂ€iteks aaradel ja raisakotkastel. Sestap huvitas Tours'i Ălikooli etoloog Delphine Soulet'd ja tema kolleege, kas sarnast tundemĂ€rki vĂ”ib nĂ€ha ka kodukanade (Gallus gallus domesticus) nĂ€gudel.
TöörĂŒhm filmis esmalt kuut Sussexi tĂ”ugu kana erinevates olukordades. MĂ”nes olukorras toimetasid kanad loomulikult omasoodu, teistes aga juhtisid uurijad lindude tegevust. PĂ”nevates ja autasu tĂ”otavates olukordades lasi töörĂŒhm kanadel vĂ”tta tolmuvanni vĂ”i söötis neid ussikestega. Hirmuga seotud olukordades pĂŒĂŒdsid uurijad linde kĂ€sitsi kinni.
Katsete jĂ€rel oli töörĂŒhma pĂ€ralt videosalvestistest vĂ”etud tuhandeid ĂŒksikkaadreid. Just nende analĂŒĂŒsiks loodud algoritmi toel said uurijad tĂ€pselt jĂ€lgida, kui punased olid igas olukorras kanade hari, pĂ”sed, kĂ”rvanibud ja lotid.
TöörĂŒhma sĂ”nul oli uuringu valim vĂ€ike, mistĂ”ttu vajavad tulemused kinnitamist suuremas kordusuuringus. Siiski ilmneb tulemustest, et vĂ€hem punetavad pĂ”sed ja kĂ”rvanibud vĂ”ivad viidata linnu rahulikule ja rÔÔmsale seisundile. Vastukaaluks nĂ€ib punetavam nĂ€gu mĂ€rku andvat linnu suuremast emotsionaalsest erutusest. Sinna hulka kuuluvad nii ussikeste saamisega seotud elevus kui ka hirm.
Soulet ja kolleegid tegid veel ĂŒhe katse, kus jaotasid 25 Sussexi tĂ”ugu kana kahte rĂŒhma. Uurijad kĂ€isid viie nĂ€dala jooksul 13 linnu juures, et kanu pisitasa inimese kohaoluga harjutada. ĂlejÀÀnud 12 lindu jĂ€eti viieks nĂ€dalaks kontrollrĂŒhmana omapĂ€i.
Kui siis kĂ”ik kanad viie nĂ€dala möödudes uuesti inimestega kokku puutusid, ilmnes kahe kanarĂŒhma vahel selge vahe. Uurijatega harjunud linnud pelgasid inimest vĂ€hem ja muutusid nende juuresolekul nĂ€ost vĂ€hem punaseks, kui nende ĂŒksi jĂ€etud liigikaaslased."""
def summarize(text, model, tokenizer, max_new_tokens=512, device='cuda'):
input_ids = tokenizer(
text, return_tensors="pt"
).input_ids
outputs = model.generate(input_ids=input_ids.to(device), max_new_tokens=max_new_tokens)
return tokenizer.decode(outputs[0], skip_special_tokens=True)
DEVICE='cuda'
model=model.to(DEVICE)
summarize(text, model, tokenizer, device=DEVICE)
âš Features
- This model is designed to summarize Estonian Parliament talks stenograms and may work with other Estonian texts with reasonable accuracy.
- It has a maximum input sequence of 2048 tokens, which is longer than many other models.
đŠ Installation
The installation process is mainly about using the transformers
library. You can install the necessary libraries via the following command:
pip install transformers
đ Documentation
Model Description
The reason for creating this model is to experiment if it's possible to simply train an Estonian summarization model with a longer input sequence length than 1024 tokens.
- Model type: T5
- Language(s) (NLP): Estonian
- Finetuned from model: agemagician/mlong-t5-tglobal-base. The vocabulary of the original model was reduced to keep only tokens present in the training data.
- Maximum input sequence (tokens): 2048
Uses
Direct Use
The model is intended to be used for summarizing Estonian Parliament talks stenograms. It might work with somewhat reasonable accuracy with other Estonian texts.
Bias, Risks, and Limitations
Biases coming from the original pre-trained model, the Estonian Parliament dataset (and GPT - 3.5 which was used to create training data summaries) are probably present in the model. No extensive study has been made.
Recommendations
Don't use the model if you need very accurate results. The model might miss important aspects from the original text and hallucinate.
đ§ Technical Details
Training Details
Training Data
Training Procedure
The training notebook is available here. An explanation of the process can be found here.
Training Hyperparameters
- Training regime: fp32
- learning_rate: 5e - 5
- num_train_epochs: 12
Evaluation
Testing Data, Factors & Metrics
Testing Data
The test data is from the et_parliament_stenos_summary test set, which contains stenograms not present in the training data.
Metrics and results
Property |
Details |
rouge1 |
36.1651 |
rouge2 |
15.9668 |
rougeL |
28.339 |
rougeLsum |
33.767 |
đ License
This model is licensed under the Apache - 2.0 license.