🚀 Smugri-tuned NLLB-1.3b, v0.01
This is a fine-tuned version of NLLB-1.3b using parallel data for 29 Finno-Ugric languages. It supports different dialect/variety generation for some languages.
This is a fine-tune of NLLB-1.3b with parallel data for 29 Finno-Ugric languages.
It supports different dialect/variety generation for some of the languages, more info below.
Info on used data and other details: soon. The training of this model is in progress,
there are several known problems and overall quality is not tested yet. So far only parallel
data was taken into training, more dialects are to come after monolingual/synthetic data is added.
🚀 Quick Start
💻 Usage Examples
Basic Usage
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
model = AutoModelForSeq2SeqLM.from_pretrained("tartuNLP/nllb1.3-smugri4-v0.01")
tokenizer = AutoTokenizer.from_pretrained("tartuNLP/nllb1.3-smugri4-v0.01")
input_text = "<New written Veps> This is a short example sentence."
source_lang = "eng_Latn"
target_lang = "vep_Latn"
tokenizer.src_lang = source_lang
input_tokenized = tokenizer(input_text, return_tensors="pt")
output_raw = model.generate(**input_tokenized, forced_bos_token_id=tokenizer.convert_tokens_to_ids(target_lang))
output = tokenizer.decode(output_raw[0], skip_special_tokens=True)
print(output)
✨ Features
Supported Languages
Property |
Details |
Supported Languages |
est_Latn (Estonian), fin_Latn (Finnish), fkv_Latn (Kven), izh_Latn (Izhorian*), krl_Latn (Proper Karelian*), liv_Latn (Livonian), lud_Latn (Ludian*), olo_Latn (Livvi-Karelian*), vep_Latn (Veps*), vot_Latn (Votic*), vro_Latn (Võro), sje_Latn (Pite Sami), sju_Latn (Ume Sami), sma_Latn (Southern Sami), sme_Latn (Northern Sami), smj_Latn (Lule Sami), smn_Latn (Inari Sami), sms_Latn (Skolt Sami), sjd_Cyrl (Kildin Sami*), kpv_Cyrl (Komi-Zyrian), koi_Cyrl (Komi-Permyak), udm_Cyrl (Udmurt), mdf_Cyrl (Moksha), myv_Cyrl (Erzya), mhr_Cyrl (Meadow Mari), mrj_Cyrl (Hill Mari), hun_Latn (Hungarian), kca_Cyrl (Khanty*), mns_Cyrl (Mansi), eng_Latn (English), lvs_Latn (Latvian), rus_Cyrl (Russian), nor_Latn (Norwegian) |
Supported Dialects
- for Izhorian:
alal
(Lower Luga), soik
(Soikkola)
- for Votic:
I
, J
, Ja
, K
, Kõ
, Ke
, Ko
, L
, Li
, Lu
, M
, P
, Po
, R
, Ra
, S
, U
, V
(explanation: https://arhiiv.eki.ee/dict/vadja/lisad/v_lyhendid.pdf)
- for Karelian Proper:
Dyorzha
, Ilomantsi
, Keret
, Kestenga
, Kontokki
, Korbiselga
, Maslozero
, Myandyselga
, New written Tver
, New written karelian
, Oulanga
, Padany
, Panozero
, Poduzhemye
, Porosozero
, Reboly
, Rugozero
, Suistamo
, Suoyarvi
, Tikhtozero
, Tikhvin
, Tolmachi
, Tunguda
, Uhta
, Valdai
, Vesyegonsk
, Voknavolok
, Vychetaibola
, Yushkozero
- for Ludian:
Central Ludian (Munozero)
, Mikhailovskoye
, New written Ludian
, Northern Ludian (Kondopoga)
, Southern Ludian (Svjatozero)
, Miikul
(Central Ludian)
- for Livvi-Karelian:
Impilahti
, Kondushi
, Kotkozero
, Nekkula
, New written Livvic
, Rypushkalitsa
, Salmi
, Suoyarvi
, Syamozero
, Tulmozero
, Vedlozero
, Vidlitsa
- for Veps:
Central Eastern Veps
, Central Western Veps
, New written Veps
, Northern Veps
, Southern Veps
- for Kildin Sami:
orth1
- for Khanty:
kazym
(Kazym), shuryshkary
(Shuryshkar)
📄 License
This project is licensed under the cc-by-4.0
license.