🚀 SpeechT5 - Russian translit
This model is a fine - tuned version of microsoft/speecht5_tts for text - to - speech tasks, specifically trained on the Common Voice 13 dataset.
🚀 Quick Start
This model is a fine - tuned version of microsoft/speecht5_tts on the Common Voice 13 dataset. It achieves a loss of 0.4853 on the evaluation set.
✨ Features
- The input should be Russian text in transliterated form (using the
transliterate
package).
- This is just a test for the hands - on exercise of the HF Audio Course and is not intended for actual use.
📦 Installation
No installation steps are provided in the original document, so this section is skipped.
💻 Usage Examples
No code examples are provided in the original document, so this section is skipped.
📚 Documentation
Model description
Input should be a Russian text in transliterated form (use the transliterate
package). This is just a test for the hands - on exercise of the HF Audio Course! Not intended for actual use!
Intended uses & limitations
More information needed
Training and evaluation data
More information needed
Training procedure
Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 1e - 05
- train_batch_size: 8
- eval_batch_size: 2
- seed: 42
- gradient_accumulation_steps: 8
- total_train_batch_size: 64
- optimizer: Adam with betas=(0.9, 0.999) and epsilon = 1e - 08
- lr_scheduler_type: linear
- lr_scheduler_warmup_steps: 400
- training_steps: 2000
Training results
Training Loss |
Epoch |
Step |
Validation Loss |
1.0359 |
0.6 |
50 |
0.8176 |
0.8866 |
1.19 |
100 |
0.6899 |
0.787 |
1.79 |
150 |
0.6478 |
0.7477 |
2.38 |
200 |
0.6233 |
0.6734 |
2.98 |
250 |
0.5630 |
0.6216 |
3.58 |
300 |
0.5429 |
0.593 |
4.17 |
350 |
0.5304 |
0.5817 |
4.77 |
400 |
0.5282 |
0.5734 |
5.37 |
450 |
0.5167 |
0.5688 |
5.96 |
500 |
0.5209 |
0.5662 |
6.56 |
550 |
0.5095 |
0.5609 |
7.15 |
600 |
0.5127 |
0.554 |
7.75 |
650 |
0.5041 |
0.5522 |
8.35 |
700 |
0.5038 |
0.5372 |
8.94 |
750 |
0.4984 |
0.5432 |
9.54 |
800 |
0.4995 |
0.5384 |
10.13 |
850 |
0.4971 |
0.5345 |
10.73 |
900 |
0.4981 |
0.5358 |
11.33 |
950 |
0.4942 |
0.5332 |
11.92 |
1000 |
0.4906 |
0.5334 |
12.52 |
1050 |
0.4897 |
0.5301 |
13.11 |
1100 |
0.4914 |
0.5298 |
13.71 |
1150 |
0.4894 |
0.524 |
14.31 |
1200 |
0.4871 |
0.5221 |
14.9 |
1250 |
0.4884 |
0.525 |
15.5 |
1300 |
0.4883 |
0.5232 |
16.1 |
1350 |
0.4866 |
0.5261 |
16.69 |
1400 |
0.4858 |
0.521 |
17.29 |
1450 |
0.4852 |
0.5225 |
17.88 |
1500 |
0.4849 |
0.5219 |
18.48 |
1550 |
0.4860 |
0.5207 |
19.08 |
1600 |
0.4839 |
0.5192 |
19.67 |
1650 |
0.4851 |
0.516 |
20.27 |
1700 |
0.4860 |
0.5186 |
20.86 |
1750 |
0.4811 |
0.5233 |
21.46 |
1800 |
0.4841 |
0.5145 |
22.06 |
1850 |
0.4819 |
0.5159 |
22.65 |
1900 |
0.4822 |
0.5146 |
23.25 |
1950 |
0.4831 |
0.5175 |
23.85 |
2000 |
0.4853 |
Framework versions
- Transformers 4.31.0
- Pytorch 2.0.1+cu118
- Datasets 2.14.4
- Tokenizers 0.13.3
🔧 Technical Details
The model is a fine - tuned version of microsoft/speecht5_tts on the Common Voice 13 dataset. The training process uses specific hyperparameters and an Adam optimizer with a linear learning rate scheduler.
📄 License
This model is released under the MIT license.