🚀 Automatic Speech Recognition Model
This model is designed for automatic speech recognition, offering high - quality performance on Turkish speech data. It fine - tunes an existing model on the MOZILLA - FOUNDATION/COMMON_VOICE_8_0 - TR dataset, achieving excellent results in evaluation.
🚀 Quick Start
This model is a fine - tuned version of [./checkpoint - 1000](https://huggingface.co/./checkpoint - 1000) on the MOZILLA - FOUNDATION/COMMON_VOICE_8_0 - TR dataset.
It achieves the following results on the evaluation set:
🔧 Technical Details
Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 0.0003
- train_batch_size: 96
- eval_batch_size: 8
- seed: 42
- gradient_accumulation_steps: 2
- total_train_batch_size: 192
- optimizer: Adam with betas=(0.9,0.999) and epsilon = 1e - 08
- lr_scheduler_type: linear
- lr_scheduler_warmup_steps: 100
- num_epochs: 100.0
- mixed_precision_training: Native AMP
Training results
Training Loss |
Epoch |
Step |
Validation Loss |
Wer |
1.0671 |
2.04 |
200 |
0.3079 |
0.2752 |
0.6433 |
4.08 |
400 |
0.2728 |
0.2848 |
0.5687 |
6.12 |
600 |
0.2882 |
0.3036 |
0.5355 |
8.16 |
800 |
0.2778 |
0.2920 |
0.5116 |
10.2 |
1000 |
0.2906 |
0.3014 |
0.5313 |
9.16 |
1200 |
0.2984 |
0.3273 |
0.4996 |
10.69 |
1400 |
0.3170 |
0.3344 |
0.4845 |
12.21 |
1600 |
0.3202 |
0.3634 |
0.5092 |
13.74 |
1800 |
0.3167 |
0.3373 |
0.4777 |
15.27 |
2000 |
0.3292 |
0.3386 |
0.4651 |
16.79 |
2200 |
0.3070 |
0.3427 |
0.461 |
18.32 |
2400 |
0.3149 |
0.3561 |
0.4481 |
19.85 |
2600 |
0.3292 |
0.3441 |
0.4479 |
21.37 |
2800 |
0.3142 |
0.3209 |
0.4305 |
22.9 |
3000 |
0.3525 |
0.3547 |
0.4254 |
24.43 |
3200 |
0.3414 |
0.3400 |
0.4066 |
25.95 |
3400 |
0.3118 |
0.3207 |
0.4043 |
27.48 |
3600 |
0.3418 |
0.3483 |
0.3985 |
29.01 |
3800 |
0.3254 |
0.3166 |
0.3982 |
30.53 |
4000 |
0.3306 |
0.3453 |
0.3929 |
32.06 |
4200 |
0.3262 |
0.3229 |
0.378 |
33.59 |
4400 |
0.3546 |
0.3336 |
0.4062 |
35.11 |
4600 |
0.3174 |
0.3457 |
0.3648 |
36.64 |
4800 |
0.3377 |
0.3357 |
0.3609 |
38.17 |
5000 |
0.3346 |
0.3520 |
0.3483 |
39.69 |
5200 |
0.3350 |
0.3526 |
0.3548 |
41.22 |
5400 |
0.3330 |
0.3406 |
0.3446 |
42.75 |
5600 |
0.3398 |
0.3372 |
0.3346 |
44.27 |
5800 |
0.3449 |
0.3288 |
0.3309 |
45.8 |
6000 |
0.3320 |
0.3144 |
0.326 |
47.33 |
6200 |
0.3400 |
0.3279 |
0.3189 |
48.85 |
6400 |
0.3400 |
0.3150 |
0.3165 |
50.38 |
6600 |
0.3359 |
0.2995 |
0.3132 |
51.91 |
6800 |
0.3343 |
0.3096 |
0.3092 |
53.44 |
7000 |
0.3224 |
0.3029 |
0.2995 |
54.96 |
7200 |
0.3205 |
0.2985 |
0.304 |
56.49 |
7400 |
0.3523 |
0.3034 |
0.2952 |
58.02 |
7600 |
0.3289 |
0.2934 |
0.2875 |
59.54 |
7800 |
0.3350 |
0.3008 |
0.2868 |
61.07 |
8000 |
0.3537 |
0.3227 |
0.2875 |
62.6 |
8200 |
0.3389 |
0.2970 |
0.2778 |
64.12 |
8400 |
0.3370 |
0.2960 |
0.2706 |
65.65 |
8600 |
0.3250 |
0.2802 |
0.2669 |
67.18 |
8800 |
0.3351 |
0.2903 |
0.2615 |
68.7 |
9000 |
0.3382 |
0.2989 |
0.2563 |
70.23 |
9200 |
0.3312 |
0.2975 |
0.2546 |
71.76 |
9400 |
0.3212 |
0.3003 |
0.2482 |
73.28 |
9600 |
0.3337 |
0.3091 |
0.2504 |
74.81 |
9800 |
0.3308 |
0.3110 |
0.2456 |
76.34 |
10000 |
0.3157 |
0.3118 |
0.2363 |
77.86 |
10200 |
0.3251 |
0.3144 |
0.2319 |
79.39 |
10400 |
0.3253 |
0.3038 |
0.2266 |
80.92 |
10600 |
0.3374 |
0.3038 |
0.2279 |
82.44 |
10800 |
0.3268 |
0.2964 |
0.2231 |
83.97 |
11000 |
0.3278 |
0.2950 |
0.2185 |
85.5 |
11200 |
0.3462 |
0.2981 |
0.2245 |
87.02 |
11400 |
0.3311 |
0.2895 |
0.223 |
88.55 |
11600 |
0.3325 |
0.2877 |
0.2121 |
90.08 |
11800 |
0.3337 |
0.2828 |
0.2126 |
91.6 |
12000 |
0.3325 |
0.2808 |
0.2027 |
93.13 |
12200 |
0.3277 |
0.2820 |
0.2058 |
94.66 |
12400 |
0.3308 |
0.2827 |
0.1991 |
96.18 |
12600 |
0.3279 |
0.2820 |
0.1991 |
97.71 |
12800 |
0.3300 |
0.2822 |
0.1986 |
99.24 |
13000 |
0.3285 |
0.2835 |
Framework versions
- Transformers 4.17.0.dev0
- Pytorch 1.10.2+cu102
- Datasets 1.18.3
- Tokenizers 0.11.0
Property |
Details |
Model Type |
Fine - tuned model for automatic speech recognition |
Training Data |
MOZILLA - FOUNDATION/COMMON_VOICE_8_0 - TR dataset |