🚀 Multilingual Language Detection Model
This model is designed to detect 45 languages. It is fine - tuned on the common - language dataset using the multilingual - e5 - base model. With an overall accuracy of 98.37%, it offers reliable language detection capabilities.
🚀 Quick Start
📦 Installation
from transformers import AutoTokenizer, AutoModelForSequenceClassification
tokenizer = AutoTokenizer.from_pretrained('Mike0307/multilingual-e5-language-detection')
model = AutoModelForSequenceClassification.from_pretrained('Mike0307/multilingual-e5-language-detection', num_labels=45)
💻 Usage Examples
Basic Usage
import torch
languages = [
"Arabic", "Basque", "Breton", "Catalan", "Chinese_China", "Chinese_Hongkong",
"Chinese_Taiwan", "Chuvash", "Czech", "Dhivehi", "Dutch", "English",
"Esperanto", "Estonian", "French", "Frisian", "Georgian", "German", "Greek",
"Hakha_Chin", "Indonesian", "Interlingua", "Italian", "Japanese", "Kabyle",
"Kinyarwanda", "Kyrgyz", "Latvian", "Maltese", "Mongolian", "Persian", "Polish",
"Portuguese", "Romanian", "Romansh_Sursilvan", "Russian", "Sakha", "Slovenian",
"Spanish", "Swedish", "Tamil", "Tatar", "Turkish", "Ukranian", "Welsh"
]
def predict(text, model, tokenizer, device = torch.device('cpu')):
model.to(device)
model.eval()
tokenized = tokenizer(text, padding='max_length', truncation=True, max_length=128, return_tensors="pt")
input_ids = tokenized['input_ids']
attention_mask = tokenized['attention_mask']
with torch.no_grad():
input_ids = input_ids.to(device)
attention_mask = attention_mask.to(device)
outputs = model(input_ids=input_ids, attention_mask=attention_mask)
logits = outputs.logits
probabilities = torch.nn.functional.softmax(logits, dim=1)
return probabilities
def get_topk(probabilities, languages, k=3):
topk_prob, topk_indices = torch.topk(probabilities, k)
topk_prob = topk_prob.cpu().numpy()[0].tolist()
topk_indices = topk_indices.cpu().numpy()[0].tolist()
topk_labels = [languages[index] for index in topk_indices]
return topk_prob, topk_labels
text = "你的測試句子"
probabilities = predict(text, model, tokenizer)
topk_prob, topk_labels = get_topk(probabilities, languages)
print(topk_prob, topk_labels)
📚 Documentation
Overview
This model supports the detection of 45 languages, and it's fine - tuned using the multilingual - e5 - base model on the common - language dataset. The overall accuracy is 98.37%, and more evaluation results are shown below.
Evaluation Results
The test datasets refer to the common_language test datasets.
Property |
Details |
Model Type |
Fine - tuned from multilingual - e5 - base |
Training Data |
common - language dataset |
Index |
Language |
Precision |
Recall |
F1 - score |
Support |
0 |
Arabic |
1.00 |
1.00 |
1.00 |
151 |
1 |
Basque |
0.99 |
1.00 |
1.00 |
111 |
2 |
Breton |
1.00 |
0.90 |
0.95 |
252 |
3 |
Catalan |
0.96 |
0.99 |
0.97 |
96 |
4 |
Chinese_China |
0.98 |
1.00 |
0.99 |
100 |
5 |
Chinese_Hongkong |
0.97 |
0.87 |
0.92 |
115 |
6 |
Chinese_Taiwan |
0.92 |
0.98 |
0.95 |
170 |
7 |
Chuvash |
0.98 |
1.00 |
0.99 |
137 |
8 |
Czech |
0.98 |
1.00 |
0.99 |
128 |
9 |
Dhivehi |
1.00 |
1.00 |
1.00 |
111 |
10 |
Dutch |
0.99 |
1.00 |
0.99 |
144 |
11 |
English |
0.96 |
1.00 |
0.98 |
98 |
12 |
Esperanto |
0.98 |
0.98 |
0.98 |
107 |
13 |
Estonian |
1.00 |
0.99 |
0.99 |
93 |
14 |
French |
0.95 |
1.00 |
0.98 |
106 |
15 |
Frisian |
1.00 |
0.98 |
0.99 |
117 |
16 |
Georgian |
1.00 |
1.00 |
1.00 |
110 |
17 |
German |
1.00 |
1.00 |
1.00 |
101 |
18 |
Greek |
1.00 |
1.00 |
1.00 |
153 |
19 |
Hakha_Chin |
0.99 |
1.00 |
0.99 |
202 |
20 |
Indonesian |
0.99 |
0.99 |
0.99 |
150 |
21 |
Interlingua |
0.96 |
0.97 |
0.96 |
182 |
22 |
Italian |
0.99 |
0.94 |
0.96 |
100 |
23 |
Japanese |
1.00 |
1.00 |
1.00 |
144 |
24 |
Kabyle |
1.00 |
0.96 |
0.98 |
156 |
25 |
Kinyarwanda |
0.97 |
1.00 |
0.99 |
103 |
26 |
Kyrgyz |
0.98 |
1.00 |
0.99 |
129 |
27 |
Latvian |
0.98 |
0.98 |
0.98 |
171 |
28 |
Maltese |
0.99 |
0.98 |
0.98 |
152 |
29 |
Mongolian |
1.00 |
1.00 |
1.00 |
112 |
30 |
Persian |
1.00 |
1.00 |
1.00 |
123 |
31 |
Polish |
0.91 |
0.99 |
0.95 |
128 |
32 |
Portuguese |
0.94 |
0.99 |
0.96 |
124 |
33 |
Romanian |
1.00 |
1.00 |
1.00 |
152 |
34 |
Romansh_Sursilvan |
0.99 |
0.95 |
0.97 |
106 |
35 |
Russian |
0.99 |
0.99 |
0.99 |
100 |
36 |
Sakha |
0.99 |
1.00 |
1.00 |
105 |
37 |
Slovenian |
0.99 |
1.00 |
1.00 |
166 |
38 |
Spanish |
0.96 |
0.95 |
0.95 |
94 |
39 |
Swedish |
0.99 |
1.00 |
0.99 |
190 |
40 |
Tamil |
1.00 |
1.00 |
1.00 |
135 |
41 |
Tatar |
1.00 |
0.96 |
0.98 |
173 |
42 |
Turkish |
1.00 |
1.00 |
1.00 |
137 |
43 |
Ukranian |
0.99 |
1.00 |
1.00 |
126 |
44 |
Welsh |
0.98 |
1.00 |
0.99 |
103 |
|
macro avg |
0.98 |
0.99 |
0.98 |
5963 |
|
weighted avg |
0.98 |
0.98 |
0.98 |
5963 |
|
overall accuracy |
|
|
0.9837 |
5963 |
📄 License
This model is licensed under the Apache - 2.0 license.