multilingual-e5-language-detection Open-source Language Detection Model - Supports 45 Languages with 98.37% Accuracy

Home

Multilingual E5 Language Detection

Developed by Mike0307

A detection model supporting 45 languages, fine-tuned based on multilingual-e5-base with an accuracy rate of 98.37%

Text Classification

Transformers

Supports Multiple LanguagesOpen Source License:Apache-2.0 #Multilingual Detection #High Accuracy #45 Language Support

Downloads 570

Release Time : 1/27/2024

Model Overview

This model is specifically designed for detecting the language type of text, supporting 45 languages including Arabic, Chinese, English, and more. It is fine-tuned on the common-language dataset based on the multilingual-e5-base model, featuring high accuracy and extensive language coverage.

Model Features

Multilingual Support

Supports detection of 45 languages, covering a wide range of language types

High Accuracy

Overall accuracy rate of 98.37%, with F1 scores exceeding 0.95 for most languages

Fine-grained Chinese Detection

Capable of distinguishing between Chinese variants from Mainland China, Hong Kong, and Taiwan

Model Capabilities

Language Detection

Multilingual Text Classification

Use Cases

Content Management

Multilingual Content Classification

Automatically identifies the language type of user-submitted content

Accuracy rate of 98.37%

Localization Services

Language Routing

Routes user requests to corresponding language services based on detected language

🚀 Multilingual Language Detection Model

This model is designed to detect 45 languages. It is fine - tuned on the common - language dataset using the multilingual - e5 - base model. With an overall accuracy of 98.37%, it offers reliable language detection capabilities.

🚀 Quick Start

📦 Installation

from transformers import AutoTokenizer, AutoModelForSequenceClassification
tokenizer = AutoTokenizer.from_pretrained('Mike0307/multilingual-e5-language-detection')
model = AutoModelForSequenceClassification.from_pretrained('Mike0307/multilingual-e5-language-detection', num_labels=45)

💻 Usage Examples

Basic Usage

import torch

languages = [
    "Arabic", "Basque", "Breton", "Catalan", "Chinese_China", "Chinese_Hongkong", 
    "Chinese_Taiwan", "Chuvash", "Czech", "Dhivehi", "Dutch", "English", 
    "Esperanto", "Estonian", "French", "Frisian", "Georgian", "German", "Greek", 
    "Hakha_Chin", "Indonesian", "Interlingua", "Italian", "Japanese", "Kabyle", 
    "Kinyarwanda", "Kyrgyz", "Latvian", "Maltese", "Mongolian", "Persian", "Polish", 
    "Portuguese", "Romanian", "Romansh_Sursilvan", "Russian", "Sakha", "Slovenian", 
    "Spanish", "Swedish", "Tamil", "Tatar", "Turkish", "Ukranian", "Welsh"
]

def predict(text, model, tokenizer, device = torch.device('cpu')):
    model.to(device)
    model.eval()
    tokenized = tokenizer(text, padding='max_length', truncation=True, max_length=128, return_tensors="pt")
    input_ids = tokenized['input_ids']
    attention_mask = tokenized['attention_mask']
    with torch.no_grad():
        input_ids = input_ids.to(device)
        attention_mask = attention_mask.to(device)
        outputs = model(input_ids=input_ids, attention_mask=attention_mask)
    logits = outputs.logits
    probabilities = torch.nn.functional.softmax(logits, dim=1)
    return probabilities

def get_topk(probabilities, languages, k=3):
    topk_prob, topk_indices = torch.topk(probabilities, k)
    topk_prob = topk_prob.cpu().numpy()[0].tolist()
    topk_indices = topk_indices.cpu().numpy()[0].tolist()
    topk_labels = [languages[index] for index in topk_indices]
    return topk_prob, topk_labels

text = "你的測試句子"
probabilities = predict(text, model, tokenizer)
topk_prob, topk_labels = get_topk(probabilities, languages)
print(topk_prob, topk_labels)

# [0.999620258808, 0.00025940246996469, 2.7690215574693e-05]
# ['Chinese_Taiwan', 'Chinese_Hongkong', 'Chinese_China']

📚 Documentation

Overview

This model supports the detection of 45 languages, and it's fine - tuned using the multilingual - e5 - base model on the common - language dataset. The overall accuracy is 98.37%, and more evaluation results are shown below.

Evaluation Results

The test datasets refer to the common_language test datasets.

Property	Details
Model Type	Fine - tuned from multilingual - e5 - base
Training Data	common - language dataset

Index	Language	Precision	Recall	F1 - score	Support
0	Arabic	1.00	1.00	1.00	151
1	Basque	0.99	1.00	1.00	111
2	Breton	1.00	0.90	0.95	252
3	Catalan	0.96	0.99	0.97	96
4	Chinese_China	0.98	1.00	0.99	100
5	Chinese_Hongkong	0.97	0.87	0.92	115
6	Chinese_Taiwan	0.92	0.98	0.95	170
7	Chuvash	0.98	1.00	0.99	137
8	Czech	0.98	1.00	0.99	128
9	Dhivehi	1.00	1.00	1.00	111
10	Dutch	0.99	1.00	0.99	144
11	English	0.96	1.00	0.98	98
12	Esperanto	0.98	0.98	0.98	107
13	Estonian	1.00	0.99	0.99	93
14	French	0.95	1.00	0.98	106
15	Frisian	1.00	0.98	0.99	117
16	Georgian	1.00	1.00	1.00	110
17	German	1.00	1.00	1.00	101
18	Greek	1.00	1.00	1.00	153
19	Hakha_Chin	0.99	1.00	0.99	202
20	Indonesian	0.99	0.99	0.99	150
21	Interlingua	0.96	0.97	0.96	182
22	Italian	0.99	0.94	0.96	100
23	Japanese	1.00	1.00	1.00	144
24	Kabyle	1.00	0.96	0.98	156
25	Kinyarwanda	0.97	1.00	0.99	103
26	Kyrgyz	0.98	1.00	0.99	129
27	Latvian	0.98	0.98	0.98	171
28	Maltese	0.99	0.98	0.98	152
29	Mongolian	1.00	1.00	1.00	112
30	Persian	1.00	1.00	1.00	123
31	Polish	0.91	0.99	0.95	128
32	Portuguese	0.94	0.99	0.96	124
33	Romanian	1.00	1.00	1.00	152
34	Romansh_Sursilvan	0.99	0.95	0.97	106
35	Russian	0.99	0.99	0.99	100
36	Sakha	0.99	1.00	1.00	105
37	Slovenian	0.99	1.00	1.00	166
38	Spanish	0.96	0.95	0.95	94
39	Swedish	0.99	1.00	0.99	190
40	Tamil	1.00	1.00	1.00	135
41	Tatar	1.00	0.96	0.98	173
42	Turkish	1.00	1.00	1.00	137
43	Ukranian	0.99	1.00	1.00	126
44	Welsh	0.98	1.00	0.99	103
	macro avg	0.98	0.99	0.98	5963
	weighted avg	0.98	0.98	0.98	5963
	overall accuracy			0.9837	5963

📄 License

This model is licensed under the Apache - 2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご