fastText Open-Source Language Identification Model - A Lightweight Tool Supporting 217 Language Detection

Fasttext Language Identification

Developed by facebook

fastText is an open-source, lightweight library for text classification and word representation learning. This model is specifically designed for language identification, supporting detection of 217 languages.

Text Classification #Multilingual identification #Lightweight model #Text classification

Downloads 533.60k

Release Time : 3/6/2023

Model Overview

A language identification model based on the fastText framework, capable of predicting the language type of input text with support for multiple language detection.

Model Features

Multilingual support

Capable of identifying 217 different languages, covering a wide range of language types.

Efficient and lightweight

The model has a small size and can run quickly on standard hardware, making it suitable for production environment deployment.

Subword information processing

Utilizes character n-gram features to handle rare words and languages with rich morphological variations.

Model Capabilities

Text language identification

Multilingual text classification

Word vector representation learning

Use Cases

Content moderation

Multilingual content classification

Automatically identifies the language type of user-generated content

Accuracy can reach over 90%

Localization services

Language routing

Automatically routes user input to the corresponding language service based on the detected language

Improves user experience

🚀 fastText (Language Identification)

fastText is an open - source library for learning text representations and classifiers, with a pre - trained model for language identification that can detect multiple languages.

🚀 Quick Start

fastText is an open - source, free, lightweight library that enables users to learn text representations and text classifiers. It runs on standard, generic hardware, and models can later be downsized to fit on mobile devices. It was introduced in this paper, and its official website is here.

This LID (Language IDentification) model predicts the language of input text. The hosted version (lid218e) was released as part of the NLLB project and can detect 217 languages. Older versions (identifying 157 languages) are available on the official fastText website.

✨ Features

Efficient Learning: fastText is designed for efficient learning of word representations and sentence classification. It allows for quick model iteration and refinement without specialized hardware, and can train models on over a billion words in a few minutes on any multicore CPU.
Multi - language Support: It includes pre - trained models learned on Wikipedia in over 157 different languages.
Versatile Usage: Can be used as a command - line tool, linked to a C++ application, or as a library for various use - cases from experimentation to production.

📦 Installation

No specific installation steps are provided in the original document.

💻 Usage Examples

Basic Usage

# Here is how to use this model to detect the language of a given text
>>> import fasttext
>>> from huggingface_hub import hf_hub_download

>>> model_path = hf_hub_download(repo_id="facebook/fasttext-language-identification", filename="model.bin")
>>> model = fasttext.load_model(model_path)
>>> model.predict("Hello, world!")

(('__label__eng_Latn',), array([0.81148803]))

>>> model.predict("Hello, world!", k=5)

(('__label__eng_Latn', '__label__vie_Latn', '__label__nld_Latn', '__label__pol_Latn', '__label__deu_Latn'), 
 array([0.61224753, 0.21323682, 0.09696738, 0.01359863, 0.01319415]))

Advanced Usage

# Cosine similarity can be used to measure the similarity between two different word vectors
>>> import numpy as np

>>> def cosine_similarity(word1, word2):
>>>     return np.dot(model[word1], model[word2]) / (np.linalg.norm(model[word1]) * np.linalg.norm(model[word2]))

>>> cosine_similarity("man", "boy")

0.061653383

>>> cosine_similarity("man", "ceo")

0.11989131

>>> cosine_similarity("woman", "ceo")

-0.08834904

📚 Documentation

Intended Uses & Limitations

You can use pre - trained word vectors for text classification or language identification. Refer to the tutorials and resources on its official website for tasks that interest you.

⚠️ Important Note

Even if the training data used for this model could be characterized as fairly neutral, this model can have biased predictions.

Training Data

Pre - trained word vectors for 157 languages were trained on Common Crawl and Wikipedia using fastText. These models were trained using CBOW with position - weights, in dimension 300, with character n - grams of length 5, a window of size 5 and 10 negatives. Three new word analogy datasets for French, Hindi, and Polish are also distributed.

Training Procedure

Tokenization

We used the Stanford word segmenter for Chinese, Mecab for Japanese, and UETsegmenter for Vietnamese. For languages using the Latin, Cyrillic, Hebrew, or Greek scripts, we used the tokenizer from the Europarl preprocessing tools. For the remaining languages, we used the ICU tokenizer.

More information about the training of these models can be found in the article Learning Word Vectors for 157 Languages.

Evaluation Datasets

The analogy evaluation datasets described in the paper are available here: French, Hindi, Polish.

BibTeX Entry and Citation Info

Please cite [1] if using this code for learning word representations or [2] if using for text classification.

[1] P. Bojanowski*, E. Grave*, A. Joulin, T. Mikolov, Enriching Word Vectors with Subword Information

@article{bojanowski2016enriching,
  title={Enriching Word Vectors with Subword Information},
  author={Bojanowski, Piotr and Grave, Edouard and Joulin, Armand and Mikolov, Tomas},
  journal={arXiv preprint arXiv:1607.04606},
  year={2016}
}

[2] A. Joulin, E. Grave, P. Bojanowski, T. Mikolov, Bag of Tricks for Efficient Text Classification

@article{joulin2016bag,
  title={Bag of Tricks for Efficient Text Classification},
  author={Joulin, Armand and Grave, Edouard and Bojanowski, Piotr and Mikolov, Tomas},
  journal={arXiv preprint arXiv:1607.01759},
  year={2016}
}

[3] A. Joulin, E. Grave, P. Bojanowski, M. Douze, H. Jégou, T. Mikolov, FastText.zip: Compressing text classification models

@article{joulin2016fasttext,
  title={FastText.zip: Compressing text classification models},
  author={Joulin, Armand and Grave, Edouard and Bojanowski, Piotr and Douze, Matthijs and J{'e}gou, H{'e}rve and Mikolov, Tomas},
  journal={arXiv preprint arXiv:1612.03651},
  year={2016}
}

If you use these word vectors, please cite the following paper:

[4] E. Grave*, P. Bojanowski*, P. Gupta, A. Joulin, T. Mikolov, Learning Word Vectors for 157 Languages

@inproceedings{grave2018learning,
  title={Learning Word Vectors for 157 Languages},
  author={Grave, Edouard and Bojanowski, Piotr and Gupta, Prakhar and Joulin, Armand and Mikolov, Tomas},
  booktitle={Proceedings of the International Conference on Language Resources and Evaluation (LREC 2018)},
  year={2018}
}

(* These authors contributed equally.)

License

The language identification model is distributed under the [Creative Commons Attribution - NonCommercial 4.0 International Public License](https://creativecommons.org/licenses/by - nc/4.0/).

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご