🚀 fastText (Language Identification)
fastText is an open - source library for learning text representations and classifiers, with a pre - trained model for language identification that can detect multiple languages.
🚀 Quick Start
fastText is an open - source, free, lightweight library that enables users to learn text representations and text classifiers. It runs on standard, generic hardware, and models can later be downsized to fit on mobile devices. It was introduced in this paper, and its official website is here.
This LID (Language IDentification) model predicts the language of input text. The hosted version (lid218e
) was released as part of the NLLB project and can detect 217 languages. Older versions (identifying 157 languages) are available on the official fastText website.
✨ Features
- Efficient Learning: fastText is designed for efficient learning of word representations and sentence classification. It allows for quick model iteration and refinement without specialized hardware, and can train models on over a billion words in a few minutes on any multicore CPU.
- Multi - language Support: It includes pre - trained models learned on Wikipedia in over 157 different languages.
- Versatile Usage: Can be used as a command - line tool, linked to a C++ application, or as a library for various use - cases from experimentation to production.
📦 Installation
No specific installation steps are provided in the original document.
💻 Usage Examples
Basic Usage
>>> import fasttext
>>> from huggingface_hub import hf_hub_download
>>> model_path = hf_hub_download(repo_id="facebook/fasttext-language-identification", filename="model.bin")
>>> model = fasttext.load_model(model_path)
>>> model.predict("Hello, world!")
(('__label__eng_Latn',), array([0.81148803]))
>>> model.predict("Hello, world!", k=5)
(('__label__eng_Latn', '__label__vie_Latn', '__label__nld_Latn', '__label__pol_Latn', '__label__deu_Latn'),
array([0.61224753, 0.21323682, 0.09696738, 0.01359863, 0.01319415]))
Advanced Usage
>>> import numpy as np
>>> def cosine_similarity(word1, word2):
>>> return np.dot(model[word1], model[word2]) / (np.linalg.norm(model[word1]) * np.linalg.norm(model[word2]))
>>> cosine_similarity("man", "boy")
0.061653383
>>> cosine_similarity("man", "ceo")
0.11989131
>>> cosine_similarity("woman", "ceo")
-0.08834904
📚 Documentation
Intended Uses & Limitations
You can use pre - trained word vectors for text classification or language identification. Refer to the tutorials and resources on its official website for tasks that interest you.
⚠️ Important Note
Even if the training data used for this model could be characterized as fairly neutral, this model can have biased predictions.
Training Data
Pre - trained word vectors for 157 languages were trained on Common Crawl and Wikipedia using fastText. These models were trained using CBOW with position - weights, in dimension 300, with character n - grams of length 5, a window of size 5 and 10 negatives. Three new word analogy datasets for French, Hindi, and Polish are also distributed.
Training Procedure
Tokenization
We used the Stanford word segmenter for Chinese, Mecab for Japanese, and UETsegmenter for Vietnamese. For languages using the Latin, Cyrillic, Hebrew, or Greek scripts, we used the tokenizer from the Europarl preprocessing tools. For the remaining languages, we used the ICU tokenizer.
More information about the training of these models can be found in the article Learning Word Vectors for 157 Languages.
Evaluation Datasets
The analogy evaluation datasets described in the paper are available here: French, Hindi, Polish.
BibTeX Entry and Citation Info
Please cite [1] if using this code for learning word representations or [2] if using for text classification.
[1] P. Bojanowski*, E. Grave*, A. Joulin, T. Mikolov, Enriching Word Vectors with Subword Information
@article{bojanowski2016enriching,
title={Enriching Word Vectors with Subword Information},
author={Bojanowski, Piotr and Grave, Edouard and Joulin, Armand and Mikolov, Tomas},
journal={arXiv preprint arXiv:1607.04606},
year={2016}
}
[2] A. Joulin, E. Grave, P. Bojanowski, T. Mikolov, Bag of Tricks for Efficient Text Classification
@article{joulin2016bag,
title={Bag of Tricks for Efficient Text Classification},
author={Joulin, Armand and Grave, Edouard and Bojanowski, Piotr and Mikolov, Tomas},
journal={arXiv preprint arXiv:1607.01759},
year={2016}
}
[3] A. Joulin, E. Grave, P. Bojanowski, M. Douze, H. Jégou, T. Mikolov, FastText.zip: Compressing text classification models
@article{joulin2016fasttext,
title={FastText.zip: Compressing text classification models},
author={Joulin, Armand and Grave, Edouard and Bojanowski, Piotr and Douze, Matthijs and J{'e}gou, H{'e}rve and Mikolov, Tomas},
journal={arXiv preprint arXiv:1612.03651},
year={2016}
}
If you use these word vectors, please cite the following paper:
[4] E. Grave*, P. Bojanowski*, P. Gupta, A. Joulin, T. Mikolov, Learning Word Vectors for 157 Languages
@inproceedings{grave2018learning,
title={Learning Word Vectors for 157 Languages},
author={Grave, Edouard and Bojanowski, Piotr and Gupta, Prakhar and Joulin, Armand and Mikolov, Tomas},
booktitle={Proceedings of the International Conference on Language Resources and Evaluation (LREC 2018)},
year={2018}
}
(* These authors contributed equally.)
License
The language identification model is distributed under the [Creative Commons Attribution - NonCommercial 4.0 International Public License](https://creativecommons.org/licenses/by - nc/4.0/).