fasttext-zh-vectors開源文本處理庫 - 免費支持中文詞向量訓練與文本分類

首頁

Fasttext Zh Vectors

由facebook開發

fastText是一個開源、免費、輕量級的文本表示學習和分類庫，支持中文詞向量訓練和文本分類任務。

文本嵌入中文#多語言詞向量 #輕量級文本分類 #高效特徵提取

下載量 355

發布時間 : 3/19/2023

模型概述

fastText庫專注於文本分類和詞向量學習，能在普通硬件上快速處理大規模文本數據，提供預訓練的中文詞向量模型。

模型特點

高效訓練

能在標準硬件上快速處理十億級詞彙量的訓練任務

子詞信息

利用字符n-gram捕捉詞形變化和罕見詞特徵

多場景支持

提供命令行工具、C++庫和編程接口，支持從實驗到生產的全流程

模型能力

詞向量生成

文本分類

語義相似度計算

語言識別

近義詞發現

使用案例

自然語言處理

語義搜索

利用詞向量計算查詢詞與文檔的語義相關性

提升搜索結果的語義匹配精度

文本分類

對新聞、評論等內容進行自動分類

快速實現多類別文本分類系統

語言分析

語言檢測

識別輸入文本的語種

支持157種語言的識別

🚀 fastText (中文)

fastText 是一個開源、免費、輕量級的庫，它能讓用戶學習文本表示和文本分類器。它可在標準通用硬件上運行，後續模型還能縮小尺寸，甚至適配移動設備。該庫在這篇論文中被提出，其官方網站可點擊此處訪問。

✨ 主要特性

高效學習：能高效學習詞表示和進行句子分類。
簡單易用：對開發者、領域專家和學生來說都易於使用。
多語言支持：包含在維基百科上學習的預訓練模型，支持超 157 種不同語言。
多方式使用：可作為命令行工具、鏈接到 C++ 應用程序，或作為庫用於從實驗、原型開發到生產的各種用例。

📦 安裝指南

文檔未提及安裝步驟，故跳過此章節。

💻 使用示例

基礎用法

以下是如何加載和使用預訓練向量：

>>> import fasttext
>>> from huggingface_hub import hf_hub_download

>>> model_path = hf_hub_download(repo_id="facebook/fasttext-zh-vectors", filename="model.bin")
>>> model = fasttext.load_model(model_path)
>>> model.words

['the', 'of', 'and', 'to', 'in', 'a', 'that', 'is', ...]

>>> len(model.words)

145940

>>> model['bread']

array([ 4.89417791e-01,  1.60882145e-01, -2.25947708e-01, -2.94273376e-01,
       -1.04577184e-01,  1.17962055e-01,  1.34821936e-01, -2.41778508e-01, ...])

高級用法

查詢英文單詞向量的最近鄰

>>> import fasttext
>>> from huggingface_hub import hf_hub_download

>>> model_path = hf_hub_download(repo_id="facebook/fasttext-en-nearest-neighbors", filename="model.bin")
>>> model = fasttext.load_model(model_path)
>>> model.get_nearest_neighbors("bread", k=5)

[(0.5641006231307983, 'butter'), 
 (0.48875734210014343, 'loaf'), 
 (0.4491206705570221, 'eat'), 
 (0.42444291710853577, 'food'), 
 (0.4229326844215393, 'cheese')]

檢測給定文本的語言

>>> import fasttext
>>> from huggingface_hub import hf_hub_download

>>> model_path = hf_hub_download(repo_id="facebook/fasttext-language-identification", filename="model.bin")
>>> model = fasttext.load_model(model_path)
>>> model.predict("Hello, world!")

(('__label__eng_Latn',), array([0.81148803]))

>>> model.predict("Hello, world!", k=5)

(('__label__eng_Latn', '__label__vie_Latn', '__label__nld_Latn', '__label__pol_Latn', '__label__deu_Latn'), 
 array([0.61224753, 0.21323682, 0.09696738, 0.01359863, 0.01319415]))

📚 詳細文檔

預期用途和限制

你可以使用預訓練詞向量進行文本分類或語言識別。可查看其官方網站上的教程和資源，尋找你感興趣的任務。

侷限性和偏差

即使該模型使用的訓練數據可被描述為相當中立，但此模型仍可能有有偏差的預測。

可以使用餘弦相似度來衡量兩個不同詞向量之間的相似度。如果兩個向量相同，餘弦相似度將為 1；對於兩個完全不相關的向量，值將為 0；如果兩個向量呈相反關係，值將為 -1。

>>> import numpy as np

>>> def cosine_similarity(word1, word2):
>>>     return np.dot(model[word1], model[word2]) / (np.linalg.norm(model[word1]) * np.linalg.norm(model[word2]))

>>> cosine_similarity("man", "boy")

0.061653383

>>> cosine_similarity("man", "ceo")

0.11989131

>>> cosine_similarity("woman", "ceo")

-0.08834904

訓練數據

使用 fastText 在 Common Crawl 和維基百科上對 157 種語言的預訓練詞向量進行了訓練。這些模型使用帶位置權重的 CBOW 進行訓練，維度為 300，字符 n-gram 長度為 5，窗口大小為 5，負樣本數為 10。我們還發布了三個新的詞類比數據集，分別用於法語、印地語和波蘭語。

訓練過程

分詞

中文使用斯坦福分詞器。
日語使用 Mecab。
越南語使用 UETsegmenter。
對於使用拉丁、西里爾、希伯來或希臘字母的語言，使用 Europarl 預處理工具中的分詞器。
對於其餘語言，使用 ICU 分詞器。

關於這些模型訓練的更多信息可在文章 Learning Word Vectors for 157 Languages 中找到。

評估數據集

論文中描述的類比評估數據集可在此處獲取：法語、印地語、波蘭語。

引用信息

如果使用此代碼學習詞表示，請引用 [1]；如果用於文本分類，請引用 [2]。

[1] P. Bojanowski*, E. Grave*, A. Joulin, T. Mikolov, Enriching Word Vectors with Subword Information

@article{bojanowski2016enriching,
  title={Enriching Word Vectors with Subword Information},
  author={Bojanowski, Piotr and Grave, Edouard and Joulin, Armand and Mikolov, Tomas},
  journal={arXiv preprint arXiv:1607.04606},
  year={2016}
}

[2] A. Joulin, E. Grave, P. Bojanowski, T. Mikolov, Bag of Tricks for Efficient Text Classification

@article{joulin2016bag,
  title={Bag of Tricks for Efficient Text Classification},
  author={Joulin, Armand and Grave, Edouard and Bojanowski, Piotr and Mikolov, Tomas},
  journal={arXiv preprint arXiv:1607.01759},
  year={2016}
}

[3] A. Joulin, E. Grave, P. Bojanowski, M. Douze, H. Jégou, T. Mikolov, FastText.zip: Compressing text classification models

@article{joulin2016fasttext,
  title={FastText.zip: Compressing text classification models},
  author={Joulin, Armand and Grave, Edouard and Bojanowski, Piotr and Douze, Matthijs and J{'e}gou, H{'e}rve and Mikolov, Tomas},
  journal={arXiv preprint arXiv:1612.03651},
  year={2016}
}

如果你使用這些詞向量，請引用以下論文：

[4] E. Grave*, P. Bojanowski*, P. Gupta, A. Joulin, T. Mikolov, Learning Word Vectors for 157 Languages

@inproceedings{grave2018learning,
  title={Learning Word Vectors for 157 Languages},
  author={Grave, Edouard and Bojanowski, Piotr and Gupta, Prakhar and Joulin, Armand and Mikolov, Tomas},
  booktitle={Proceedings of the International Conference on Language Resources and Evaluation (LREC 2018)},
  year={2018}
}

(* 這些作者貢獻相同。)