llm-data-textbook-quality-fasttext-classifier-v2開源文本分類器

首頁

Llm Data Textbook Quality Fasttext Classifier V2

由kenhktsui開發

這是一個基於fasttext構建的教育價值分類器，用於判斷網絡文本是否具有較高的教育價值，適用於大語言模型(LLM)預訓練數據篩選。

文本分類英語開源協議:MIT #教育價值評估 #LLM數據篩選 #CPU高效

下載量 3,651

發布時間 : 5/19/2024

模型概述

該分類器可以判斷文本的教育價值水平，分為高、中、低三個等級，特別適用於LLM訓練數據的質量篩選。

模型特點

高效CPU推理

基於fasttext構建，在CPU上每秒可分類超過2000個樣本，適合即時使用

三級教育價值評估

提供高、中、低三個教育價值等級，比二元分類提供更細粒度的評估

量化模型支持

提供量化模型版本model_quantized.bin，優化推理效率

模型能力

文本分類

教育價值評估

數據質量篩選

使用案例

LLM訓練數據篩選

預訓練數據過濾

在LLM預訓練前篩選高質量教育價值的文本數據

提高訓練數據質量，改善模型性能

教育內容分析

教材內容評估

評估不同教育材料的教育價值水平

幫助識別高質量教育內容

🚀 📚llm-data-textbook-quality-fasttext-classifier-v2

本項目是一個教育價值分類器，能夠對來自網絡的文本是否具有高教育價值進行分類。它可以作為訓練大語言模型（LLM）時預訓練數據篩選的過濾器，具有較高的教育價值分類粒度。

博客 | 數據集

🚀 快速開始

更新信息

2024年7月7日：量化模型 "model_quantized.bin" 發佈。

model = fasttext.load_model(hf_hub_download("kenhktsui/llm-data-textbook-quality-fasttext-classifer-v2", "model_quantized.bin"))

模型相關圖片

✨ 主要特性

教育價值分類功能

“輸入垃圾，輸出垃圾。無論語言模型的參數數量多少，其質量都取決於訓練數據的質量。”

該教育價值分類器可以對來自網絡的文本是否具有高教育價值（比教科書質量的定義更明確）進行分類。它深受論文 Textbooks Are All You Need 的啟發，該論文中開發了一個分類器來預測數據的教育價值，然後用於數據過濾。

該模型是在網絡/原始文本上進行訓練的，而不是在格式化為指令數據集的數據上（目前）。它可以在訓練大語言模型（LLM）時用作預訓練數據篩選的過濾器🤖。該模型有3個標籤，而不是2個，提供了更高的教育價值分類粒度：

高（前25%的教育價值）
中（中間25 - 75%的教育價值）
低（後25%的教育價值）

當該分類器有更多下游實驗結果時，將會有詳細的報告/論文發佈。關於該分類器的驗證，請參閱分析。該分類器已應用於各種預訓練數據集，具體情況請參閱 基準測試。

高性能計算

⚡ 該模型基於 fasttext 構建，在CPU上每秒可以對2000多個示例進行分類，因此可以在預訓練過程中即時使用。

請注意，教科書質量是高質量的一個子集。

反饋機制

💬 歡迎反饋！如果您覺得這個模型有幫助，請點贊並留下評論。我將持續致力於讓大語言模型的數據篩選變得更好、更簡單。

💻 使用示例

基礎用法

教育價值的範圍是 [0, 2]，詳細公式如下所述。

predict_educational_value(['''Logic is the study of correct reasoning. It includes both formal and informal logic. Formal logic is the study of deductively valid inferences or logical truths. It examines how conclusions follow from premises due to the structure of arguments alone, independent of their topic and content. Informal logic is associated with informal fallacies, critical thinking, and argumentation theory. It examines arguments expressed in natural language while formal logic uses formal language. When used as a countable noun, the term "a logic" refers to a logical formal system that articulates a proof system. Logic plays a central role in many fields, such as philosophy, mathematics, computer science, and linguistics.'''])
# 輸出 [1.9266871362924576]
predict_educational_value(['''"Attention Is All You Need" is a landmark[1][2] 2017 research paper authored by eight scientists working at Google, responsible for expanding 2014 attention mechanisms proposed by Bahdanau et al. into a new deep learning architecture known as the transformer. The paper is considered by some to be a founding document for modern artificial intelligence, as transformers became the main architecture of large language models.[3][4] At the time, the focus of the research was on improving Seq2seq techniques for machine translation, but even in their paper the authors saw the potential for other tasks like question answering and for what is now called multimodal Generative AI.[5]'''])
# 輸出 [1.8226698189973831]
predict_educational_value(['''A large language model (LLM) is a computational model notable for its ability to achieve general-purpose language generation and other natural language processing tasks such as classification. Based on language models, LLMs acquire these abilities by learning statistical relationships from text documents during a computationally intensive self-supervised and semi-supervised training process.[1] LLMs can be used for text generation, a form of generative AI, by taking an input text and repeatedly predicting the next token or word.[2]'''])
# 輸出 [1.7609568238258362]
predict_educational_value(['''In Vapnik–Chervonenkis theory, the Vapnik–Chervonenkis (VC) dimension is a measure of the size (capacity, complexity, expressive power, richness, or flexibility) of a class of sets. The notion can be extended to classes of binary functions. It is defined as the cardinality of the largest set of points that the algorithm can shatter, which means the algorithm can always learn a perfect classifier for any labeling of at least one configuration of those data points. It was originally defined by Vladimir Vapnik and Alexey Chervonenkis.[1]'''])
# 輸出 [1.589950144290924]
predict_educational_value(['''The query vector is compared (via dot product) with each word in the keys. This helps the model discover the most relevant word for the query word. In this case "girl" was determined to be the most relevant word for "that". The result (size 4 in this case) is run through the softmax function, producing a vector of size 4 with probabilities summing to 1. Multiplying this against the value matrix effectively amplifies the signal for the most important words in the sentence and diminishes the signal for less important words.[5] The structure of the input data is captured in the Wq and Wk weights, and the Wv weights express that structure in terms of more meaningful features for the task being trained for. For this reason, the attention head components are called Query (Wq), Key (Wk), and Value (Wv)—a loose and possibly misleading analogy with relational database systems.'''])
# 輸出 [1.4657384157180786]
predict_educational_value(['''The Arsenal Football Club (commonly known as simply Arsenal) is an English professional football club based in Holloway, North London. Arsenal compete in the Premier League, the top flight of English football. In domestic football, Arsenal has won 13 league titles (including one unbeaten title), a record 14 FA Cups, two League Cups, 17 FA Community Shields, and a Football League Centenary Trophy. In European football, they have one European Cup Winners' Cup and one Inter-Cities Fairs Cup. In terms of trophies won, it is the third-most successful club in English football.[2]'''])
# 輸出 [1.1015518307685852]
predict_educational_value(['''The 2003–04 season was Arsenal Football Club's 12th season in the Premier League and their 78th consecutive season in the top flight of English football.[3][4] It began on 1 July 2003 and concluded on 30 June 2004, with competitive matches played between August and May. The club ended the Premier League campaign as champions without a single defeat – a record of 26 wins and 12 draws. Arsenal fared less well in the cups, eliminated in the FA Cup and League Cup semi-finals to Manchester United and Middlesbrough respectively, and at the quarter-final stage of the UEFA Champions League to Chelsea.'''])
# 輸出 [1.0146622359752655]
predict_educational_value(['''As both teams' first-choice kits featured a shade of red, Arsenal wore their yellow away strip, while Barcelona wore their traditional blue and maroon striped kit. Arsenal won the coin toss and Barcelona kicked off.[21] Barcelona almost immediately came under pressure when Thierry Henry shot straight at Barcelona goalkeeper Víctor Valdés, who conceded a corner. From the resulting corner Arsenal had another chance again courtesy of Henry, whose shot was again saved by Valdés. The next attack in the seventh minute resulted in Arsenal goalkeeper Jens Lehmann saving from Ludovic Giuly after he shot from a narrow angle. Four minutes later Barcelona were awarded a free-kick 35 yards from goal; Ronaldinho shot wide of the goal.'''])
# 輸出 [0.7897453680634499]

通過觀察可以發現，該模型更喜歡科學知識。它也對阿森納足球俱樂部感興趣，然而，它認為某場特定比賽的總結沒有很好的教育價值。

高級用法

from typing import List
import re
from huggingface_hub import hf_hub_download
import fasttext

model = fasttext.load_model(hf_hub_download("kenhktsui/llm-data-textbook-quality-fasttext-classifer-v2", "model.bin"))

def replace_newlines(text: str) -> str:
  return re.sub("\n+", " ", text)

score_dict = {
  '__label__': 0, 
  '__label__Low': 0, 
  '__label__Mid': 1,
  '__label__High': 2
}

def predict_educational_value(text_list):
  text_list = [replace_newlines(text) for text in text_list]
  pred = model.predict(text_list, k=-1)
  score_list = []
  for l, s in zip(*pred):
    score = 0
    for _l, _s in zip(l, s):
      score += score_dict[_l] * _s
    score_list.append(float(score))
  return score_list

predict_educational_value(["Hi"])
# 輸出: [3.0000010156072676e-05]

📚 詳細文檔

📊 基準測試

為了確保該分類器的有效性，將其應用於各種數據集。

教育價值 = 2分 * P(高) + 1分 * P(中) + 0分 * P(低)

該分數大致可以解釋為：

教育價值	類別
2	高
1	中
0	低

數據集	採樣	平均教育價值	類型
SciPhi/textbooks-are-all-you-need-lite	前100,000條	1.846	合成數據
nampdn-ai/tiny-orca-textbooks	前100,000條	1.673	合成數據
HuggingFaceTB/cosmopedia stanford	前100,000條	1.673	合成數據
vikp/textbook_quality_programming	前100,000條	1.663	合成數據
HuggingFaceTB/cosmopedia web_samples_v1	前100,000條	1.618	合成數據
nampdn-ai/tiny-textbooks	前100,000條	1.586	合成數據
HuggingFaceTB/cosmopedia web_samples_v2	前100,000條	1.562	合成數據
HuggingFaceTB/cosmopedia openstax	前100,000條	1.462	合成數據
HuggingFaceTB/cosmopedia wikihow	前100,000條	1.422	合成數據
HuggingFaceTB/cosmopedia khanacademy	前100,000條	1.419	合成數據
HuggingFaceTB/cosmopedia auto_math_text	前100,000條	1.347	合成數據
armanc/scientific_papers pubmed	前100,000條	1.260	真實數據
HuggingFaceTB/cosmopedia stories	前100,000條	1.154	合成數據
teknium/OpenHermes-2.5	前100,000條	1.121	合成數據
timdettmers/openassistant-guanaco	前100,000條	1.115	真實數據
open-web-math/open-web-math	前100,000條	1.089	真實數據
armanc/scientific_papers arxiv	前100,000條	1.068	真實數據
HuggingFaceFW/fineweb	前100,000條	1.056	真實數據
NousResearch/dolma-v1_7-305B*	前100,000條	1.037	真實數據
tatsu-lab/alpaca	前100,000條	1.020	合成數據
BEE-spoke-data/fineweb-100k_en-med	前100,000條	1.019	真實數據
JeanKaddour/minipile	前100,000條	0.998	真實數據
togethercomputer/RedPajama-Data-V2 en 2023-06	前100,000條	0.985	真實數據
wikipedia en 20220301	前100,000條	0.975	真實數據
Replete-AI/code_bagel	前100,000條	0.950	合成數據
allenai/c4 en	前100,000條	0.934	真實數據
mattymchen/refinedweb-3m	前100,000條	0.857	真實數據
iamtarun/python_code_instructions_18k_alpaca	前100,000條	0.849	合成數據
tiiuae/falcon-refinedweb	前100,000條	0.835	真實數據
BEE-spoke-data/FineMeme-100k	前100,000條	0.716	真實數據
neuralcatcher/hateful_memes	前100,000條	0.070	真實數據