instructor-base開源文本嵌入模型 - 精準計算句子相似度，高效完成文本檢索

首頁

Instructor Base

由hkunlp開發

基於T5架構的文本嵌入模型，專注於句子相似度計算和文本檢索任務，在多個基準測試中表現優異。

文本嵌入

Transformers

英語開源協議:Apache-2.0 #文本語義相似度 #多任務文本嵌入 #英語信息檢索

下載量 13.22k

發布時間 : 12/20/2022

模型概述

該模型是一個基於T5架構的文本嵌入模型，主要用於生成高質量的句子嵌入向量，支持信息檢索、文本分類、聚類和語義相似度計算等多種自然語言處理任務。

模型特點

多任務性能優異

在MTEB基準測試的多個任務中表現優秀，包括分類、聚類和檢索任務

高效文本嵌入

能夠生成高質量的句子嵌入向量，適用於大規模信息檢索場景

廣泛適用性

支持多種下游NLP任務，包括相似度計算、分類和聚類等

模型能力

句子相似度計算

文本嵌入生成

信息檢索

文本分類

文本聚類

語義搜索

文本重排序

使用案例

電子商務

產品評論分類

對亞馬遜產品評論進行情感分析分類

在AmazonPolarity分類任務中達到88.36%準確率

反事實檢測

識別亞馬遜產品評論中的反事實陳述

在AmazonCounterfactual分類任務中達到86.21%準確率

金融

銀行客服分類

對銀行客戶諮詢進行分類

在Banking77分類任務中達到77.04%準確率

學術研究

論文聚類

對arXiv和biorxiv論文進行主題聚類

在ArxivClusteringP2P任務中達到39.68 v_measure分數

🚀 hkunlp/instructor-base

我們推出了 Instructor👨‍🏫，這是一個經過指令微調的文本嵌入模型。無需任何微調，只需提供任務指令，它就能生成適用於任何任務（如分類、檢索、聚類、文本評估等）和領域（如科學、金融等）的文本嵌入。Instructor👨‍ 在 70 個不同的嵌入任務中達到了最優性能！

該模型使用我們定製的 sentence-transformer 庫，易於上手。更多詳細信息，請查看我們的論文和項目頁面！

**************************** 更新內容 ****************************

01/21：我們發佈了一個新的檢查點，該檢查點使用難負樣本進行訓練，性能更優。
12/21：我們發佈了論文、代碼、檢查點和項目頁面，歡迎查看！

🚀 快速開始

📦 安裝指南

pip install InstructorEmbedding

💻 使用示例

基礎用法

計算自定義嵌入：

from InstructorEmbedding import INSTRUCTOR
model = INSTRUCTOR('hkunlp/instructor-base')
sentence = "3D ActionSLAM: wearable person tracking in multi-floor environments"
instruction = "Represent the Science title:"
embeddings = model.encode([[instruction,sentence]])
print(embeddings)

高級用法

計算句子相似度

from sklearn.metrics.pairwise import cosine_similarity
sentences_a = [['Represent the Science sentence: ','Parton energy loss in QCD matter'], 
               ['Represent the Financial statement: ','The Federal Reserve on Wednesday raised its benchmark interest rate.']]
sentences_b = [['Represent the Science sentence: ','The Chiral Phase Transition in Dissipative Dynamics'],
               ['Represent the Financial statement: ','The funds rose less than 0.5 per cent on Friday']]
embeddings_a = model.encode(sentences_a)
embeddings_b = model.encode(sentences_b)
similarities = cosine_similarity(embeddings_a,embeddings_b)
print(similarities)

信息檢索

import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
query  = [['Represent the Wikipedia question for retrieving supporting documents: ','where is the food stored in a yam plant']]
corpus = [['Represent the Wikipedia document for retrieval: ','Capitalism has been dominant in the Western world since the end of feudalism, but most feel[who?] that the term "mixed economies" more precisely describes most contemporary economies, due to their containing both private-owned and state-owned enterprises. In capitalism, prices determine the demand-supply scale. For example, higher demand for certain goods and services lead to higher prices and lower demand for certain goods lead to lower prices.'],
          ['Represent the Wikipedia document for retrieval: ',"The disparate impact theory is especially controversial under the Fair Housing Act because the Act regulates many activities relating to housing, insurance, and mortgage loansâ€”and some scholars have argued that the theory's use under the Fair Housing Act, combined with extensions of the Community Reinvestment Act, contributed to rise of sub-prime lending and the crash of the U.S. housing market and ensuing global economic recession"],
          ['Represent the Wikipedia document for retrieval: ','Disparate impact in United States labor law refers to practices in employment, housing, and other areas that adversely affect one group of people of a protected characteristic more than another, even though rules applied by employers or landlords are formally neutral. Although the protected classes vary by statute, most federal civil rights laws protect based on race, color, religion, national origin, and sex as protected traits, and some laws include disability status and other traits as well.']]
query_embeddings = model.encode(query)
corpus_embeddings = model.encode(corpus)
similarities = cosine_similarity(query_embeddings,corpus_embeddings)
retrieved_doc_id = np.argmax(similarities)
print(retrieved_doc_id)

聚類

import sklearn.cluster
sentences = [['Represent the Medicine sentence for clustering: ','Dynamical Scalar Degree of Freedom in Horava-Lifshitz Gravity'],
             ['Represent the Medicine sentence for clustering: ','Comparison of Atmospheric Neutrino Flux Calculations at Low Energies'],
             ['Represent the Medicine sentence for clustering: ','Fermion Bags in the Massive Gross-Neveu Model'],
             ['Represent the Medicine sentence for clustering: ',"QCD corrections to Associated t-tbar-H production at the Tevatron"],
             ['Represent the Medicine sentence for clustering: ','A New Analysis of the R Measurements: Resonance Parameters of the Higher,  Vector States of Charmonium']]
embeddings = model.encode(sentences)
clustering_model = sklearn.cluster.MiniBatchKMeans(n_clusters=2)
clustering_model.fit(embeddings)
cluster_assignment = clustering_model.labels_
print(cluster_assignment)

📚 詳細文檔

如果您想為特定句子計算自定義嵌入，可以遵循以下統一模板編寫指令：

Represent the domain text_type for task_objective:

domain 是可選的，它指定了文本的領域，例如科學、金融、醫學等。
text_type 是必需的，它指定了編碼單元，例如句子、文檔、段落等。
task_objective 是可選的，它指定了嵌入的目標，例如檢索文檔、對句子進行分類等。

📄 許可證

該項目採用 apache-2.0 許可證。

模型指標

屬性	詳情
模型類型	文本嵌入模型
訓練數據	未提及

模型在多個任務和數據集上的詳細指標如下：

分類任務

數據集	準確率	AP	F1
MTEB AmazonCounterfactualClassification (en)	86.2089552238806	55.76273850794966	81.26104211414781
MTEB AmazonPolarityClassification	88.35995000000001	84.18839957309655	88.317619250081
MTEB AmazonReviewsClassification (en)	44.64	未提及	42.48663956478136
MTEB Banking77Classification	77.03571428571428	未提及	75.87384305045917
MTEB EmotionClassification	51.760000000000005	未提及	45.51690565701713
MTEB ImdbClassification	81.1744	75.44973697032414	81.09901117955782
MTEB MTOPDomainClassification (en)	93.71865025079799	未提及	93.38906173610519
MTEB MTOPIntentClassification (en)	70.2576379388965	未提及	49.20405830249464
MTEB MassiveIntentClassification (en)	67.48486886348351	未提及	64.92199176095157
MTEB MassiveScenarioClassification (en)	72.59246805648958	未提及	72.1222026389164
MTEB ToxicConversationsClassification	71.8194	14.447702451658554	55.13659412856185
MTEB TweetSentimentExtractionClassification	63.310696095076416	未提及	63.360434851097814

檢索任務

數據集	MAP@1	MAP@10	MAP@100	MAP@1000	MRR@1	MRR@10	MRR@100	MRR@1000	NDCG@1	NDCG@10	NDCG@100	NDCG@1000	準確率@1	準確率@10	準確率@100	準確率@1000	召回率@1	召回率@10	召回率@100	召回率@1000	召回率@3	召回率@5
MTEB ArguAna	27.383000000000003	43.024	44.023	44.025999999999996	28.094	43.315	44.313	44.317	27.383000000000003	52.032000000000004	56.19499999999999	56.272	27.383000000000003	8.087	0.989	0.099	27.383000000000003	80.868	98.86200000000001	99.431	51.28	65.22
MTEB CQADupstackAndroidRetrieval	33.739999999999995	46.197	47.814	47.934	41.059	52.292	52.978	53.015	41.059	52.608	57.965	59.775999999999996	41.059	9.943	1.6070000000000002	0.20500000000000002	33.739999999999995	63.888999999999996	85.832	97.475	51.953	57.498000000000005
MTEB CQADupstackEnglishRetrieval	31.169999999999998	41.455	42.716	42.847	39.427	47.818	48.519	48.558	39.427	47.181	51.737	53.74	39.427	8.847	1.425	0.189	31.169999999999998	56.971000000000004	76.31400000000001	88.93900000000001	45.208	49.923
MTEB CQADupstackGamingRetrieval	39.682	52.766000000000005	53.84100000000001	53.898	45.266	56.093	56.763	56.793000000000006	45.266	58.836	62.863	63.912	45.266	9.492	1.236	0.13699999999999998	39.682	73.233	90.335	97.452	58.562000000000005	65.569
MTEB CQADupstackGisRetrieval	26.743	34.016000000000005	35.028999999999996	35.113	28.927000000000003	36.32	37.221	37.281	28.927000000000003	38.474000000000004	43.580000000000005	45.64	28.927000000000003	5.74	0.8710000000000001	0.108	26.743	49.955	73.904	89.133	38.072	43.266
MTEB CQADupstackMathematicaRetrieval	16.928	23.549	24.887	25.018	21.02	27.898	29.018	29.099999999999998	21.02	28.277	34.54	37.719	21.02	5.361	0.9809999999999999	0.13899999999999998	16.928	38.601	65.759	88.543	25.556	30.447000000000003
MTEB CQADupstackPhysicsRetrieval	28.549000000000003	38.426	39.845000000000006	39.956	35.034	44.041000000000004	44.95	44.997	35.034	44.218	49.958000000000006	52.019000000000005	35.034	7.911	1.26	0.16	28.549000000000003	56.035999999999994	79.701	93.149	42.275	49.097
MTEB CQADupstackProgrammersRetrieval	29.391000000000002	39.48	40.727000000000004	40.835	35.959	44.726	45.531	45.582	35.959	45.303	50.683	52.818	35.959	8.241999999999999	1.274	0.163	29.391000000000002	57.364000000000004	80.683	94.918	42.263	48.634
MTEB CQADupstackRetrieval	26.791749999999997	35.75541666666667	37.00791666666667	37.12408333333333	31.744333333333337	39.9925	40.86458333333333	40.92175000000001	31.744333333333337	40.95008333333334	46.25966666666667	48.535333333333334	31.744333333333337	7.135166666666666	1.1535833333333334	0.15391666666666665	26.791749999999997	51.98625	75.30358333333334	91.05433333333333	39.39583333333333	45.05925
MTEB CQADupstackStatsRetrieval	22.219	29.162	30.049999999999997	30.144	25.153	31.814999999999998	32.573	32.645	25.153	33.099000000000004	37.768	40.331	25.153	5.183999999999999	0.8170000000000001	0.11100000000000002	22.219	42.637	64.704	83.963	32.444	36.802
MTEB CQADupstackTexRetrieval	17.427999999999997	24.029	25.119999999999997	25.257	21.129	27.750000000000004	28.666999999999998	28.754999999999995	21.129	28.203	33.44	36.61	21.129	5.055	0.909	0.13699999999999998	17.427999999999997	36.923	60.606	83.19	26.845000000000002	31.247000000000003
MTEB CQADupstackUnixRetrieval	26.457000000000004	35.228	36.475	36.585	30.784	39.133	40.11	40.169	30.784	40.358	46.119	48.428	30.784	6.800000000000001	1.083	0.13899999999999998	26.457000000000004	51.845	77.046	92.892	38.89	44.688
MTEB CQADupstackWebmastersRetrieval	29.378999999999998	37.373	39.107	39.317	35.178	42.44	43.434	43.482	35.178	42.82	48.935	51.28	35.178	7.945	1.524	0.242	29.378999999999998	52.141999999999996	79.49000000000001	93.782	39.579	45.462
MTEB CQADupstackWordpressRetrieval	19.814999999999998	27.383999999999997	28.483999999999998	28.585	21.996	29.584	30.611	30.684	21.996	32.024	37.528	40.150999999999996	21.996	5.102	0.856	0.117	19.814999999999998	44.239	69.269	89.216	31.102999999999998	38.078
MTEB ClimateFEVER	11.349	19.436	21.282999999999998	21.479	25.863000000000003	37.218	38.198	38.236	25.863000000000003	27.953	35.327	38.708999999999996	25.863000000000003	8.99	1.6889999999999998	0.232	11.349	34.581	60.178	78.88199999999999	20.041999999999998	25.458
MTEB DBPedia	7.893	15.457	20.905	22.116	57.49999999999999	65.467	66.022	66.039	45.875	33.344	36.849	44.03	57.49999999999999	25.95	7.89	1.669	7.893	20.724999999999998	42.516	65.822	12.615000000000002	15.482000000000001
MTEB FEVER	53.882	65.902	66.33	66.348	58.041	70.133	70.463	70.47	58.041	71.84700000000001	73.699	74.06700000000001	58.041	9.427000000000001	1.049	0.11	53.882	85.99	94.09100000000001	96.612	75.25	80.997
MTEB FiQA2018	19.165	31.845000000000002	33.678999999999995	33.878	38.272	47.04	47.923	47.973	38.272	39.177	45.995000000000005	49.312	38.272	10.926	1.809	0.23700000000000002	19.165	45.103	70.295	90.592	32.832	37.905
MTEB HotpotQA	32.397	44.83	45.716	45.797	64.794	71.866	72.22	72.238	64.794	54.186	57.623000000000005	59.302	64.794	11.219	1.394	0.16199999999999998	32.397	56.096999999999994	69.696	80.88499999999999	46.150999999999996	50.993
MTEB MSMARCO	19.519000000000002	31.025000000000002	32.275999999999996	32.329	20.115	31.569000000000003	32.768	32.816	20.115	37.756	43.858000000000004	45.199	20.115	6.122	0.919	0.10300000000000001	19.519000000000002	58.62500000000001	86.99	97.268	37.002	46.778
MTEB NFCorpus	5.185	11.158	14.041	15.360999999999999	44.582	53.083999999999996	53.787	53.824000000000005	42.57	31.593	29.093999999999998	37.909	43.963	23.498	7.6160000000000005	2.032	5.185	15.234	29.49	62.273999999999994	9.55	11.103
MTEB NQ	23.803	38.183	39.421	39.464	26.68	40.439	41.415	41.443999999999996	26.68	45.882	51.227999999999994	52.207	26.68	7.9750000000000005	1.0959999999999999	0.11900000000000001	23.803	67.152	90.522	97.743	45.338	55.106
MTEB QuoraRetrieval	70.473	84.452	85.101	85.115	81.19	87.324	87.434	87.435	81.21000000000001	88.19	89.44	89.526	81.21000000000001	13.417000000000002	1.537	0.157	70.473	95.367	99.616	99.996	86.936	91.557
MTEB SciFact	44.583	52.978	53.803	53.839999999999996	47.0	54.730000000000004	55.31399999999999	55.346	47.0	57.82899999999999	61.49400000000001	62.676	47.0	7.867	0.997	0.11	44.583	71.172	87.7	97.333	56.511	64.206
MTEB TRECCOVID	0.2	1.398	7.406	18.401	70.0	79.25999999999999	79.25999999999999	79.25999999999999	63.0	58.548	45.216	41.149	70.0	64.0	46.92	18.642	0.2	1.6729999999999998	10.856	38.964999999999996	0.504	0.852
MTEB Touche2020	1.6629999999999998	8.601	14.354	15.927	18.367	34.466	35.235	35.27	14.285999999999998	20.374	33.532000000000004	45.561	18.367	20.204	7.489999999999999	1.5630000000000002	1.6629999999999998	15.549	47.497	84.524	5.289	8.035

聚類任務

數據集	V-measure
MTEB ArxivClusteringP2P	39.68441054431849
MTEB ArxivClusteringS2S	29.188539728343844
MTEB BiorxivClusteringP2P	32.98041170516364
MTEB BiorxivClusteringS2S	25.71652988451154
MTEB MedrxivClusteringP2P	30.887642595096825
MTEB MedrxivClusteringS2S	28.3764418784054
MTEB RedditClustering	59.25776525253911
MTEB RedditClusteringP2P	63.22135271663078
MTEB StackExchangeClustering	65.0394225901397
MTEB StackExchangeClusteringP2P	35.27954189859326
MTEB TwentyNewsgroupsClustering	51.30677907335145

重排序任務

數據集	MAP	MRR
MTEB AskUbuntuDupQuestions	63.173362687519784	76.18860748362133
MTEB SciDocsRR	78.82761687254882	93.46223674655047
MTEB StackOverflowDupQuestions	50.99055979974896	51.82745257193787

STS任務

數據集	餘弦相似度斯皮爾曼相關係數
MTEB BIOSSES	82.30789953771232
MTEB SICK-R	80.25888668589654
MTEB STS12	77.02037527837669
MTEB STS13	86.58432681008449
MTEB STS14	81.31697756099051
MTEB STS15	88.18867599667057
MTEB STS16	84.87853941747623
MTEB STS17 (en-en)	89.46479925383916
MTEB STS22 (en)	66.45272113649146
MTEB STSBenchmark	86.43357313527851

成對分類任務

數據集	餘弦相似度準確率	餘弦相似度AP	餘弦相似度F1	餘弦相似度精確率	餘弦相似度召回率	點積準確率	點積AP	點積F1	點積精確率	點積召回率	歐幾里得距離準確率	歐幾里得距離AP	歐幾里得距離F1	歐幾里得距離精確率	歐幾里得距離召回率	曼哈頓距離準確率	曼哈頓距離AP	曼哈頓距離F1	曼哈頓距離精確率	曼哈頓距離召回率	最大值準確率	最大值AP	最大值F1
MTEB SprintDuplicateQuestions	99.66237623762376	90.35465126226322	82.44575936883628	81.32295719844358	83.6	99.66237623762376	90.35464287920453	82.44575936883628	81.32295719844358	83.6	99.66237623762376	90.3546512622632	82.44575936883628	81.32295719844358	83.6	99.65940594059406	90.29220174849843	82.4987605354487	81.80924287118977	83.2	99.66237623762376	90.35465126226322	82.4987605354487
MTEB TwitterSemEval2015	86.12386004649221	73.99096426215495	68.18416968442834	66.86960933536275	69.55145118733509	86.12386004649221	73.99096813038672	68.18416968442834	66.86960933536275	69.55145118733509	86.12386004649221	73.99095984980165	68.18416968442834	66.86960933536275	69.55145118733509	86.09405734040651	73.96825745608601	68.13888179729383	65.99901088031652	70.42216358839049	86.12386004649221	73.99096813038672	68.18416968442834
MTEB TwitterURLCorpus	88.99367407924865	86.19720829843081	78.39889075384951	74.5110278818144	82.71481367416075	88.99367407924865	86.19718471454047	78.39889075384951	74.5110278818144	82.71481367416075	88.99367407924865	86.1972021422436	78.39889075384951	74.5110278818144	82.71481367416075	88.95680521597392	86.16659921351506	78.39125971550081	74.82502799552073	82.31444410224823	88.99367407924865	86.19720829843081	78.39889075384951