piccolo-base-zh Open-source Chinese Foundation Model - Supports NLP Tasks such as Text Similarity, Classification, and Clustering

Piccolo Base Zh

Developed by sensenova

Piccolo is a Chinese base model specializing in various natural language processing tasks such as semantic text similarity (STS), classification, clustering, and retrieval.

Text Embedding

Transformers

#Chinese Semantic Similarity #Medical QA Re-ranking #Text Clustering Analysis

Downloads 303

Release Time : 9/4/2023

Model Overview

This model excels in Chinese text processing tasks, supporting functions like semantic similarity calculation, text classification, clustering analysis, and information retrieval.

Model Features

Multi-task Support

Capable of handling multiple natural language processing tasks, including semantic similarity calculation, text classification, clustering, and information retrieval.

Chinese Optimization

Specially optimized for Chinese text processing, demonstrating strong performance across various Chinese NLP tasks.

High-performance Retrieval

Outstanding performance in medical QA retrieval tasks, achieving map@1000 of 37.576 and recall@1000 as high as 98.196%.

Model Capabilities

Semantic Text Similarity Calculation

Text Classification

Text Clustering

Information Retrieval

QA Re-ranking

Use Cases

Medical QA

Medical QA Retrieval

Used to retrieve answers related to medical questions

Achieved map@1000 of 37.576 on the CMedQA dataset

Medical QA Re-ranking

Re-ranking retrieved medical QA results by relevance

Achieved mrr of 88.535 on the CMedQAv2 dataset

Text Understanding

Semantic Similarity Judgment

Determining the semantic similarity between two Chinese texts

Achieved cos_sim_spearman of 51.405 on the AFQMC dataset

Text Classification

Classifying Chinese texts

Achieved accuracy of 40.236% on the AmazonReviews Chinese dataset

🚀 piccolo-base-zh

This document presents the performance metrics of the piccolo-base-zh model on various tasks in the MTEB (Massive Text Embedding Benchmark) framework, including semantic textual similarity (STS), classification, clustering, reranking, retrieval, and pair classification.

📚 Documentation

Model Performance Metrics

1. Semantic Textual Similarity (STS)

Task	Dataset	Split	Cosine Similarity (Pearson)	Cosine Similarity (Spearman)	Euclidean Distance (Pearson)	Euclidean Distance (Spearman)	Manhattan Distance (Pearson)	Manhattan Distance (Spearman)
STS	MTEB AFQMC	validation	49.16558217326158	51.4049475858823	49.85853741070363	51.501428092542234	49.746099634926296	51.41081804320127
STS	MTEB ATEC	test	52.385361699031854	52.59114913702212	54.994530439418355	52.54102886188004	54.9503071669608	52.51465652540901
STS	MTEB BQ	test	60.98952187211432	62.68189713123115	61.089426749761344	62.41743375544581	61.14747216341409	62.488918956547046
STS	MTEB LCQMC	test	70.02878561337955	75.39509553139982	73.92598696939956	75.5471147196853	73.88049486090739	75.51361990583285

2. Classification

Task	Dataset	Split	Accuracy	F1	AP
Classification	MTEB AmazonReviewsClassification (zh)	test	40.236	39.43040092463147	-
Classification	MTEB IFlyTek	validation	44.34782608695652	36.401426200836205	-
Classification	MTEB JDReview	test	84.25891181988743	78.55080202541332	50.54636280166089

3. Clustering

Task	Dataset	Split	V-Measure
Clustering	MTEB CLSClusteringP2P	test	38.36392300667918
Clustering	MTEB CLSClusteringS2S	test	35.645927581489175

4. Reranking

Task	Dataset	Split	MAP	MRR
Reranking	MTEB CMedQAv1	test	85.25085782849087	87.77154761904762
Reranking	MTEB CMedQAv2	test	86.15357754080844	88.53547619047617

5. Retrieval

Task	Dataset	Split	MAP@1	MAP@10	MAP@100	MAP@1000	MAP@3	MAP@5	MRR@1	MRR@10	MRR@100	MRR@1000	MRR@3	MRR@5	NDCG@1	NDCG@10	NDCG@100	NDCG@1000	NDCG@3	NDCG@5	Precision@1	Precision@10	Precision@100	Precision@1000	Precision@3	Precision@5	Recall@1	Recall@10	Recall@100	Recall@1000	Recall@3	Recall@5
Retrieval	MTEB CmedqaRetrieval	dev	23.683	35.522999999999996	37.456	37.576	31.584	33.684999999999995	36.459	44.534	45.6	45.647	42.186	43.482	36.459	42.025	49.754	51.815999999999995	37.056	38.962	36.459	9.485000000000001	1.567	0.183	21.13	15.209	23.683	52.190999999999995	84.491	98.19600000000001	37.09	43.262
Retrieval	MTEB CovidRetrieval	dev	72.99799999999999	81.271	81.53399999999999	81.535	80.049	80.793	73.13	81.193	81.463	81.464	80.067	80.741	73.34	84.503	85.643	85.693	82.135	83.401	73.34	9.536	1.004	0.101	29.54	18.398	72.99799999999999	94.31	99.368	99.789	87.935	90.991
Retrieval	MTEB DuRetrieval	dev	26.537	81.292	84.031	84.066	56.571000000000005	71.082	91.2	93.893	93.955	93.95700000000001	93.61699999999999	93.767	91.2	88.255	90.813	91.144	87.435	85.961	91.2	42.14	4.817	0.48900000000000005	78.467	65.75999999999999	26.537	89.262	97.783	99.49799999999999	58.573	75.154
Retrieval	MTEB EcomRetrieval	dev	48.5	57.898	58.599000000000004	58.616	55.1	56.80500000000001	48.5	57.898	58.599000000000004	58.616	55.1	56.80500000000001	48.5	62.876	66.00200000000001	66.467	57.162	60.263999999999996	48.5	7.870000000000001	0.927	0.096	21.032999999999998	14.14	48.5	78.7	92.7	96.39999999999999	63.1	70.7
Retrieval	MTEB MMarcoRetrieval	dev	64.739	74.039	74.38	74.39099999999999	72.074	73.29299999999999	66.92	74.636	74.94	74.95	72.911	73.981	66.92	77.924	79.471	79.73400000000001	74.17200000000001	76.236	66.92	9.5	-	-	-	-	-	-	-	-	-	-

6. Pair Classification

Task	Dataset	Split	Cosine Similarity (Accuracy)	Cosine Similarity (AP)	Cosine Similarity (F1)	Cosine Similarity (Precision)	Cosine Similarity (Recall)	Dot Product (Accuracy)	Dot Product (AP)	Dot Product (F1)	Dot Product (Precision)	Dot Product (Recall)	Euclidean Distance (Accuracy)	Euclidean Distance (AP)	Euclidean Distance (F1)	Euclidean Distance (Precision)	Euclidean Distance (Recall)	Manhattan Distance (Accuracy)	Manhattan Distance (AP)	Manhattan Distance (F1)	Manhattan Distance (Precision)	Manhattan Distance (Recall)	Max (Accuracy)	Max (AP)	Max (F1)
PairClassification	MTEB Cmnli	validation	74.20324714371618	82.32631646194994	76.64052827073876	68.58725761772854	86.83656768763151	70.33072760072159	77.46972172609794	73.6668924804026	62.84676354029062	88.98760813654431	74.78051713770296	82.65778389584023	77.1843623157445	71.05211406096362	84.47509936871639	74.76849067949489	82.55694030572194	77.1776459569154	69.5423855963991	86.69628244096329	74.78051713770296	82.65778389584023	77.1843623157445

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご