Jina Embeddings V2 Base Chinese Embedding Model - Open-source and Free for Sentence Similarity Calculation and Feature Extraction

Jina Embeddings V2 Base Zh

Developed by silverjam

Jina Embeddings V2 Base is a sentence embedding model optimized for Chinese, which can convert text into high-dimensional vector representations for calculating sentence similarity and feature extraction.

Text Embedding Supports Multiple LanguagesOpen Source License:Apache-2.0 #Chinese sentence embedding #Semantic similarity calculation #Medical Q&A re-ranking

Downloads 63

Release Time : 6/5/2024

Model Overview

This model focuses on the embedding representation of Chinese text and supports various natural language processing tasks, such as sentence similarity calculation, text classification, and clustering.

Model Features

Optimized for Chinese

Specifically optimized for Chinese text to provide more accurate Chinese sentence embedding representations.

Multi-task support

Supports various natural language processing tasks, including sentence similarity calculation, text classification, and clustering.

High performance

Performs excellently in multiple Chinese benchmark tests, especially in the sentence similarity task.

Model Capabilities

Sentence embedding generation

Text feature extraction

Sentence similarity calculation

Text classification

Text clustering

Information retrieval

Use Cases

Information retrieval

Medical Q&A retrieval

Used for retrieving relevant questions and answers in a medical Q&A system

Performs well on the CMedQA dataset, with a MAP of 83.74

Text similarity

Q&A pair matching

Determine the relevance between questions and answers

The Pearson value of cosine similarity on the AFQMC dataset is 48.51

Text classification

Product review classification

Perform sentiment or topic classification on Chinese product reviews

Achieves an accuracy of 34.94% on the Amazon Chinese review classification task

🚀 jina-embeddings-v2-base-zh

This is a model related to sentence embeddings, which can be used for tasks such as feature extraction, sentence similarity calculation, etc. It has been tested on multiple datasets and shows certain performance in different tasks.

📚 Documentation

Model Information

Property	Details
Model Name	jina-embeddings-v2-base-zh
Tags	sentence-transformers, feature-extraction, sentence-similarity, mteb, transformers, transformers.js
Inference	false
License	apache-2.0
Languages Supported	en, zh

Performance Results

The model has been evaluated on various tasks and datasets, here are the detailed results:

1. STS (Semantic Textual Similarity) Tasks

C-MTEB/AFQMC (Validation Split)
- cos_sim_pearson: 48.51403119231363
- cos_sim_spearman: 50.5928547846445
- euclidean_pearson: 48.750436310559074
- euclidean_spearman: 50.50950238691385
- manhattan_pearson: 48.7866189440328
- manhattan_spearman: 50.58692402017165
C-MTEB/ATEC (Test Split)
- cos_sim_pearson: 50.25985700105725
- cos_sim_spearman: 51.28815934593989
- euclidean_pearson: 52.70329248799904
- euclidean_spearman: 50.94101139559258
- manhattan_pearson: 52.6647237400892
- manhattan_spearman: 50.922441325406176
C-MTEB/BQ (Test Split)
- cos_sim_pearson: 65.15667035488342
- cos_sim_spearman: 66.07110142081
- euclidean_pearson: 60.447598102249714
- euclidean_spearman: 61.826575796578766
- manhattan_pearson: 60.39364279354984
- manhattan_spearman: 61.78743491223281
mteb/sts22-crosslingual-sts (Test Split, zh Config)
- cos_sim_pearson: 66.54931957553592
- cos_sim_spearman: 69.25068863016632
- euclidean_pearson: 50.26525596106869
- euclidean_spearman: 63.83352741910006
- manhattan_pearson: 49.98798282198196

2. Classification Tasks

mteb/amazon_reviews_multi (Test Split, zh Config)
- accuracy: 34.944
- f1: 34.06478860660109
C-MTEB/IFlyTek-classification (Validation Split)
- accuracy: 47.36437091188918
- f1: 36.60946954228577
C-MTEB/JDReview-classification (Test Split)
- accuracy: 79.5684803001876
- ap: 42.671935929201524
- f1: 73.31912729103752
mteb/amazon_massive_intent (Test Split, zh-CN Config)
- accuracy: 68.1977135171486
- f1: 67.23114308718404
mteb/amazon_massive_scenario (Test Split, zh-CN Config)
- accuracy: 71.92669804976462
- f1: 72.90628475628779
C-MTEB/MultilingualSentiment-classification (Validation Split)
- accuracy: 63.29333333333334
- f1: 63.03293854259612
C-MTEB/OnlineShopping-classification (Test Split)
- accuracy: 87.00000000000001
- ap: 83.24372135949511
- f1: 86.95554191530607

3. Clustering Tasks

C-MTEB/CLSClusteringP2P (Test Split)
- v_measure: 39.96714175391701
C-MTEB/CLSClusteringS2S (Test Split)
- v_measure: 38.39863566717934

4. Reranking Tasks

C-MTEB/CMedQAv1-reranking (Test Split)
- map: 83.63680381780644
- mrr: 86.16476190476192
C-MTEB/CMedQAv2-reranking (Test Split)
- map: 83.74350667859487
- mrr: 86.10388888888889
C-MTEB/Mmarco-reranking (Dev Split)
- map: 31.5372713650176
- mrr: 30.163095238095238

5. Retrieval Tasks

C-MTEB/CmedqaRetrieval (Dev Split)
- map_at_1: 22.072
- map_at_10: 32.942
- map_at_100: 34.768
- map_at_1000: 34.902
- map_at_3: 29.357
- map_at_5: 31.236000000000004
- mrr_at_1: 34.259
- mrr_at_10: 41.957
- mrr_at_100: 42.982
- mrr_at_1000: 43.042
- mrr_at_3: 39.722
- mrr_at_5: 40.898
- ndcg_at_1: 34.259
- ndcg_at_10: 39.153
- ndcg_at_100: 46.493
- ndcg_at_1000: 49.01
- ndcg_at_3: 34.636
- ndcg_at_5: 36.278
- precision_at_1: 34.259
- precision_at_10: 8.815000000000001
- precision_at_100: 1.474
- precision_at_1000: 0.179
- precision_at_3: 19.73
- precision_at_5: 14.174000000000001
- recall_at_1: 22.072
- recall_at_10: 48.484
- recall_at_100: 79.035
- recall_at_1000: 96.15
- recall_at_3: 34.607
- recall_at_5: 40.064
C-MTEB/CovidRetrieval (Dev Split)
- map_at_1: 69.178
- map_at_10: 77.523
- map_at_100: 77.793
- map_at_1000: 77.79899999999999
- map_at_3: 75.878
- map_at_5: 76.849
- mrr_at_1: 69.44200000000001
- mrr_at_10: 77.55
- mrr_at_100: 77.819
- mrr_at_1000: 77.826
- mrr_at_3: 75.957
- mrr_at_5: 76.916
- ndcg_at_1: 69.44200000000001
- ndcg_at_10: 81.217
- ndcg_at_100: 82.45
- ndcg_at_1000: 82.636
- ndcg_at_3: 77.931
- ndcg_at_5: 79.655
- precision_at_1: 69.44200000000001
- precision_at_10: 9.357
- precision_at_100: 0.993
- precision_at_1000: 0.101
- precision_at_3: 28.1
- precision_at_5: 17.724
- recall_at_1: 69.178
- recall_at_10: 92.624
- recall_at_100: 98.209
- recall_at_1000: 99.684
- recall_at_3: 83.772
- recall_at_5: 87.882
C-MTEB/DuRetrieval (Dev Split)
- map_at_1: 25.163999999999998
- map_at_10: 76.386
- map_at_100: 79.339
- map_at_1000: 79.39500000000001
- map_at_3: 52.959
- map_at_5: 66.59
- mrr_at_1: 87.9
- mrr_at_10: 91.682
- mrr_at_100: 91.747
- mrr_at_1000: 91.751
- mrr_at_3: 91.267
- mrr_at_5: 91.527
- ndcg_at_1: 87.9
- ndcg_at_10: 84.569
- ndcg_at_100: 87.83800000000001
- ndcg_at_1000: 88.322
- ndcg_at_3: 83.473
- ndcg_at_5: 82.178
- precision_at_1: 87.9
- precision_at_10: 40.605000000000004
- precision_at_100: 4.752
- precision_at_1000: 0.488
- precision_at_3: 74.9
- precision_at_5: 62.96000000000001
- recall_at_1: 25.163999999999998
- recall_at_10: 85.97399999999999
- recall_at_100: 96.63000000000001
- recall_at_1000: 99.016
- recall_at_3: 55.611999999999995
- recall_at_5: 71.936
C-MTEB/EcomRetrieval (Dev Split)
- map_at_1: 48.6
- map_at_10: 58.831
- map_at_100: 59.427
- map_at_1000: 59.44199999999999
- map_at_3: 56.383
- map_at_5: 57.753
- mrr_at_1: 48.6
- mrr_at_10: 58.831
- mrr_at_100: 59.427
- mrr_at_1000: 59.44199999999999
- mrr_at_3: 56.383
- mrr_at_5: 57.753
- ndcg_at_1: 48.6
- ndcg_at_10: 63.951
- ndcg_at_100: 66.72200000000001
- ndcg_at_1000: 67.13900000000001
- ndcg_at_3: 58.882
- ndcg_at_5: 61.373
- precision_at_1: 48.6
- precision_at_10: 8.01
- precision_at_100: 0.928
- precision_at_1000: 0.096
- precision_at_3: 22.033
- precision_at_5: 14.44
- recall_at_1: 48.6
- recall_at_10: 80.10000000000001
- recall_at_100: 92.80000000000001
- recall_at_1000: 96.1
- recall_at_3: 66.10000000000001
- recall_at_5: 72.2
C-MTEB/MedicalRetrieval (Dev Split)
- map_at_1: 49.2
- map_at_10: 54.539
- map_at_100: 55.135
- map_at_1000: 55.19199999999999
- map_at_3: 53.383
- map_at_5: 54.142999999999994
- mrr_at_1: 49.2
- mrr_at_10: 54.539
- mrr_at_100: 55.135999999999996
- mrr_at_1000: 55.19199999999999
- mrr_at_3: 53.383
- mrr_at_5: 54.142999999999994
- ndcg_at_1: 49.2
- ndcg_at_10: 57.123000000000005
- ndcg_at_100: 60.21300000000001
- ndcg_at_1000: 61.915
- ndcg_at_3: 54.772
- ndcg_at_5: 56.157999999999994
- precision_at_1: 49.2
- precision_at_10: 6.52
- precision_at_100: 0.8009999999999999
- precision_at_1000: 0.094
- precision_at_3: 19.6
- precision_at_5: 12.44
- recall_at_1: 49.2
- recall_at_10: 65.2
- recall_at_100: 80.10000000000001
- recall_at_1000: 93.89999999999999
- recall_at_3: 58.8
- recall_at_5: 62.2

6. PairClassification Tasks

C-MTEB/CMNLI (Validation Split)
- cos_sim_accuracy: 76.7047504509922
- cos_sim_ap: 85.26649874800871
- cos_sim_f1: 78.13528724646915
- cos_sim_precision: 71.57587548638132
- cos_sim_recall: 86.01823708206688
- dot_accuracy: 70.13830426939266
- dot_ap: 77.01510412382171
- dot_f1: 73.56710042713817
- dot_precision: 63.955094991364426
- dot_recall: 86.57937806873977
- euclidean_accuracy: 75.53818400481059
- euclidean_ap: 84.34668448241264
- euclidean_f1: 77.51741608613047
- euclidean_precision: 70.65614777756399
- euclidean_recall: 85.85457096095394
- manhattan_accuracy: 75.49007817197835
- manhattan_ap: 84.40297506704299
- manhattan_f1: 77.63185324160932
- manhattan_precision: 70.03949595636637
- manhattan_recall: 87.07037643207856
- max_accuracy: 76.7047504509922
- max_ap: 85.26649874800871
- max_f1: 78.13528724646915
C-MTEB/OCNLI (Validation Split)
- cos_sim_accuracy: 75.69030860855442
- cos_sim_ap: 80.6157833772759
- cos_sim_f1: 77.87524366471735
- cos_sim_precision: 72.3076923076923
- cos_sim_recall: 84.37170010559663
- dot_accuracy: 67.78559826746074
- dot_ap: 72.00871467527499
- dot_f1: 72.58722247394654
- dot_precision: 63.57142857142857
- dot_recall: 84.58289334741288
- euclidean_accuracy: 75.20303194369248
- euclidean_ap: 80.98587256415605
- euclidean_f1: 77.26396917148362
- euclidean_precision: 71.03631532329496
- euclidean_recall: 84.68848996832101
- manhattan_accuracy: 75.20303194369248
- manhattan_ap: 80.93460699513219
- manhattan_f1: 77.124773960217
- manhattan_precision: 67.43083003952569
- manhattan_recall: 90.07391763463569
- max_accuracy: 75.69030860855442
- max_ap: 80.98587256415605
- max_f1: 77.87524366471735

📄 License

The model is licensed under the apache-2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご