Open-source GTE Multilingual Base Model - Free Support for Multilingual Sentence Similarity Calculation and Text Embedding

Intention

Developed by leeloolee

The GTE multilingual base model is a dense sentence transformer that supports sentence similarity calculation and text embedding tasks in multiple languages.

Text Embedding

Transformers

Supports Multiple LanguagesOpen Source License:Apache-2.0 #Multilingual sentence embedding #Cross-language semantic similarity #Dense vector retrieval

Downloads 32

Release Time : 9/7/2024

Model Overview

This model is a multilingual sentence transformer specifically designed for sentence similarity calculation and text embedding tasks across multiple languages. It supports over 50 languages and is suitable for applications such as cross-language information retrieval, text clustering, and classification.

Model Features

Multilingual support

Supports over 50 languages, suitable for cross-language text processing tasks.

Dense representation

Uses a dense transformer architecture to generate high-quality sentence embeddings.

Versatility

Suitable for various natural language processing tasks, including similarity calculation, clustering, classification, and information retrieval.

Model Capabilities

Sentence similarity calculation

Text embedding generation

Cross-language information retrieval

Text clustering

Text classification

Bilingual text mining

Use Cases

Information retrieval

Cross-language document retrieval

This model can be used to retrieve relevant documents in different languages.

Achieved an NDCG@10 score of 53.638 in the MTEB AlloprofRetrieval task

Text classification

Sentiment analysis

Can be used for multilingual sentiment classification tasks.

Achieved an accuracy of 80.72% in the MTEB AmazonPolarityClassification task

Text similarity

Sentence similarity calculation

Calculates semantic similarity between sentences in different languages.

Achieved a Spearman correlation coefficient of 81.21 for cosine similarity in the MTEB BIOSSES task

🚀 gte-multilingual-base (dense)

This is a multilingual model that supports a wide range of languages. It has been tested on various tasks such as Clustering, STS, Classification, Reranking, and Retrieval, showing different performance metrics on multiple datasets.

📄 License

The model is licensed under the Apache-2.0 license.

📚 Documentation

Supported Languages

The model supports the following languages:

af, ar, az, be, bg, bn, ca, ceb, cs, cy, da, de, el, en, es, et, eu, fa, fi, fr, gl, gu, he, hi, hr, ht, hu, hy, id, is, it, ja, jv, ka, kk, km, kn, ko, ky, lo, lt, lv, mk, ml, mn, mr, ms, my, ne, nl, 'no', pa, pl, pt, qu, ro, ru, si, sk, sl, so, sq, sr, sv, sw, ta, te, th, tl, tr, uk, ur, vi, yo, zh

Model Performance

The following table shows the performance of the gte-multilingual-base (dense) model on different tasks and datasets:

Task Type	Dataset Name	Metric Type	Metric Value
Clustering	PL-MTEB/8tags-clustering	v_measure	33.66681726329994
STS	C-MTEB/AFQMC	cos_sim_spearman	43.54760696384009
STS	C-MTEB/ATEC	cos_sim_spearman	48.91186363417501
Classification	PL-MTEB/allegro-reviews	accuracy	41.689860834990064
Clustering	lyon-nlp/alloprof (AlloProfClusteringP2P)	v_measure	54.20241337977897
Clustering	lyon-nlp/alloprof (AlloProfClusteringS2S)	v_measure	44.34083695608643
Reranking	lyon-nlp/mteb-fr-reranking-alloprof-s2p	map	64.91495250072002
Retrieval	lyon-nlp/alloprof	ndcg_at_10	53.638
Classification	mteb/amazon_counterfactual	accuracy	75.95522388059702
Classification	mteb/amazon_polarity	accuracy	80.717625
Classification	mteb/amazon_reviews_multi (en)	accuracy	43.64199999999999
Classification	mteb/amazon_reviews_multi (de)	accuracy	40.108
Classification	mteb/amazon_reviews_multi (es)	accuracy	40.169999999999995
Classification	mteb/amazon_reviews_multi (fr)	accuracy	39.56799999999999
Classification	mteb/amazon_reviews_multi (ja)	accuracy	35.75000000000001
Classification	mteb/amazon_reviews_multi (zh)	accuracy	33.342000000000006
Retrieval	mteb/arguana	ndcg_at_10	58.231
Retrieval	clarin-knext/arguana-pl	ndcg_at_10	53.166000000000004
Clustering	mteb/arxiv-clustering-p2p	v_measure	46.01900557959478
Clustering	mteb/arxiv-clustering-s2s	v_measure	41.06626465345723
Reranking	mteb/askubuntudupquestions-reranking	map	61.87514497610431
STS	mteb/biosses-sts	cos_sim_spearman	81.21450112991194
STS	C-MTEB/BQ	cos_sim_spearman	51.71589543397271
Retrieval	maastrichtlawtech/bsard	ndcg_at_10	26.115
BitextMining	mteb/bucc-bitext-mining (de-en)	f1	98.6169102296451
BitextMining	mteb/bucc-bitext-mining (fr-en)	f1	97.89603052314916
BitextMining	mteb/bucc-bitext-mining (ru-en)	f1	97.12388869645537
BitextMining	mteb/bucc-bitext-mining (zh-en)	f1	98.15692469720906
Classification	mteb/banking77	accuracy	85.36038961038962
Clustering	mteb/biorxiv-clustering-p2p	v_measure	37.5903826674123
Clustering	mteb/biorxiv-clustering-s2s	v_measure	34.21474277151329
Classification	PL-MTEB/cbd	accuracy	62.519999999999996
PairClassification	PL-MTEB/cdsce-pairclassification	cos_sim_ap	74.90132799162956
STS	PL-MTEB/cdscr-sts	cos_sim_spearman	90.30727955142524
Clustering	C-MTEB/CLSClusteringP2P	v_measure	37.94850105022274
Clustering	C-MTEB/CLSClusteringS2S	v_measure	38.11958675421534
Reranking	C-MTEB/CMedQAv1-reranking	map	86.10950950485399
Reranking	C-MTEB/CMedQAv2-reranking	map	87.28038294231966
Retrieval	mteb/cqadupstack-android	ndcg_at_10	47.099000000000004
Retrieval	mteb/cqadupstack-english	ndcg_at_10	45.973000000000006
Retrieval	mteb/cqadupstack-gaming	ndcg_at_10	55.606
Retrieval	mteb/cqadupstack-gis	ndcg_at_10	36.638
Retrieval	mteb/cqadupstack-mathematica	ndcg_at_10	30.711
Retrieval	mteb/cqadupstack-physics	ndcg_at_10	44.523
Retrieval	mteb/cqadupstack-programmers	ndcg_at_10	37.940000000000005
Retrieval	mteb/cqadupstack	...	...

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご