text2vec-base-multilingual Open Source Model - Supports Multilingual Sentence Similarity Calculation and Feature Extraction

Text2vec Base Multilingual

Developed by shibing624

A multilingual sentence embedding model supporting Chinese, English, German, French, and other languages, focusing on sentence similarity calculation and feature extraction tasks.

Text Embedding

Transformers

Supports Multiple LanguagesOpen Source License:Apache-2.0 #Multilingual Text Vectorization #Sentence Similarity Calculation #Cross-Language Semantic Matching

Downloads 128.13k

Release Time : 6/22/2023

Model Overview

This model is based on the Sentence-Transformers framework and trained on multilingual natural language inference datasets. It can convert text into high-quality vector representations, suitable for cross-language semantic similarity calculation and information retrieval tasks.

Model Features

Multilingual Support

Supports text embedding for multiple languages including Chinese, English, German, and French.

High-Performance Sentence Similarity Calculation

Performs excellently in multiple benchmarks, accurately calculating semantic similarity between sentences.

Pre-trained Model

Pre-trained on large-scale multilingual datasets, ready to use out of the box.

Model Capabilities

Sentence similarity calculation

Text feature extraction

Cross-language semantic retrieval

Text classification

Clustering analysis

Use Cases

Information Retrieval

Cross-Language Document Retrieval

Achieves similarity retrieval for documents in different languages using a unified vector space.

Text Classification

Multilingual Sentiment Analysis

Implements sentiment classification for multilingual texts based on sentence embeddings.

Achieves 43.35% accuracy on MTEB EmotionClassification.

Clustering Analysis

Academic Paper Clustering

Performs topic clustering on arXiv papers.

Achieves 32.32 v_measure score on MTEB ArxivClusteringP2P.

🚀 text2vec-base-multilingual

This is a multilingual model for sentence similarity tasks. It supports multiple languages and has been tested on various datasets with different tasks, providing a series of evaluation metrics.

📚 Documentation

Model Information

Property	Details
Pipeline Tag	sentence-similarity
License	apache-2.0
Library Name	sentence-transformers
Tags	sentence-transformers, feature-extraction, sentence-similarity, transformers, text2vec, mteb
Datasets	shibing624/nli-zh-all
Languages	zh, en, de, fr, it, nl, pt, pl, ru
Metrics	spearmanr

Model Index

Name: text2vec-base-multilingual
Results:
- Classification Tasks:
  - MTEB AmazonCounterfactualClassification (en):
    - Accuracy: 70.97014925373134
    - AP: 33.95151328318672
    - F1: 65.14740155705596
  - MTEB AmazonCounterfactualClassification (de):
    - Accuracy: 68.69379014989293
    - AP: 79.68277579733802
    - F1: 66.54960052336921
  - MTEB AmazonPolarityClassification:
    - Accuracy: 66.103375
    - AP: 61.10087197664471
    - F1: 65.75198509894145
  - MTEB AmazonReviewsClassification (en):
    - Accuracy: 33.134
    - F1: 32.7905397597083
  - MTEB AmazonReviewsClassification (de):
    - Accuracy: 33.388
    - F1: 33.190561196873084
  - MTEB AmazonReviewsClassification (es):
    - Accuracy: 34.824
    - F1: 34.297290157740726
  - MTEB AmazonReviewsClassification (fr):
    - Accuracy: 33.449999999999996
    - F1: 33.08017234412433
  - MTEB AmazonReviewsClassification (ja):
    - Accuracy: 30.046
    - F1: 29.857141661482228
  - MTEB AmazonReviewsClassification (zh):
    - Accuracy: 32.522
    - F1: 31.854699911472174
  - MTEB Banking77Classification:
    - Accuracy: 78.08441558441558
    - F1: 77.99825264827898
  - MTEB EmotionClassification:
    - Accuracy: 43.35
    - F1: 38.80269436557695
  - MTEB ImdbClassification:
    - Accuracy: 59.348
    - AP: 55.75065220262251
    - F1: 58.72117519082607
  - MTEB MTOPDomainClassification (en):
    - Accuracy: 81.04879160966712
    - F1: 80.86889779192701
  - MTEB MTOPDomainClassification (de):
    - Accuracy: 78.59397013243168
    - F1: 77.09902761555972
  - MTEB MTOPDomainClassification (es):
    - Accuracy: 79.24282855236824
    - F1: 78.75883867079015
  - MTEB MTOPDomainClassification (fr):
    - Accuracy: 76.16661446915127
    - F1: 76.30204722831901
  - MTEB MTOPDomainClassification (hi):
    - Accuracy: 78.74506991753317
    - F1: 77.50560442779701
  - MTEB MTOPDomainClassification (th):
    - Accuracy: 77.67088607594937
    - F1: 77.21442956887493
  - MTEB MTOPIntentClassification (en):
    - Accuracy: 62.786137710898316
    - F1: 46.23474201126368
  - MTEB MTOPIntentClassification (de):
    - Accuracy: 55.285996055226825
    - F1: 37.98039513682919
  - MTEB MTOPIntentClassification (es):
    - Accuracy: 58.67911941294196
    - F1: 40.541410807124954
  - MTEB MTOPIntentClassification (fr):
    - Accuracy: 53.257124960851854
    - F1: 38.42982319259366
  - MTEB MTOPIntentClassification (hi):
    - Accuracy: 59.62352097525995
    - F1: 41.28886486568534
  - MTEB MTOPIntentClassification (th):
    - Accuracy: 58.799276672694404
    - F1: 43.68379466247341
  - MTEB MassiveIntentClassification (af):
    - Accuracy: 45.42030934767989
    - F1: 44.12201543566376
  - MTEB MassiveIntentClassification (am):
    - Accuracy: 37.67652992602556
    - F1: 35.422091900843164
  - MTEB MassiveIntentClassification (ar):
    - Accuracy: 45.02353732347007
    - F1: 41.852484084738194
  - MTEB MassiveIntentClassification (az):
    - Accuracy: 48.70880968392737
    - F1: 46.904360615435046
  - MTEB MassiveIntentClassification (bn):
    - Accuracy: 43.78950907868191
- Clustering Tasks:
  - MTEB ArxivClusteringP2P:
    - V-Measure: 32.31918856561886
  - MTEB ArxivClusteringS2S:
    - V-Measure: 25.503481615956137
  - MTEB BiorxivClusteringP2P:
    - V-Measure: 28.98583420521256
  - MTEB BiorxivClusteringS2S:
    - V-Measure: 23.195091778460892
- Reranking Task:
  - MTEB AskUbuntuDupQuestions:
    - MAP: 57.91471462820568
    - MRR: 71.82990370663501
- STS Task:
  - MTEB BIOSSES:
    - Cosine Similarity Pearson: 68.83853315193127
    - Cosine Similarity Spearman: 66.16174850417771
    - Euclidean Pearson: 56.65313897263153
    - Euclidean Spearman: 52.69156205876939
    - Manhattan Pearson: 56.97282154658304
    - Manhattan Spearman: 53.167476517261015

📄 License

This project is licensed under the apache-2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご