GreenNode-Embedding-Large-VN-V1 Open-source Model - Optimized for Vietnamese, Aiding Semantic Similarity and Retrieval Tasks

Greennode Embedding Large VN V1

Developed by GreenNode

This is a sentence embedding model optimized for Vietnamese, capable of converting text into 1024-dimensional vectors, suitable for semantic similarity and retrieval tasks.

Text Embedding

Safetensors

Other#Vietnamese semantic retrieval #High-dimensional vector embedding #Tabular data optimization

Downloads 785

Release Time : 4/11/2025

Model Overview

A sentence embedding model based on the XLM-RoBERTa architecture, specifically optimized for Vietnamese text, supporting tasks such as semantic similarity calculation, text retrieval, and clustering.

Model Features

Vietnamese optimization

Specially trained for Vietnamese text, outperforming general multilingual models in Vietnamese retrieval tasks.

Long text support

Supports sequences up to 8192 tokens, suitable for processing longer documents.

High-performance retrieval

Excels in multiple Vietnamese retrieval benchmarks, particularly in tabular retrieval tasks.

Model Capabilities

Semantic text similarity calculation

Semantic search

Text clustering

Text classification

Paraphrase mining

Use Cases

Information retrieval

Legal document retrieval

Quickly find relevant documents in legal text databases

Achieved an average performance of 74.95% on the Zac legal text retrieval dataset

Tabular data retrieval

Retrieve relevant information from structured tabular data

Achieved an average performance of 46.23% on the GreenNode tabular retrieval dataset

Question answering systems

Vietnamese question answering

Build retrieval components for Vietnamese question answering systems

Achieved an average performance of 56.86% on the VieQuAD dataset

🚀 SentenceTransformer

This is a sentence-transformers model. It maps sentences and paragraphs to a 1024-dimensional dense vector space, which can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

✨ Features

Maps sentences and paragraphs to a 1024 - dimensional dense vector space.
Applicable to various natural language processing tasks such as semantic textual similarity, semantic search, etc.

📦 Installation

First, install the Sentence Transformers library:

pip install -U sentence-transformers

💻 Usage Examples

Basic Usage

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("sentence_transformers_model_id")
# Run inference
sentences = [
    'The weather is lovely today.',
    "It's so sunny outside!",
    'He drove to the stadium.',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 1024]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

📚 Documentation

Model Details

Model Description

Property	Details
Model Type	Sentence Transformer
Maximum Sequence Length	8192 tokens
Output Dimensionality	1024 tokens
Similarity Function	Cosine Similarity
Training Dataset	GreenNode/GreenNode-Table-Markdown-Retrieval
Language	Vietnamese
License	cc-by-4.0

Model Sources

Documentation: Sentence Transformers Documentation
Repository: Sentence Transformers on GitHub
Hugging Face: Sentence Transformers on Hugging Face

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 8192, 'do_lower_case': False}) with Transformer model: XLMRobertaModel 
  (1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)

Evaluation

Table: Performance comparison of various models on GreenNodeTableRetrieval

Dataset: GreenNode/GreenNode-Table-Markdown-Retrieval

Model Name	MAP@5 ↑	MRR@5 ↑	NDCG@5 ↑	Recall@5 ↑	Mean ↑
Multilingual Embedding models
me5_small	33.75	33.75	35.68	41.49	36.17
me5_large	38.16	38.16	40.27	46.62	40.80
M3-Embedding	36.52	36.52	38.60	44.84	39.12
OpenAI-embedding-v3	30.61	30.61	32.57	38.46	33.06
Vietnamese Embedding models (Prior Work)
halong-embedding	32.15	32.15	34.13	40.09	34.63
sup-SimCSE-VietNamese-phobert_base	10.90	10.90	12.03	15.41	12.31
vietnamese-bi-encoder	13.61	13.61	14.63	17.68	14.89
GreenNode-Embedding (Our Work)
M3-GN-VN	41.85	41.85	44.15	57.05	46.23
M3-GN-VN-Mixed	42.08	42.08	44.33	51.06	44.89

Table: Performance comparison of various models on ZacLegalTextRetrieval

Dataset: GreenNode/zalo-ai-legal-text-retrieval-vn

Model Name	MAP@5 ↑	MRR@5 ↑	NDCG@5 ↑	Recall@5 ↑	Mean ↑
Multilingual Embedding models
me5_small	54.68	54.37	58.32	69.16	59.13
me5_large	60.14	59.62	64.17	76.02	64.99
M3-Embedding	69.34	68.96	73.70	86.68	74.67
OpenAI-embedding-v3	38.68	38.80	41.53	49.94	41.74
Vietnamese Embedding models (Prior Work)
halong-embedding	52.57	52.28	56.64	68.72	57.55
sup-SimCSE-VietNamese-phobert_base	25.15	25.07	27.81	35.79	28.46
vietnamese-bi-encoder	54.88	54.47	59.10	79.51	61.99
GreenNode-Embedding (Our Work)
M3-GN-VN	65.03	64.80	69.19	81.66	70.17
M3-GN-VN-Mixed	69.75	69.28	74.01	86.74	74.95

Table: Performance comparison of various models on VieQuADRetrieval

Dataset: taidng/UIT-ViQuAD2.0

Model Name	MAP@5 ↑	MRR@5 ↑	NDCG@5 ↑	Recall@5 ↑	Mean ↑
Multilingual Embedding models
me5_small	40.42	69.21	50.05	50.71	52.60
me5_large	44.18	67.81	53.04	55.86	55.22
M3-Embedding	44.08	72.28	54.07	56.01	56.61
OpenAI-embedding-v3	32.39	53.97	40.48	43.02	42.47
Vietnamese Embedding models (Prior Work)
halong-embedding	39.42	62.31	48.63	52.73	50.77
sup-SimCSE-VietNamese-phobert_base	20.45	35.99	26.73	29.59	28.19
vietnamese-bi-encoder	31.89	54.62	40.26	42.53	42.33
GreenNode-Embedding (Our Work)
M3-GN-VN	42.85	71.98	52.90	54.25	55.50
M3-GN-VN-Mixed	44.20	72.64	54.30	56.30	56.86

Table: Performance comparison of various models on GreenNodeTableRetrieval (Hit Rate)

Model Name	Hit Rate@1 ↑	Hit Rate@5 ↑	Hit Rate@10 ↑	Hit Rate@20 ↑
Multilingual Embedding models
me5_small	38.99	53.37	59.28	65.09
me5_large	43.99	59.74	65.74	71.59
bge-m3	42.15	57.00	63.05	68.96
OpenAI-embedding-v3	-	-	-	-
Vietnamese Embedding models (Prior Work)
halong-embedding	37.22	52.49	58.57	64.64
sup-SimCSE-VietNamese-phobert_base	14.00	24.74	30.32	36.44
vietnamese-bi-encoder	16.89	25.94	30.50	35.70
GreenNode-Embedding (Our Work)
M3-GN-VN	48.31	64.60	70.83	76.46
M3-GN-VN-Mixed	47.94	64.24	70.43	76.14

Framework Versions

Python: 3.10.14
Sentence Transformers: 3.0.1
Transformers: 4.42.4
PyTorch: 2.3.1
Accelerate: 0.33.0
Datasets: 2.20.0
Tokenizers: 0.19.1

📄 License

This model is licensed under cc-by-4.0.

📚 Citation

BibTeX

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご