Open-source model sup-SimCSE-VietNamese-phobert-base - Empowering Vietnamese sentence embedding processing and compatible with various data

Sup SimCSE VietNamese Phobert Base

Developed by VoVanPhuc

SimeCSE_Vietnamese is a Vietnamese sentence embedding model based on SimCSE, using PhoBERT as the pretrained language model, suitable for both unlabeled and labeled data.

Text Embedding

Transformers

Other#Vietnamese Sentence Embedding #Contrastive Learning #PhoBERT Pretraining

Downloads 25.51k

Release Time : 3/2/2022

Model Overview

SimeCSE_Vietnamese is a model for Vietnamese sentence embedding, optimized through contrastive learning during pretraining to generate high-quality sentence vector representations.

Model Features

Contrastive Learning Based on SimCSE

Adopts SimCSE's contrastive learning method to optimize the pretraining process and improve the quality of sentence embeddings.

Supports Unlabeled and Labeled Data

The model is suitable for both unlabeled and labeled data, with strong generalization capabilities.

Pretraining Based on PhoBERT

Uses PhoBERT as the pretrained language model, fully leveraging the linguistic characteristics of Vietnamese.

Model Capabilities

Generate Vietnamese sentence embeddings

Sentence similarity calculation

Text retrieval

Use Cases

Text Similarity

Sentence Similarity Calculation

Calculate the similarity between two Vietnamese sentences.

Information Retrieval

Vietnamese Text Retrieval

Used to retrieve Vietnamese documents most relevant to the query sentence.

🚀 SimeCSE_Vietnamese: Simple Contrastive Learning of Sentence Embeddings with Vietnamese

SimeCSE_Vietnamese is a pre - trained model for sentence embeddings in Vietnamese. It solves the problem of generating high - quality sentence embeddings in the Vietnamese language, which can be used for various natural language processing tasks such as sentence similarity calculation.

🚀 Quick Start

Open In Colab

✨ Features

The pre - training approach of SimeCSE_Vietnamese is based on SimCSE, optimizing the pre - training procedure for more robust performance.
It encodes input sentences using a pre - trained language model like PhoBert.
Works with both unlabeled and labeled data.

📦 Installation

Using SimeCSE_Vietnamese with `sentences-transformers`

Install sentence-transformers:
- pip install -U sentence-transformers
Install pyvi to word segment:
- pip install pyvi

Using SimeCSE_Vietnamese with `transformers`

Install transformers:
- pip install -U transformers
Install pyvi to word segment:
- pip install pyvi

💻 Usage Examples

Using SimeCSE_Vietnamese with `sentences-transformers`

Basic Usage

from sentence_transformers import SentenceTransformer
from pyvi.ViTokenizer import tokenize

model = SentenceTransformer('VoVanPhuc/sup-SimCSE-VietNamese-phobert-base')

sentences = ['Kẻ đánh bom đinh tồi tệ nhất nước Anh.',
          'Nghệ sĩ làm thiện nguyện - minh bạch là việc cấp thiết.',
          'Bắc Giang tăng khả năng điều trị và xét nghiệm.',
          'HLV futsal Việt Nam tiết lộ lý do hạ Lebanon.',
          'việc quan trọng khi kêu gọi quyên góp từ thiện là phải minh bạch, giải ngân kịp thời.',
          '20% bệnh nhân Covid-19 có thể nhanh chóng trở nặng.',
          'Thái Lan thua giao hữu trước vòng loại World Cup.',
          'Cựu tuyển thủ Nguyễn Bảo Quân: May mắn ủng hộ futsal Việt Nam',
          'Chủ ki-ốt bị đâm chết trong chợ đầu mối lớn nhất Thanh Hoá.',
          'Bắn chết người trong cuộc rượt đuổi trên sông.'
          ]

sentences = [tokenize(sentence) for sentence in sentences]
embeddings = model.encode(sentences)

Using SimeCSE_Vietnamese with `transformers`

Basic Usage

import torch
from transformers import AutoModel, AutoTokenizer
from pyvi.ViTokenizer import tokenize

PhobertTokenizer = AutoTokenizer.from_pretrained("VoVanPhuc/sup-SimCSE-VietNamese-phobert-base")
model = AutoModel.from_pretrained("VoVanPhuc/sup-SimCSE-VietNamese-phobert-base")

sentences = ['Kẻ đánh bom đinh tồi tệ nhất nước Anh.',
          'Nghệ sĩ làm thiện nguyện - minh bạch là việc cấp thiết.',
          'Bắc Giang tăng khả năng điều trị và xét nghiệm.',
          'HLV futsal Việt Nam tiết lộ lý do hạ Lebanon.',
          'việc quan trọng khi kêu gọi quyên góp từ thiện là phải minh bạch, giải ngân kịp thời.',
          '20% bệnh nhân Covid-19 có thể nhanh chóng trở nặng.',
          'Thái Lan thua giao hữu trước vòng loại World Cup.',
          'Cựu tuyển thủ Nguyễn Bảo Quân: May mắn ủng hộ futsal Việt Nam',
          'Chủ ki-ốt bị đâm chết trong chợ đầu mối lớn nhất Thanh Hoá.',
          'Bắn chết người trong cuộc rượt đuổi trên sông.'
          ]

sentences = [tokenize(sentence) for sentence in sentences]

inputs = PhobertTokenizer(sentences, padding=True, truncation=True, return_tensors="pt")

with torch.no_grad():
    embeddings = model(**inputs, output_hidden_states=True, return_dict=True).pooler_output

📚 Documentation

Pre - trained models

Property	Details
Model Type	`VoVanPhuc/sup-SimCSE-VietNamese-phobert-base`, 135M params, base architecture; `VoVanPhuc/unsup-SimCSE-VietNamese-phobert-base`, 135M params, base architecture

📄 License

Citation

@article{gao2021simcse,
   title={{SimCSE}: Simple Contrastive Learning of Sentence Embeddings},
   author={Gao, Tianyu and Yao, Xingcheng and Chen, Danqi},
   journal={arXiv preprint arXiv:2104.08821},
   year={2021}
}

@inproceedings{phobert,
title     = {{PhoBERT: Pre-trained language models for Vietnamese}},
author    = {Dat Quoc Nguyen and Anh Tuan Nguyen},
booktitle = {Findings of the Association for Computational Linguistics: EMNLP 2020},
year      = {2020},
pages     = {1037--1042}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご

Sup SimCSE VietNamese Phobert Base

Model Overview

Model Features

Model Capabilities

Use Cases

🚀 SimeCSE_Vietnamese: Simple Contrastive Learning of Sentence Embeddings with Vietnamese

🚀 Quick Start

✨ Features

📦 Installation

Using SimeCSE_Vietnamese with sentences-transformers

Using SimeCSE_Vietnamese with transformers

💻 Usage Examples

Using SimeCSE_Vietnamese with sentences-transformers

Basic Usage

Using SimeCSE_Vietnamese with transformers

Basic Usage

📚 Documentation

Pre - trained models

📄 License

Citation

Using SimeCSE_Vietnamese with `sentences-transformers`

Using SimeCSE_Vietnamese with `transformers`

Using SimeCSE_Vietnamese with `sentences-transformers`

Using SimeCSE_Vietnamese with `transformers`