Word-order-jina Open-source Model - Easily Generate Sentence Embeddings and Accurately Calculate Semantic Similarity

Word Order Jina

Developed by bwang0911

This is a SentenceTransformer model fine-tuned on jina-embeddings-v2-base-en for generating sentence embeddings and computing semantic similarity.

Text Embedding

Safetensors

English#Sentence similarity calculation #Semantic search optimization #Multiple negative ranking

Downloads 37

Release Time : 12/3/2024

Model Overview

The model maps sentences and paragraphs into a 768-dimensional dense vector space, suitable for tasks like semantic text similarity, semantic search, paraphrase mining, text classification, and clustering.

Model Features

Efficient semantic encoding

Capable of efficiently encoding sentences and paragraphs into 768-dimensional dense vectors

Multiple negative training

Trained with multiple negative ranking loss to enhance the model's ability to distinguish similar sentences

Mixed dataset training

Trained on a combination of word_orders and negation_dataset to improve model comprehension

Model Capabilities

Compute sentence similarity

Generate text embeddings

Semantic search

Text classification

Text clustering

Use Cases

Information retrieval

Semantic search

Retrieve relevant documents based on the semantics of the query rather than keyword matching

Improves the relevance and accuracy of search results

Text analysis

Text clustering

Automatically group semantically similar documents

Helps discover thematic structures in document collections

🚀 SentenceTransformer based on jinaai/jina-embeddings-v2-base-en

This SentenceTransformer model is finetuned from jinaai/jina-embeddings-v2-base-en on specific datasets. It maps sentences and paragraphs to a 768 - dimensional dense vector space, enabling applications like semantic textual similarity, semantic search, paraphrase mining, text classification, and clustering.

🚀 Quick Start

Prerequisites

First, you need to install the sentence-transformers library.

pip install -U sentence-transformers

Run Inference

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("bwang0911/word-order-jina")
# Run inference
sentences = [
    'Paint preserves wood',
    'Coating protects timber',
    'timber coating protects',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 768]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

✨ Features

Semantic Understanding: Maps sentences and paragraphs to a 768 - dimensional dense vector space for semantic analysis.
Fine - Tuned: Finetuned on word_orders and negation_dataset datasets.
Multiple Applications: Can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, and clustering.

📦 Installation

To use this model, you need to install the sentence-transformers library:

pip install -U sentence-transformers

💻 Usage Examples

Basic Usage

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("bwang0911/word-order-jina")
# Run inference
sentences = [
    'Paint preserves wood',
    'Coating protects timber',
    'timber coating protects',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 768]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

📚 Documentation

Model Details

Model Description

Property	Details
Model Type	Sentence Transformer
Base model	jinaai/jina-embeddings-v2-base-en
Maximum Sequence Length	128 tokens
Output Dimensionality	768 dimensions
Similarity Function	Cosine Similarity
Training Datasets	word_orders, negation_dataset
Language	en

Model Sources

Documentation: Sentence Transformers Documentation
Repository: Sentence Transformers on GitHub
Hugging Face: Sentence Transformers on Hugging Face

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 128, 'do_lower_case': False}) with Transformer model: JinaBertModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)

Training Details

Training Datasets

word_orders

Dataset: word_orders at 99609ac
Size: 1,002 training samples
Columns: <code>anchor</code>, <code>pos</code>, and <code>neg</code>
Approximate statistics based on the first 1000 samples:
anchor pos neg
type string string string
details
min: 5 tokens
mean: 12.34 tokens
max: 32 tokens
min: 5 tokens
mean: 12.1 tokens
max: 30 tokens
min: 5 tokens
mean: 11.51 tokens
max: 24 tokens

Samples:

anchor	pos	neg
`The river flows from the mountains to the sea`	`Water travels from mountain peaks to ocean`	`The river flows from the sea to the mountains`
`Train departs London for Paris`	`Railway journey from London heading to Paris`	`Train departs Paris for London`
`Cargo ship sails from Shanghai to Singapore`	`Maritime route Shanghai to Singapore`	`Cargo ship sails from Singapore to Shanghai`

Loss: MultipleNegativesRankingLoss with these parameters:

{
    "scale": 20,
    "similarity_fct": "cos_sim"
}

negation_dataset

Dataset: negation_dataset at cd02256
Size: 10,000 training samples
Columns: <code>anchor</code>, <code>entailment</code>, and <code>negative</code>

Approximate statistics based on the first 1000 samples:

	anchor	entailment	negative
type	string	string	string
details	min: 6 tokens mean: 16.48 tokens max: 44 tokens	min: 4 tokens mean: 9.63 tokens max: 31 tokens	min: 5 tokens mean: 10.46 tokens max: 32 tokens

Samples:

anchor	entailment	negative
`Two young girls are playing outside in a non-urban environment.`	`Two girls are playing outside.`	`Two girls are not playing outside.`
`A man with a red shirt is watching another man who is standing on top of a attached cart filled to the top.`	`A man is standing on top of a cart.`	`A man is not standing on top of a cart.`
`A man in a blue shirt driving a Segway type vehicle.`	`A person is riding a motorized vehicle.`	`A person is not riding a motorized vehicle.`

Loss: MultipleNegativesRankingLoss with these parameters:

{
    "scale": 20,
    "similarity_fct": "cos_sim"
}

Training Hyperparameters

Non - Default Hyperparameters

per_device_train_batch_size: 128
warmup_ratio: 0.1
fp16: True
batch_sampler: no_duplicates

All Hyperparameters

Click to expand

overwrite_output_dir: False
do_predict: False
eval_strategy: no
prediction_loss_only: True
per_device_train_batch_size: 128
per_device_eval_batch_size: 8
per_gpu_train_batch_size: None
per_gpu_eval_batch_size: None
gradient_accumulation_steps: 1
eval_accumulation_steps: None
torch_empty_cache_steps: None
learning_rate: 5e - 05
weight_decay: 0.0
adam_beta1: 0.9
adam_beta2: 0.999
adam_epsilon: 1e - 08
max_grad_norm: 1.0
num_train_epochs: 3
max_steps: -1
lr_scheduler_type: linear
lr_scheduler_kwargs: {}
warmup_ratio: 0.1
warmup_steps: 0
log_level: passive
log_level_replica: warning
log_on_each_node: True
logging_nan_inf_filter: True
save_safetensors: True
save_on_each_node: False
save_only_model: False
restore_callback_states_from_checkpoint: False
no_cuda: False
use_cpu: False
use_mps_device: False
seed: 42
data_seed: None
jit_mode_eval: False
use_ipex: False
bf16: False
fp16: True
fp16_opt_level: O1
half_precision_backend: auto
bf16_full_eval: False
fp16_full_eval: False
tf32: None
local_rank: 0
ddp_backend: None
tpu_num_cores: None
tpu_metrics_debug: False
debug: []
dataloader_drop_last: False
dataloader_num_workers: 0
dataloader_prefetch_factor: None
past_index: -1
disable_tqdm: False
remove_unused_columns: True
label_names: None
load_best_model_at_end: False
ignore_data_skip: False
fsdp: []
fsdp_min_num_params: 0
fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
fsdp_transformer_layer_cls_to_wrap: None
accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
deepspeed: None
label_smoothing_factor: 0.0
optim: adamw_torch
optim_args: None
adafactor: False
group_by_length: False
length_column_name: length
ddp_find_unused_parameters: None
ddp_bucket_cap_mb: None
ddp_broadcast_buffers: False

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご