Word Order Jina
This is a SentenceTransformer model fine-tuned on jina-embeddings-v2-base-en for generating sentence embeddings and computing semantic similarity.
Downloads 37
Release Time : 12/3/2024
Model Overview
The model maps sentences and paragraphs into a 768-dimensional dense vector space, suitable for tasks like semantic text similarity, semantic search, paraphrase mining, text classification, and clustering.
Model Features
Efficient semantic encoding
Capable of efficiently encoding sentences and paragraphs into 768-dimensional dense vectors
Multiple negative training
Trained with multiple negative ranking loss to enhance the model's ability to distinguish similar sentences
Mixed dataset training
Trained on a combination of word_orders and negation_dataset to improve model comprehension
Model Capabilities
Compute sentence similarity
Generate text embeddings
Semantic search
Text classification
Text clustering
Use Cases
Information retrieval
Semantic search
Retrieve relevant documents based on the semantics of the query rather than keyword matching
Improves the relevance and accuracy of search results
Text analysis
Text clustering
Automatically group semantically similar documents
Helps discover thematic structures in document collections
đ SentenceTransformer based on jinaai/jina-embeddings-v2-base-en
This SentenceTransformer model is finetuned from jinaai/jina-embeddings-v2-base-en on specific datasets. It maps sentences and paragraphs to a 768 - dimensional dense vector space, enabling applications like semantic textual similarity, semantic search, paraphrase mining, text classification, and clustering.
đ Quick Start
Prerequisites
First, you need to install the sentence-transformers
library.
pip install -U sentence-transformers
Run Inference
from sentence_transformers import SentenceTransformer
# Download from the đ¤ Hub
model = SentenceTransformer("bwang0911/word-order-jina")
# Run inference
sentences = [
'Paint preserves wood',
'Coating protects timber',
'timber coating protects',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 768]
# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]
⨠Features
- Semantic Understanding: Maps sentences and paragraphs to a 768 - dimensional dense vector space for semantic analysis.
- Fine - Tuned: Finetuned on word_orders and negation_dataset datasets.
- Multiple Applications: Can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, and clustering.
đĻ Installation
To use this model, you need to install the sentence-transformers
library:
pip install -U sentence-transformers
đģ Usage Examples
Basic Usage
from sentence_transformers import SentenceTransformer
# Download from the đ¤ Hub
model = SentenceTransformer("bwang0911/word-order-jina")
# Run inference
sentences = [
'Paint preserves wood',
'Coating protects timber',
'timber coating protects',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 768]
# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]
đ Documentation
Model Details
Model Description
Property | Details |
---|---|
Model Type | Sentence Transformer |
Base model | jinaai/jina-embeddings-v2-base-en |
Maximum Sequence Length | 128 tokens |
Output Dimensionality | 768 dimensions |
Similarity Function | Cosine Similarity |
Training Datasets | word_orders, negation_dataset |
Language | en |
Model Sources
- Documentation: Sentence Transformers Documentation
- Repository: Sentence Transformers on GitHub
- Hugging Face: Sentence Transformers on Hugging Face
Full Model Architecture
SentenceTransformer(
(0): Transformer({'max_seq_length': 128, 'do_lower_case': False}) with Transformer model: JinaBertModel
(1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
(2): Normalize()
)
Training Details
Training Datasets
word_orders
- Dataset: word_orders at 99609ac
- Size: 1,002 training samples
- Columns:
<code>anchor</code>
,<code>pos</code>
, and<code>neg</code>
- Approximate statistics based on the first 1000 samples:
anchor pos neg type string string string details - min: 5 tokens
- mean: 12.34 tokens
- max: 32 tokens
- min: 5 tokens
- mean: 12.1 tokens
- max: 30 tokens
- min: 5 tokens
- mean: 11.51 tokens
- max: 24 tokens
- Samples:
anchor pos neg The river flows from the mountains to the sea
Water travels from mountain peaks to ocean
The river flows from the sea to the mountains
Train departs London for Paris
Railway journey from London heading to Paris
Train departs Paris for London
Cargo ship sails from Shanghai to Singapore
Maritime route Shanghai to Singapore
Cargo ship sails from Singapore to Shanghai
- Loss:
MultipleNegativesRankingLoss
with these parameters:{ "scale": 20, "similarity_fct": "cos_sim" }
negation_dataset
- Dataset: negation_dataset at cd02256
- Size: 10,000 training samples
- Columns:
<code>anchor</code>
,<code>entailment</code>
, and<code>negative</code>
- Approximate statistics based on the first 1000 samples:
anchor entailment negative type string string string details - min: 6 tokens
- mean: 16.48 tokens
- max: 44 tokens
- min: 4 tokens
- mean: 9.63 tokens
- max: 31 tokens
- min: 5 tokens
- mean: 10.46 tokens
- max: 32 tokens
- Samples:
anchor entailment negative Two young girls are playing outside in a non-urban environment.
Two girls are playing outside.
Two girls are not playing outside.
A man with a red shirt is watching another man who is standing on top of a attached cart filled to the top.
A man is standing on top of a cart.
A man is not standing on top of a cart.
A man in a blue shirt driving a Segway type vehicle.
A person is riding a motorized vehicle.
A person is not riding a motorized vehicle.
- Loss:
MultipleNegativesRankingLoss
with these parameters:{ "scale": 20, "similarity_fct": "cos_sim" }
Training Hyperparameters
Non - Default Hyperparameters
per_device_train_batch_size
: 128warmup_ratio
: 0.1fp16
: Truebatch_sampler
: no_duplicates
All Hyperparameters
Click to expand
overwrite_output_dir
: Falsedo_predict
: Falseeval_strategy
: noprediction_loss_only
: Trueper_device_train_batch_size
: 128per_device_eval_batch_size
: 8per_gpu_train_batch_size
: Noneper_gpu_eval_batch_size
: Nonegradient_accumulation_steps
: 1eval_accumulation_steps
: Nonetorch_empty_cache_steps
: Nonelearning_rate
: 5e - 05weight_decay
: 0.0adam_beta1
: 0.9adam_beta2
: 0.999adam_epsilon
: 1e - 08max_grad_norm
: 1.0num_train_epochs
: 3max_steps
: -1lr_scheduler_type
: linearlr_scheduler_kwargs
: {}warmup_ratio
: 0.1warmup_steps
: 0log_level
: passivelog_level_replica
: warninglog_on_each_node
: Truelogging_nan_inf_filter
: Truesave_safetensors
: Truesave_on_each_node
: Falsesave_only_model
: Falserestore_callback_states_from_checkpoint
: Falseno_cuda
: Falseuse_cpu
: Falseuse_mps_device
: Falseseed
: 42data_seed
: Nonejit_mode_eval
: Falseuse_ipex
: Falsebf16
: Falsefp16
: Truefp16_opt_level
: O1half_precision_backend
: autobf16_full_eval
: Falsefp16_full_eval
: Falsetf32
: Nonelocal_rank
: 0ddp_backend
: Nonetpu_num_cores
: Nonetpu_metrics_debug
: Falsedebug
: []dataloader_drop_last
: Falsedataloader_num_workers
: 0dataloader_prefetch_factor
: Nonepast_index
: -1disable_tqdm
: Falseremove_unused_columns
: Truelabel_names
: Noneload_best_model_at_end
: Falseignore_data_skip
: Falsefsdp
: []fsdp_min_num_params
: 0fsdp_config
: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}fsdp_transformer_layer_cls_to_wrap
: Noneaccelerator_config
: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}deepspeed
: Nonelabel_smoothing_factor
: 0.0optim
: adamw_torchoptim_args
: Noneadafactor
: Falsegroup_by_length
: Falselength_column_name
: lengthddp_find_unused_parameters
: Noneddp_bucket_cap_mb
: Noneddp_broadcast_buffers
: False
Jina Embeddings V3
Jina Embeddings V3 is a multilingual sentence embedding model supporting over 100 languages, specializing in sentence similarity and feature extraction tasks.
Text Embedding
Transformers Supports Multiple Languages

J
jinaai
3.7M
911
Ms Marco MiniLM L6 V2
Apache-2.0
A cross-encoder model trained on the MS Marco passage ranking task for query-passage relevance scoring in information retrieval
Text Embedding English
M
cross-encoder
2.5M
86
Opensearch Neural Sparse Encoding Doc V2 Distill
Apache-2.0
A sparse retrieval model based on distillation technology, optimized for OpenSearch, supporting inference-free document encoding with improved search relevance and efficiency over V1
Text Embedding
Transformers English

O
opensearch-project
1.8M
7
Sapbert From PubMedBERT Fulltext
Apache-2.0
A biomedical entity representation model based on PubMedBERT, optimized for semantic relation capture through self-aligned pre-training
Text Embedding English
S
cambridgeltl
1.7M
49
Gte Large
MIT
GTE-Large is a powerful sentence transformer model focused on sentence similarity and text embedding tasks, excelling in multiple benchmark tests.
Text Embedding English
G
thenlper
1.5M
278
Gte Base En V1.5
Apache-2.0
GTE-base-en-v1.5 is an English sentence transformer model focused on sentence similarity tasks, excelling in multiple text embedding benchmarks.
Text Embedding
Transformers Supports Multiple Languages

G
Alibaba-NLP
1.5M
63
Gte Multilingual Base
Apache-2.0
GTE Multilingual Base is a multilingual sentence embedding model supporting over 50 languages, suitable for tasks like sentence similarity calculation.
Text Embedding
Transformers Supports Multiple Languages

G
Alibaba-NLP
1.2M
246
Polybert
polyBERT is a chemical language model designed to achieve fully machine-driven ultrafast polymer informatics. It maps PSMILES strings into 600-dimensional dense fingerprints to numerically represent polymer chemical structures.
Text Embedding
Transformers

P
kuelumbus
1.0M
5
Bert Base Turkish Cased Mean Nli Stsb Tr
Apache-2.0
A sentence embedding model based on Turkish BERT, optimized for semantic similarity tasks
Text Embedding
Transformers Other

B
emrecan
1.0M
40
GIST Small Embedding V0
MIT
A text embedding model fine-tuned based on BAAI/bge-small-en-v1.5, trained with the MEDI dataset and MTEB classification task datasets, optimized for query encoding in retrieval tasks.
Text Embedding
Safetensors English
G
avsolatorio
945.68k
29
Featured Recommended AI Models
Š 2025AIbase