English-phrases-bible Open Source Model - Free Deployment for Semantic Search and Precise Text Vector Mapping

English Phrases Bible

Developed by iamholmes

A sentence embedding model based on DistilBert TAS-B, optimized for semantic search tasks, capable of mapping text to a 768-dimensional vector space

Text Embedding

Transformers

Open Source License:Apache-2.0 #Semantic Search Optimization #Efficient Vector Encoding #Question-Answer Matching

Downloads 28

Release Time : 4/27/2022

Model Overview

This model is the sentence-transformers implementation of the DistilBert TAS-B model, specifically designed for generating semantic embeddings of sentences and paragraphs, suitable for information retrieval and semantic similarity calculation tasks

Model Features

Efficient Semantic Encoding

Based on the lightweight DistilBert architecture, providing efficient semantic encoding capabilities for sentences and paragraphs

Search Optimization

Specifically optimized for information retrieval and semantic search tasks

High-Dimensional Vector Space

Maps text to a 768-dimensional dense vector space, capturing rich semantic information

Model Capabilities

Sentence Embedding Generation

Semantic Similarity Calculation

Information Retrieval

Document Ranking

Use Cases

Information Retrieval

Question-Answer Systems

Achieves precise question-answer matching by calculating semantic similarity between queries and candidate answers

Effectively identifies the most semantically relevant answers to questions

Document Search

Used to build semantic-based document search engines

Provides more relevant results compared to keyword-based search

Content Recommendation

🚀 sentence-transformers/msmarco-distilbert-base-tas-b

This model maps sentences and paragraphs to a 768-dimensional dense vector space and is optimized for semantic search tasks.

🚀 Quick Start

This is a port of the DistilBert TAS-B Model to sentence-transformers model. It maps sentences & paragraphs to a 768 dimensional dense vector space and is optimized for the task of semantic search.

✨ Features

Sentence and Paragraph Mapping: Maps sentences and paragraphs to a 768-dimensional dense vector space.
Semantic Search Optimization: Optimized for semantic search tasks.

📦 Installation

Using this model becomes easy when you have sentence-transformers installed:

pip install -U sentence-transformers

💻 Usage Examples

Basic Usage (Sentence-Transformers)

from sentence_transformers import SentenceTransformer, util

query = "How many people live in London?"
docs = ["Around 9 Million people live in London", "London is known for its financial district"]

#Load the model
model = SentenceTransformer('sentence-transformers/msmarco-distilbert-base-tas-b')

#Encode query and documents
query_emb = model.encode(query)
doc_emb = model.encode(docs)

#Compute dot score between query and all document embeddings
scores = util.dot_score(query_emb, doc_emb)[0].cpu().tolist()

#Combine docs & scores
doc_score_pairs = list(zip(docs, scores))

#Sort by decreasing score
doc_score_pairs = sorted(doc_score_pairs, key=lambda x: x[1], reverse=True)

#Output passages & scores
for doc, score in doc_score_pairs:
    print(score, doc)

Advanced Usage (HuggingFace Transformers)

Without sentence-transformers, you can use the model like this: First, you pass your input through the transformer model, then you have to apply the right pooling-operation on-top of the contextualized word embeddings.

from transformers import AutoTokenizer, AutoModel
import torch

#CLS Pooling - Take output from first token
def cls_pooling(model_output):
    return model_output.last_hidden_state[:,0]

#Encode text
def encode(texts):
    # Tokenize sentences
    encoded_input = tokenizer(texts, padding=True, truncation=True, return_tensors='pt')

    # Compute token embeddings
    with torch.no_grad():
        model_output = model(**encoded_input, return_dict=True)

    # Perform pooling
    embeddings = cls_pooling(model_output)

    return embeddings


# Sentences we want sentence embeddings for
query = "How many people live in London?"
docs = ["Around 9 Million people live in London", "London is known for its financial district"]

# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/msmarco-distilbert-base-tas-b")
model = AutoModel.from_pretrained("sentence-transformers/msmarco-distilbert-base-tas-b")

#Encode query and docs
query_emb = encode(query)
doc_emb = encode(docs)

#Compute dot score between query and all document embeddings
scores = torch.mm(query_emb, doc_emb.transpose(0, 1))[0].cpu().tolist()

#Combine docs & scores
doc_score_pairs = list(zip(docs, scores))

#Sort by decreasing score
doc_score_pairs = sorted(doc_score_pairs, key=lambda x: x[1], reverse=True)

#Output passages & scores
for doc, score in doc_score_pairs:
    print(score, doc)

📚 Documentation

For an automated evaluation of this model, see the Sentence Embeddings Benchmark: https://seb.sbert.net

🔧 Technical Details

SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: DistilBertModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
)

📄 License

This project is licensed under the apache-2.0 license.

📖 Citing & Authors

Have a look at: DistilBert TAS-B Model

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご