D

Dense Encoder Msmarco Distilbert Word2vec256k MLM 210k Emb Updated

Developed by vocab-transformers
DistilBERT model with word2vec-initialized 256k vocabulary, optimized for sentence similarity and information retrieval tasks
Downloads 23
Release Time : 3/2/2022

Model Overview

This model employs an extended vocabulary initialized with word2vec, trained on the MS MARCO dataset, suitable for sentence embedding generation and semantic similarity calculation

Model Features

Extended vocabulary
Utilizes a 256k vocabulary size initialized with word2vec, providing stronger lexical coverage compared to standard BERT models
Efficient training
Based on DistilBERT architecture, reducing model complexity while maintaining performance
Specialized optimization
Optimized specifically for information retrieval tasks using MarginMSELoss on the MS MARCO dataset

Model Capabilities

Sentence embedding generation
Semantic similarity calculation
Information retrieval
Document matching

Use Cases

Information retrieval
Search engine optimization
Improving document relevance ranking for search engines
Achieved MRR@10 of 34.91 on MS MARCO dev set
QA systems
Matching user questions with candidate answers in knowledge bases
Achieved nDCG@10 of 67.56 and 68.18 on TREC-DL 2019/2020 respectively
Semantic analysis
Document deduplication
Identifying semantically similar documents
Featured Recommended AI Models
AIbase
Empowering the Future, Your AI Solution Knowledge Base
Š 2025AIbase