Model Selection

Large-scale pre-training

# Large-scale pre-training

Qwen3-8B-Base is the latest generation of Tongyi's large model series, with 8.2 billion parameters and support for 119 languages. It is suitable for a variety of natural language processing tasks.

Large Language Model

LHM is a feedforward model that can reconstruct an animatable 3D human body from a single image within seconds. Trained with image reconstruction loss on a large-scale video dataset, our model demonstrates strong generalization capabilities across diverse real-world scenarios.

3D Vision English

Izanami Wav2vec2 Large

Japanese wav2vec2.0 Large model pre-trained on large-scale Japanese TV broadcast audio data

Speech Recognition Japanese

Kushinada Hubert Large

A Japanese HuBERT Large model pre-trained on 62,215 hours of Japanese TV broadcast audio data for speech feature extraction

Speech Recognition Japanese

Kushinada Hubert Base

Japanese speech feature extraction model pre-trained on 62,215 hours of Japanese TV broadcast audio data

Speech Recognition Japanese

Videomaev2 Base

VideoMAEv2-Base is a self-supervised video feature extraction model that employs a dual masking mechanism pre-trained on the UnlabeldHybrid-1M dataset.

Video Processing

Sam2 Hiera Large.fb R1024 2pt1

SAM2 model based on HieraDet image encoder, focusing on efficient image feature extraction

Image Segmentation

Eva02 Enormous Patch14 Clip 224.laion2b

EVA-CLIP is a vision-language model based on the CLIP architecture, supporting zero-shot image classification tasks.

Vit Huge Patch14 Clip 224.metaclip 2pt5b

A dual-purpose vision-language model trained on the MetaCLIP-2.5B dataset, supporting zero-shot image classification tasks

Image Classification

Vit Large Patch14 Clip 224.metaclip 2pt5b

A dual-framework compatible vision model trained on MetaCLIP-2.5B dataset, supporting zero-shot image classification tasks

Image Classification

Vit Large Patch14 Clip 224.metaclip 400m

Vision Transformer model trained on MetaCLIP-400M dataset, supporting zero-shot image classification tasks

Image Classification

Vit Base Patch16 Clip 224.metaclip 2pt5b

A dual-framework compatible vision model trained on the MetaCLIP-2.5B dataset, supporting both OpenCLIP and timm frameworks

Image Classification

Vit Base Patch32 Clip 224.metaclip 2pt5b

A vision Transformer model trained on the MetaCLIP-2.5B dataset, compatible with both open_clip and timm frameworks

Image Classification

Structtable InternVL2 1B

A multimodal table recognition model based on InternVL2-1B, supporting conversion of table images to LaTeX/HTML/Markdown formats

Safetensors Supports Multiple Languages

EuroLLM-1.7B is the first pre-trained model in the EuroLLM series, with multilingual processing capabilities, capable of understanding and generating text in multiple European and other related languages.

Large Language Model

Transformers Supports Multiple Languages

Retnet 1.3B 100B

A text generation model trained on the SlimPajama-627B dataset, utilizing the retina network architecture.

Large Language Model

Safetensors Supports Multiple Languages

Internvit 6B 224px

InternViT-6B-224px is a foundational vision model focused on image feature extraction, with 5903 million parameters, supporting image inputs of 224x224 pixels.

Image Classification

Vit Bigg 14 CLIPA Datacomp1b

CLIPA-v2 model, focusing on zero-shot image classification tasks, achieving efficient visual representation learning through contrastive image-text training

Vit Bigg 14 CLIPA 336 Datacomp1b

CLIPA-v2 model, an efficient contrastive image-text model, focused on zero-shot image classification tasks

Vit H 14 CLIPA Datacomp1b

CLIPA-v2 model, an efficient contrastive vision-language model designed for zero-shot image classification tasks.

Metaclip L14 400m

MetaCLIP is a vision-language model trained on CommonCrawl data for constructing shared image-text embedding spaces.

Metaclip B16 400m

MetaCLIP is a vision-language model trained on CommonCrawl data for constructing shared image-text embedding spaces

Metaclip B32 Fullcc2.5b

MetaCLIP is a vision-language model trained on 2.5 billion data points from CommonCrawl (CC) to construct a shared image-text embedding space.

Unsup Simcse Ja Large

This is an unsupervised learning-based Japanese sentence embedding model, specifically designed to generate high-quality Japanese sentence embeddings.

Transformers Japanese

Nucleotide Transformer V2 50m Multi Species

The Nucleotide Transformer is a set of foundational language models pre-trained on whole-genome DNA sequences, integrating genomic data from over 3,200 human genomes and 850 diverse species.

Molecular Model

Stt En Fastconformer Ctc Large

This is a large automatic speech recognition (ASR) model based on the FastConformer architecture, specifically designed for transcribing English speech into text.

Speech Recognition English

SAM is a visual model capable of generating high-quality object masks from input points or bounding boxes, with zero-shot transfer capability.

Image Segmentation

Transformers Other

SAM is a vision model capable of generating high-quality object masks from input prompts (such as points or boxes), supporting zero-shot segmentation tasks

Image Segmentation

Transformers Other

mGPT 13B is a multi-language language model that supports 61 languages, covering 25 language families. It is trained on 600GB of text data and has powerful multi-language processing capabilities.

Large Language Model

Transformers Supports Multiple Languages

SAM is a vision model capable of generating high-quality object masks based on input prompts, supporting zero-shot transfer to new tasks

Image Segmentation

Transformers Other

Nucleotide Transformer 2.5b Multi Species

A DNA sequence analysis model pre-trained on genomes from 850 species, supporting tasks such as molecular phenotype prediction

Molecular Model

A Russian pre-trained language model based on the T5 architecture, employing a mixed training strategy with 7 denoisers similar to UL2, supporting various text generation tasks.

Large Language Model

Transformers Other

Deberta V1 Base

DeBERTa-base is a pre-trained bidirectional encoder for Russian, mainly used for processing Russian text tasks.

Large Language Model

Transformers Supports Multiple Languages

CLIP Convnext Large D.laion2b S26b B102k Augreg

Large-scale ConvNeXt-Large CLIP model trained on LAION-2B dataset, supporting zero-shot image classification and image-text retrieval tasks

T5 Efficient Gc4 All German Small El32

A T5 model trained on the large-scale cleaned German Common Crawl corpus (GC4), specializing in German natural language processing tasks.

Large Language Model

Transformers German

Russian pre-trained language model based on T5 architecture, employing a UL2-like mixed training strategy with 7 denoising tasks, 1.7 billion parameters

Large Language Model

Transformers Other

A Russian BERT model jointly trained by Sber AI team and MLSA Lab at Moscow State University's AI Institute, specializing in scientific text processing

Large Language Model

Transformers Other

BiT is a simple method for scaling up pre-training of ResNet-like architectures, bringing significant improvements in transfer learning.

Image Classification

Transformers Other

GPT-2 is an autoregressive language model based on the Transformer architecture. It is pre-trained on a large-scale English corpus through self-supervised learning and excels at text generation tasks.

Large Language Model

Transformers English

Roberta Large NER

Named entity recognition model fine-tuned on the English CoNLL-2003 dataset based on the XLM-RoBERTa-large model

Sequence Labeling Supports Multiple Languages

Featured Recommended AI Models

AIbase

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご

© 2025AIbase