C

Canine C

Developed by google
CANINE-c is a character-level encoding model pretrained on multilingual text, operating directly on Unicode characters without explicit tokenization.
Downloads 191.50k
Release Time : 3/2/2022

Model Overview

CANINE-c is a multilingual text encoding model based on self-supervised learning, operating directly at the character level without traditional tokenization steps. It is pretrained using masked language modeling and next sentence prediction objectives, suitable for various downstream NLP tasks.

Model Features

No Tokenization
Operates directly on Unicode characters without explicit tokenizers like WordPiece or SentencePiece.
Multilingual Support
Pretrained on Wikipedia data in 104 languages, offering broad language coverage.
Character-Level Processing
Each character is converted to a Unicode code point, simplifying input preprocessing.
Autoregressive Character Loss
Uses autoregressive prediction of masked character spans to enhance character-level prediction capabilities.

Model Capabilities

Multilingual Text Understanding
Character-Level Text Encoding
Masked Language Modeling
Next Sentence Prediction

Use Cases

Natural Language Processing
Sequence Classification
Can be used for text classification tasks such as sentiment analysis and topic classification.
Token Classification
Suitable for sequence labeling tasks like named entity recognition and part-of-speech tagging.
Question Answering
Can be used to build question-answering systems, processing user queries based on character-level understanding.
Featured Recommended AI Models
AIbase
Empowering the Future, Your AI Solution Knowledge Base
Š 2025AIbase