OLMo2-11B-SuperBPE-t180k Open-Source Large Language Model - Supports Superword Recognition and Subword Segmentation

Olmo2 11B SuperBPE T180k

Developed by UW

An 11-billion parameter large language model trained with the innovative SuperBPE tokenizer, supporting superword unit recognition and subword tokenization capabilities.

Large Language Model

Transformers

EnglishOpen Source License:Apache-2.0 #Superword unit tokenization #11-billion parameter large model #Long-context processing

Downloads 29

Release Time : 3/19/2025

Model Overview

A large language model extended from the OLMo2-7B architecture, trained with the SuperBPE tokenizer, featuring enhanced text comprehension and generation capabilities.

Model Features

SuperBPE Tokenizer

Innovatively introduces superword units (capable of spanning word boundaries) while retaining subword tokenization capabilities.

Efficient Context Processing

3000-token context window, equivalent to the byte-level context capacity of a 4096-token BPE model.

Large-scale Training

Trained on 238 billion tokens with a vocabulary size of 200,000.

Model Capabilities

Text generation

Natural language understanding

Use Cases

Text generation

Creative writing

Generate coherent and creative text content.

Code generation

Assist in generating programming code snippets.

Natural language processing

Text summarization

Automatically generate concise summaries of text.

Question answering systems

Build intelligent question-answering systems.

🚀 SuperBPE

This 11B model, trained from scratch with a SuperBPE tokenizer, extends the BPE algorithm and matches the FLOPs of an 8B BPE model in both training and inference.

🚀 Quick Start

The SuperBPE model is a 11B model trained from scratch using a SuperBPE tokenizer. SuperBPE extends the BPE algorithm, incorporating both traditional subword tokens (within word boundaries) and new superword tokens (spanning multiple words). It has the same train and inference FLOPs as the 8B BPE model.

The model is based on a scaled - up version of the Olmo2 7B architecture and uses the Olmo2 7B pretraining data. It has a context length of 3,000 tokens, equivalent to the effective context size in bytes of a BPE model with a context length of 4,096 tokens, and is trained on 238B tokens. The tokenizer has a vocabulary size of 200k, transitioning from learning subword to superword tokens at a vocabulary size of 180k.

💻 Usage Examples

Basic Usage

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("UW/OLMo2-11B-SuperBPE-t180k")
model = AutoModelForCausalLM.from_pretrained("UW/OLMo2-11B-SuperBPE-t180k")

tokenizer.convert_ids_to_tokens(tokenizer.encode("By the way, I am a fan of the Milky Way."))
# ['ByĠtheĠway', ',ĠIĠam', 'Ġa', 'Ġfan', 'ĠofĠthe', 'ĠMilkyĠWay', '.']

📄 License

This project is licensed under the Apache 2.0 license.

📚 Documentation

Citation

@misc{liu-etal-2025-superbpe,
  title={SuperBPE: Space Travel for Language Models}, 
  author={Alisa Liu and Jonathan Hayase and Valentin Hofmann and Sewoong Oh and Noah A. Smith and Yejin Choi},
  year={2025},
  eprint={2503.13423},
  archivePrefix={arXiv},
  primaryClass={cs.CL},
  url={https://arxiv.org/abs/2503.13423}, 
}

Information Table

Property	Details
Library Name	transformers
Datasets	allenai/olmo-mix-1124
Model Type	SuperBPE 11B model
Training Data	Scaled - up version of Olmo2 7B architecture and Olmo2 7B pretraining data, 238B tokens
Context Length	3,000 tokens
Tokenizer Vocabulary Size	200k
Transition Vocabulary Size	180k

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご