Open-source Evo-1-8k-base biological model - Supports long-context modeling and design, and accurately analyzes sequences

Evo 1 8k Base

Developed by togethercomputer

Evo is a biological foundation model capable of long context modeling and design. It uses the StripedHyena architecture and can model sequences at single nucleotide and byte-level resolution.

Molecular Model

Transformers

Open Source License:Apache-2.0 #Single nucleotide modeling #Long context processing #Genomic sequence generation

Downloads 31.09k

Release Time : 2/24/2024

Model Overview

Evo is a biological foundation model focusing on long context sequence modeling and design, especially suitable for genomic-level data processing and analysis.

Model Features

Long context modeling

Capable of processing long context sequences, with computation and memory approximately linearly scaling with the context length.

Efficient autoregressive generation

Achieves efficient autoregressive generation through a cyclic pattern, and a single 80GB GPU can generate over 500k tokens.

Hybrid architecture

Combines the Hyena block of multi-head attention and gated convolution to improve the decoder-only Transformer architecture.

Robust training

Robust to training beyond the computational optimal boundary and can be trained with far more tokens than the Chinchilla optimal number.

Model Capabilities

Single nucleotide resolution sequence modeling

Byte-level resolution sequence modeling

Long context sequence processing

Efficient autoregressive generation

Genomic scale inference

Use Cases

Genomics

Genomic sequence generation

Used for generating and analyzing genomic sequences

Performs inference and generates sequences at the genomic scale

Molecular scale fine-tuning

Used for sequence fine-tuning tasks at the molecular level

🚀 Evo-1 (Phase 1)

Evo is a biological foundation model that can handle long-context modeling and design. It uses the StripedHyena architecture to model sequences at a single-nucleotide, byte - level resolution with near - linear scaling of compute and memory relative to context length.

🚀 Quick Start

We identified and fixed an issue related to a wrong permutation of some projections, which affects generation quality. To use the new model revision, please load as follows:

config = AutoConfig.from_pretrained(model_name, trust_remote_code=True, revision="1.1_fix")
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    config=config,
    trust_remote_code=True,
    revision="1.1_fix"
)

✨ Features

Long - context Modeling: Capable of long - context modeling and design in the biological domain.
StripedHyena Architecture: Enables single - nucleotide, byte - level resolution sequence modeling with near - linear scaling of compute and memory.
Intermediate Checkpoints: We release weights of 15 intermediate pretraining checkpoints for phase 1 and phase 2 of pretraining.

📦 Installation

To use StripedHyena outside of the playground, you will need to install custom kernels. Please follow the instructions from the standalone repository.

💻 Usage Examples

Example usage is provided in the standalone repo.

📚 Documentation

About

Evo uses the StripedHyena architecture to enable modeling of sequences at a single - nucleotide, byte - level resolution with near - linear scaling of compute and memory relative to context length. Evo has 7 billion parameters and is trained on OpenGenome, a prokaryotic whole - genome dataset containing ~300 billion tokens.

Evo-1 (Phase 1) is our first model in the Evo family, trained at a context length of 8k.

Checkpoint Name	Description
`evo-1-8k-base`	A model pretrained with 8,192 context. We use this model as the base model for molecular - scale finetuning tasks.
`evo-1-131k-base`	A model pretrained with 131,072 context using `evo-1-8k-base` as the initialization. We use this model to reason about and generate sequences at the genome scale.

Model Architecture

StripedHyena is a deep signal processing, hybrid architecture composed of multi - head attention and gated convolutions arranged in Hyena blocks, improving over decoder - only Transformers.

Some highlights of the architecture:

Efficient autoregressive generation via a recurrent mode (>500k generation with a single 80GB GPU)
Significantly faster training and finetuning at long context (>3x at 131k)
Improved scaling laws over state - of - the - art architectures (e.g., Transformer++) on both natural language and biological sequences.
Robust to training beyond the compute - optimal frontier e.g., training way beyond Chinchilla - optimal token amounts (see preprint for details -- more details to come)

Parametrization for Inference and Finetuning

One of the advantages of deep signal processing models is their flexibility. Different parametrizations of convolutions can be used depending on the memory, expressivity and causality requirements of pretraining, finetuning or inference workloads.

The main classes are:

Modal canonical: unconstrained poles (reference, reference), or constrained poles (reference, reference).
Companion canonical / rational: TBA.
Hypernetworks: hypernetwork (reference), modulated hypernetwork (reference).
Explicit: modulated explicit (reference).

StripedHyena is a mixed precision model. Make sure to keep your poles and residues in float32 precision, especially for longer prompts or training.

📄 License

This project is licensed under the Apache-2.0 license.

Cite

@article{nguyen2024sequence,
   author = {Eric Nguyen and Michael Poli and Matthew G. Durrant and Brian Kang and Dhruva Katrekar and David B. Li and Liam J. Bartie and Armin W. Thomas and Samuel H. King and Garyk Brixi and Jeremy Sullivan and Madelena Y. Ng and Ashley Lewis and Aaron Lou and Stefano Ermon and Stephen A. Baccus and Tina Hernandez-Boussard and Christopher RÃ© and Patrick D. Hsu and Brian L. Hie },
   title = {Sequence modeling and design from molecular to genome scale with Evo},
   journal = {Science},
   volume = {386},
   number = {6723},
   pages = {eado9336},
   year = {2024},
   doi = {10.1126/science.ado9336},
   URL = {https://www.science.org/doi/abs/10.1126/science.ado9336},

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご