GENERator - 3b-base Open-source Genome Model, Based on Eukaryotic Data, More Powerful for Ultra-long Base Pair Analysis

Generator Eukaryote 3b Base

Developed by GenerTeam

GENERator is a generative genome foundation model with a 98,000 base pair context length and 3 billion parameters, trained on an extended dataset of eukaryotic DNA

Protein Model

Transformers

Open Source License:MIT #Long sequence generation #Cross-species genome #98,000 base pair context

Downloads 1,599

Release Time : 2/11/2025

Model Overview

This model is a foundational model focused on genome sequence generation and analysis, with enhanced cross-species understanding and generation capabilities

Model Features

Long context processing

Supports context lengths of up to 98,000 base pairs

Cross-species understanding

Trained on diverse eukaryotic DNA datasets with cross-species analysis capabilities

Large-scale pre-training

Pre-trained on 386 billion base pairs of DNA sequences

Model Capabilities

DNA sequence generation

Genome sequence analysis

Sequence embedding representation

Use Cases

Genome research

Gene sequence generation

Generate new DNA sequences based on input sequences

Can generate biologically plausible DNA sequence fragments

Sequence feature extraction

Obtain embedding representations of DNA sequences

Can be used for downstream analysis tasks such as gene classification or functional prediction

🚀 GENERator-eukaryote-3b-base Model

GENERator is a generative genomic foundation model with a 98k base - pair context length and 3B parameters. It's trained on a vast eukaryotic DNA dataset, enabling enhanced understanding and generation across various organisms.

🚀 Quick Start

In this repository, we present GENERator, a generative genomic foundation model featuring a context length of 98k base pairs and 3B parameters, trained on an expansive dataset comprising 386 billion base pairs of eukaryotic DNA. The extensive and diverse pre - training data endow the GENERator with enhanced understanding and generation capabilities across various organisms.

For more technical details, please refer to our paper GENERator: A Long - Context Generative Genomic Foundation Model. The code and implementation details are available on Github: https://github.com/GenerTeam/GENERator.

💻 Usage Examples

Basic Usage

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

# Load the tokenizer and model.
tokenizer = AutoTokenizer.from_pretrained("GenerTeam/GENERator-eukaryote-3b-base", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("GenerTeam/GENERator-eukaryote-3b-base")
config = model.config

max_length = config.max_position_embeddings

# Define input sequences.
sequences = [
    "ATGAGGTGGCAAGAAATGGGCTAC",
    "GAATTCCATGAGGCTATAGAATAATCTAAGAGAAAT"
]

# Process the sequences
sequences = [tokenizer.bos_token + sequence for sequence in sequences]

# Tokenize the sequences
tokenizer.padding_side = "left"
inputs = tokenizer(
    sequences,
    add_special_tokens=False,
    return_tensors="pt",
    padding=True,
    truncation=True,
    max_length=max_length
)

# Generate the sequences
with torch.inference_mode():
    outputs = model.generate(**inputs, max_new_tokens=32, temperature=0.00001, top_k=1)

# Decode the generated sequences
decoded_sequences = tokenizer.batch_decode(outputs, skip_special_tokens=True)

# Print the decoded sequences
print(decoded_sequences)

# It is expected to observe non-sense decoded sequences (e.g., 'AAAAAA')
# The input sequences are too short to provide sufficient context.

Advanced Usage

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

# Load the tokenizer and model.
tokenizer = AutoTokenizer.from_pretrained("GENERator-eukaryote-3b-base", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("GenerTeam/GENERator-eukaryote-3b-base")

config = model.config
max_length = config.max_position_embeddings

# Define input sequences.
sequences = [
    "ATGAGGTGGCAAGAAATGGGCTAC",
    "GAATTCCATGAGGCTATAGAATAATCTAAGAGAAAT"
]

# Tokenize the sequences with add_special_tokens=True to automatically add special tokens,
# such as the BOS EOS token, at the appropriate positions.
tokenizer.padding_side = "right"
inputs = tokenizer(
    sequences,
    add_special_tokens=True,
    return_tensors="pt",
    padding=True,
    truncation=True,
    max_length=max_length
)

# Perform a forward pass through the model to obtain the outputs, including hidden states.
with torch.inference_mode():
    outputs = model(**inputs, output_hidden_states=True)

# Retrieve the hidden states from the last layer.
hidden_states = outputs.hidden_states[-1]  # Shape: (batch_size, sequence_length, hidden_size)

# Use the attention_mask to determine the index of the last token in each sequence.
# Since add_special_tokens=True is used, the last token is typically the EOS token.
attention_mask = inputs["attention_mask"]
last_token_indices = attention_mask.sum(dim=1) - 1  # Index of the last token for each sequence

# Extract the embedding corresponding to the EOS token for each sequence.
seq_embeddings = []
for i, token_index in enumerate(last_token_indices):
    # Fetch the embedding for the last token (EOS token).
    seq_embedding = hidden_states[i, token_index, :]
    seq_embeddings.append(seq_embedding)

# Stack the embeddings into a tensor with shape (batch_size, hidden_size)
seq_embeddings = torch.stack(seq_embeddings)

print("Sequence Embeddings:", seq_embeddings)

📄 License

The project uses the MIT license.

📚 Citation

@misc{wu2025generator,
      title={GENERator: A Long-Context Generative Genomic Foundation Model}, 
      author={Wei Wu and Qiuyi Li and Mingyang Li and Kun Fu and Fuli Feng and Jieping Ye and Hui Xiong and Zheng Wang},
      year={2025},
      eprint={2502.07272},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2502.07272}, 
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご