This repository provides the code necessary to reproduce the experiments presented in the paper Bio-xLSTM: Generative modeling, representation and in-context learning of biological and chemical sequences. The code is organized across the following repositories:
- Prot-xLSTM (current repository)
- DNA-xLSTM
- Chem-xLSTM
git clone https://github.com/ml-jku/Prot-xLSTM.git
cd Prot-xLSTM
conda env create -f prot_xlstm_env.yml
conda activate prot_xlstm
pip install -e .
This package also supports the use of ProtMamba and ProtTransformer++ models. If you want to enable flash attention for the transformer model, install flash-attn
separately.
Model weights and the processed dataset can be downloaded here. To reproduce the results, place the model weights in a checkpoints/
folder and copy the dataset to the data/
folder.
For an easy start with Prot-xLSTM applications, we provide two sample notebooks:
-
examples/generation.ipynb
: This notebook demonstrates how to generate and evaluate novel protein sequences based on a set context sequences. -
examples/variant_fitness.ipynb
: This notebook enables you to assess the mutational effects of amino acid substitutions on a target sequence, with the option to include context proteins as well.
configs/
: Configuration files for model training.data/
: Train, validation and test splits of the dataset.evaluation/
: Scripts to reproduce experiments and figures from the paper.evaluation/generation/
: Homology-conditioned sequence generation.evaluation/pretraining/
: Learning curves and test set performance.evaluation/proteingym/
: ProteinGym DMS substitution benchmark.
examples/
: Example notebooks for Prot-xLSTM applications.protxlstm/
: Implementation of Prot-xLSTM.
Download preprocessed data from here or download raw multiple-sequence alignment files from the OpenProteinSet dataset using:
aws s3 cp s3://openfold/uniclust30/data/a3m_files/ --recursive --no-sign-request --exclude "*" --include "*.a3m"
and preprocess data with (this takes several hours!):
python protxlstm/data.py
To train a Prot-xLSTM model, set the desired model parameters in configs/xlstm_default_config.yaml
and dataset/training parameters in configs/train_default_config.yaml
, and run:
python protxlstm/train.py
ProtMamba and Transformer++ models can be trained using the following:
python protxlstm/train.py --model_config_path=configs/mamba_default_config.yaml
python protxlstm/train.py --model_config_path=configs/llama_default_config.yaml
To evaluate a model on the test set, provide the name of the checkpoint folder (located in checkpoints/
) and the context length you want to evaluate on, and run:
python evaluation/pretraining/evaluate.py --model_name protxlstm_102M_60B --model_type xlstm --context_len 131072
To reprodruce the results on the sequence generation downstream-task, set the checkpoint path in evaluation/generation/run_sample_sequences.py
and run the script using:
python evaluation/generation/run_sample_sequences.py
To score the generated sequences, run:
python evaluation/generation/run_score_sequences.py
This will generate a dataframe for each selected protein cluster, containing all generated sequences and their relevant metrics.
To evaluate the model on the ProteinGym DMS Substitutions Benchmark, first download the DMS data directly from the ProteinGym website. Additionally, download the ColabFold MSAs used as context from this link. Place both the DMS data and ColabFold MSA files into a data/proteingym
directory, and then run:
python evaluation/proteingym/run.py
The underlying code was adapted from the ProtMamba repository, and includes original code from the xLSTM repository.
@article{schmidinger2024bio-xlstm,
title={{Bio-xLSTM}: Generative modeling, representation and in-context learning of biological and chemical sequences},
author={Niklas Schmidinger and Lisa Schneckenreiter and Philipp Seidl and Johannes Schimunek and Pieter-Jan Hoedt and Johannes Brandstetter and Andreas Mayr and Sohvi Luukkonen and Sepp Hochreiter and Günter Klambauer},
journal={arXiv},
doi = {},
year={2024},
url={}
}