ProtoBind-Diff: A Structure-Free Diffusion Language Model for Protein Sequence-Conditioned Ligand Design

Implementation of ProtoBind-Diff: A Structure-Free Diffusion Language Model for Protein Sequence-Conditioned Ligand Design by Lukia Mistryukova*, Vladimir Manuilov*, Konstantin Avchaciov*, and Peter O. Fedichev.

ProtoBind-Diff is a masked diffusion language model that generates target-specific, high-quality small-molecule ligands. Trained entirely without structural input, it enables structure-independent ligand design across the full proteome and matches structure-based methods in docking and Boltz-1 benchmarks.

If you have questions, feel free to open an issue or send us an email at lukiia.mistriukova@gero.ai, konstantin.avchaciov@gero.ai, vladimir.manuylov@gero.ai, and peter.fedichev@gero.ai.

You can also try out the model on Hugging Face Spaces.

Overview

This repository contains the evaluation toolkit and supporting data for ProtoBind-Diff. It is organized as follows:

`data/`

We selected 12 protein targets to benchmark molecular generation quality across different models.

One folder per model (pocket2mol/, targetdiff/, etc.) containing SMILES files of generated ligands.
bindingdb/ – sets of random molecules (used as reference inactive molecules).
bindingdb_active/ – sets of true active molecules.
actives_bindingdb_cl – clustered true actives; one representative per similarity cluster.
CrossDocked2020/ – cleaned PDB files and corresponding ligands for the targets.
fasta/ – FASTA sequences of the targets.

`notebooks/`

Jupyter notebooks to reproduce the paper figures: docking/Boltz-1 score distributions, interpretation of attention maps, chemical property distributions, and UMAP-based target specificity analysis. We also added allign.ipynb to show differences between canonical sequences and PDB receptor sequences.

`results/`

Raw docking and Boltz-1 ipTM score tables for each method.

`protobind_diff/`

Model code, train and inference scripts.

`scripts/`

Scripts to download and prepare data for training.

Usage

Setup Environment

Clone the current repo

git clone https://github.com/gero-science/ProtoBind-Diff.git

You can install the project locally in editable mode:

python -m venv protodiff_env
source protodiff_env/bin/activate    
pip install -e .

Then you'll be able to run:

protobind-infer #inference
protobind-train #train

Inference

This script generates potential binding molecules (in SMILES format) for a given protein sequence by computing the protein embeddings on-the-fly.

Note: The script automatically downloads the model checkpoint from Hugging Face Hub and uses the best available hardware.

You need to specify the protein sequence with --fasta_file or --sequence. And you are ready to run inference:

protobind-infer --fasta_file examples/input.fasta

View All Command-Line Options

--output_dir: Specify a different output folder (default: ../outputs).
--output: Change the output filename (default: generated_smiles.txt).
--n_batches: Set the number of generation batches (default: 5).
--batch_size: Set the number of molecules generated per batch (default: 10).
--checkpoint_path: Use a local model checkpoint file.
--tokenizer_path: Use a local tokenizer file.
--model_name: Specify the ESM model for embeddings (default: esm2_t33_650M_UR50D).
--cache: Set a custom cache folder for downloads (default: ../cache).
--sampling_steps: Set the number of steps during sampling (default: 250).
--lig_max_length: Set the max length of generated molecules (default: 170).
--nucleus_p: Set the value of the nucleus sampling parameter (default: 0.9).
--eta: Set the value of the probability of remasking (default: 0.1).

Model Training

Training the model involves a three-step pipeline: 1) downloading the dataset, 2) pre-processing the data to generate embeddings and similarity matrices, and 3) running the training script with a configuration file.

Step 1: Download Raw Data

First, download the necessary dataset files (protein/ligand pairs, tokenizer, etc.) from Hugging Face Hub.

Run the command:

python scripts/download.py --output_dir ./data/experiments/diffusion

This will create a directory containing the raw data.csv and other necessary files.

Step 2: Pre-process Data

Next, you need to generate protein embeddings and a molecular similarity matrix from the raw data. This script performs both of these tasks. Note: This is a computationally intensive step. Calculating the ESM embeddings requires a GPU and at least 64 GB of RAM (depending on the size of the categorical_mappings.json file).

Run the command:

python scripts/esm_and_sim_matrix.py --data_dir ./data/experiments/diffusion --out_dir ./data/experiments/diffusion

This uses the files in the --data_dir and saves the new files: all_prots_*.pt (embeddings) and all_smiles_sparse_*.npz (Tanimoto matrix).

Step 3: Start Training

You are ready to train your model. Training is configured using .yaml files located in the configs/ directory.

protobind-train -o ./experiment_dir --exp_dir_prefix experiment_name --yaml ./configs/masked_diffusion.yaml

This command uses the following arguments:

--yaml: Specifies the main model and data configuration file.
--output_dir: Defines the parent directory where all experiment results will be saved.
--exp_dir_prefix: Creates a specific folder for this run (e.g., ./experiments/my_first_run).

To tune the model, you can edit the parameters in configs/masked_diffusion.yaml. Details on all tunable hyperparameters can be found in protobind_diff/parsers.py.

Citations

If you use this code or the models in your research, please cite the following paper:

@article {Mistryukova2025.06.16.659955,
	author = {Mistryukova, Lukia and Manuilov, Vladimir and Avchaciov, Konstantin and Fedichev, Peter O.},
	title = {ProtoBind-Diff: A Structure-Free Diffusion Language Model for Protein Sequence-Conditioned Ligand Design},
	year = {2025},
	journal = {bioRxiv}
}

License

The code and model weights are released under MIT license. See the LICENSE file for details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

ProtoBind-Diff: A Structure-Free Diffusion Language Model for Protein Sequence-Conditioned Ligand Design

Overview

`data/`

`notebooks/`

`results/`

`protobind_diff/`

`scripts/`

Usage

Setup Environment

Inference

Model Training

Step 1: Download Raw Data

Step 2: Pre-process Data

Step 3: Start Training

Citations

License

About

Uh oh!

Releases

Packages

Contributors 2

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
configs		configs
data		data
examples		examples
notebooks		notebooks
paper/tables		paper/tables
protobind_diff		protobind_diff
results		results
scripts		scripts
LICENSE		LICENSE
README.md		README.md
graphical-abstract.png		graphical-abstract.png
pyproject.toml		pyproject.toml

License

gero-science/ProtoBind-Diff

Folders and files

Latest commit

History

Repository files navigation

ProtoBind-Diff: A Structure-Free Diffusion Language Model for Protein Sequence-Conditioned Ligand Design

Overview

data/

notebooks/

results/

protobind_diff/

scripts/

Usage

Setup Environment

Inference

Model Training

Step 1: Download Raw Data

Step 2: Pre-process Data

Step 3: Start Training

Citations

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

`data/`

`notebooks/`

`results/`

`protobind_diff/`

`scripts/`

Packages