This is the official code repository for the paper titled GenMol: A Drug Discovery Generalist with Discrete Diffusion (ICML 2025).
- We introduce GenMol, a model for unified and versatile molecule generation by building masked discrete diffusion that generates SAFE molecular sequences.
- We propose fragment remasking, an effective strategy for exploring chemical space using molecular fragments as the unit of exploration.
- We propose molecular context guidance (MCG), a guidance scheme for GenMol to effectively utilize molecular context information.
- We validate the efficacy and versatility of GenMol on a wide range of drug discovery tasks.
We introduce GenMol V2, trained with an extended SAFE syntax, demonstrating improved performance in de novo and fragment-constrained generation. Please refer to the section below: GenMol V2: GenMol with Extended SAFE Syntax.
Clone this repository:
git clone https://github.com/NVIDIA-Digital-Bio/genmol.git
cd genmolRun the following command to install the dependencies:
bash env/setup.shTroubleshooting: ImportError: libXrender.so.1
Run the following command:
apt update && apt install -y libsm6 libxext6 && apt-get install -y libxrender-devTroubleshooting: ImportError: cannot import name '_CONFIG_FOR_DOC' from 'transformers.models.gpt2.modeling_gpt2'
Run the following command:
#!/bin/bash
# Use CONDA_PREFIX which points to current active environment
if [ -z "$CONDA_PREFIX" ]; then
echo "Error: No conda environment is currently active"
exit 1
fi
# Comment out all lines in the safe package __init__.py
sed -i 's/^/# /' "$CONDA_PREFIX/lib/python3.10/site-packages/safe/__init__.py"
# Import required packages
echo "from .converter import SAFEConverter, decode, encode" >> "$CONDA_PREFIX/lib/python3.10/site-packages/safe/__init__.py"
echo "Fixed safe package in environment: $CONDA_PREFIX"We provide the pretrained checkpoint. Place model.ckpt in the checkpoints directory and set the correct information in ./configs/base.yaml.
(Optional) To train GenMol from scratch, run the following command:
torchrun --nproc_per_node ${num_gpu} scripts/train.py hydra.run.dir=${save_dir} wandb.name=${exp_name}Other hyperparameters can be adjusted in configs/base.yaml.
The training used 8 NVIDIA A100 GPUs and took ~5 hours.
We used the SAFE dataset to train GenMol. To use your own training dataset, first convert your SMILES dataset into SAFE by running the following command:
python scripts/preprocess_data.py ${input_path} ${data_path}${input_path} is the path to the dataset file with a SMILES in each row. For example,
CCS(=O)(=O)N1CC(CC#N)(n2cc(-c3ncnc4[nH]ccc34)cn2)C1
NS(=O)(=O)c1cc2c(cc1Cl)NC(C1CC3C=CC1C3)NS2(=O)=O
...
${data_path} is the path of the processed dataset.
Then, set data in base.yaml to ${data_path}.
Run the following command to perform de novo generation:
python scripts/exps/denovo/run.pyTroubleshooting: _pickle.UnpicklingError: invalid load key, '<'
If you see this error, it is likely coming from /miniconda3/envs/genmol/lib/python3.10/site-packages/tdc/chem_utils/oracle/oracle.py, line 347, in readFragmentScores _fscores = pickle.load(f)
The root cause is a corrupted or incompletely downloaded pkl file for the SA score. The fix is simple: grab the correct files from the official RDKit repository: https://github.com/rdkit/rdkit/tree/master/Contrib/SA_Score/fpscores.pkl.gz
Extract the downloaded file into the genmol/oracle directory.
The experiment in the paper used 1 NVIDIA A100 GPU.
Run the following command to perform fragment-constrained generation:
python scripts/exps/frag/run.pyThe experiment in the paper used 1 NVIDIA A100 GPU.
We provide the fragment vocabularies in the folder scripts/exps/pmo/vocab.
(Optional) Place zinc250k.csv in the data folder, then run the following command to construct the fragment vocabularies and label the molecules with property labels:
python scripts/exps/pmo/get_vocab.pyRun the following command to perform goal-directed hit generation:
python scripts/exps/pmo/run.py -o ${oracle_name}The generated molecules will be saved in scripts/exps/pmo/main/genmol/results.
Run the following command to evaluate the result:
python scripts/exps/pmo/eval.py ${file_name}
# e.g., python scripts/exps/pmo/eval.py scripts/exps/pmo/main/genmol/results/albuterol_similarity_0.csvThe experiment in the paper used 1 NVIDIA A100 GPU and took ~2-4 hours for each task.
Run the following command to perform goal-directed lead optimization:
python scripts/exps/lead/run.py -o ${oracle_name} -i ${start_mol_idx} -d ${sim_threshold}The generated molecules will be saved in scripts/exps/lead/results.
Run the following command to evaluate the result:
python scripts/exps/lead/eval.py ${file_name}
# e.g., python scripts/exps/lead/eval.py scripts/exps/lead/results/parp1_id0_thr0.4_0.csvThe experiment in the paper used 1 NVIDIA A100 GPU and took ~10 min for each task.
π GenMol V2: GenMol with Extended SAFE Syntax (Angle-Brackets for Inter-Fragment Attachment Points)
GenMol V2 introduces Extended SAFE Syntax, which uses angle-brackets for Inter-Fragment Attachment Points. This change improves performance for specific tasks, particularly one-step linker design.
Following SAFE-GPT, GenMol performs two-step linker design in fragment-constrained generation, i.e., two molecules are respectively generated given each of two fragments and then combined later as a single molecule. However, users may prefer one-step generation that can condition the context of both fragments at the same time.
While GenMol shows versatile performance on various tasks, it shows low validity in some tasks, especially in one-step linker design. We attribute this to the standard SAFE syntax, which considers the intra-fragment (linked atoms are in the same fragment) and inter-fragment (links between two fragments) attachment points are not easy to distinguish.
To this end, we propose an extended SAFE syntax that uses angle-brackets to distinguish intra-fragment attachment points from inter-fragment attachment points.
For example, a SAFE string
X1XXX1X2.X2X3XXXX3X4.X5XX5X4
has 1, 3, 5 as its intra-fragment attachment points, while 2 and 4 are inter-fragment attachment points. With the extended syntax it becomes:
X1XXX1X<1>.X<1>X1XXXX1X<2>.X1XX1X<2>
In this way, the links within a fragment (i.e., 1, 2, ...) are independent to links crossing fragments (i.e., <1>, <2>, ...) and the model can learn how to complete a SAFE more efficiently.
GenMol V2 trained with the extended SAFE syntax actually shows significantly improved performance on de novo and fragment-constrained generation! On goal-directed hit generation and lead optimization, GenMol V2 performs slightly worse than GenMol. This is because GenMol performs fragment remasking in these tasks, which changes only a small part of the entire molecular sequence, and therefore does not benefit from the extended SAFE syntax.
| Model | Validity (%) | Uniqueness (%) | Quality (%) | Diversity |
|---|---|---|---|---|
| GenMol | 100.0 | 99.7 | 84.6 | 0.818 |
| GenMol V2 | 100.0 | 97.8 | 89.7 | 0.830 |
| Model | Task | Validity (%) | Uniqueness (%) | Quality (%) | Diversity | Distance |
|---|---|---|---|---|---|---|
| GenMol | Linker design (1-step) | 16.7 | 93.9 | 4.3 | 0.529 | 0.573 |
| Linker design | 100.0 | 83.7 | 21.9 | 0.547 | 0.563 | |
| Motif extension | 82.9 | 77.5 | 30.1 | 0.617 | 0.682 | |
| Scaffold decoration | 96.6 | 82.7 | 31.8 | 0.591 | 0.651 | |
| Superstructure generation | 97.5 | 83.6 | 34.8 | 0.599 | 0.762 | |
| GenMol V2 | Linker design (1-step) | 81.8 | 87.1 | 28.6 | 0.566 | 0.545 |
| Linker design | 100.0 | 76.6 | 18.4 | 0.512 | 0.539 | |
| Motif extension | 99.4 | 84.5 | 49.0 | 0.626 | 0.659 | |
| Scaffold decoration | 99.2 | 90.5 | 39.7 | 0.571 | 0.604 | |
| Superstructure generation | 99.7 | 89.8 | 39.0 | 0.551 | 0.769 |
| Model | PMO Sum Score |
|---|---|
| GenMol | 18.362 |
| GenMol V2 | 17.943 |
| Model | Success rate (%) |
|---|---|
| GenMol | 86.7 |
| GenMol V2 | 80.0 |
We provide the trained GenMol V2 checkpoint. Place model_v2.ckpt in the checkpoints directory and set the correct information in ./configs/base.yaml.
(Optional) To train GenMol V2 from scratch, run the following command:
torchrun --nproc_per_node ${num_gpu} scripts/train.py hydra.run.dir=${save_dir} wandb.name=${exp_name} loader.global_batch_size=1024 training.use_bracket_safe=trueThe training used 8 NVIDIA A100 GPUs.
Run the following command to perform de novo generation using GenMol V2:
python scripts/exps/denovo/run.py -c scripts/exps/frag/hparams_v2.yamlRun the following command to perform fragment-constrained generation using GenMol V2:
python scripts/exps/frag/run.py -c scripts/exps/frag/hparams_v2.yamlCopyright @ 2025, NVIDIA Corporation. All rights reserved.
The source code is made available under Apache-2.0.
The model weights are made available under the NVIDIA Open Model License.
If you find this repository and our paper useful, we kindly request to cite our work.
@article{lee2025genmol,
title = {GenMol: A Drug Discovery Generalist with Discrete Diffusion},
author = {Lee, Seul and Kreis, Karsten and Veccham, Srimukh Prasad and Liu, Meng and Reidenbach, Danny and Peng, Yuxing and Paliwal, Saee and Nie, Weili and Vahdat, Arash},
journal = {International Conference on Machine Learning},
year = {2025}
}