You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Molecule-RNN is a recurrent neural network built with Pytorch to generate molecules for drug discovery. Basically, it learns the distribution of the training dataset and tries to sample from this distrubtion. So, the output molecules will have similar distributions to the training dataset.
Tokenization of SMILES
There are different ways to tokenize SMILES, 3 of them are implemented in this project:
Character-level tokenization, which is a naive way to tokenize SMILES. In this scheme, every character is treated as a single token expect those two-charater elements such Al and Br.
Regular expression-based tokenization. In this scheme, each pair of square bracket [*] is also treated as a single token.
SELFIES tokenization. SELFIES stands for Self-Referencing Embedded Strings, it is a 100% robust molecular string representation. See details here.
Dataset
The chembl28 dataset is used. It is under ./dataset.
Training
Set the out_dir in train.yaml as the directory where you want to store output results.
Set which_vocab and vocab_path in train.yaml to specify which tokenization scheme to use. The pre-computed vocabularies are at ./vocab.
Twick other hyper-paramters in train.yaml if you like (the default setting is working).
Run the training script.
python train.py
Sampling
The trained model will be saved in the out_dir directory. We can generate molecules by sampling the trained model according to the output distribution. If the -result_dir is not specified, the out_dir in train.yaml will be used.
python sample.py -result_dir your_output_dir
The default setting yields over 80% valid rate for character-level tokenization and regex-based tokenization, and it gives 99.9% valid rate for SELFIES tokenization. Here are examples of some sampled molecules:
TODOs
Currently beam search sampling is not supported given the lenghts of the sequences. Feel free to make a PR or write an issue if you have any idea to search for molecules with high probabilities. :)
Introduce reinforcement learning, which can make the model prefer some chemical or spatial properties.
About
A recurrent neural network (RNN) that generates drug-like molecules for drug discovery.