MolRep is a Python package for fairly measuring algorithmic progress on chemical property datasets. It currently provides a complete re-evaluation of 16 state-of-the-art deep representation models over 16 benchmark property datsaets.
If you found this package useful, please cite our papers: MolRep and Mol-XAI for now:
@article{rao2021molrep,
title={MolRep: A Deep Representation Learning Library for Molecular Property Prediction},
author={Rao, Jiahua and Zheng, Shuangjia and Song, Ying and Chen, Jianwen and Li, Chengtao and Xie, Jiancong and Yang, Hui and Chen, Hongming and Yang, Yuedong},
journal={bioRxiv},
year={2021},
publisher={Cold Spring Harbor Laboratory}
}
@article{rao2021quantitative,
title={Quantitative Evaluation of Explainable Graph Neural Networks for Molecular Property Prediction},
author={Rao, Jiahua and Zheng, Shuangjia and Yang, Yuedong},
journal={arXiv preprint arXiv:2107.04119},
year={2021}
}
We provide a script to install the environment. You will need the conda package manager, which can be installed from here.
To install the required packages, follow there instructions (tested on a linux terminal):
-
clone the repository
git clone https://github.com/biomed-AI/MolRep
-
cdinto the cloned directorycd MolRep
-
run the install script
source install.sh
Where <your_conda_path> is your conda path, and <CUDA_VERSION> is an optional argument that can be either cpu, cu92, cu100, cu101, cu110. If you do not provide a cuda version, the script will default to cu110. The script will create a virtual environment named MolRep, with all the required packages needed to run our code. Important: do NOT run this command using bash instead of source!
Data (including Explainable Dataset) could be download from Google_Driver
[!NEWS] The human experiments fro explainability task (molecules and results) are available at Here
| Dataset | Task | Task type | #Molecule | Splits | Metric | Reference |
|---|---|---|---|---|---|---|
| QM7 | 1 | Regression | 7160 | Stratified | MAE | Wu et al. |
| QM8 | 12 | Regression | 21786 | Random | MAE | Wu et al. |
| QM9 | 12 | Regression | 133885 | Random | MAE | Wu et al. |
| ESOL | 1 | Regression | 1128 | Random | RMSE | Wu et al. |
| FreeSolv | 1 | Regression | 642 | Random | RMSE | Wu et al. |
| Lipophilicity | 1 | Regression | 4200 | Random | RMSE | Wu et al. |
| BBBP | 1 | Classification | 2039 | Scaffold | ROC-AUC | Wu et al. |
| Tox21 | 12 | Classification | 7831 | Random | ROC-AUC | Wu et al. |
| SIDER | 27 | Classification | 1427 | Random | ROC-AUC | Wu et al. |
| ClinTox | 2 | Classification | 1478 | Random | ROC-AUC | Wu et al. |
| Liver injury | 1 | Classification | 2788 | Random | ROC-AUC | Xu et al. |
| Mutagenesis | 1 | Classification | 6511 | Random | ROC-AUC | Hansen et al. |
| hERG | 1 | Classification | 4813 | Random | ROC-AUC | Li et al. |
| MUV | 17 | Classification | 93087 | Random | PRC-AUC | Wu et al. |
| HIV | 1 | Classification | 41127 | Random | ROC-AUC | Wu et al. |
| BACE | 1 | Classification | 1513 | Random | ROC-AUC | Wu et al. |
| Methods | Descriptions | Reference |
|---|---|---|
| Mol2Vec | Mol2Vec is an unsupervised approach to learns vector representations of molecular substructures that point in similar directions for chemically related substructures. | Jaeger et al. |
| N-Gram graph | N-gram graph is a simple unsupervised representation for molecules that first embeds the vertices in the molecule graph and then constructs a compact representation for the graph by assembling the ver-tex embeddings in short walks in the graph. | Liu et al. |
| FP2Vec | FP2Vec is a molecular featurizer that represents a chemical compound as a set of trainable embedding vectors and combine with CNN model. | Jeon et al. |
| VAE | VAE is a framework for training two neural networks (encoder and decoder) to learn a mapping from high-dimensional molecular representation into a lower-dimensional space. | Kingma et al. |
| Methods | Descriptions | Reference |
|---|---|---|
| BiLSTM | BiLSTM is an artificial recurrent neural network (RNN) architecture to encoding sequences from compound SMILES strings. | Hochreiter et al. |
| SALSTM | SALSTM is a self-attention mechanism with improved BiLSTM for molecule representation. | Zheng et al |
| Transformer | Transformer is a network based solely on attention mechanisms and dispensing with recurrence and convolutions entirely to encodes compound SMILES strings. | Vaswani et al. |
| MAT | MAT is a molecule attention transformer utilized inter-atomic distances and the molecular graph structure to augment the attention mechanism. | Maziarka et al. |
| Methods | Descriptions | Reference |
|---|---|---|
| DGCNN | DGCNN is a deep graph convolutional neural network that proposes a graph convolution model with SortPooling layer which sorts graph vertices in a consistent order to learning the embedding of molec-ular graph. | Zhang et al. |
| GraphSAGE | GraphSAGE is a framework for inductive representation learning on molecular graphs that used to generate low-dimensional representations for atoms and performs sum, mean or max-pooling neigh-borhood aggregation to updates the atom representation and molecular representation. | Hamilton et al. |
| GIN | GIN is the Graph Isomorphism Network that builds upon the limitations of GraphSAGE to capture different graph structures with the Weisfeiler-Lehman graph isomorphism test. | Xu et al. |
| ECC | ECC is an Edge-Conditioned Convolution Network that learns a different parameter for each edge label (bond type) on the molecular graph, and neighbor aggregation is weighted according to specific edge parameters. | Simonovsky et al. |
| DiffPool | DiffPool combines a differentiable graph encoder with its an adaptive pooling mechanism that col-lapses nodes on the basis of a supervised criterion to learning the representation of molecular graphs. | Ying et al. |
| MPNN | MPNN is a message-passing graph neural network that learns the representation of compound molecular graph. It mainly focused on obtaining effective vertices (atoms) embedding | Gilmer et al. |
| D-MPNN | DMPNN is another message-passing graph neural network that messages associated with directed edges (bonds) rather than those with vertices. It can make use of the bond attributes. | Yang et al. |
| CMPNN | CMPNN is the graph neural network that improve the molecular graph embedding by strengthening the message interactions between edges (bonds) and nodes (atoms). | Song et al. |
To train a model by K-fold, run 5-fold-training_example.ipynb.
To test a pretrained model, run testing-example.ipynb.
To explain the GNN model, run Explainer_Experiments.py
More results will be updated soon.
