You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Code for "Self-supervised machine learning methods for protein design improve sampling, but not the identification of high-fitness variants"
This repo contains the code for reproducing the results of the publication "Self-supervised machine learning methods for protein design improve sampling, but not the identification of high-fitness variants" (link to preprint). If you are interested instead in using the implemented features for your own work, an overview of them can be found in the Rosetta documentation here, and a tutorial is available from the Meiler Rosetta workshop 2023 "Tutorial 2: Machine Learning in Rosetta".
Running the different design protocols
Code for running the different design protocols can be found in the folder of each dataset, e.g. emi/avg03.sh. All scripts use the RosettaScripts XML provided in the main folder which are named after the different protocols shown in the paper.
Sequences and metrics of resulting designs
The unique sequences and calculated metrics of each design protocol are available in the dataset folders ("dataset/dataset_designs.csv"), e.g. emi/emi_designs.csv.
Analysis of designs
The code for analyzing the resulting designs and reproducing figures can be found in the design_analysis.ipynb notebook. In order to run the jupyter notebooks, first create a python environment using the environment.yaml file with either conda or mamba:
The code for training and evaluating the oracle models for each dataset can be found in the model_training.ipynb notebook. The datasets used for training can be found in each dataset folder, e.g. gb1/gb1_mutations_full_data.csv. The already trained models are also available, e.g. gb1/gb1_ridge.joblib.