You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Generate protein sequences with protein descriptions (and protein sequence fragments)
For generating protein sequences one by one or in batches, separately refer to gen_single_seq.py and gen_batch_seqs.py.
The cases of protein sequences and text descriptions are in the data directory. For example :
Description: FUNCTION: Component of the acetyl coenzyme A carboxylase complex. SUBCELLULAR LOCATION: Cytoplasm. SIMILARITY: Belongs to the AccA family.
Sequence: MAVSDRKLQLLDFEKPLAELEDRIEQIRSLSEQNGVDVTDQIAQLEGRAEQLRQEIFSSLTPMQELQLARHPRRPSTLDYIHAISDEWMELHGDRRGYDDPAIVGGVGRIGGQPVLMLGHQKGRDTKDNVARNFGMPFPSGYRKAMRL...
The generation codes below determine whether the process is guided solely by text or by a combination of text and sequence.
seq=None, # Only protein descriptions guide the generation processseq=tokenized_seqs['input_ids'][...,:1].to(device), # Both sequence fragments and descriptions guide the generation process
Train new model based on ProtDAT
You can build a custom protein text-sequence dataset with a specific pattern and train it using the architecture in Decoder.py.
Citations
If you find ProtDAT useful, cite the relevant paper:
@article{guo2024protdat,
title={ProtDAT: A Unified Framework for Protein Sequence Design from Any Protein Text Description},
author={Guo, Xiao-Yu and Li, Yi-Fan and Liu, Yuan and Pan, Xiaoyong and Shen, Hong-Bin},
journal={arXiv preprint arXiv:2412.04069},
year={2024}
}
License
Code License
The ProtDAT source codes are licensed under CC BY-NC 4.0.
The ESM1b model can be found at ESM1b, which is under the MIT license