ProtDAT

A Unified Framework for Protein Sequence Design from Any Protein Text Description

Preparation

ProtDAT is implemented with Python3 (>=3.9). We recommend you to use a virtual environment to install the dependencies :

conda create -n ProtDAT python=3.9

PyTorch can be installed by selecting the corresponding version through https://pytorch.org/. After that, install other requirements by :

pip install -r requirements.txt

Finally, activate the virtual environment by :

conda activate ProtDAT

Download models and files

Before using ProtDAT, there are several steps:

Download the ESM1b and PubMedBERT models and place them in the esm1b and pubmedbert subfolders within the model directory.
Download ProtDAT model weight file state_dict.pth and datasets.

Usage

Generate protein sequences with protein descriptions (and protein sequence fragments)

For generating protein sequences one by one or in batches, separately refer to gen_single_seq.py and gen_batch_seqs.py.

The cases of protein sequences and text descriptions are in the data directory. For example :

Description: FUNCTION: Component of the acetyl coenzyme A carboxylase complex. SUBCELLULAR LOCATION: Cytoplasm. SIMILARITY: Belongs to the AccA family.
Sequence: MAVSDRKLQLLDFEKPLAELEDRIEQIRSLSEQNGVDVTDQIAQLEGRAEQLRQEIFSSLTPMQELQLARHPRRPSTLDYIHAISDEWMELHGDRRGYDDPAIVGGVGRIGGQPVLMLGHQKGRDTKDNVARNFGMPFPSGYRKAMRL...

The generation codes below determine whether the process is guided solely by text or by a combination of text and sequence.

seq=None,                                           # Only protein descriptions guide the generation process
seq=tokenized_seqs['input_ids'][...,:1].to(device), # Both sequence fragments and descriptions guide the generation process

Train new model based on ProtDAT

You can build a custom protein text-sequence dataset with a specific pattern and train it using the architecture in Decoder.py.

Citations

If you find ProtDAT useful, cite the relevant paper:

@article{guo2024protdat,
  title={ProtDAT: A Unified Framework for Protein Sequence Design from Any Protein Text Description},
  author={Guo, Xiao-Yu and Li, Yi-Fan and Liu, Yuan and Pan, Xiaoyong and Shen, Hong-Bin},
  journal={arXiv preprint arXiv:2412.04069},
  year={2024}
}

License

Code License

The ProtDAT source codes are licensed under CC BY-NC 4.0.
The ESM1b model can be found at ESM1b, which is under the MIT license
The PubMedBERT model can be found at PubMedBERT, which is under the Apache License 2.0

Model Parameters License

The ProtDAT parameters are made availabe under a Creative Commons Attribution 4.0 International License.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

ProtDAT

Preparation

Download models and files

Usage

Generate protein sequences with protein descriptions (and protein sequence fragments)

Train new model based on ProtDAT

Citations

License

Code License

Model Parameters License

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
data		data
model		model
.gitignore		.gitignore
Decoder.py		Decoder.py
LICENSE		LICENSE
README.md		README.md
gen_batch_seqs.py		gen_batch_seqs.py
gen_single_seq.py		gen_single_seq.py
requirements.txt		requirements.txt

License

GXY0116/ProtDAT

Folders and files

Latest commit

History

Repository files navigation

ProtDAT

Preparation

Download models and files

Usage

Generate protein sequences with protein descriptions (and protein sequence fragments)

Train new model based on ProtDAT

Citations

License

Code License

Model Parameters License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages