Prediction of 8-state protein secondary structures by a novel deep learning architecture

Buzhong Zhang, Jinyan Li and Qiang Lü*

Published on BMC Bioinformatics(2018)19:293

Abstract

Protein secondary structure can be regarded as a bridge that links the primary sequence and tertiary structure. An accurate secondary structure prediction can significantly give more precise and high resolution on structure-based properties analysis. A Faster and more accurate protein secondary structure prediction tool CRRNN(eCRRNN) is provided here.

Supplementary online materials

Softwares:

  1. The predictng models of eCRRNN (standalone version) can be download here (Q8 prediction, Q3 prediction). Please note that to run the predictor, you need to install the following softwares other from ours:

1.1 Linux

1.2 python 2.7

1.3 Keras 2.1.4 and tensorflow 1.13

1.4 blast 2.2.28 for preparing the PSSM feature set

Please follow the README in our software package in order to prepare input features and run our predictor. Script files is provided for demo how to run our model.

Data files:

  1. Sequences and labels of TR6614 are also provided.
    Our training sets are generated from cullpdb_pc25_res3.0_R1.0_d160826_chains12665.fasta and cullpdb_pc25_res3.0_R1.0_d150314_chains9494.fasta.
    .
  2. Sequences and labels of TR5534 are also provided. Anyone who used this dataset, please thanks to Jian Zhou and Olga G. Troyanskaya. If possible, please cite:
    Zhou, Jian, and O. G. Troyanskaya. "Deep Supervised and Convolutional Generative Stochastic Network for Protein Secondary Structure Prediction." Proceedings of the 31st International Converenfe on Machine Learning (ICML), (2014):745-753.
  3. Our experiments used test datasets CASP10, CASP11,CASP12 and mapping vectors are provided here.
    Mapping vectors will be used to prepare your testing dataset.
    CB513 is provided in fasta and label format.
    The preprocessed CB513 dataset wich is transformatted from Jian Zhou's dataset can be down here. And the example coding is also provided here.

    The CASP data style is: sequences residues features,labels. The 21-dim features are 20 PSSM and residues.
    The style of PSSM is: A R N D C Q E G H I L K M F P S T W Y V
    input data of eCRRNN are "sequences residues features". The input features are: 20-PSSM, 7-dim Physical properties, 1-dim conservation score, 22dim- protein encoding.

The person who uses this data and code is expected to cite the following paper:
Buzhong Zhang, Jinyan Li and Qiang Lü. Prediction of 8-state protein secondary structures by a novel deep learning architecture. BMC Bioinformatics,(2018)19:293.

Thank you!

If you have any suggestions or questions, Please email to: bzzhang@stu.suda.edu.cn