| Proteins often interact each other and form protein complexes to carry out various biochemical activities. Accurate prediction and identification of the interaction sites from protein sequences is an intensively studied problem in the field. This prediction problem is still challenging because of two remaining critical issues: hard choice of appropriate learning method and class imbalance. In this study, we propose to use a deep learning method for improving the imbalanced prediction of protein interaction sites. We developed a new Simplified Long-short Term Memory (SLSTM) network to implement a deep learning architecture (named DLPred).
To deal with the imbalanced classification in deep learning model, we explore three new ideas. First, our collection of the training data is to construct a set of protein sequences, instead of a set of just single residues, to retain the entire sequential completeness of each protein; Second, a new penalization factor is appended to the loss function such that the penalization to the non-interaction site loss can be effectively enhanced; Third, multi-task learning of interaction sites and residue solvent accessibility prediction are used for correcting the preference of the prediction model on the non-interaction sites.
Our model is evaluated on three public datasets: Dset186, Dtestset72 and PDBtestset164. Compared with the state-of-the-art methods, DLPred is able to significantly improve the predictive accuracies and AUC values while improving F-measure.
|Supported by Qiang Lab|