Accurate prediction of secondary structures and transmembrane segments is often the first step towards modeling the tertiary structure of a protein. Existing methods are either specialized in one class of proteins or developed to predict one class of 1D structural attributes (secondary structure, topology, or transmembrane segment). The Membrane Association and Secondary Structures of Proteins predictor, or MASSP, is a new method for simultaneous prediction of secondary structure, transmembrane segment, and membrane topology with no a priori assumption on the class of the input protein sequence. MASSP uses multi?tiered artificial neural networks that incorporates recent innovations in machine learning. The first tier is a multi-task multi-layer convolutional neural network that has learns patterns in image-like input position-specific-scoring matrices and predicts residue-level 1D structural attributes. The second tier is a long short-term memory neural network that treats the predictions of the first tier from the perspective of natural language processing and predicts the class of the input protein sequence. We curated a non-redundant data set consisting of 54 bitopic, 241 multi-spanning TM-alpha, 77 TM-beta, and 372 soluble proteins, respectively for training and testing MASSP. For secondary structure prediction, the median Q3 of MASSP is 0.839, slightly better than the Q3 of PSIPRED (0.832) and that of SPINE-X (0.819) and substantially better than that of RaptorX-Property (0.750). The median segment overlap score (SOV) of MASSP is 0.766, gaining a > 8.6% improvement over all three methods. For transmembrane topology prediction, MASSP has a performance comparable to OCTOPUS and substantially better than MEMSAT3 and TMHMM2 on TM-alpha proteins, and on TM-beta proteins, MASSP is significantly better than both BOCTOPUS2 and PRED-TMBB2. By integrating prediction of secondary structure and transmembrane segments in a deep-learning framework, MASSP improves performance over previous methods, has broader applicability, and enables proteome scale predictions.
|