admin管理员组

文章数量:1530058

题目:Accurate De Novo Prediction of Protein Contact Map by Ultra-Deep Learning Model

翻译:基于超深学习模型的蛋白质接触图谱从头精确预测

下载链接:https://doi/10.1371/journal.pcbi.1005324 

Abstract

Motivation

Protein contacts contain key information for the understanding of protein structure and function and thus, contact prediction from sequence is an important problem. Recently exciting progress has been made on this problem, but the predicted contacts for proteins without many sequence homologs is still of low quality and not very useful for de novo structure prediction.

蛋白质接触包含了了解蛋白质结构和功能的关键信息,因此,从序列中预测接触是一个重要的问题。最近在这个问题上已经取得了令人兴奋的进展,但是对于没有许多序列同源物的蛋白质的预测接触仍然是低质量的,并且对于结构从头预测没有多大用处。

Method

This paper presents a new deep learning method that predicts contacts by integrating both evolutionary coupling (EC) and sequence conservation information through an ultra-deep neural network formed by two deep residual neural networks. The first residual network conducts a series of 1-dimensional convolutional transformation of sequential features; the second residual network conducts a series of 2-dimensional convolutional transformation of pairwise information including output of the first residual network, EC information and pairwise potential. By using very deep residual networks, we can accurately model contact occurrence patterns and complex sequence-structure relationship and thus, obtain higher-quality contact prediction regardless of how many sequence homologs are available for proteins in question.

本文提出了一种新的深度学习方法,通过两个深度残差神经网络形成的超深神经网络,整合进化耦合和序列守恒信息来预测接触。第一个残差网络对序列特征进行一系列的一维卷积变换;第二个残差网络对第一个残差网络的输出、 EC信息和成对电位进行一系列二维卷积变换。通过使用非常深的残差网络,我们可以精确地模拟接触发生模式和复杂的序列结构关系,从而获得更高质量的接触预测,而不管所讨论的蛋白质有多少序列同源物。

Results

Our method greatly outperforms existing methods and leads to much more accurate contact-assisted folding. Tested on 105 CASP11 targets, 76 past CAMEO hard targets, and 398 membrane proteins, the average top L long-range prediction accuracy obtained by our method, one representative EC method CCMpred and the CASP11 winner MetaPSICOV is 0.47, 0.21 and 0.30, respectively; the average top L/10 long-range accuracy of our method, CCMpred and MetaPSICOV is 0.77, 0.47 and 0.59, respectively. Ab initio folding using our predicted contacts as restraints but without any force fields can yield correct folds (i.e., TMscore>0.6) for 203 of the 579 test proteins, while that using MetaPSICOV- and CCMpred-predicted contacts can do so for only 79 and 62 of them, respectively. Our contact-assisted models also have much better quality than template-based models especially for membrane proteins. The 3D models built from our contact prediction have TMscore>0.5 for 208 of the 398 membrane proteins, while those from homology modeling have TMscore>0.5 for only 10 of them. Further, even if trained mostly by soluble proteins, our deep learning method works very well on membrane proteins. In the recent blind CAMEO benchmark, our fully-automated web server implementing this method successfully folded 6 targets with a new fold and only 0.3L-2.3L effective sequence homologs, including one β protein of 182 residues, one α+β protein of 125 residues, one α protein of 140 residues, one α protein of 217 residues, one α/β of 260 residues and one α protein of 462 residues. Our method also achieved the highest F1 score on free-modeling targets in the latest CASP (Critical Assessment of Structure Prediction), although it was not fully implemented back then.

我们的方法大大优于现有的方法,导致更精确的接触辅助折叠。对105个CASP11靶点、76个CAMEO hard靶点和398个膜蛋白进行测试,用我们的方法、一个具有代表性的EC方法CCMpred和CASP11冠军方法 MetaPSICOV得到的平均top L长程预测精度分别是0.47、0.21和0.30;本文的方法、CCMpred和MetaPSICOV的top L/10远程平均精度分别为0.77、0.47和0.59。使用我们预测的接触作为约束条件但没有任何力场的从头算折叠可以为579个测试蛋白质中的203个产生正确的折叠(即TMscore>0.6),而使用MetaPSICOV-和CCMpred预测接触的方法分别只能对其中79个和62个产生正确的折叠。我们的接触辅助模型也比基于模板的模型有更好的质量,特别是对于膜蛋白。根据我们的接触预测建立的3D模型中,有208个膜蛋白的TMscore>0.5,而同源模型中只有10个的TMscore>0.5。此外,即使主要是通过可溶性蛋白质训练,我们的深度学习方法对膜蛋白也非常有效。在最近的盲CAMEO基准测试中,我们实现该方法的全自动web服务器成功地折叠了6个目标,只有0.3L-2.3L的有效序列同源,包括一个182个残基的β蛋白,一个125个残基的α+β蛋白,一个140个残基的α蛋白,一个217个残基的α蛋白,260个残基中有一个α/β,462个残基中有一个α蛋白。我们的方法在最新的CASP(Critical Assessment of Structure Prediction)中也获得了自由建模目标的最高F1分数,尽管那时还没有完全实现。

 

Availability

http://raptorx.uchicago.edu/ContactMap/

Author Summary

Protein contact prediction and contact-assisted folding has made good progress due to direct evolutionary coupling analysis (DCA). However, DCA is effective on only some proteins with a very large number of sequence homologs. To further improve contact prediction, we borrow ideas from deep learning, which has recently revolutionized object recognition, speech recognition and the GO game. Our deep learning method can model complex sequence-structure relationship and high-order correlation (i.e., contact occurrence patterns) and thus, improve contact prediction accuracy greatly. Our test results show that our method greatly outperforms the state-of-the-art methods regardless how many sequence homologs are available for a protein in question. Ab initio folding guided by our predicted contacts may fold many more test proteins than the other contact predictors. Our contact-assisted 3D models also have much better quality than homology models built from the training proteins, especially for membrane proteins. One interesting finding is that even trained mostly with soluble proteins, our method performs very well on membrane proteins. Recent blind CAMEO test confirms that our method can fold large proteins with a new fold and only a small number of sequence homologs.

由于直接进化耦合分析(DCA)的出现,蛋白质接触预测和接触辅助折叠技术取得了长足的进展。然而,DCA只对一些具有大量同源序列的蛋白质有效。为了进一步改进接触预测,我们借鉴了深度学习(deep learning)的思想,它最近彻底改变了物体识别、语音识别和围棋游戏。我们的深度学习方法可以模拟复杂的序列结构关系和高阶相关(即接触发生模式),从而大大提高接触预测精度。我们的测试结果表明,无论一个蛋白质有多少序列同源序列,我们的方法都大大优于现有的方法。由我们预测的接触引导的从头算折叠可能比其他接触预测器折叠更多的测试蛋白质。我们的接触辅助三维模型也比由训练蛋白建立的同源模型质量好得多,特别是对于膜蛋白。最有趣的是,我们的膜上发现了一种很有意思的可溶性蛋白质。最近的盲CAMEO实验证实,我们的方法可以用一个新的折叠来折叠大的蛋白质,并且只有少量的序列同源。

Introduction

De novo protein structure prediction from sequence alone is one of most challenging problems in computational biology. Recent progress has indicated that some correctly-predicted long-range contacts may allow accurate topology-level structure modeling [1] and that direct evolutionary coupling analysis (DCA) of multiple sequence alignment (MSA) may reveal some long-range native contacts for proteins and protein-protein interactions with a large number of sequence homologs [2, 3]. Therefore, contact prediction and contact-assisted protein folding has recently gained much attention in the community. However, for many proteins especially those without many sequence homologs, the predicted contacts by the state-of-the-art predictors such as CCMpred [4], PSICOV [5], Evfold [6], plmDCA[7], Gremlin[8], MetaPSICOV [9] and CoinDCA [10] are still of low quality and insufficient for accurate contact-assisted protein folding [11, 12]. This motivates us to develop a better contact prediction method, especially for proteins without a large number of sequence homologs. In this paper we define that two residues form a contact if they are spatially proximal in the native structure, i.e., the Euclidean distance of their Cβ atoms less than 8Å [13].

蛋白质结构的从头预测是计算生物学中最具挑战性的问题之一。最近的研究表明,一些正确预测的长程接触可以实现精确的拓扑级结构建模,多序列比对(MSA)的直接进化耦合分析(DCA)可以揭示蛋白质与大量同源序列的长期天然接触和相互作用。因此,接触预测和接触辅助蛋白质折叠近年来受到了社会的广泛关注。然而,对于许多蛋白质,尤其是那些没有许多序列同源物的蛋白质,最新的预测器如CCMpred、PSICOV、Evfold、plmDCA、Gremlin、MetaPSICOV和CoinDCA预测的接触仍然质量低下,不足以进行精确的接触辅助蛋白质折叠。这促使我们开发一种更好的接触预测方法,特别是对于没有大量同源序列的蛋白质。在本文中,我们定义了如果两个残基在自然结构中的空间上接近,即它们的Cβ原子的欧氏距离小于8Å。

Existing contact prediction methods roughly belong to two categories: evolutionary coupling analysis (ECA) and supervised machine learning. ECA predicts contacts by identifying co-evolved residues in a protein, such as EVfold [6], PSICOV [5], CCMpred [4], Gremlin [8], plmDCA and others [14–16]. However, DCA usually needs a large number of sequence homologs to be effective [10, 17]. Supervised machine learning predicts contacts from a variety of information, e.g., SVMSEQ [18], CMAPpro [13], PconsC2 [17], MetaPSICOV [9], PhyCMAP [19] and CoinDCA-NN [10]. Meanwhile, PconsC2 uses a 5-layer supervised learning architecture [17]; CoinDCA-NN and MetaPSICOV employ a 2-layer neural network [9]. CMAPpro uses a neural network with more layers, but its performance saturates at about 10 layers. Some supervised methods such as MetaPSICOV and CoinDCA-NN outperform ECA on proteins without many sequence homologs, but their performance is still limited by their shallow architectures.

现有的接触预测方法大致分为两类:进化耦合分析(ECA)和有监督机器学习。ECA通过识别蛋白质中的共进化残基来预测接触,如EVfold、PSICOV、CCMpred、Gremlin、plmDCA等。然而,DCA通常需要大量的同源序列才能有效。有监督机器学习从各种信息中预测接触,例如SVMSEQ、CMAPpro、PconsC2、MetaPSICOV、PhyCMAP和CoinDCA NN。同时,PconsC2采用5层监督学习架构;CoinDCA NN和MetaPSICOV采用2层神经网络。CMAPpro使用具有更多层的神经网络,但其性能在大约10层时饱和。一些有监督的方法如MetaPSICOV和CoinDCA-NN在没有多个同源序列的情况下优于ECA方法,但其性能仍受到其浅层结构的限制。

To further improve supervised learning methods for contact prediction, we borrow ideas from very recent breakthrough in computer vision. In particular, we have greatly improved contact prediction by developing a brand-new deep learning model called residual neural network [20] for contact prediction. Deep learning is a powerful machine learning technique that has revolutionized image classification [21, 22] and speech recognition [23]. In 2015, ultra-deep residual neural networks [24] demonstrated superior performance in several computer vision challenges (similar to CASP) such as image classification and object recognition [25]. If we treat a protein contact map as an image, then protein contact prediction is kind of similar to (but not exactly same as) pixel-level image labeling, so some techniques effective for image labeling may also work for contact prediction. However, there are some important differences between image labeling and contact prediction. First, in computer vision community, image-level labeling (i.e., classification of a single image) has been extensively studied, but there are much fewer studies on pixel-level image labeling (i.e., classification of an individual pixel). Second, in many image classification scenarios, image size is resized to a fixed value, but we cannot resize a contact map since we need to do prediction for every residue pair (equivalent to an image pixel). Third, contact prediction has much more complex input features (including both sequential and pairwise features) than image labeling. Fourth, the ratio of contacts in a protein is very small (<2%). That is, the number of positive and negative labels in contact prediction is extremely unbalanced.

为了进一步改进接触预测的监督学习方法,我们借鉴了计算机视觉最近的突破。特别是,我们开发了一种全新的用于接触预测的深度学习模型,称为残差神经网络[20],极大地改进了接触预测。深度学习是一种强大的机器学习技术,它彻底改变了图像分类和语音识别。2015年,超深残差神经网络在图像分类和目标识别等计算机视觉挑战(类似于CASP)中表现出了优异的性能。如果我们把蛋白质接触图看作一幅图像,那么蛋白质接触预测有点类似(但并不完全相同)像素级图像标记,因此一些有效的图像标记技术也可以用于接触预测。然而,图像标记和接触预测有一些重要的区别。首先,在计算机视觉领域,图像级标记(即单个图像的分类)得到了广泛的研究,但是对于像素级图像标记(即单个像素的分类)的研究却少得多。其次,在许多图像分类场景中,图像大小被调整到一个固定的值,但是我们不能调整接触图的大小,因为我们需要对每个残基对(相当于一个图像像素)进行预测。第三,接触预测比图像标记具有更复杂的输入特征(包括序列特征和成对特征)。第四,蛋白质中的接触比例很小(<2%)。也就是说,接触预测中正负标签的数量极不平衡。

In this paper we present a very deep residual neural network for contact prediction. Such a network can capture very complex sequence-contact relationship and high-order contact correlation. We train this deep neural network using a subset of proteins with solved structures and then test it on public data including the CASP [26, 27] and CAMEO [28] targets as well as many membrane proteins. Our experimental results show that our method yields much better accuracy than existing methods and also result in much more accurate contact-assisted folding. The deep learning method described here will also be useful for the prediction of protein-protein and protein-RNA interfacial contacts.

本文提出了一种用于接触预测的深度残差神经网络。这种网络可以捕捉非常复杂的序列接触关系和高阶接触相关性。我们使用一组具有已解决结构的蛋白质来训练这种深层神经网络,然后在公共数据上进行测试,包括CASP和CAMEO靶点以及许多膜蛋白。实验结果表明,与现有的方法相比,我们的方法获得了更高的精度,并且获得了更精确的接触辅助折叠。本文所述的深度学习方法也将用于预测蛋白质-蛋白质和蛋白质-RNA界面接触。

Results

Deep learning model for contact prediction

Fig 1 illustrates our deep neural network model for contact prediction [29]. Different from previous supervised learning approaches[9, 13] for contact prediction that employ only a small number of hidden layers (i.e., a shallow architecture), our deep neural network employs dozens of hidden layers. By using a very deep architecture, our model can automatically learn the complex relationship between sequence information and contacts and also model the interdependency among contacts and thus, improve contact prediction [17]. Our model consists of two major modules, each being a residual neural network. The first module conducts a series of 1-dimensional (1D) convolutional transformations of sequential features (sequence profile, predicted secondary structure and solvent accessibility). The output of this 1D convolutional network is converted to a 2-dimensional (2D) matrix by outer concatenation (an operation similar to outer product) and then fed into the 2nd module together with pairwise features (i.e., co-evolution information, pairwise contact and distance potential). The 2nd module is a 2D residual network that conducts a series of 2D convolutional transformations of its input. Finally, the output of the 2D convolutional network is fed into a logistic regression, which predicts the probability of any two residues form a contact. In addition, each convolutional layer is also preceded by a simple nonlinear transformation called rectified linear unit [30]. Mathematically, the output of 1D residual network is just a 2D matrix with dimension L×m where m is the number of new features (or hidden neurons) generated by the last convolutional layer of the network. Biologically, this 1D residual network learns the sequential context of a residue. By stacking multiple convolution layers, the network can learn information in a very large sequential context. The output of a 2D convolutional layer has dimension L×L×n where n is the number of new features (or hidden neurons) generated by this layer for one residue pair. The 2D residual network mainly learns contact occurrence patterns or high-order residue correlation (i.e., 2D context of a residue pair). The number of hidden neurons may vary at each layer.

图1展示了我们用于接触预测的深层神经网络模型。与以往接触预测的监督学习方法不同,我们的深层神经网络使用了几十个隐藏层。通过使用一个非常深层的结构,我们的模型可以自动地学习序列信息和接触之间的复杂关系,并对联系人之间的相互依赖性进行建模,从而提高接触的预测能力。我们的模型由两个主要模块组成,每个模块都是一个残差神经网络。第一个模块对序列特征(序列 profile、预测的二级结构和溶剂可及性)进行一系列一维(1D)卷积变换。该一维卷积网络的输出通过外级联(类似于外积的操作)转换为二维(2D)矩阵,然后连同成对特征(即协同进化信息、成对接触和距离势)输入第二模块。第二个模块是一个二维残差网络,它对其输入进行一系列二维卷积变换。最后,将二维卷积网络的输出输入logistic回归,该回归预测任意两个残基形成接触的概率。此外,每一个卷积层之前也有一个简单的非线性变换称为整流线性单元。从数学上讲,一维残差网络的输出只是一个维数为L×m的二维矩阵,其中m是网络最后一个卷积层产生的新特征(或隐藏神经元)的数目。从生物学角度讲,这个1维残差网络学习残基顺序的上下文。通过叠加多个卷积层,网络可以在非常大的连续上下文中学习信息。二维卷积层的输出具有L×L×n的维数,其中n是该层为一个残基对生成的新特征(或隐藏神经元)的数目。2维残差网络主要学习接触发生模式或高阶残基相关性(即残基对的2D上下文)。每一层的隐藏神经元的数量可能有所不同。

Fig 1. Illustration of our deep learning model for contact prediction where L is the sequence length of one protein under prediction.

https://doi/10.1371/journal.pcbi.1005324.g001

Our test data includes the 150 Pfam families described in [5], 105 CASP11 test proteins [31], 398 membrane proteins (S1 Table) and 76 CAMEO hard targets released from 10/17/2015 to 04/09/2016 (S2 Table). The tested methods include PSICOV [5], Evfold [6], CCMpred [4], plmDCA[7], Gremlin[8], and MetaPSICOV [9]. The former 5 methods employs pure DCA while MetaPSICOV [9] is a supervised learning method that performed the best in CASP11 [31]. All the programs are run with parameters set according to their respective papers. We cannot evaluate PconsC2 [17] since we failed to obtain any results from its web server. PconsC2 did not outperform MetaPSICOV in CASP11 [31], so it may suffice to just compare our method with MetaPSICOV.

我们的测试数据包括中描述的150个Pfam家族、105个CASP11测试蛋白、398个膜蛋白(S1表)和2015年10月17日至2016年9月4日(S2表)发布的76个CAMEO硬靶点(S2表)。测试方法包括PSICOV、Evfold、CCMpred、plmDCA、Gremlin和MetaPSICOV。前5种方法采用纯DCA,而MetaPSICOV是一种监督学习方法,在CASP11中表现最好。所有的程序都是根据各自的文件设置参数运行的。我们无法评估PconsC2,因为我们无法从其web服务器获取任何结果。PconsC2在CASP11中的性能没有超过MetaPSICOV,因此只需将我们的方法与MetaPSICOV进行比较就足够了。

注:先跳过result部分

————————————————————————————————————————————————

Conclusion and Discussion

This paper has presented a new deep (supervised) learning method that can greatly improve protein contact prediction. Our method distinguishes itself from previous supervised learning methods in that we employ a concatenation of two deep residual neural networks to model sequence-contact relationship, one for modeling of sequential features (i.e., sequence profile, predicted secondary structure and solvent accessibility) and the other for modeling of pairwise features (e.g., coevolution information). Ultra-deep residual network is the latest breakthrough in computer vision and has demonstrated the best performance in the computer vision challenge tasks (similar to CASP) in 2015. Our method is unique in that we predict all contacts of a protein simultaneously, which allows us to easily model high-order residue correlation. By contrast, existing supervised learning methods predict if two residues form a contact or not independent of the other residue pairs. Our (blind) test results show that our method dramatically improves contact prediction, exceeding currently the best methods (e.g., CCMpred, Evfold, PSICOV and MetaPSICOV) by a very large margin. Even without using any force fields and fragment assembly, ab initio folding using our predicted contacts as restraints can yield 3D structural models of correct fold for many more test proteins. Further, our experimental results also show that our contact-assisted models are much better than template-based models built from the training proteins of our deep model. We expect that our contact prediction methods can help reveal much more biological insights for those protein families without solved structures and close structural homologs.

本文提出了一种新的深度(监督)学习方法,可以大大改善蛋白质接触预测。我们的方法不同于以往的监督学习方法,我们使用两个深度残差神经网络来建立序列接触关系的模型,一个用于序列特征的建模(即序列profile,预测二级结构和溶剂可及性)和其他用于成对特征建模(例如,协同进化信息)。超深残差网络是计算机视觉领域的最新突破,在2015年的计算机视觉挑战任务(类似于CASP)中表现出了最好的性能。我们的方法是独特的,因为我们同时预测了蛋白质的所有接触,这使得我们能够很容易地建立高阶残基相关性的模型。相比之下,现有的监督学习方法可以预测两个残基是否形成一个接触或不独立于其他残基对。我们的(盲)测试结果表明,我们的方法显著提高了接触预测,在很大程度上超过了目前最好的方法(如CCMpred、Evfold、PSICOV和MetaPSICOV)。即使不使用任何力场和片段组装,使用我们预测的接触作为约束的从头算折叠可以为更多的测试蛋白质生成正确折叠的三维结构模型。此外,我们的实验结果也显示,我们的接触辅助模型比由我们的深层模型的训练蛋白质建立的基于模板的模型要好得多。我们期望我们的接触预测方法可以帮助揭示那些没有解决结构和结构同源的蛋白质家族更多的生物学见解。

Our method outperforms ECA due to a couple of reasons. First, ECA predicts contacts using information only in a single protein family, while our method learns sequence-structure relationship from thousands of protein families. Second, ECA considers only pairwise residue correlation, while our deep network architecture can capture high-order residue correlation (or contact occurrence patterns) very well. Our method uses a subset of protein features used by MetaPSICOV, but outperforms MetaPSICOV mainly because we explicitly model contact patterns (or high-order correlation), which is enabled by predicting contacts of a single protein simultaneously. MetaPSICOV employs a 2-stage approach. The 1st stage predicts if there is a contact between a pair of residues independent of the other residues. The 2nd stage considers the correlation between one residue pair and its neighboring pairs, but not in a very good way. In particular, the prediction errors in the 1st stage of MetaPSICOV cannot be corrected by the 2nd stage since two stages are trained separately. By contrast, we train all 2D convolution layers simultaneously (each layer is equivalent to one stage) so that later stages can correct prediction errors in early stages. In addition, a deep network can model much higher-order correlation and thus, capture information in a much larger context.

由于几个原因,我们的方法优于ECA。首先,ECA只利用单个蛋白家族中的信息来预测接触,而我们的方法从数千个蛋白质家族中学习序列结构关系。第二,ECA只考虑成对残基相关,而我们的深层网络结构可以很好地捕捉高阶残基相关(或接触发生模式)。我们的方法使用了MetaPSICOV所使用的蛋白质特征子集,但其性能优于MetaPSICOV,主要是因为我们明确地建立了接触模式(或高阶关联),这种模式是通过同时预测单个蛋白质的接触来实现的。MetaPSICOV采用两阶段方法。第一阶段预测一对独立于其他残基的残基之间是否存在接触。第二阶段考虑了一个残差对与其相邻对之间的相关性,但不是很好。特别是,MetaPSICOV第一阶段的预测误差不能被第二阶段修正,因为两个阶段是分开训练的。相比之下,我们同时训练所有的二维卷积层(每一层相当于一个阶段),以便后期阶段能够纠正早期的预测误差。此外,深层网络可以建立更高阶关联的模型,从而在更大的上下文中捕获信息。

Our deep model does not predict contact maps by simply recognizing them from PDB, as evidenced by our experimental settings and results. First, we employ a strict criterion to remove redundancy so that there are no training proteins with sequence identity >25% or BLAST E-value <0.1 with any test proteins. Second, our contact-assisted models also have better quality than homology models, so it is unlikely that our predicted contact maps are simply copied from the training proteins. Third, our deep model trained by only soluble proteins works very well on membrane proteins. By contrast, the homology models built from soluble proteins for the membrane proteins have very low quality. Their average TMscore is no more than 0.17, which is the expected TMscore of any two randomly-chosen proteins. Finally, the blind CAMEO test indicates that our method successfully folded several targets with a new fold.

我们的deep模型不能通过简单地从PDB中识别它们来预测接触图,我们的实验设置和结果证明了这一点。首先,我们使用一个严格的标准来消除冗余,这样就没有序列一致性>25%或BLAST E值<0.1的任何测试蛋白质的训练蛋白。第二,我们的接触辅助模型也比同源模型具有更好的质量,因此我们预测的接触图不太可能仅仅是从训练蛋白质中复制出来的。第三,我们只训练可溶性蛋白的深层模型对膜蛋白起到了很好的作用。相比之下,由可溶性蛋白构建的膜蛋白同源性模型质量很低。它们的平均TMscore不超过0.17,这是任意两个随机选择的蛋白质的预期TMscore。最后,盲CAMEO实验表明,我们的方法成功地用一个新的折叠方法折叠了多个目标。

Our contact prediction method also performed the best in CASP12 in terms of the F1 score calculated on top L/2 long- and medium-range contacts of 38 free-modeling targets, although back then (May-July 2016) our method was not fully implemented. F1 score is a well-established and robust metric in evaluating the performance of a prediction method. Our method outperformed the 2nd best server iFold_1 by about 7.6% in terms of the total F1 score and the 3rd best server (i.e., an improved MetaPSICOV) by about 10.0%. Our advantage is even bigger when only top L/5 long- and medium-range contacts are evaluated. iFold_1 also used a deep neural network while the new MetaPSICOV used a deeper and wider network and more input features than the old version. This CASP result further confirms that deep learning can indeed improve protein contact prediction.

我们的接触预测方法在CASP12中也表现得最好,在38个自由建模目标的top L/2中远程接触计算的F1分数方面,尽管当时(2016年5月至7月),我们的方法还没有完全实施。F1评分是评价预测方法性能的一个成熟而稳健的指标。在F1总分方面,我们的方法比第二好的服务器iFold_1高出约7.6%,第三好服务器(即改进的MetaPSICOV)高出约10.0%。我们的优势更大时,只有top L/5中远程接触评估。iFold_1还使用了深度神经网络,而新的MetaPSICOV使用了比旧版本更深更广的网络和更多的输入特性。这个CASP结果进一步证实了深度学习确实可以改善蛋白质接触预测。

We have studied the impact of different input features. First of all, the co-evolution strength produced by CCMpred is very important. Without it, the top L/10 long-range prediction accuracy may drop by 0.15 for soluble proteins and more for membrane proteins. The larger performance degradation for membrane proteins is mainly because information learned from sequential features of soluble proteins is not very useful for membrane proteins. The depth of our deep model is as important as CCMpred, as evidenced by the fact that our deep method has much better accuracy than MetaPSICOV although we use a subset of protein features used by MetaPSICOV. Our test shows that a deep model with 9 and 30 layers have top L/10 accuracy ~0.1 and ~0.03 worse than a 60-layer model, respectively. This suggests that it is very important to model contact occurrence patterns (i.e., high-order residue correlation) by a deep architecture. The pairwise contact potential and mutual information may impact the accuracy by 0.02–0.03. The secondary structure and solvent accessibility may impact the accuracy by 0.01–0.02.

我们研究了不同输入特征的影响。首先,CCMpred产生的协同进化强度非常重要。如果没有它,可溶性蛋白的最高L/10长程预测精度可能下降0.15,而膜蛋白的L/10预测精度可能会下降更多。膜蛋白更大的性能退化主要是因为从可溶性蛋白质的序列特征中获得的信息对膜蛋白不是很有用。我们的deep模型的深度和CCMpred一样重要,这一事实证明了我们的deep方法比MetaPSICOV具有更好的精确度,尽管我们使用MetaPSICOV使用的蛋白质特征子集。我们的测试表明,9层和30层的深层模型的top L/10精度分别比60层模型差0.1和0.03。这表明,用一个深层结构来模拟接触发生模式(即高阶残基关联)是非常重要的。两两接触电位和互信息对测量精度的影响为0.02~0.03。二级结构和溶剂可及性对准确度的影响为0.01~0.02。

An interesting finding is that although our training set contains only ~100 membrane proteins, our model works well for membrane proteins, much better than CCMpred and MetaPSICOV. Even without using any membrane proteins in our training set, our deep models have almost the same accuracy on membrane proteins as those trained with membrane proteins. This implies that the sequence-structure relationship learned by our model from non-membrane proteins can generalize well to membrane protein contact prediction. This may be because that both soluble and membrane proteins share similar contact occurrence patterns in their contact maps and our deep method improves over previous methods by making a good use of contact occurrence patterns. We are going to study if we can further improve contact prediction accuracy of membrane proteins by including many more membrane proteins in the training set.

一个有趣的发现是,虽然我们的训练集只包含约100个膜蛋白,但我们的模型对膜蛋白的效果很好,比CCMpred和MetaPSICOV要好得多。即使在我们的训练集中不使用任何膜蛋白,我们的深层模型对膜蛋白的精确度几乎与那些用膜蛋白训练的模型一样。这意味着我们的模型从非膜蛋白中获得的序列结构关系可以很好地推广到膜蛋白接触预测。这可能是因为可溶性蛋白和膜蛋白在它们的接触图谱中都有相似的接触发生模式,而且我们的deep方法通过很好地利用接触发生模式而改进了以前的方法。我们正在研究是否可以通过在训练集中包含更多的膜蛋白来进一步提高膜蛋白的接触预测精度。

We may further improve contact prediction accuracy by enlarging the training set. First, the latest PDB25 has more than 10,000 proteins, which can provide many more training proteins than what we are using now. Second, when removing redundancy between training and test proteins, we may relax the BLAST E-value cutoff to 0.001 or simply drop it. This will improve the top L/k (k = 1, 2, 5, 10) contact prediction accuracy by 1–3% and accordingly the quality of the resultant 3D models by 0.01–0.02 in terms of TMscore. We may also improve the 3D model quality by combining our predicted contacts with energy function and fragment assembly. For example, we may feed our predicted contacts to Rosetta to build 3D models. Compared to CNS, Rosetta makes use of energy function and more local structural restraints through fragment assembly and thus, shall result in much better 3D models. Finally, instead of predicting contacts, our deep learning model actually can predict inter-residue distance distribution (i.e., distance matrix), which provides finer-grained information than contact maps and thus, shall benefit 3D structure modeling more than predicted contacts.

 通过扩大训练集可以进一步提高接触预测精度。首先,最新的PDB25有超过10000种蛋白质,可以提供比我们现在使用的更多的训练蛋白质。第二,当去除训练和测试蛋白质之间的冗余时,我们可以将BLAST E值的截止值放宽到0.001,或者干脆放弃它。这将使top L/k(k=1、2、5、10)接触预测精度提高1-3%,相应地,根据TMscore,生成的3D模型的质量将提高0.01-0.02。我们还可以通过将预测的接触与能量函数和碎片装配相结合来提高三维模型的质量。例如,我们可以将我们预测的接触提供给Rosetta来构建3D模型。与CNS相比,Rosetta通过碎片组装利用了能量函数和更多的局部结构约束,从而可以得到更好的三维模型。最后,我们的深度学习模型并不是预测接触,而是能够预测残差间的距离分布(即距离矩阵),它比接触图提供更细粒度的信息,因此比预测的接触更有利于三维结构建模。

Our model achieves pretty good performance when using around 60–70 convolutional layers. A natural question to ask is can we further improve prediction accuracy by using many more convolutional layers? In computer vision, it has been shown that a 1001-layer residual neural network can yield better accuracy for image-level classification than a 100-layer network (but no result on pixel-level labeling is reported). Currently we cannot apply more than 100 layers to our model due to insufficient memory of a GPU card (12G). We plan to overcome the memory limitation by extending our training algorithm to run on multiple GPU cards. Then we will train a model with hundreds of layers to see if we can further improve prediction accuracy or not.

当使用大约60–70个卷积层时,我们的模型获得了相当好的性能。一个自然的问题是,我们能否通过使用更多的卷积层来进一步提高预测精度?在计算机视觉中,研究表明1001层残差神经网络比100层网络具有更好的图像级分类精度(但目前还没有关于像素级标记的结果)。目前,由于GPU卡(12G)内存不足,我们无法将超过100层应用到我们的模型中。我们计划通过扩展我们的训练算法在多个GPU卡上运行来克服内存限制。然后我们将训练一个有数百层的模型,看看我们是否可以进一步提高预测精度。

Method

Deep learning model details

Residual network blocks.

Our network consists of two residual neural networks, each in turn consisting of some residual blocks concatenated together. Fig 22 shows an example of a residual block consisting of 2 convolution layers and 2 activation layers. In this figure, Xl and Xl+1 are the input and output of the block, respectively. The activation layer conducts a simple nonlinear transformation of its input without using any parameters. Here we use the ReLU activation function [30] for such a transformation. Let f(Xl) denote the result of Xl going through the two activation layers and the two convolution layers. Then, Xl+1 is equal to Xl + f(Xl). That is, Xl+1 is a combination of Xl and its nonlinear transformation. Since f(Xl) is equal to the difference between Xl+1 and Xl, f is called residual function and this network called residual network. In the first residual network, Xl and Xl+1 represent sequential features and have dimension L×nl and L×nl+1, respectively, where L is protein sequence length and nl (nl+1) can be interpreted as the number of features or hidden neurons at each position (i.e., residue). In the 2nd residual network, Xl and Xl+1 represent pairwise features and have dimension L × L × nl and L × L× nl+1, respectively, where nl (nl+1) can be interpreted as the number of features or hidden neurons at one position (i.e., residue pair). Typically, we enforce nl ≤ nl+1 since one position at a higher level is supposed to carry more information. When nl < nl+1, in calculating Xl + f(Xl) we shall pad zeros to Xl so that it has the same dimension as Xl+1.

注:文中的l都表示下标,只翻译一部分,其余是公式的推倒,可以看明白。跳过!

我们的网络由两个残差神经网络组成,每个残差神经网络又由一些连接在一起的残差块组成。图22示出了由2个卷积层和2个激活层组成的剩余块的示例。在这个图中,Xl和Xl+1分别是块的输入和输出。激活层在不使用任何参数的情况下对其输入进行简单的非线性变换。这里我们使用ReLU激活函数来进行这种转换。

To speed up training, we also add a batch normalization layer [43] before each activation layer, which normalizes its input to have mean 0 and standard deviation 1. The filter size (i.e., window size) used by a 1D convolution layer is 17 while that used by a 2D convolution layer is 3×3 or 5×5. By stacking many residual blocks together, even if at each convolution layer we use a small window size, our network can model very long-range interdependency between input features and contacts as well as the long-range interdependency between two different residue pairs. We fix the depth (i.e., the number of convolution layers) of the 1D residual network to 6, but vary the depth of the 2D residual network. Our experimental results show that with ~60 hidden neurons at each position and ~60 convolution layers for the 2nd residual network, our model can yield pretty good performance. Note that it has been shown that for image classification a convolutional neural network with a smaller window size but many more layers usually outperforms a network with a larger window size but fewer layers. Further, a 2D convolutional neural network with a smaller window size also has a smaller number of parameters than a network with a larger window size. See https://github/KaimingHe/deep-residual-networks for some existing implementations of 2D residual neural network. However, they assume an input of fixed dimension, while our network needs to take variable-length proteins as input.

为了加快训练速度,我们还在每个激活层之前添加了一个批处理规范化层,它将其输入标准化为平均值为0,标准偏差为1。1D卷积层使用的滤波器尺寸(即窗口尺寸)为17,而2D卷积层使用的滤波器尺寸为3×3或5×5。通过将许多残差块叠加在一起,即使在每个卷积层我们使用了一个小窗口大小,我们的网络可以模拟输入特征和接触之间的非常长的相互依赖关系,以及两个不同残余数对之间的长程互依关系。我们将一维残差网络的深度(即卷积层数)固定为6,但改变二维残差网络的深度。实验结果表明,在每个位置有大约60个隐藏神经元,第二个残差网络有大约60个卷积层,我们的模型可以产生很好的性能。此外,具有较小窗口大小的2D卷积神经网络也比具有较大窗口大小的网络具有较少的参数数目。见 https://github/KaimingHe/deep-remainment-networks 对于一些现有的二维残差神经网络的实现。然而,他们假设一个固定维度的输入,而我们的网络需要以可变长度的蛋白质作为输入。

Fig 22. A building block of our residual network with Xl and Xl+1 being input and output, respectively.Each block consists of two convolution layers and two activation layers.

Our deep learning method for contact prediction is unique in at least two aspects. First, our model employs two multi-layer residual neural networks, which have not been applied to contact prediction before. Residual neural networks can pass both linear and nonlinear information from end to end (i.e., from the initial input to the final output). Second, we do contact prediction on the whole contact map by treating it as an individual image. In contrast, previous supervised learning methods separate the prediction of one residue pair from the others. By predicting contacts of a protein simultaneously, we can easily model long-range contact correlation and high-order residue correlation and long-range correlation between a contact and input features.

我们的接触预测深度学习方法至少在两个方面是独特的。首先,我们的模型采用了两个多层残差神经网络,这两个神经网络在接触预测方面还没有应用。残差神经网络可以从端到端传递线性和非线性信息(即从初始输入到最终输出)。第二,将整个接触图作为一个单独的图像进行接触预测。与此相反,以前的监督学习方法将一个残基对的预测与其他残基对的预测分开。通过同时预测蛋白质的接触,我们可以很容易地建立接触与输入特征之间的长程接触相关、高阶残基相关和长程相关模型。

Convolutional operation.

现有的深度学习开发工具包,如Theano(http://deeplearning/software/theano/),TensorFlow(https://www.tensorflow/)提供了一个用于卷积运算的API(应用程序编程接口),因此我们不需要自己实现它。 见http://deeplearning/tutorial/lenet.html以及https://www.nervanasys/compolutional-neural-networks/一个很好的卷积网络教程。关于1D卷积网络在蛋白质序列标记中的应用,请参见。粗略地说,一维卷积运算实际上是矩阵向量乘法,二维卷积也可以用类似的方法解释。设X和Y(尺寸分别为L×m和L×n)分别为一维卷积层的输入和输出。设窗口大小为2w+1,s=(2w+1)m。将X转换为Y的卷积算子可以表示为一个维数为n×s的二维矩阵,表示为C。C与蛋白质长度无关,每个卷积层可能有不同的C。

 Training and dealing with proteins of different lengths.

Our network can take as input variable-length proteins. We train our deep network in a minibatch mode, which is routinely used in deep learning. That is, at each training iteration, we use a minibatch of proteins to calculate gradient and update the model parameters. A minibatch may have one or several proteins. We sort all training proteins by length and group proteins of similar lengths into minibatches. Considering that most proteins have length up to 600 residues, proteins in a minibatch often have the same length. In the case that they do not, we add zero padding to shorter proteins. Our convolutional operation is protein-length independent, so two different minibatches are allowed to have different protein lengths. We have tested minibatches with only a single protein or with several proteins. Both work well. However, it is much easier to implement minibatches with only a single protein.

我们的网络可以把可变长度的蛋白质作为输入。我们以小批量模式训练我们的深层网络,这种模式通常用于深度学习。也就是说,在每次训练迭代中,我们使用一小批蛋白质来计算梯度并更新模型参数。一小批可能含有一种或几种蛋白质。我们按长度对所有训练蛋白进行分类,并将长度相似的蛋白质分组成小批量。考虑到大多数蛋白质的长度可达600个残基,小批量中的蛋白质通常具有相同的长度。如果没有,我们给较短的蛋白质加上零填充。我们的卷积运算与蛋白质长度无关,因此允许两个不同的小批量具有不同的蛋白质长度。我们已经用一种或几种蛋白质进行了小批量试验。两者都很好。然而,用一种蛋白质来实现小批量生产要容易得多。

Since our network can take as input variable-length lengths, we do not need to cut a long protein into segments in predicting contact maps. Instead we predict all contacts of a protein chain simultaneously. There is no need to use zero padding when only a single protein is predicted in a batch. Zero padding is needed only when several proteins of different lengths are predicted in a batch.

由于我们的网络可以将可变长度的长度作为输入,所以在预测接触图时,我们不需要将一个长的蛋白质切成段。相反,我们同时预测蛋白质链的所有接触。当一个批次中只预测一个蛋白质时,不需要使用零填充。只有在一批预测出几个不同长度的蛋白质时,才需要零填充。

Training and test data

Our test data includes the 150 Pfam families [5], 105 CASP11 test proteins, 76 hard CAMEO test proteins released in 2015 (S1 Table) and 398 membrane proteins (S2 Table). All test membrane proteins have length no more than 400 residues and any two membrane proteins share less than 40% sequence identity. For the CASP test proteins, we use the official domain definitions, but we do not parse a CAMEO or membrane protein into domains.

我们的测试数据包括150个Pfam家族、105个CASP11测试蛋白、76个硬CAMEO测试蛋白(S1表)和398个膜蛋白(S2表)。所有被测膜蛋白的残基长度不超过400个,任意两个膜蛋白的序列同源性均小于40%。对于CASP测试蛋白,我们使用官方的域定义,但是我们不将CAMEO或膜蛋白解析成域。

Our training set is a subset of PDB25 created in February 2015, in which any two proteins share less than 25% sequence identity. We exclude a protein from the training set if it satisfies one of the following conditions: (i) sequence length smaller than 26 or larger than 700, (ii) resolution worse than 2.5Å, (iii) has domains made up of multiple protein chains, (iv) no DSSP information, and (v) there is inconsistency between its PDB, DSSP and ASTRAL sequences [48]. To remove redundancy with the test sets, we exclude any training proteins sharing >25% sequence identity or having BLAST E-value <0.1 with any test proteins. In total there are 6767 proteins in our training set, from which we have trained 7 different models. For each model, we randomly sampled ~6000 proteins from the training set to train the model and used the remaining proteins to validate the model and determine the hyper-parameters (i.e., regularization factor). The final model is the average of these 7 models.

我们的训练集是2015年2月创建的PDB25的一个子集,其中任意两个蛋白质的序列同源性小于25%。如果一个蛋白质满足以下条件之一,我们将其从训练集中排除:(i)序列长度小于26或大于700,(ii)分辨率低于2.5ï,(iii)具有由多个蛋白质链组成的结构域,(iv)没有DSSP信息,以及(v)其PDB、DSSP和ASTRAL序列之间存在不一致性。为了消除测试集的冗余,我们排除了与任何测试蛋白共享>25%序列同源性或BLAST E值<0.1的任何训练蛋白。在我们的训练集中总共有6767个蛋白质,我们从中训练了7个不同的模型。对于每个模型,我们从训练集中随机抽取约6000个蛋白质来训练模型,并使用剩余的蛋白质来验证模型并确定超参数(即正则化因子)。最后的模型是这7个模型的平均值。

Protein features

We use similar but fewer protein features as MetaPSICOV. In particular, the input features include protein sequence profile (i.e., position-specific scoring matrix), predicted 3-state secondary structure and 3-state solvent accessibility, direct co-evolutionary information generated by CCMpred, mutual information and pairwise potential [45, 46]. To derive these features, we need to generate MSA (multiple sequence alignment). For a training protein, we run PSI-BLAST (with E-value 0.001 and 3 iterations) to search the NR (non-redundant) protein sequence database dated in October 2012 to find its sequence homologs, and then build its MSA and sequence profile and predict other features (i.e., secondary structure and solvent accessibility). Sequence profile is represented as a 2D matrix with dimension L×20 where L is the protein length. Predicted secondary structure is represented as a 2D matrix with dimension L××3 (each entry is a predicted score or probability), so is the predicted solvent accessibility. Concatenating them together, we have a 2D matrix with dimension L×26, which is the input of our 1D residual network.

我们使用相似但较少的蛋白质特征作为MetaPSICOV。具体而言,输入特征包括蛋白质序列轮廓(即位置特异性评分矩阵)、预测的3态二级结构和3态溶剂可及性、CCMpred生成的直接协同进化信息、互信息和成对电位。为了得到这些特征,我们需要生成MSA(multiple sequence alignment)。对于一个训练蛋白,我们使用PSI-BLAST(E值0.001和3次迭代)搜索2012年10月的NR(非冗余)蛋白质序列数据库,找到其序列同源序列,然后构建其MSA和序列轮廓,并预测其他特征(即二级结构和溶剂可及性)。序列profile表示为二维矩阵,维数为L×20,其中L为蛋白质长度。预测的二级结构用维数为L×3的二维矩阵表示(每个条目都是一个预测分数或概率),预测的溶剂可达性也是如此。将它们串联在一起,我们得到一个维数为L×26的二维矩阵,它是我们的一维剩余网络的输入。

For a test protein, we generate four different MSAs by running HHblits [39] with 3 iterations and E-value set to 0.001 and 1, respectively, to search through the uniprot20 HMM library released in November 2015 and February 2016. From each individual MSA, we derive one sequence profile and employ our in-house tool RaptorX-Property [49] to predict the secondary structure and solvent accessibility accordingly. That is, for each test protein we generate 4 sets of input features and accordingly 4 different contact predictions. Then we average these 4 predictions to obtain the final contact prediction. This averaged contact prediction is about 1–2% better than that predicted from a single set of features. Although currently there are quite a few packages that can generate direct evolutionary coupling information, we only employ CCMpred to do so because it runs fast on GPU [4].

对于一个测试蛋白质,我们通过运行HHblits生成四个不同的msa,其中3次迭代,E值分别设置为0.001和1,以搜索2015年11月和2016年2月发布的uniprot20 HMM库。从每个单独的MSA中,我们得到一个序列剖面,并利用我们的内部工具RaptorX Property来预测二级结构和溶剂可及性。也就是说,对于每个测试蛋白质,我们生成4组输入特征和相应的4种不同的接触预测。然后我们平均这4个预测得到最终的接触预测。这种平均接触预测比从一组特征中预测的要好1-2%。虽然目前有相当多的包可以生成直接的进化耦合信息,但是我们只使用CCMpred来实现,因为它在GPU上运行速度很快。

Programs to compare and evaluation metrics

We compare our method with PSICOV [5], Evfold [6], CCMpred [4], plmDCA, Gremlin, and MetaPSICOV [9]. The first 5 methods conduct pure DCA while MetaPSICOV employs supervised learning. MetaPSICOV [9] performed the best in CASP11 [31]. CCMpred, plmDCA, Gremlin perform similarly, but better than PSICOV and Evfold. All the programs are run with parameters set according to their respective papers. We evaluate the accuracy of the top L/k (k = 10, 5, 2, 1) predicted contacts where L is protein sequence length. The prediction accuracy is defined as the percentage of native contacts among the top L/k predicted contacts. We also divide contacts into three groups according to the sequence distance of two residues in a contact. That is, a contact is short-, medium- and long-range when its sequence distance falls into [6, 11], [12, 23], and ≥24, respectively.

我们将我们的方法与PSICOV、Evfold、CCMpred、plmDCA、Gremlin和MetaPSICOV进行了比较。前5种方法进行纯DCA,而MetaPSICOV采用监督学习。MetaPSICOV在CASP11中表现最好。CCMpred、plmDCA、Gremlin的性能相似,但优于PSICOV和Evfold。所有的程序都是根据各自的文件设置参数运行的。我们评估L/k(k=10,5,2,1)预测接触的准确性,其中L是蛋白质序列长度。预测精度定义为在L/k预测值最高的接触中天然接触的百分比。我们还根据一个接触上两个残基的序列距离将其分为三组。也就是说,当一个触点的序列距离分别为[6,11]、[12,23]和≥24时,它是短距离、中距离和长距离。

Calculation of meff

Meff measures the amount of homologous information in an MSA (multiple sequence alignment). It can be interpreted as the number of non-redundant sequence homologs in an MSA when 70% sequence identity is used as cutoff. To calculate Meff, we first calculate the sequence identity between any two proteins in the MSA. Let a binary variable Sij denote the similarity between two protein sequences i and j. Sij is equal to 1 if and only if the sequence identity between i and j is at least 70%. For a protein i, we calculate the sum of Sij over all the proteins (including itself) in the MSA and denote it as Si. Finally, we calculate Meff as the sum of 1/Si over all the protein sequences in this MSA.

Meff测量MSA(多序列比对)中同源信息的数量。当以70%的序列同源性作为截止值时,它可以解释为MSA中非冗余序列同源的数目。为了计算Meff,我们首先计算MSA中任意两个蛋白质之间的序列一致性。为了计算Meff,我们首先计算MSA中任意两个蛋白质之间的序列一致性。让一个二元变量Sij表示两个蛋白质序列i和j之间的相似性。当且仅当i和j之间的序列一致性至少为70%时,Sij等于1。对于蛋白质i,我们计算MSA中所有蛋白质(包括自身)的Sij之和,并将其表示为Si。最后,我们将Meff计算为MSA中所有蛋白质序列的1/Si之和。

3D model construction by contact-assisted folding

We use a similar approach as described in [11] to build the 3D models of a test protein by feeding predicted contacts and secondary structure to the Crystallography & NMR System (CNS) suite [32]. We predict secondary structure using our in-house tool RaptorX-Property [49] and then convert it to distance, angle and h-bond restraints using a script in the Confold package [11]. For each test protein, we choose top 2L predicted contacts (L is sequence length) no matter whether they are short-, medium- or long-range and then convert them to distance restraints. That is, a pair of residues predicted to form a contact is assumed to have distance between 3.5Å and 8.0 Å. In current implementation, we do not use any force fields to help with folding. We generate twenty 3D structure models using CNS and select top 5 models by the NOE score yielded by CNS[32]. The NOE score mainly reflects the degree of violation of the model against the input constraints (i.e., predicted secondary structure and contacts). The lower the NOE score, the more likely the model has a higher quality. When CCMpred- and MetaPSICOV-predicted contacts are used to build 3D models, we also use the secondary structure predicted by RaptorX-Property to warrant a fair comparison.

我们使用中描述的类似方法,通过将预测的接触面和二级结构输入晶体学和核磁共振系统(CNS)套件来构建测试蛋白质的3D模型。我们使用我们的内部工具RaptorX Property预测二级结构,然后使用Confold包中的脚本将其转换为距离、角度和氢键约束。对于每个测试蛋白,我们选择前2L预测接触(L是序列长度),不管它们是短距离、中距离还是长距离,然后将它们转换为距离限制。也就是说,假设预测形成接触的一对残基的距离在3.5埃和8.0埃之间。在当前的实现中,我们不使用任何力场来帮助折叠。我们使用CNS生成了20个三维结构模型,并根据CNS的NOE评分选出前5个模型。NOE评分主要反映模型对输入约束(即预测的二级结构和接触)的违背程度。NOE得分越低,模型越有可能具有更高的质量。当CCMpred和MetaPSICOV预测接触用于建立三维模型时,我们也使用raptrox属性预测的二级结构来保证公平的比较。

Template-based modeling (TBM) of the test proteins

To generate template-based models (TBMs) for a test protein, we first run HHblits (with the UniProt20_2016 library) to generate an HMM file for the test protein, then run HHsearch with this HMM file to search for the best templates among the 6767 training proteins of our deep learning model, and finally run MODELLER to build a TBM from each of the top 5 templates.

为了为测试蛋白质生成基于模板的模型(TBM),我们首先运行HHblits(使用UniProt20_2016库)为测试蛋白质生成一个HMM文件,然后使用这个HMM文件运行HHsearch,在我们的深度学习模型的6767个训练蛋白质中搜索最佳模板,最后运行MODELLER从每个前5个模板。

 

 

本文标签: predictionproteinnovoaccurateDe