CCTOP

Constrained Consensus TOPology prediction

We have tested the accuracies of several prediction methods on the benchmark sets. After testing, ten methods have been selected according to their prediction accuracies, availability, how they can be integrated into a consensus method. The selected methods are: HMMTOP, Membrain, Memsat-SVM, Octopus, Philius, Phobius, Pro, Prodiv, Scampi, TMHMM.

We search with BLAST for the submitted sequence against TOPDB database by the parameters of e-value 10^-10. Hits are accepted if the following clauses all were true: i) the hit's length is above 80% of the query sequence's length; ii) all TM helices are covered in the homologous TOPDB entry by the alignment; iii) sequence similarity is above 40% within HSPs. Topology data of the homologous protein in TOPDB database are used in the constrained prediction by mirroring their sequential positions according to the position of the HSPs.

The search engine of TOPDOM homepage is used to locate those domains/motifs in the human sequences that were found earlier conservatively on the same side of TMPs, and we used the position and topology localization of the result(s) as constraint(s).

The newly developed consensus prediction algorithm is based on the probabilistic framework provided by the hidden Markov model, therefore the HMMTOP method can be utilized for this task. Briefly, the results of the ten prediction methods together with the available 3D or experimental topology data can be applied in HMMTOP as weighted constraints to obtain the constrained consensus prediction result. The weights depend on the per-protein topology or topography accuracies of the methods. The results of the i-th method are:

\(Pred_{i} = l_{1}, l_{2} ... l_{n}\), \(1\leq i\leq m\)

\(l_{j} \in {"I","M","O","L","U"}\), \(1\leq j\leq n\)

where \(m\leq 10\) (the ten prediction methods and zero, one or more 3D/experimental topology constraints), n is the length of the query sequence and the "I", "M", "O", "L", "U" labels correspond to cytoplasmic loops, membrane spanning segments, non-cytoplasmic loops, membrane re-entrant loops and unknown regions, respectively.

We calculated the per-protein topography ( \(Acc_{Tpg}\) ) and topology ( \(Acc_{Top}\) ) accuracies of each method on the "structure benchmark set", and used these values as weights for the constraints. ( \(Acc_{Tpg}\) ) is applied for those positions, where the prediction method resulted in transmembrane or re-entrant loops (label "M" or "L", respectively), otherwise ( \(Acc_{Top}\) ) is used (for label "I" and "O"). In the case of 3D or experimental topology data, the weights are set to 20. In the case of prediction methods, the results of the given prediction are used as constraints, but only if the prediction is valid, i.e. it contains at least one transmembrane region:

\[W_{i,j} = \left\{ \begin{array}{lr} Acc_{Top}(i), \ \ if\ \ \ \ Pred_{i,j}\in {"I","O"} and \ type(j) = prediction \ method \\ Acc_{Tpg}(i), \ \ if\ \ \ \ Pred_{i,j}\in {"M","L"} and \ type(j) = prediction \ method \\ 20, \ \ if\ \ \ \ type(j) = experimental \ result \\ \end{array} \right. \]\(where \ \ 1\leq i\leq m, \ 1\leq j\leq n\)

These weights are normalized to one in each sequential position, and are used as constraints in the HMM:

\(C_{j,k} = \cfrac{\sum\limits_{i=1}^{m} W_{i,j}\cdot\Delta (k, Pred_{i,j})}{\sum\limits_{k=1}^{m} \sum\limits_{i=1}^{m} W_{i,j}\cdot\Delta (k, Pred_{i,j})} \ \ 1\leq j\leq n, \ 1\leq k\leq N\)

where N is the number of states in the hidden Markov model and

\[\Delta (a,b) = \left\{ \begin{array}{lr} 1, \ \ if\ \ \ \ Label (S_{a})=b\\ 0, \ \ if\ \ \ \ Label (S_{a})\neq b\\ \end{array} \\ \right. \]\(where \ \ 1\leq a\leq N, \ 1\leq b\leq \hat{N}\)

where \(\hat{N} \) = 4, the number of the main states (inside, outside, membrane and loop) and S denote states of the hidden Markov model. If Memsat-SVM or Octopus methods resulted in re-entrant loop regions, or re-entrant loop regions are used as 3D or experimental topology constraints, a modified architecture for HMMTOP algorithm is used, allowing the extra “language rule” for the hidden Markov model.

back