A correlation-based approach for predicting
protein domains from sequence
M. Lexa1
1Masaryk University, Faculty of Informatics,
Botanicka 68a, 60200 Brno, Czech Republic
lexa@fi.muni.cz
Protein
structure prediction from sequence is a very important problem that has not
been solved satisfactorily yet. Prediction is possible for some sequences by
the means of homology modelling, threading, fragment assembly or ab-initio
modelling. An important step of any prediction method is preprocessing of the
analyzed sequence to recognize the presence of possible domains. If domains can
be identified, the problem can be split into a set of simpler problems, each
dealing with one domain separately. Domains often contain typical regions of
sequence similarity to other proteins containing the same domain, therefore
similarity searches represent a natural approach to domain identification. This
has been used, for example, in the construction of the ProDom database of
protein domains [1]. Unfortunately, many sequences lack sequence similarity to
existing proteins, and domains can not be identified this way. An alternative
method based on a statistic recognizing interdomain regions, rather then the
domains themselves, predicts domain boundaries from the amino acid compostion
of the analyzed sequence regardless of the presence or absense of sequence
similarity [2]. Another method predicts the boundaries of domains by
considering the hydrophobicity of aminoacids and their likelihood to occur
inside or outside a typical globular domain [3].
In this
contribution, I present a new approach to domain prediction. While based on
sequence similarity as some previous methods, it only relies on very short
segments of sequence similarity that might be considered noise by many other
methods. The novelty and potential power of this method lies in the subsequent
analysis of the identified short similarities. These are considered in pairs
ocurring along the analysed sequence. Pairs that are found correlated in a
large protein database are assumed to have some common function (structural or
other) in the protein. As such, they are much more likely to exist within a
single domain, since domains are often considered to be units of elementary
functions. Regions of the analyzed sequence spanned by many such correlated
pairs are then evaluated as domain candidates. Regions with a minimal number of
spanning correlations are predicted to be domain boundaries. Compared to
previously reported version of this approach [4], a clustering step has been
added to the analysis, to enable detection of non-contiguous domains.
The new
method is being tested against a database of proteins with known domain
composition. Other methods will be compared to our methods and the results
presented.
1. F. Servant, C. Bru, S. Carrere, E. Courcelle,
J. Gouzy, D. Peyruc, D. Kahn, ProDom: automated clustering of homologous
domains, Briefings in Bioinformatics,
3, (2002), 246-251.
2. M.
Dumontier, R. Yao, H.J. Feldman, C.W.V. Hogue, Armadillo: domain boundary
prediction by amino acid composition Journal
of Molecular Biology, 350,
(2005), 1061-1073.
3. R.A. George, K. Lin, J. Heringa,
Scooby-domain: prediction of globular domains in protein sequence, Nucleic Acid Research, 33, (2005), W160-W163.
4. M. Lexa, G. Valle, Combining rapid word searches with segment-to-segment alignment for sensitive similarity detection, domain identification and structural modelling, Proceedings of the Bioinformatics Italian Society Meeting Padova 26-27 March 2004, (2004), A66.