A correlation-based approach for predicting protein domains from sequence

M. Lexa¹

¹Masaryk University, Faculty of Informatics, Botanicka 68a, 60200 Brno, Czech Republic

lexa@fi.muni.cz

Protein structure prediction from sequence is a very important problem that has not been solved satisfactorily yet. Prediction is possible for some sequences by the means of homology modelling, threading, fragment assembly or ab-initio modelling. An important step of any prediction method is preprocessing of the analyzed sequence to recognize the presence of possible domains. If domains can be identified, the problem can be split into a set of simpler problems, each dealing with one domain separately. Domains often contain typical regions of sequence similarity to other proteins containing the same domain, therefore similarity searches represent a natural approach to domain identification. This has been used, for example, in the construction of the ProDom database of protein domains [1]. Unfortunately, many sequences lack sequence similarity to existing proteins, and domains can not be identified this way. An alternative method based on a statistic recognizing interdomain regions, rather then the domains themselves, predicts domain boundaries from the amino acid compostion of the analyzed sequence regardless of the presence or absense of sequence similarity [2]. Another method predicts the boundaries of domains by considering the hydrophobicity of aminoacids and their likelihood to occur inside or outside a typical globular domain [3].

In this contribution, I present a new approach to domain prediction. While based on sequence similarity as some previous methods, it only relies on very short segments of sequence similarity that might be considered noise by many other methods. The novelty and potential power of this method lies in the subsequent analysis of the identified short similarities. These are considered in pairs ocurring along the analysed sequence. Pairs that are found correlated in a large protein database are assumed to have some common function (structural or other) in the protein. As such, they are much more likely to exist within a single domain, since domains are often considered to be units of elementary functions. Regions of the analyzed sequence spanned by many such correlated pairs are then evaluated as domain candidates. Regions with a minimal number of spanning correlations are predicted to be domain boundaries. Compared to previously reported version of this approach [4], a clustering step has been added to the analysis, to enable detection of non-contiguous domains.

The new method is being tested against a database of proteins with known domain composition. Other methods will be compared to our methods and the results presented.

1. F. Servant, C. Bru, S. Carrere, E. Courcelle, J. Gouzy, D. Peyruc, D. Kahn, ProDom: automated clustering of homologous domains, Briefings in Bioinformatics, 3, (2002), 246-251.

2. M. Dumontier, R. Yao, H.J. Feldman, C.W.V. Hogue, Armadillo: domain boundary prediction by amino acid composition Journal of Molecular Biology, 350, (2005), 1061-1073.

3. R.A. George, K. Lin, J. Heringa, Scooby-domain: prediction of globular domains in protein sequence, Nucleic Acid Research, 33, (2005), W160-W163.

4. M. Lexa, G. Valle, Combining rapid word searches with segment-to-segment alignment for sensitive similarity detection, domain identification and structural modelling, Proceedings of the Bioinformatics Italian Society Meeting Padova 26-27 March 2004, (2004), A66.