SECOM efficiently identifies all the sequentially homologous regions that recur within these proteins

Moreover, when the proteome data are given as the input, more information can be found. Homologous analysis of the sequences is assumed to provide evolutionary, functional, and structural information. The main difference between proteome-scale and single-protein-level domain detection is that a domain is assumed to be a recurring segment of amino acids within the proteome. Various homologous search approaches have been proposed to solve this problem. The DIVCLUS program performs allagainst-all Smith-Waterman pairwise comparisons. The resulting pairs are then merged using single linkage clustering. This method is quite sensitive but computationally expensive. The Domainer algorithm works in a similar manner. It first conducts an allagainst-all BLAST search to identify segment pairs with high degrees of homology. These segment pairs are then iteratively merged into consistent clusters. There are two main bottlenecks in the existing all-against-all alignment-based methods. First, after the pairwise alignment, irrelevant domains are clustered into the same domain by the clustering algorithms. For instance, a protein may comprise several different domains or even multiple copies of the same domain. The widely used single linkage-clustering algorithm merges these different domains into one due to the chain effect. Second, the asymptotic runtime of the most efficient method is still O, where N is the number of proteins in the inquiry dataset and m is the maximum length of the proteins in the dataset. This is too slow for the proteome-scale domain detection problem. To overcome these two bottlenecks, we propose a novel genome-scale domain detection method: SECOM, a hash SEed and COMmunity searching-based domain detection method. Given all the protein sequences from a genome. SECOM does not conduct all-againstall sequence comparisons. Instead, we assume that the domains of the input protein set have highly conserved segments. The highly conserved segments are not necessarily those sharing identical amino acids, however. They may be those with sequential similarities. SECOM identifies the highly conserved segments by using hash seeds as proposed in a recent study by Li et al.. We then formulate the domain detection SB203580 problem into a graph representation, in which each node is an input protein sequence and each edge represents the number of hash seeds shared between the two nodes. The problem is to identify all the strongly connected subgraphs. Such subgraphs, however, can overlap because a protein sequence can contain different domains. Therefore, we introduce a clique percolation algorithm to identify the strongly connected subgraphs, i.e., communities, in the graph. Each community corresponds to a domain detected by SECOM. In this way, SECOM is able to identify the overlapping domains. The runtime is nearly-linear to the size of the inputs and quadratic to the number of domains, which is a much smaller number than the size of the input.

Leave a Reply

Your email address will not be published.