The data points are first clustered according to the selected features

Regarding the second issue, a few recent studies have aimed to combine unsupervised data clustering with feature selection. In, the authors proposed altering the procedure of data clustering and feature selection iteratively. In each iteration, the data points are first clustered according to the selected features, and then FDA is applied to identify a new set of features according to the cluster labels. In, the iterative procedure is improved by converting the original problem into a convex optimization problem. However, none of these studies are able to exploit the prior knowledge of the data, which is important with microarray data analysis. Here, we Atractylodin present a general framework for feature selection that is able to overcome the two shortcomings simultaneously. The proposed framework integrates the ontology information of the genes with their expression data to perform unsupervised data clustering to group similar cellular responses into clusters, and to identify the genes that are most discriminative among the clusters of cellular responses. Mixture regression models are first applied to cluster the multiple experimental conditions. Important genes with high correlation to each group of experimental conditions are then found by a regression model that automatically incorporates the GO information. The key genes that differentiate the groups of conditions are identified to provide insight into the differences among the biological processes. A major Alphalipoic-acid advantage of this method is the easy assimilation and update of the functional information of the genes. Another major advantage of the proposed method is that it unifies unsupervised data clustering with supervised feature selection into a single framework. This combination allows us to identify genes relevant to multiple biological processes without having to know, a priori, which experimental condition is related to which biological process. This is important when conditions are difficult to classify or the classification of conditions are unknown a priori. Finally, the proposed method allows for parallel identification of genes relevant to multiple cellular responses, which makes it an efficient high-throughput analysis.

Leave a Reply

Your email address will not be published.