Comparison and evaluation of microarray feature selection methods.




Abstract Motivation: Numerous feature selection approaches have been applied to the identification of differentially expressed genes in microarray data. These include simple fold change, classical t-statistic, moderated t- statistics and other methods. Even though these methods return gene lists that are often dissimilar, few direct comparisons of these exist. We present an empirical study in which we compare some of the most commonly used feature selection methods. We apply these to nine publicly available datasets, and compare, both the gene lists produced and how these perform in class prediction of test datasets.

Results: In this study, we compare the efficiency of the feature selection methods; significance analysis of microarrays (SAM), analysis of variance (ANOVA), empirical bayes t-statistic, template matching, maxT, between group analysis, area under the ROC curve, the t-statistic, fold change, and a set of randomly selected genes. In each case these ten methods were applied to nine different binary (two class) microarray datasets. Firstly we found little consensus in gene lists produced by the nine feature selection methods (~25%). Secondly, we evaluate the class prediction efficiency of each gene list tests in both jack-knife leave one and training and test cross-validation using four supervised classifiers. We report that the choice of feature selection method, the number of genes in the genelist, and the number of cases (samples) available substantially influence classification success. Overall, we find that area under a ROC curve and empirical bayes t-statistic (found in the Limma package from Bioconductor) outperform the other methods examined.

Availability: All computations were performed using the open source statistical language R and Bioconductor. The R code is available on request.
Authors Jeffery IB, Higgins DG, Culhane AC
Bioinformatics, Conway Institute, University College Dublin, Belfield, Dublin 4, Ireland.
Publication Date Submitted to Bioinformatics
Contact emails
Keywords feature selection; gene selection; microarray dataset; cancer; microarray
Supplemental Information These data are the processed tab delimited files. The leukaemia, colon and ALL datasets were available in the Bioconductor libraries golubEsets, colonCA, ALL. The leukaemia and colon data were further processed using quantile normalisation. The raw data for the other datasets were downloaded from authors website or GEO as raw data files (.cel files), gene expression values were called using the robust multichip average (RMA, Irizarry, et al., 2003) method and data were quantile normalised using the Bioconductor package, affy. See the manuscript for further details.

Description Data File Size
Normal vrs tumour colon cancer dataset Alon et al., (1999) 2,112 Kb
Acute lymphoblastic leukaemia dataset Chiaretti et al., (2004) 27,409 Kb
ALL vrs AML dataset Golub et al., (1999) 6,190 Kb
Follicular lymphoma vrs DLBCL dataset Shipp et al., (2002) 9,344 Kb
Normal vrs tumour prostate dataset Singh et al., (2002) 21,888 Kb
Multiple myeloma bone lesion dataset Tian et al., (2003) 37,009 Kb