Background: The classification of phenotypes on microarray data has drawn much attention in last
few years. The known methods mainly focused on the selection or construction of features based on either genes
or gene pairs on continuous-value gene expression data. However, few researches have been implemented to
identify useful features based on both genes and gene pairs on binary-value gene expression data.
Objective: In this work, we proposed a new algorithm, called FSGGP, to select both feature genes and
feature gene pairs on the binary-value gene expression data to improve two-phenotype classification.
Method: We calculated the uncertainty coefficient which represented how well a phenotype was described
by a gene or gene pair under some possible relationship, and the exact relationship between the gene or gene
pair and the phenotype was identified by the value of uncertainty coefficient. Furthermore, the closeness
between genes or gene pairs and phenotypes was calculated, and the genes or gene pairs closely related with
phenotypes were selected. The redundancy of genes and gene pairs as features was calculated by cross
entropy on the binary data, and the redundant feature genes or gene pairs were eliminated. The optimal
feature sets were obtained by the wrapper based forward feature selection for three classical classifiers.
Results: The algorithm was experimentally assessed on four public datasets. The results showed that
algorithm FSGGP had better performance over four known feature selection algorithms based on either
genes or gene pairs in terms of the average classification error rates.
Conclusion: We developed an algorithm to select both feature genes and feature gene pairs on the binaryvalue
gene expression data, where the selection of feature gene pairs was implemented by identifying the
higher logical relationship between gene pairs and phenotypes. The comparison with four known feature
selection algorithms suggests that feature selection algorithms based on both genes and gene pairs can
achieve better performance than feature selection algorithms based on either genes or gene pairs, and the
identification of higher logical relationship is an effective approach for the selection of feature gene pairs.