Bmc Bioinformatics
-
Supertree methods combine phylogenies with overlapping sets of taxa into a larger one. Topological conflicts frequently arise among source trees for methodological or biological reasons, such as long branch attraction, lateral gene transfers, gene duplication/loss or deep gene coalescence. When topological conflicts occur among source trees, liberal methods infer supertrees containing the most frequent alternative, while veto methods infer supertrees not contradicting any source tree, i.e. discard all conflicting resolutions. When the source trees host a significant number of topological conflicts or have a small taxon overlap, supertree methods of both kinds can propose poorly resolved, hence uninformative, supertrees. ⋯ The paper introduces and tests two new methodologies, PhySIC_IST and STC, that demonstrate the interest in inferring non-plenary supertrees as well as preprocessing the source trees. An implementation of the methods is available at: http://www.atgc-montpellier.fr/physic_ist/.
-
Multicategory Support Vector Machines (MC-SVM) are powerful classification systems with excellent performance in a variety of data classification problems. Since the process of generating models in traditional multicategory support vector machines for large datasets is very computationally intensive, there is a need to improve the performance using high performance computing techniques. ⋯ The experiments show that the PMC-SVM can significantly improve the performance of classification of microarray data without loss of accuracy, compared with previous work.
-
Comparative Study
Comparison and evaluation of methods for generating differentially expressed gene lists from microarray data.
Numerous feature selection methods have been applied to the identification of differentially expressed genes in microarray data. These include simple fold change, classical t-statistic and moderated t-statistics. Even though these methods return gene lists that are often dissimilar, few direct comparisons of these exist. We present an empirical study in which we compare some of the most commonly used feature selection methods. We apply these to 9 publicly available datasets, and compare, both the gene lists produced and how these perform in class prediction of test datasets. ⋯ We report that the choice of feature selection method, the number of genes in the genelist, the number of cases (samples) and the noise in the dataset, substantially influence classification success. Recommendations are made for choice of feature selection. Area under a ROC curve performed well with datasets that had low levels of noise and large sample size. Rank products performs well when datasets had low numbers of samples or high levels of noise. The Empirical bayes t-statistic performed well across a range of sample sizes.
-
Cross-validation (CV) is an effective method for estimating the prediction error of a classifier. Some recent articles have proposed methods for optimizing classifiers by choosing classifier parameter values that minimize the CV error estimate. We have evaluated the validity of using the CV error estimate of the optimized classifier as an estimate of the true error expected on independent data. ⋯ We show that using CV to compute an error estimate for a classifier that has itself been tuned using CV gives a significantly biased estimate of the true error. Proper use of CV for estimating true error of a classifier developed using a well defined algorithm requires that all steps of the algorithm, including classifier parameter tuning, be repeated in each CV loop. A nested CV procedure provides an almost unbiased estimate of the true error.