Abstract
One common and challenging problem faced by many bioinformatics applications, such as promoter recognition, splice site prediction, RNA gene prediction, drug discovery and protein classification, is the imbalance of the available datasets. In most of these applications, the positive data examples are largely outnumbered by the negative data examples, which often leads to the development of sub-optimal prediction models having high negative recognition rate (Specificity = SP) and low positive recognition rate (Sensitivity = SE). When class imbalance learning methods are applied, usually, the SE is increased at the expense of reducing some amount of the SP. In this paper, we point out that in these data-imbalanced bioinformatics applications, the goal of applying class imbalance learning methods would be to increase the SE as high as possible by keeping the reduction of SP as low as possible. We explain that the existing performance measures used in class imbalance learning can still produce sub-optimal models with respect to this classification goal. In order to overcome these problems, we introduce a new performance measure called Adjusted Geometric-mean (AGm). The experimental results obtained on ten real-world imbalanced bioinformatics datasets demonstrates that the AGm metric can achieve a lower rate of reduction of SP than the existing performance metrics, when increasing the SE through class imbalance learning methods. This characteristic of AGm metric makes it more suitable for achieving the proposed classification goal in imbalanced bioinformatics datasets learning.
Original language | English |
---|---|
Article number | 1250003 |
Number of pages | 23 |
Journal | Journal of Bioinformatics and Computational Biology |
Volume | 10 |
Issue number | 4 |
DOIs | |
Publication status | Published - Aug 2012 |
Externally published | Yes |
Keywords
- Class imbalance dataset learning
- imbalanced bioinformatics datasets
- performance measures
- Support Vector Machines
ASJC Scopus subject areas
- Biochemistry
- Molecular Biology
- Computer Science Applications