Adjusted geometric-mean: A novel performance measure for imbalanced bioinformatics datasets learning

Rukshan Batuwita, Vasile Palade

Research output: Contribution to journalArticlepeer-review

51 Citations (Scopus)

Abstract

One common and challenging problem faced by many bioinformatics applications, such as promoter recognition, splice site prediction, RNA gene prediction, drug discovery and protein classification, is the imbalance of the available datasets. In most of these applications, the positive data examples are largely outnumbered by the negative data examples, which often leads to the development of sub-optimal prediction models having high negative recognition rate (Specificity = SP) and low positive recognition rate (Sensitivity = SE). When class imbalance learning methods are applied, usually, the SE is increased at the expense of reducing some amount of the SP. In this paper, we point out that in these data-imbalanced bioinformatics applications, the goal of applying class imbalance learning methods would be to increase the SE as high as possible by keeping the reduction of SP as low as possible. We explain that the existing performance measures used in class imbalance learning can still produce sub-optimal models with respect to this classification goal. In order to overcome these problems, we introduce a new performance measure called Adjusted Geometric-mean (AGm). The experimental results obtained on ten real-world imbalanced bioinformatics datasets demonstrates that the AGm metric can achieve a lower rate of reduction of SP than the existing performance metrics, when increasing the SE through class imbalance learning methods. This characteristic of AGm metric makes it more suitable for achieving the proposed classification goal in imbalanced bioinformatics datasets learning.

Original languageEnglish
Article number1250003
Number of pages23
JournalJournal of Bioinformatics and Computational Biology
Volume10
Issue number4
DOIs
Publication statusPublished - Aug 2012
Externally publishedYes

Keywords

  • Class imbalance dataset learning
  • imbalanced bioinformatics datasets
  • performance measures
  • Support Vector Machines

ASJC Scopus subject areas

  • Biochemistry
  • Molecular Biology
  • Computer Science Applications

Fingerprint

Dive into the research topics of 'Adjusted geometric-mean: A novel performance measure for imbalanced bioinformatics datasets learning'. Together they form a unique fingerprint.

Cite this