Efficient resampling methods for training support vector machines with imbalanced datasets

Rukshan Batuwita, Vasile Palade

Research output: Chapter in Book/Report/Conference proceedingConference proceeding

38 Citations (Scopus)

Abstract

Random undersampling and oversampling are simple but well-known resampling methods applied to solve the problem of class imbalance. In this paper we show that the random oversampling method can produce better classification results than the random undersampling method, since the oversampling can increase the minority class recognition rate by sacrificing less amount of majority class recognition rate than the undersampling method. However, the random oversampling method would increase the computational cost associated with the SVM training largely due to the addition of new training examples. In this paper we present an investigation carried out to develop efficient resampling methods that can produce comparable classification results to the random oversampling results, but with the use of less amount of data. The main idea of the proposed methods is to first select the most informative data examples located closer to the class boundary region by using the separating hyperplane found by training an SVM model on the original imbalanced dataset, and then use only those examples in resampling. We demonstrate that it would be possible to obtain comparable classification results to the random oversampling results through two sets of efficient resampling methods which use 50% less amount of data and 75% less amount of data, respectively, compared to the sizes of the datasets generated by the random oversampling method.

Original languageEnglish
Title of host publication2010 IEEE World Congress on Computational Intelligence, WCCI 2010 - 2010 International Joint Conference on Neural Networks, IJCNN 2010
PublisherIEEE
ISBN (Print)9781424469178
DOIs
Publication statusPublished - 2010
Externally publishedYes
Event2010 6th IEEE World Congress on Computational Intelligence, WCCI 2010 - 2010 International Joint Conference on Neural Networks, IJCNN 2010 - Barcelona, Spain
Duration: 18 Jul 201023 Jul 2010

Conference

Conference2010 6th IEEE World Congress on Computational Intelligence, WCCI 2010 - 2010 International Joint Conference on Neural Networks, IJCNN 2010
CountrySpain
CityBarcelona
Period18/07/1023/07/10

Fingerprint

Support vector machines
Costs

Keywords

  • Support vector machines
  • Training
  • Testing
  • Computational modeling
  • Machine learning
  • Digital signal processing
  • Computational efficiency

ASJC Scopus subject areas

  • Software
  • Artificial Intelligence

Cite this

Batuwita, R., & Palade, V. (2010). Efficient resampling methods for training support vector machines with imbalanced datasets. In 2010 IEEE World Congress on Computational Intelligence, WCCI 2010 - 2010 International Joint Conference on Neural Networks, IJCNN 2010 [5596787] IEEE. https://doi.org/10.1109/IJCNN.2010.5596787

Efficient resampling methods for training support vector machines with imbalanced datasets. / Batuwita, Rukshan; Palade, Vasile.

2010 IEEE World Congress on Computational Intelligence, WCCI 2010 - 2010 International Joint Conference on Neural Networks, IJCNN 2010. IEEE, 2010. 5596787.

Research output: Chapter in Book/Report/Conference proceedingConference proceeding

Batuwita, R & Palade, V 2010, Efficient resampling methods for training support vector machines with imbalanced datasets. in 2010 IEEE World Congress on Computational Intelligence, WCCI 2010 - 2010 International Joint Conference on Neural Networks, IJCNN 2010., 5596787, IEEE, 2010 6th IEEE World Congress on Computational Intelligence, WCCI 2010 - 2010 International Joint Conference on Neural Networks, IJCNN 2010, Barcelona, Spain, 18/07/10. https://doi.org/10.1109/IJCNN.2010.5596787
Batuwita R, Palade V. Efficient resampling methods for training support vector machines with imbalanced datasets. In 2010 IEEE World Congress on Computational Intelligence, WCCI 2010 - 2010 International Joint Conference on Neural Networks, IJCNN 2010. IEEE. 2010. 5596787 https://doi.org/10.1109/IJCNN.2010.5596787
Batuwita, Rukshan ; Palade, Vasile. / Efficient resampling methods for training support vector machines with imbalanced datasets. 2010 IEEE World Congress on Computational Intelligence, WCCI 2010 - 2010 International Joint Conference on Neural Networks, IJCNN 2010. IEEE, 2010.
@inproceedings{1d1ad8742d094845a06d3ddf3642eee1,
title = "Efficient resampling methods for training support vector machines with imbalanced datasets",
abstract = "Random undersampling and oversampling are simple but well-known resampling methods applied to solve the problem of class imbalance. In this paper we show that the random oversampling method can produce better classification results than the random undersampling method, since the oversampling can increase the minority class recognition rate by sacrificing less amount of majority class recognition rate than the undersampling method. However, the random oversampling method would increase the computational cost associated with the SVM training largely due to the addition of new training examples. In this paper we present an investigation carried out to develop efficient resampling methods that can produce comparable classification results to the random oversampling results, but with the use of less amount of data. The main idea of the proposed methods is to first select the most informative data examples located closer to the class boundary region by using the separating hyperplane found by training an SVM model on the original imbalanced dataset, and then use only those examples in resampling. We demonstrate that it would be possible to obtain comparable classification results to the random oversampling results through two sets of efficient resampling methods which use 50{\%} less amount of data and 75{\%} less amount of data, respectively, compared to the sizes of the datasets generated by the random oversampling method.",
keywords = "Support vector machines, Training, Testing, Computational modeling, Machine learning, Digital signal processing, Computational efficiency",
author = "Rukshan Batuwita and Vasile Palade",
year = "2010",
doi = "10.1109/IJCNN.2010.5596787",
language = "English",
isbn = "9781424469178",
booktitle = "2010 IEEE World Congress on Computational Intelligence, WCCI 2010 - 2010 International Joint Conference on Neural Networks, IJCNN 2010",
publisher = "IEEE",

}

TY - GEN

T1 - Efficient resampling methods for training support vector machines with imbalanced datasets

AU - Batuwita, Rukshan

AU - Palade, Vasile

PY - 2010

Y1 - 2010

N2 - Random undersampling and oversampling are simple but well-known resampling methods applied to solve the problem of class imbalance. In this paper we show that the random oversampling method can produce better classification results than the random undersampling method, since the oversampling can increase the minority class recognition rate by sacrificing less amount of majority class recognition rate than the undersampling method. However, the random oversampling method would increase the computational cost associated with the SVM training largely due to the addition of new training examples. In this paper we present an investigation carried out to develop efficient resampling methods that can produce comparable classification results to the random oversampling results, but with the use of less amount of data. The main idea of the proposed methods is to first select the most informative data examples located closer to the class boundary region by using the separating hyperplane found by training an SVM model on the original imbalanced dataset, and then use only those examples in resampling. We demonstrate that it would be possible to obtain comparable classification results to the random oversampling results through two sets of efficient resampling methods which use 50% less amount of data and 75% less amount of data, respectively, compared to the sizes of the datasets generated by the random oversampling method.

AB - Random undersampling and oversampling are simple but well-known resampling methods applied to solve the problem of class imbalance. In this paper we show that the random oversampling method can produce better classification results than the random undersampling method, since the oversampling can increase the minority class recognition rate by sacrificing less amount of majority class recognition rate than the undersampling method. However, the random oversampling method would increase the computational cost associated with the SVM training largely due to the addition of new training examples. In this paper we present an investigation carried out to develop efficient resampling methods that can produce comparable classification results to the random oversampling results, but with the use of less amount of data. The main idea of the proposed methods is to first select the most informative data examples located closer to the class boundary region by using the separating hyperplane found by training an SVM model on the original imbalanced dataset, and then use only those examples in resampling. We demonstrate that it would be possible to obtain comparable classification results to the random oversampling results through two sets of efficient resampling methods which use 50% less amount of data and 75% less amount of data, respectively, compared to the sizes of the datasets generated by the random oversampling method.

KW - Support vector machines

KW - Training

KW - Testing

KW - Computational modeling

KW - Machine learning

KW - Digital signal processing

KW - Computational efficiency

U2 - 10.1109/IJCNN.2010.5596787

DO - 10.1109/IJCNN.2010.5596787

M3 - Conference proceeding

SN - 9781424469178

BT - 2010 IEEE World Congress on Computational Intelligence, WCCI 2010 - 2010 International Joint Conference on Neural Networks, IJCNN 2010

PB - IEEE

ER -