A computational pipeline for data augmentation towards the improvement of disease classification and risk stratification models: A case study in two clinical domains

Vasileios C Pezoulas, Grigoris I Grigoriadis, George Gkois, Nikolaos S Tachos, Tim Smole, Zoran Bosnić, Matej Pičulin, Iacopo Olivotto, Fausto Barlocco, Marko Robnik-Šikonja, Djordje G Jakovljevic, Andreas Goules, Athanasios G Tzioufas, Dimitrios I Fotiadis

    Research output: Contribution to journalArticlepeer-review

    11 Citations (Scopus)
    50 Downloads (Pure)

    Abstract

    Virtual population generation is an emerging field in data science with numerous applications in healthcare towards the augmentation of clinical research databases with significant lack of population size. However, the impact of data augmentation on the development of AI (artificial intelligence) models to address clinical unmet needs has not yet been investigated. In this work, we assess whether the aggregation of real with virtual patient data can improve the performance of the existing risk stratification and disease classification models in two rare clinical domains, namely the primary Sjögren's Syndrome (pSS) and the hypertrophic cardiomyopathy (HCM), for the first time in the literature. To do so, multivariate approaches, such as, the multivariate normal distribution (MVND), and straightforward ones, such as, the Bayesian networks, the artificial neural networks (ANNs), and the tree ensembles are compared against their performance towards the generation of high-quality virtual data. Both boosting and bagging algorithms, such as, the Gradient boosting trees (XGBoost), the AdaBoost and the Random Forests (RFs) were trained on the augmented data to evaluate the performance improvement for lymphoma classification and HCM risk stratification. Our results revealed the favorable performance of the tree ensemble generators, in both domains, yielding virtual data with goodness-of-fit 0.021 and KL-divergence 0.029 in pSS and 0.029, 0.027 in HCM, respectively. The application of the XGBoost on the augmented data revealed an increase by 10.9% in accuracy, 10.7% in sensitivity, 11.5% in specificity for lymphoma classification and 16.1% in accuracy, 16.9% in sensitivity, 13.7% in specificity in HCM risk stratification.

    Original languageEnglish
    Article number104520
    Number of pages12
    JournalComputers in Biology and Medicine
    Volume134
    Early online date6 Jun 2021
    DOIs
    Publication statusPublished - Jul 2021

    Bibliographical note

    This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).

    Funder

    This project has received funding from the European Union's Horizon 2020 research and innovation programme under grant agreement No 777204.

    Keywords

    • Artificial intelligence
    • Data augmentation
    • HCM risk stratification
    • Lymphoma classification
    • Virtual population generation

    ASJC Scopus subject areas

    • Computer Science Applications
    • Health Informatics

    Fingerprint

    Dive into the research topics of 'A computational pipeline for data augmentation towards the improvement of disease classification and risk stratification models: A case study in two clinical domains'. Together they form a unique fingerprint.

    Cite this