Overcoming “Big Data” Barriers in Machine Learning Techniques for the Real-Life Applications

Ireneusz Czarnowski, Piotr Jędrzejowicz, Kuo-Ming Chao, Tülay Yildirim

Research output: Contribution to journalArticle

4 Downloads (Pure)

Abstract

Data analysis, regardless of whether the data are expected to explain a quantitative (as in regression) or categorical (as in classification) models, often requires overcoming various barriers. They include unbalanced datasets, faulty measurement results, and incomplete data.

A special group of barriers is typical for what nowadays is referred to as “Big Data.” The term is used to characterize problems where the available datasets are too large to easily deal with traditional machine learning tools and approaches. It is now generally accepted that dealing with huge and complex data sets poses many processing challenges and opens a range of research and technological problems and calls for new approaches.

Big Data is often characterized by the well-known 5V properties:
(i) Volume: typically a huge amount of data,
(ii) Velocity: a speed at which data are generated including their dynamics and evolution in time,
(iii) Variety: involving multiple, heterogeneous, and complex data representations,
(iv) Veracity: the uncertainty of data and lack of its quality assurance,
(v) Value: a potential business value that big data analysis could offer.

Big Data environment is often a distributed one with a distributed data sources. These sources can be heterogeneous, differing in various respects including storage technologies and representation methods.

Challenges of the Big Data not only involve a need to overcome the 5V properties but also include a need to develop techniques for data capturing, transforming, integrating, and modelling. Yet other important issues are concerned with privacy, security, governance, and ethical aspects of the Big Data analysis.

Current advances in dealing with the Big Data problems, albeit in many cases spectacular, are far from being satisfactory for the real-life applications. This becomes especially true in numerous domains where machine learning tasks are crucial to obtaining knowledge of different processes and properties in areas such as bioinformatics, text mining, or security. Unfortunately, the majority of the current algorithms become ineffective when the problem becomes very large since underlying combinatorial optimization problems are, as a rule, computationally difficult. There exists a variety of methods and tools which are excellent at solving small and medium size machine learning tasks but become unsatisfactory when dealing with the large ones.

Current hot topics in the quest to improve the effectiveness of the machine learning techniques include a search for compact knowledge representation methods and better tools for knowledge discovery and integration. Machine learning may also profit from integrating collective intelligence techniques, applying evolutionary and bioinspired techniques, and exploring further deep and extreme learning techniques.
Original languageEnglish
Number of pages3
JournalComplexity
Volume2018
DOIs
Publication statusPublished - 30 Dec 2018

Fingerprint

Learning systems
Combinatorial optimization
Knowledge representation
Bioinformatics
Big data
Quality assurance
Data mining
Profitability
Processing
Industry

Bibliographical note

This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Cite this

Overcoming “Big Data” Barriers in Machine Learning Techniques for the Real-Life Applications. / Czarnowski, Ireneusz; Jędrzejowicz, Piotr ; Chao, Kuo-Ming; Yildirim, Tülay .

In: Complexity, Vol. 2018, 30.12.2018.

Research output: Contribution to journalArticle

Czarnowski, Ireneusz ; Jędrzejowicz, Piotr ; Chao, Kuo-Ming ; Yildirim, Tülay . / Overcoming “Big Data” Barriers in Machine Learning Techniques for the Real-Life Applications. In: Complexity. 2018 ; Vol. 2018.
@article{907bd7944a174eccbb29d513bbc91d21,
title = "Overcoming “Big Data” Barriers in Machine Learning Techniques for the Real-Life Applications",
abstract = "Data analysis, regardless of whether the data are expected to explain a quantitative (as in regression) or categorical (as in classification) models, often requires overcoming various barriers. They include unbalanced datasets, faulty measurement results, and incomplete data.A special group of barriers is typical for what nowadays is referred to as “Big Data.” The term is used to characterize problems where the available datasets are too large to easily deal with traditional machine learning tools and approaches. It is now generally accepted that dealing with huge and complex data sets poses many processing challenges and opens a range of research and technological problems and calls for new approaches.Big Data is often characterized by the well-known 5V properties:(i) Volume: typically a huge amount of data,(ii) Velocity: a speed at which data are generated including their dynamics and evolution in time,(iii) Variety: involving multiple, heterogeneous, and complex data representations,(iv) Veracity: the uncertainty of data and lack of its quality assurance,(v) Value: a potential business value that big data analysis could offer.Big Data environment is often a distributed one with a distributed data sources. These sources can be heterogeneous, differing in various respects including storage technologies and representation methods.Challenges of the Big Data not only involve a need to overcome the 5V properties but also include a need to develop techniques for data capturing, transforming, integrating, and modelling. Yet other important issues are concerned with privacy, security, governance, and ethical aspects of the Big Data analysis.Current advances in dealing with the Big Data problems, albeit in many cases spectacular, are far from being satisfactory for the real-life applications. This becomes especially true in numerous domains where machine learning tasks are crucial to obtaining knowledge of different processes and properties in areas such as bioinformatics, text mining, or security. Unfortunately, the majority of the current algorithms become ineffective when the problem becomes very large since underlying combinatorial optimization problems are, as a rule, computationally difficult. There exists a variety of methods and tools which are excellent at solving small and medium size machine learning tasks but become unsatisfactory when dealing with the large ones.Current hot topics in the quest to improve the effectiveness of the machine learning techniques include a search for compact knowledge representation methods and better tools for knowledge discovery and integration. Machine learning may also profit from integrating collective intelligence techniques, applying evolutionary and bioinspired techniques, and exploring further deep and extreme learning techniques.",
author = "Ireneusz Czarnowski and Piotr Jędrzejowicz and Kuo-Ming Chao and T{\"u}lay Yildirim",
note = "This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.",
year = "2018",
month = "12",
day = "30",
doi = "10.1155/2018/1234390",
language = "English",
volume = "2018",
journal = "Complexity",
issn = "1076-2787",
publisher = "Wiley",

}

TY - JOUR

T1 - Overcoming “Big Data” Barriers in Machine Learning Techniques for the Real-Life Applications

AU - Czarnowski, Ireneusz

AU - Jędrzejowicz, Piotr

AU - Chao, Kuo-Ming

AU - Yildirim, Tülay

N1 - This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

PY - 2018/12/30

Y1 - 2018/12/30

N2 - Data analysis, regardless of whether the data are expected to explain a quantitative (as in regression) or categorical (as in classification) models, often requires overcoming various barriers. They include unbalanced datasets, faulty measurement results, and incomplete data.A special group of barriers is typical for what nowadays is referred to as “Big Data.” The term is used to characterize problems where the available datasets are too large to easily deal with traditional machine learning tools and approaches. It is now generally accepted that dealing with huge and complex data sets poses many processing challenges and opens a range of research and technological problems and calls for new approaches.Big Data is often characterized by the well-known 5V properties:(i) Volume: typically a huge amount of data,(ii) Velocity: a speed at which data are generated including their dynamics and evolution in time,(iii) Variety: involving multiple, heterogeneous, and complex data representations,(iv) Veracity: the uncertainty of data and lack of its quality assurance,(v) Value: a potential business value that big data analysis could offer.Big Data environment is often a distributed one with a distributed data sources. These sources can be heterogeneous, differing in various respects including storage technologies and representation methods.Challenges of the Big Data not only involve a need to overcome the 5V properties but also include a need to develop techniques for data capturing, transforming, integrating, and modelling. Yet other important issues are concerned with privacy, security, governance, and ethical aspects of the Big Data analysis.Current advances in dealing with the Big Data problems, albeit in many cases spectacular, are far from being satisfactory for the real-life applications. This becomes especially true in numerous domains where machine learning tasks are crucial to obtaining knowledge of different processes and properties in areas such as bioinformatics, text mining, or security. Unfortunately, the majority of the current algorithms become ineffective when the problem becomes very large since underlying combinatorial optimization problems are, as a rule, computationally difficult. There exists a variety of methods and tools which are excellent at solving small and medium size machine learning tasks but become unsatisfactory when dealing with the large ones.Current hot topics in the quest to improve the effectiveness of the machine learning techniques include a search for compact knowledge representation methods and better tools for knowledge discovery and integration. Machine learning may also profit from integrating collective intelligence techniques, applying evolutionary and bioinspired techniques, and exploring further deep and extreme learning techniques.

AB - Data analysis, regardless of whether the data are expected to explain a quantitative (as in regression) or categorical (as in classification) models, often requires overcoming various barriers. They include unbalanced datasets, faulty measurement results, and incomplete data.A special group of barriers is typical for what nowadays is referred to as “Big Data.” The term is used to characterize problems where the available datasets are too large to easily deal with traditional machine learning tools and approaches. It is now generally accepted that dealing with huge and complex data sets poses many processing challenges and opens a range of research and technological problems and calls for new approaches.Big Data is often characterized by the well-known 5V properties:(i) Volume: typically a huge amount of data,(ii) Velocity: a speed at which data are generated including their dynamics and evolution in time,(iii) Variety: involving multiple, heterogeneous, and complex data representations,(iv) Veracity: the uncertainty of data and lack of its quality assurance,(v) Value: a potential business value that big data analysis could offer.Big Data environment is often a distributed one with a distributed data sources. These sources can be heterogeneous, differing in various respects including storage technologies and representation methods.Challenges of the Big Data not only involve a need to overcome the 5V properties but also include a need to develop techniques for data capturing, transforming, integrating, and modelling. Yet other important issues are concerned with privacy, security, governance, and ethical aspects of the Big Data analysis.Current advances in dealing with the Big Data problems, albeit in many cases spectacular, are far from being satisfactory for the real-life applications. This becomes especially true in numerous domains where machine learning tasks are crucial to obtaining knowledge of different processes and properties in areas such as bioinformatics, text mining, or security. Unfortunately, the majority of the current algorithms become ineffective when the problem becomes very large since underlying combinatorial optimization problems are, as a rule, computationally difficult. There exists a variety of methods and tools which are excellent at solving small and medium size machine learning tasks but become unsatisfactory when dealing with the large ones.Current hot topics in the quest to improve the effectiveness of the machine learning techniques include a search for compact knowledge representation methods and better tools for knowledge discovery and integration. Machine learning may also profit from integrating collective intelligence techniques, applying evolutionary and bioinspired techniques, and exploring further deep and extreme learning techniques.

U2 - 10.1155/2018/1234390

DO - 10.1155/2018/1234390

M3 - Article

VL - 2018

JO - Complexity

JF - Complexity

SN - 1076-2787

ER -