Abstract
Data analysis, regardless of whether the data are expected to explain a quantitative (as in regression) or categorical (as in classification) models, often requires overcoming various barriers. They include unbalanced datasets, faulty measurement results, and incomplete data.
A special group of barriers is typical for what nowadays is referred to as “Big Data.” The term is used to characterize problems where the available datasets are too large to easily deal with traditional machine learning tools and approaches. It is now generally accepted that dealing with huge and complex data sets poses many processing challenges and opens a range of research and technological problems and calls for new approaches.
Big Data is often characterized by the well-known 5V properties:
(i) Volume: typically a huge amount of data,
(ii) Velocity: a speed at which data are generated including their dynamics and evolution in time,
(iii) Variety: involving multiple, heterogeneous, and complex data representations,
(iv) Veracity: the uncertainty of data and lack of its quality assurance,
(v) Value: a potential business value that big data analysis could offer.
Big Data environment is often a distributed one with a distributed data sources. These sources can be heterogeneous, differing in various respects including storage technologies and representation methods.
Challenges of the Big Data not only involve a need to overcome the 5V properties but also include a need to develop techniques for data capturing, transforming, integrating, and modelling. Yet other important issues are concerned with privacy, security, governance, and ethical aspects of the Big Data analysis.
Current advances in dealing with the Big Data problems, albeit in many cases spectacular, are far from being satisfactory for the real-life applications. This becomes especially true in numerous domains where machine learning tasks are crucial to obtaining knowledge of different processes and properties in areas such as bioinformatics, text mining, or security. Unfortunately, the majority of the current algorithms become ineffective when the problem becomes very large since underlying combinatorial optimization problems are, as a rule, computationally difficult. There exists a variety of methods and tools which are excellent at solving small and medium size machine learning tasks but become unsatisfactory when dealing with the large ones.
Current hot topics in the quest to improve the effectiveness of the machine learning techniques include a search for compact knowledge representation methods and better tools for knowledge discovery and integration. Machine learning may also profit from integrating collective intelligence techniques, applying evolutionary and bioinspired techniques, and exploring further deep and extreme learning techniques.
A special group of barriers is typical for what nowadays is referred to as “Big Data.” The term is used to characterize problems where the available datasets are too large to easily deal with traditional machine learning tools and approaches. It is now generally accepted that dealing with huge and complex data sets poses many processing challenges and opens a range of research and technological problems and calls for new approaches.
Big Data is often characterized by the well-known 5V properties:
(i) Volume: typically a huge amount of data,
(ii) Velocity: a speed at which data are generated including their dynamics and evolution in time,
(iii) Variety: involving multiple, heterogeneous, and complex data representations,
(iv) Veracity: the uncertainty of data and lack of its quality assurance,
(v) Value: a potential business value that big data analysis could offer.
Big Data environment is often a distributed one with a distributed data sources. These sources can be heterogeneous, differing in various respects including storage technologies and representation methods.
Challenges of the Big Data not only involve a need to overcome the 5V properties but also include a need to develop techniques for data capturing, transforming, integrating, and modelling. Yet other important issues are concerned with privacy, security, governance, and ethical aspects of the Big Data analysis.
Current advances in dealing with the Big Data problems, albeit in many cases spectacular, are far from being satisfactory for the real-life applications. This becomes especially true in numerous domains where machine learning tasks are crucial to obtaining knowledge of different processes and properties in areas such as bioinformatics, text mining, or security. Unfortunately, the majority of the current algorithms become ineffective when the problem becomes very large since underlying combinatorial optimization problems are, as a rule, computationally difficult. There exists a variety of methods and tools which are excellent at solving small and medium size machine learning tasks but become unsatisfactory when dealing with the large ones.
Current hot topics in the quest to improve the effectiveness of the machine learning techniques include a search for compact knowledge representation methods and better tools for knowledge discovery and integration. Machine learning may also profit from integrating collective intelligence techniques, applying evolutionary and bioinspired techniques, and exploring further deep and extreme learning techniques.
Original language | English |
---|---|
Number of pages | 3 |
Journal | Complexity |
Volume | 2018 |
DOIs | |
Publication status | Published - 30 Dec 2018 |