CatBoost for Big Data: an Interdisciplinary Review

Mapping Intimacies ◽

10.21203/rs.3.rs-54646/v1 ◽

2020 ◽

Author(s):

John Hancock ◽

Taghi M Khoshgoftaar

Keyword(s):

Machine Learning ◽

Big Data ◽

Interdisciplinary Approach ◽

Parameter Tuning ◽

Heterogeneous Data ◽

Ensemble Techniques ◽

Learning Tasks ◽

The Family ◽

Classification And Regression ◽

Boosted Decision Trees

Abstract Gradient Boosted Decision Trees (GBDT's) are a powerful tool for classification and regression tasks in Big Data, Researchers should be familiar with the strengths and weaknesses of current implementations of GBDT's in order to use them effectively and make successful contributions. CatBoost is a member of the family of GBDT machine learning ensemble techniques. Since its debut in late 2018, researchers have ellCcessfully used CatBoost for machine learning studies involving Big Data. We take this opportunity to review recent research on CatBoost as it relates to Big Data, and learn best practices from studies that .55 CatBoost in a positive light, as well as studies where CatBoost does not outshine other techniques, since we can learn lessons from both types of scenarios. Furthermore, as a Decision Tree based algorithm, CatBoost is well-suited to machine learning tasks involving categorical, heterogeneous data. Recent work across multiple disciplines illustrates CatBoost's effectiveness and shortcomings in classification and regression tasks. Another important issue we expose in literature on CatBoost is its sensitivity to hyper-parameters and the importance of hyper-parameter tuning. One contribution we make is to take an interdisciplinary approach to cover studies related to CatBoost in a single work. This provides researchers an in-depth understanding to help clarify proper application of CatBoost in solving problems. To the best of our knowledge, this is the first survey that studies all works related to CatBoost in a single publication.

Download Full-text

CatBoost for Big Data: an Interdisciplinary Review

10.21203/rs.3.rs-54646/v2 ◽

2020 ◽

Author(s):

John Hancock ◽

Taghi M Khoshgoftaar

Keyword(s):

Machine Learning ◽

Big Data ◽

Interdisciplinary Approach ◽

Parameter Tuning ◽

Heterogeneous Data ◽

Ensemble Techniques ◽

Learning Tasks ◽

The Family ◽

Classification And Regression ◽

Boosted Decision Trees

Abstract Gradient Boosted Decision Trees (GBDT’s) are a powerful tool for classiﬁcation and regression tasks in Big Data. Researchers should be familiar with the strengths and weaknesses of current implementations of GBDT’s in order to use them eﬀectively and make successful contributions. CatBoost is a member of the family of GBDT machine learning ensemble techniques. Since its debut in late 2018, researchers have successfully used CatBoost for machine learning studies involving Big Data. We take this opportunity to review recent research on CatBoost as it relates to Big Data, and learn best practices from studies that cast CatBoost in a positive light, as well as studies where CatBoost does not outshine other techniques, since we can learn lessons from both types of scenarios. Furthermore, as a Decision Tree based algorithm, CatBoost is well-suited to machine learning tasks involving categorical, heterogeneous data. Recent work across multiple disciplines illustrates CatBoost’s eﬀectiveness and shortcomings in classiﬁcation and regression tasks. Another important issue we expose in literature on CatBoost is its sensitivity to hyper-parameters and the importance of hyper-parameter tuning. One contribution we make is to take an interdisciplinary approach to cover studies related to CatBoost in a single work. This provides researchers an in-depth understanding to help clarify proper application of CatBoost in solving problems. To the best of our knowledge, this is the ﬁrst survey that studies all works related to CatBoost in a single publication.

Download Full-text

CatBoost for big data: an interdisciplinary review

Journal Of Big Data ◽

10.1186/s40537-020-00369-8 ◽

2020 ◽

Vol 7 (1) ◽

Author(s):

John T. Hancock ◽

Taghi M. Khoshgoftaar

Keyword(s):

Machine Learning ◽

Big Data ◽

Interdisciplinary Approach ◽

Parameter Tuning ◽

Heterogeneous Data ◽

Ensemble Techniques ◽

Learning Tasks ◽

The Family ◽

Classification And Regression ◽

Boosted Decision Trees

Abstract Gradient Boosted Decision Trees (GBDT’s) are a powerful tool for classification and regression tasks in Big Data. Researchers should be familiar with the strengths and weaknesses of current implementations of GBDT’s in order to use them effectively and make successful contributions. CatBoost is a member of the family of GBDT machine learning ensemble techniques. Since its debut in late 2018, researchers have successfully used CatBoost for machine learning studies involving Big Data. We take this opportunity to review recent research on CatBoost as it relates to Big Data, and learn best practices from studies that cast CatBoost in a positive light, as well as studies where CatBoost does not outshine other techniques, since we can learn lessons from both types of scenarios. Furthermore, as a Decision Tree based algorithm, CatBoost is well-suited to machine learning tasks involving categorical, heterogeneous data. Recent work across multiple disciplines illustrates CatBoost’s effectiveness and shortcomings in classification and regression tasks. Another important issue we expose in literature on CatBoost is its sensitivity to hyper-parameters and the importance of hyper-parameter tuning. One contribution we make is to take an interdisciplinary approach to cover studies related to CatBoost in a single work. This provides researchers an in-depth understanding to help clarify proper application of CatBoost in solving problems. To the best of our knowledge, this is the first survey that studies all works related to CatBoost in a single publication.

Download Full-text

Graph-Based Semi-Supervised Learning With Big Data

Cognitive Analytics ◽

10.4018/978-1-7998-2460-2.ch012 ◽

2020 ◽

pp. 214-244

Author(s):

Prithish Banerjee ◽

Mark Vere Culp ◽

Kenneth Jospeh Ryan ◽

George Michailidis

Keyword(s):

Machine Learning ◽

Big Data ◽

Supervised Learning ◽

Prior Knowledge ◽

Linear Algebra ◽

Real Data ◽

Data Set ◽

Regression Problems ◽

Classification And Regression ◽

Empirical Demonstration

This chapter presents some popular graph-based semi-supervised approaches. These techniques apply to classification and regression problems and can be extended to big data problems using recently developed anchor graph enhancements. The background necessary for understanding this Chapter includes linear algebra and optimization. No prior knowledge in methods of machine learning is necessary. An empirical demonstration of the techniques for these methods is also provided on real data set benchmarks.

Download Full-text

Streaming feature selection algorithms for big data: A survey

Applied Computing and Informatics ◽

10.1016/j.aci.2019.01.001 ◽

2020 ◽

Vol ahead-of-print (ahead-of-print) ◽

Cited By ~ 5

Author(s):

Noura AlNuaimi ◽

Mohammad Mehedy Masud ◽

Mohamed Adel Serhani ◽

Nazar Zaki

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Big Data ◽

Real Time ◽

Relevant Literature ◽

Heterogeneous Data ◽

Proper Solution ◽

Exact Figure ◽

Selection Algorithms ◽

Over Time

Organizations in many domains generate a considerable amount of heterogeneous data every day. Such data can be processed to enhance these organizations’ decisions in real time. However, storing and processing large and varied datasets (known as big data) is challenging to do in real time. In machine learning, streaming feature selection has always been considered a superior technique for selecting the relevant subset features from highly dimensional data and thus reducing learning complexity. In the relevant literature, streaming feature selection refers to the features that arrive consecutively over time; despite a lack of exact figure on the number of features, numbers of instances are well-established. Many scholars in the field have proposed streaming-feature-selection algorithms in attempts to find the proper solution to this problem. This paper presents an exhaustive and methodological introduction of these techniques. This study provides a review of the traditional feature-selection algorithms and then scrutinizes the current algorithms that use streaming feature selection to determine their strengths and weaknesses. The survey also sheds light on the ongoing challenges in big-data research.

Download Full-text

Extracting and Transforming Heterogeneous Data from XML files for Big Data

International Journal of Engineering and Advanced Technology - Regular Issue ◽

10.35940/ijeat.b3438.129219 ◽

2019 ◽

Vol 9 (2) ◽

pp. 4276-4280

Keyword(s):

Machine Learning ◽

Big Data ◽

Learning Community ◽

Digital Technology ◽

Data Warehousing ◽

Continuous Process ◽

Heterogeneous Data ◽

Data Generation ◽

Data Systems ◽

The Web

Digital technology is fast changing in the recent years and with this change, the number of data systems, sources, and formats has also increased exponentially. So the process of extracting data from these multiple source systems and transforming it to suit for various analytics processes is gaining importance at an alarming rate. In order to handle Big Data, the process of transformation is quite challenging, as data generation is a continuous process. In this paper, we extract data from various heterogeneous sources from the web and try to transform it into a form which is vastly used in data warehousing so that it caters to the analytical needs of the machine learning community.

Download Full-text

Multi-Fidelity Automatic Hyper-Parameter Tuning via Transfer Series Expansion

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v33i01.33013846 ◽

2019 ◽

Vol 33 ◽

pp. 3846-3853 ◽

Cited By ~ 1

Author(s):

Yi-Qi Hu ◽

Yang Yu ◽

Wei-Wei Tu ◽

Qiang Yang ◽

Yuqiang Chen ◽

...

Keyword(s):

Machine Learning ◽

Series Expansion ◽

Low Cost ◽

Parameter Tuning ◽

Small Data ◽

Full Dataset ◽

Learning Tasks ◽

Derivative Free Optimization ◽

Derivative Free ◽

Full Power

Automatic machine learning (AutoML) aims at automatically choosing the best configuration for machine learning tasks. However, a configuration evaluation can be very time consuming particularly on learning tasks with large datasets. This limitation usually restrains derivative-free optimization from releasing its full power for a fine configuration search using many evaluations. To alleviate this limitation, in this paper, we propose a derivative-free optimization framework for AutoML using multi-fidelity evaluations. It uses many lowfidelity evaluations on small data subsets and very few highfidelity evaluations on the full dataset. However, the lowfidelity evaluations can be badly biased, and need to be corrected with only a very low cost. We thus propose the Transfer Series Expansion (TSE) that learns the low-fidelity correction predictor efficiently by linearly combining a set of base predictors. The base predictors can be obtained cheaply from down-scaled and experienced tasks. Experimental results on real-world AutoML problems verify that the proposed framework can accelerate derivative-free configuration search significantly by making use of the multi-fidelity evaluations.

Download Full-text

Big Data Analytics in Healthcare using Machine Learning Algorithms: A Comparative Study

International Journal of Online and Biomedical Engineering (iJOE) ◽

10.3991/ijoe.v16i13.18609 ◽

2020 ◽

Vol 16 (13) ◽

pp. 19

Author(s):

Sai Hanuman Akundi ◽

Soujanya R ◽

Madhuri PM

Keyword(s):

Machine Learning ◽

Big Data ◽

Comparative Study ◽

Data Analytics ◽

Learning Algorithms ◽

Big Data Analytics ◽

Large Data ◽

Heterogeneous Data ◽

Machine Learning Algorithms ◽

Healthcare Sector

In recent years vast quantities of data have been managed in various ways of medical applications and multiple organizations worldwide have developed this type of data and, together, these heterogeneous data are called big data. Data with other characteristics, quantity, speed and variety are the word big data. The healthcare sector has faced the need to handle the large data from different sources, renowned for generating large amounts of heterogeneous data. We can use the Big Data analysis to make proper decision in the health system by tweaking some of the current machine learning algorithms. If we have a large amount of knowledge that we want to predict or identify patterns, master learning would be the way forward. In this article, a brief overview of the Big Data, functionality and ways of Big data analytics are presented, which play an important role and affect healthcare information technology significantly. Within this paper we have presented a comparative study of algorithms for machine learning. We need to make effective use of all the current machine learning algorithms to anticipate accurate outcomes in the world of nursing.

Download Full-text

People's Councils for Ethical Machine Learning

10.31235/osf.io/sjpmc ◽

2018 ◽

Author(s):

Dan McQuillan

Keyword(s):

Machine Learning ◽

Social Media ◽

Big Data ◽

Knowledge Production ◽

The Core ◽

Machine Learning Methods ◽

The Family ◽

The World ◽

Social Media Platforms ◽

Drone Warfare

Machine learning is a form of knowledge production native to the era of big data. It is at the core of social media platforms and everyday interactions. It is also being rapidly adopted for research and discovery across academia, business and government. This paper will explore the way the affordances of machine learning itself, and the forms of social apparatus that it becomes a part of, will potentially erode ethics and draw us in to a drone-like perspective. Unconstrained machine learning enables and delimits our knowledge of the world in particular ways: the abstractions and operations of machine learning produce a ‘view from above’ whose consequences for both ethics and legality parallel the dilemmas of drone warfare. The family of machine learning methods is not somehow inherently bad or dangerous, nor does implementing them signal any intent to cause harm. Nevertheless, the machine learning assemblage produces a targeting gaze whose algorithms obfuscate the legality of its judgements, and whose iterations threaten to create both specific injustices and broader states of exception. Given the urgent need to provide some kind of balance before machine learning becomes embedded everywhere, this paper proposes people’s councils as a way to contest machinic judgements and reassert openness and discourse.

Download Full-text

Unified Framework for Control of Machine Learning Tasks Towards Effective and Efficient Processing of Big Data

Studies in Big Data - Data Science and Big Data: An Environment of Computational Intelligence ◽

10.1007/978-3-319-53474-9_6 ◽

2017 ◽

pp. 123-140 ◽

Cited By ~ 4

Author(s):

Han Liu ◽

Alexander Gegov ◽

Mihaela Cocea

Keyword(s):

Machine Learning ◽

Big Data ◽

Unified Framework ◽

Learning Tasks ◽

Efficient Processing

Download Full-text

A Survey of Parallel Clustering Algorithms Based on Spark

Scientific Programming ◽

10.1155/2020/8884926 ◽

2020 ◽

Vol 2020 ◽

pp. 1-12

Author(s):

Wen Xiao ◽

Juan Hu

Keyword(s):

Machine Learning ◽

Image Processing ◽

Big Data ◽

Information Retrieval ◽

Social Network ◽

Clustering Algorithms ◽

Future Research ◽

Parallel Design ◽

Learning Tasks ◽

Parallel Clustering

Clustering is one of the most important unsupervised machine learning tasks, which is widely used in information retrieval, social network analysis, image processing, and other fields. With the explosive growth of data, the classical clustering algorithms cannot meet the requirements of clustering for big data. Spark is one of the most popular parallel processing platforms for big data, and many researchers have proposed many parallel clustering algorithms based on Spark. In this paper, the existing parallel clustering algorithms based on Spark are classified and summarized, the parallel design framework of each kind of algorithms is discussed, and after comparing different kinds of algorithms, the direction of the future research is discussed.

Download Full-text