scholarly journals CatBoost for Big Data: an Interdisciplinary Review

2020 ◽  
Author(s):  
John Hancock ◽  
Taghi M Khoshgoftaar

Abstract Gradient Boosted Decision Trees (GBDT's) are a powerful tool for classification and regression tasks in Big Data, Researchers should be familiar with the strengths and weaknesses of current implementations of GBDT's in order to use them effectively and make successful contributions. CatBoost is a member of the family of GBDT machine learning ensemble techniques. Since its debut in late 2018, researchers have ellCcessfully used CatBoost for machine learning studies involving Big Data. We take this opportunity to review recent research on CatBoost as it relates to Big Data, and learn best practices from studies that .55 CatBoost in a positive light, as well as studies where CatBoost does not outshine other techniques, since we can learn lessons from both types of scenarios. Furthermore, as a Decision Tree based algorithm, CatBoost is well-suited to machine learning tasks involving categorical, heterogeneous data. Recent work across multiple disciplines illustrates CatBoost's effectiveness and shortcomings in classification and regression tasks. Another important issue we expose in literature on CatBoost is its sensitivity to hyper-parameters and the importance of hyper-parameter tuning. One contribution we make is to take an interdisciplinary approach to cover studies related to CatBoost in a single work. This provides researchers an in-depth understanding to help clarify proper application of CatBoost in solving problems. To the best of our knowledge, this is the first survey that studies all works related to CatBoost in a single publication.

2020 ◽  
Author(s):  
John Hancock ◽  
Taghi M Khoshgoftaar

Abstract Gradient Boosted Decision Trees (GBDT’s) are a powerful tool for classification and regression tasks in Big Data. Researchers should be familiar with the strengths and weaknesses of current implementations of GBDT’s in order to use them effectively and make successful contributions. CatBoost is a member of the family of GBDT machine learning ensemble techniques. Since its debut in late 2018, researchers have successfully used CatBoost for machine learning studies involving Big Data. We take this opportunity to review recent research on CatBoost as it relates to Big Data, and learn best practices from studies that cast CatBoost in a positive light, as well as studies where CatBoost does not outshine other techniques, since we can learn lessons from both types of scenarios. Furthermore, as a Decision Tree based algorithm, CatBoost is well-suited to machine learning tasks involving categorical, heterogeneous data. Recent work across multiple disciplines illustrates CatBoost’s effectiveness and shortcomings in classification and regression tasks. Another important issue we expose in literature on CatBoost is its sensitivity to hyper-parameters and the importance of hyper-parameter tuning. One contribution we make is to take an interdisciplinary approach to cover studies related to CatBoost in a single work. This provides researchers an in-depth understanding to help clarify proper application of CatBoost in solving problems. To the best of our knowledge, this is the first survey that studies all works related to CatBoost in a single publication.


2020 ◽  
Vol 7 (1) ◽  
Author(s):  
John T. Hancock ◽  
Taghi M. Khoshgoftaar

Abstract Gradient Boosted Decision Trees (GBDT’s) are a powerful tool for classification and regression tasks in Big Data. Researchers should be familiar with the strengths and weaknesses of current implementations of GBDT’s in order to use them effectively and make successful contributions. CatBoost is a member of the family of GBDT machine learning ensemble techniques. Since its debut in late 2018, researchers have successfully used CatBoost for machine learning studies involving Big Data. We take this opportunity to review recent research on CatBoost as it relates to Big Data, and learn best practices from studies that cast CatBoost in a positive light, as well as studies where CatBoost does not outshine other techniques, since we can learn lessons from both types of scenarios. Furthermore, as a Decision Tree based algorithm, CatBoost is well-suited to machine learning tasks involving categorical, heterogeneous data. Recent work across multiple disciplines illustrates CatBoost’s effectiveness and shortcomings in classification and regression tasks. Another important issue we expose in literature on CatBoost is its sensitivity to hyper-parameters and the importance of hyper-parameter tuning. One contribution we make is to take an interdisciplinary approach to cover studies related to CatBoost in a single work. This provides researchers an in-depth understanding to help clarify proper application of CatBoost in solving problems. To the best of our knowledge, this is the first survey that studies all works related to CatBoost in a single publication.


2020 ◽  
pp. 214-244
Author(s):  
Prithish Banerjee ◽  
Mark Vere Culp ◽  
Kenneth Jospeh Ryan ◽  
George Michailidis

This chapter presents some popular graph-based semi-supervised approaches. These techniques apply to classification and regression problems and can be extended to big data problems using recently developed anchor graph enhancements. The background necessary for understanding this Chapter includes linear algebra and optimization. No prior knowledge in methods of machine learning is necessary. An empirical demonstration of the techniques for these methods is also provided on real data set benchmarks.


2020 ◽  
Vol ahead-of-print (ahead-of-print) ◽  
Author(s):  
Noura AlNuaimi ◽  
Mohammad Mehedy Masud ◽  
Mohamed Adel Serhani ◽  
Nazar Zaki

Organizations in many domains generate a considerable amount of heterogeneous data every day. Such data can be processed to enhance these organizations’ decisions in real time. However, storing and processing large and varied datasets (known as big data) is challenging to do in real time. In machine learning, streaming feature selection has always been considered a superior technique for selecting the relevant subset features from highly dimensional data and thus reducing learning complexity. In the relevant literature, streaming feature selection refers to the features that arrive consecutively over time; despite a lack of exact figure on the number of features, numbers of instances are well-established. Many scholars in the field have proposed streaming-feature-selection algorithms in attempts to find the proper solution to this problem. This paper presents an exhaustive and methodological introduction of these techniques. This study provides a review of the traditional feature-selection algorithms and then scrutinizes the current algorithms that use streaming feature selection to determine their strengths and weaknesses. The survey also sheds light on the ongoing challenges in big-data research.


Digital technology is fast changing in the recent years and with this change, the number of data systems, sources, and formats has also increased exponentially. So the process of extracting data from these multiple source systems and transforming it to suit for various analytics processes is gaining importance at an alarming rate. In order to handle Big Data, the process of transformation is quite challenging, as data generation is a continuous process. In this paper, we extract data from various heterogeneous sources from the web and try to transform it into a form which is vastly used in data warehousing so that it caters to the analytical needs of the machine learning community.


Author(s):  
Yi-Qi Hu ◽  
Yang Yu ◽  
Wei-Wei Tu ◽  
Qiang Yang ◽  
Yuqiang Chen ◽  
...  

Automatic machine learning (AutoML) aims at automatically choosing the best configuration for machine learning tasks. However, a configuration evaluation can be very time consuming particularly on learning tasks with large datasets. This limitation usually restrains derivative-free optimization from releasing its full power for a fine configuration search using many evaluations. To alleviate this limitation, in this paper, we propose a derivative-free optimization framework for AutoML using multi-fidelity evaluations. It uses many lowfidelity evaluations on small data subsets and very few highfidelity evaluations on the full dataset. However, the lowfidelity evaluations can be badly biased, and need to be corrected with only a very low cost. We thus propose the Transfer Series Expansion (TSE) that learns the low-fidelity correction predictor efficiently by linearly combining a set of base predictors. The base predictors can be obtained cheaply from down-scaled and experienced tasks. Experimental results on real-world AutoML problems verify that the proposed framework can accelerate derivative-free configuration search significantly by making use of the multi-fidelity evaluations.


Author(s):  
Sai Hanuman Akundi ◽  
Soujanya R ◽  
Madhuri PM

In recent years vast quantities of data have been managed in various ways of medical applications and multiple organizations worldwide have developed this type of data and, together, these heterogeneous data are called big data. Data with other characteristics, quantity, speed and variety are the word big data. The healthcare sector has faced the need to handle the large data from different sources, renowned for generating large amounts of heterogeneous data. We can use the Big Data analysis to make proper decision in the health system by tweaking some of the current machine learning algorithms. If we have a large amount of knowledge that we want to predict or identify patterns, master learning would be the way forward. In this article, a brief overview of the Big Data, functionality and ways of Big data analytics are presented, which play an important role and affect healthcare information technology significantly. Within this paper we have presented a comparative study of algorithms for machine learning. We need to make effective use of all the current machine learning algorithms to anticipate accurate outcomes in the world of nursing.


2018 ◽  
Author(s):  
Dan McQuillan

Machine learning is a form of knowledge production native to the era of big data. It is at the core of social media platforms and everyday interactions. It is also being rapidly adopted for research and discovery across academia, business and government. This paper will explore the way the affordances of machine learning itself, and the forms of social apparatus that it becomes a part of, will potentially erode ethics and draw us in to a drone-like perspective. Unconstrained machine learning enables and delimits our knowledge of the world in particular ways: the abstractions and operations of machine learning produce a ‘view from above’ whose consequences for both ethics and legality parallel the dilemmas of drone warfare. The family of machine learning methods is not somehow inherently bad or dangerous, nor does implementing them signal any intent to cause harm. Nevertheless, the machine learning assemblage produces a targeting gaze whose algorithms obfuscate the legality of its judgements, and whose iterations threaten to create both specific injustices and broader states of exception. Given the urgent need to provide some kind of balance before machine learning becomes embedded everywhere, this paper proposes people’s councils as a way to contest machinic judgements and reassert openness and discourse.


2020 ◽  
Vol 2020 ◽  
pp. 1-12
Author(s):  
Wen Xiao ◽  
Juan Hu

Clustering is one of the most important unsupervised machine learning tasks, which is widely used in information retrieval, social network analysis, image processing, and other fields. With the explosive growth of data, the classical clustering algorithms cannot meet the requirements of clustering for big data. Spark is one of the most popular parallel processing platforms for big data, and many researchers have proposed many parallel clustering algorithms based on Spark. In this paper, the existing parallel clustering algorithms based on Spark are classified and summarized, the parallel design framework of each kind of algorithms is discussed, and after comparing different kinds of algorithms, the direction of the future research is discussed.


Sign in / Sign up

Export Citation Format

Share Document