Unified Framework for Control of Machine Learning Tasks Towards Effective and Efficient Processing of Big Data

CatBoost for Big Data: an Interdisciplinary Review

10.21203/rs.3.rs-54646/v1 ◽

2020 ◽

Author(s):

John Hancock ◽

Taghi M Khoshgoftaar

Keyword(s):

Machine Learning ◽

Big Data ◽

Interdisciplinary Approach ◽

Parameter Tuning ◽

Heterogeneous Data ◽

Ensemble Techniques ◽

Learning Tasks ◽

The Family ◽

Classification And Regression ◽

Boosted Decision Trees

Abstract Gradient Boosted Decision Trees (GBDT's) are a powerful tool for classification and regression tasks in Big Data, Researchers should be familiar with the strengths and weaknesses of current implementations of GBDT's in order to use them effectively and make successful contributions. CatBoost is a member of the family of GBDT machine learning ensemble techniques. Since its debut in late 2018, researchers have ellCcessfully used CatBoost for machine learning studies involving Big Data. We take this opportunity to review recent research on CatBoost as it relates to Big Data, and learn best practices from studies that .55 CatBoost in a positive light, as well as studies where CatBoost does not outshine other techniques, since we can learn lessons from both types of scenarios. Furthermore, as a Decision Tree based algorithm, CatBoost is well-suited to machine learning tasks involving categorical, heterogeneous data. Recent work across multiple disciplines illustrates CatBoost's effectiveness and shortcomings in classification and regression tasks. Another important issue we expose in literature on CatBoost is its sensitivity to hyper-parameters and the importance of hyper-parameter tuning. One contribution we make is to take an interdisciplinary approach to cover studies related to CatBoost in a single work. This provides researchers an in-depth understanding to help clarify proper application of CatBoost in solving problems. To the best of our knowledge, this is the first survey that studies all works related to CatBoost in a single publication.

Download Full-text

CatBoost for Big Data: an Interdisciplinary Review

10.21203/rs.3.rs-54646/v2 ◽

2020 ◽

Author(s):

John Hancock ◽

Taghi M Khoshgoftaar

Keyword(s):

Machine Learning ◽

Big Data ◽

Interdisciplinary Approach ◽

Parameter Tuning ◽

Heterogeneous Data ◽

Ensemble Techniques ◽

Learning Tasks ◽

The Family ◽

Classification And Regression ◽

Boosted Decision Trees

Abstract Gradient Boosted Decision Trees (GBDT’s) are a powerful tool for classiﬁcation and regression tasks in Big Data. Researchers should be familiar with the strengths and weaknesses of current implementations of GBDT’s in order to use them eﬀectively and make successful contributions. CatBoost is a member of the family of GBDT machine learning ensemble techniques. Since its debut in late 2018, researchers have successfully used CatBoost for machine learning studies involving Big Data. We take this opportunity to review recent research on CatBoost as it relates to Big Data, and learn best practices from studies that cast CatBoost in a positive light, as well as studies where CatBoost does not outshine other techniques, since we can learn lessons from both types of scenarios. Furthermore, as a Decision Tree based algorithm, CatBoost is well-suited to machine learning tasks involving categorical, heterogeneous data. Recent work across multiple disciplines illustrates CatBoost’s eﬀectiveness and shortcomings in classiﬁcation and regression tasks. Another important issue we expose in literature on CatBoost is its sensitivity to hyper-parameters and the importance of hyper-parameter tuning. One contribution we make is to take an interdisciplinary approach to cover studies related to CatBoost in a single work. This provides researchers an in-depth understanding to help clarify proper application of CatBoost in solving problems. To the best of our knowledge, this is the ﬁrst survey that studies all works related to CatBoost in a single publication.

Download Full-text

A Survey of Parallel Clustering Algorithms Based on Spark

Scientific Programming ◽

10.1155/2020/8884926 ◽

2020 ◽

Vol 2020 ◽

pp. 1-12

Author(s):

Wen Xiao ◽

Juan Hu

Keyword(s):

Machine Learning ◽

Image Processing ◽

Big Data ◽

Information Retrieval ◽

Social Network ◽

Clustering Algorithms ◽

Future Research ◽

Parallel Design ◽

Learning Tasks ◽

Parallel Clustering

Clustering is one of the most important unsupervised machine learning tasks, which is widely used in information retrieval, social network analysis, image processing, and other fields. With the explosive growth of data, the classical clustering algorithms cannot meet the requirements of clustering for big data. Spark is one of the most popular parallel processing platforms for big data, and many researchers have proposed many parallel clustering algorithms based on Spark. In this paper, the existing parallel clustering algorithms based on Spark are classified and summarized, the parallel design framework of each kind of algorithms is discussed, and after comparing different kinds of algorithms, the direction of the future research is discussed.

Download Full-text

CatBoost for big data: an interdisciplinary review

Journal Of Big Data ◽

10.1186/s40537-020-00369-8 ◽

2020 ◽

Vol 7 (1) ◽

Author(s):

John T. Hancock ◽

Taghi M. Khoshgoftaar

Keyword(s):

Machine Learning ◽

Big Data ◽

Interdisciplinary Approach ◽

Parameter Tuning ◽

Heterogeneous Data ◽

Ensemble Techniques ◽

Learning Tasks ◽

The Family ◽

Classification And Regression ◽

Boosted Decision Trees

Abstract Gradient Boosted Decision Trees (GBDT’s) are a powerful tool for classification and regression tasks in Big Data. Researchers should be familiar with the strengths and weaknesses of current implementations of GBDT’s in order to use them effectively and make successful contributions. CatBoost is a member of the family of GBDT machine learning ensemble techniques. Since its debut in late 2018, researchers have successfully used CatBoost for machine learning studies involving Big Data. We take this opportunity to review recent research on CatBoost as it relates to Big Data, and learn best practices from studies that cast CatBoost in a positive light, as well as studies where CatBoost does not outshine other techniques, since we can learn lessons from both types of scenarios. Furthermore, as a Decision Tree based algorithm, CatBoost is well-suited to machine learning tasks involving categorical, heterogeneous data. Recent work across multiple disciplines illustrates CatBoost’s effectiveness and shortcomings in classification and regression tasks. Another important issue we expose in literature on CatBoost is its sensitivity to hyper-parameters and the importance of hyper-parameter tuning. One contribution we make is to take an interdisciplinary approach to cover studies related to CatBoost in a single work. This provides researchers an in-depth understanding to help clarify proper application of CatBoost in solving problems. To the best of our knowledge, this is the first survey that studies all works related to CatBoost in a single publication.

Download Full-text

Advanced Interpretable Machine Learning Methods for Clinical NGS Big Data of Complex Hereditary Diseases

10.3389/978-2-88966-274-6 ◽

2020 ◽

Keyword(s):

Machine Learning ◽

Big Data ◽

Hereditary Diseases ◽

Learning Methods ◽

Machine Learning Methods ◽

Interpretable Machine Learning

Download Full-text

Bipolar Disorder and Oxidative Stress Injury Mechanism - Clinical Big Data Analysis Based on Machine Learning

Case Medical Research ◽

10.31525/ct1-nct03949218 ◽

2019 ◽

Author(s):

Keyword(s):

Oxidative Stress ◽

Machine Learning ◽

Bipolar Disorder ◽

Big Data ◽

Data Analysis ◽

Big Data Analysis ◽

Injury Mechanism ◽

Stress Injury ◽

Oxidative Stress Injury ◽

And Oxidative Stress

Download Full-text

The Cross-Sectional Pricing of Corporate Bonds Using Big Data and Machine Learning

SSRN Electronic Journal ◽

10.2139/ssrn.3686164 ◽

2020 ◽

Cited By ~ 2

Author(s):

Turan G. Bali ◽

Amit Goyal ◽

Dashan Huang ◽

Fuwei Jiang ◽

Quan Wen

Keyword(s):

Machine Learning ◽

Big Data ◽

Corporate Bonds ◽

Cross Sectional ◽

The Cross

Download Full-text

Data Driven Smart Proxy for CFD Application of Big Data Analytics & Machine Learning in Computational Fluid Dynamics, Report Two: Model Building at the Cell Level

10.2172/1431303 ◽

2018 ◽

Cited By ~ 1

Author(s):

A. Ansari ◽

S. Mohaghegh ◽

M. Shahnam ◽

J. F. Dietiker ◽

T. Li

Keyword(s):

Machine Learning ◽

Fluid Dynamics ◽

Computational Fluid Dynamics ◽

Big Data ◽

Data Analytics ◽

Model Building ◽

Big Data Analytics ◽

Data Driven ◽

Cell Level

Download Full-text

Identifying Cancer Targets Based on Machine Learning Methods via Chou’s 5-steps Rule and General Pseudo Components

Current Topics in Medicinal Chemistry ◽

10.2174/1568026619666191016155543 ◽

2019 ◽

Vol 19 (25) ◽

pp. 2301-2317 ◽

Cited By ~ 2

Author(s):

Ruirui Liang ◽

Jiayang Xie ◽

Chi Zhang ◽

Mengying Zhang ◽

Hai Huang ◽

...

Keyword(s):

Machine Learning ◽

Growth Rate ◽

Big Data ◽

Human Genome Project ◽

Genome Project ◽

Support Vector ◽

Successful Implementation ◽

Learning Methods ◽

Machine Learning Methods ◽

Vector Machines

In recent years, the successful implementation of human genome project has made people realize that genetic, environmental and lifestyle factors should be combined together to study cancer due to the complexity and various forms of the disease. The increasing availability and growth rate of ‘big data’ derived from various omics, opens a new window for study and therapy of cancer. In this paper, we will introduce the application of machine learning methods in handling cancer big data including the use of artificial neural networks, support vector machines, ensemble learning and naïve Bayes classifiers.

Download Full-text

Machine Learning Algorithms for Short-Term Load Forecast in Residential Buildings Using Smart Meters, Sensors and Big Data Solutions

IEEE Access ◽

10.1109/access.2019.2958383 ◽

2019 ◽

Vol 7 ◽

pp. 177874-177889 ◽

Cited By ~ 10

Author(s):

Simona-Vasilica Oprea ◽

Adela Bara

Keyword(s):

Machine Learning ◽

Big Data ◽

Residential Buildings ◽

Learning Algorithms ◽

Machine Learning Algorithms ◽

Short Term ◽

Smart Meters ◽

Load Forecast

Download Full-text