A Non-Deterministic Strategy for Searching Optimal Number of Trees Hyperparameter in Random Forest

Data Analysis Using Representation Theory and Clustering Algorithms

WSEAS TRANSACTIONS ON COMPUTERS ◽

10.37394/23205.2020.19.38 ◽

2021 ◽

Vol 19 ◽

pp. 310-320

Author(s):

Suboh Alkhushayni ◽

Taeyoung Choi ◽

Du’a Alzaleq

Keyword(s):

Data Analysis ◽

Random Forest ◽

Hierarchical Clustering ◽

Clustering Algorithm ◽

Clustering Algorithms ◽

Optimal Number ◽

Categorical Variables ◽

Common Disease ◽

Agglomerative Hierarchical Clustering ◽

Data Set

This work aims to expand the knowledge of the area of data analysis through both persistence homology, as well as representations of directed graphs. To be specific, we looked for how we can analyze homology cluster groups using agglomerative Hierarchical Clustering algorithms and methods. Additionally, the Wine data, which is offered in R studio, was analyzed using various cluster algorithms such as Hierarchical Clustering, K-Means Clustering, and PAM Clustering. The goal of the analysis was to find out which cluster's method is proper for a given numerical data set. By testing the data, we tried to find the agglomerative hierarchical clustering method that will be the optimal clustering algorithm among these three; K-Means, PAM, and Random Forest methods. By comparing each model's accuracy value with cultivar coefficients, we came with a conclusion that K-Means methods are the most helpful when working with numerical variables. On the other hand, PAM clustering and Gower with random forest are the most beneficial approaches when working with categorical variables. All these tests can determine the optimal number of clustering groups, given the data set, and by doing the proper analysis. Using those the project, we can apply our method to several industrial areas such that clinical, business, and others. For example, people can make different groups based on each patient who has a common disease, required therapy, and other things in the clinical society. Additionally, for the business area, people can expect to get several clustered groups based on the marginal profit, marginal cost, or other economic indicators.

Machine Learning-Based Nicotine Addiction Prediction Models for Youth E-Cigarette and Waterpipe (Hookah) Users

Journal of Clinical Medicine ◽

10.3390/jcm10050972 ◽

2021 ◽

Vol 10 (5) ◽

pp. 972

Author(s):

Jeeyae Choi ◽

Hee-Tae Jung ◽

Anastasiya Ferrell ◽

Seoyoon Woo ◽

Linda Haddad

Keyword(s):

Machine Learning ◽

Random Forest ◽

Prediction Models ◽

Public Awareness ◽

Confusion Matrix ◽

Harmful Effect ◽

Nicotine Addiction ◽

Optimal Number ◽

Machine Learning Algorithms ◽

Predictor Variables

Despite the harmful effect on health, e-cigarette and hookah smoking in youth in the U.S. has increased. Developing tailored e-cigarette and hookah cessation programs for youth is imperative. The aim of this study was to identify predictor variables such as social, mental, and environmental determinants that cause nicotine addiction in youth e-cigarette or hookah users and build nicotine addiction prediction models using machine learning algorithms. A total of 6511 participants were identified as ever having used e-cigarettes or hookah from the National Youth Tobacco Survey (2019) datasets. Prediction models were built by Random Forest with ReliefF and Least Absolute Shrinkage and Selection Operator (LASSO). ReliefF identified important predictor variables, and the Davies–Bouldin clustering evaluation index selected the optimal number of predictors for Random Forest. A total of 193 predictor variables were included in the final analysis. Performance of prediction models was measured by Root Mean Square Error (RMSE) and Confusion Matrix. The results suggested high performance of prediction. Identified predictor variables were aligned with previous research. The noble predictors found, such as ‘witnessed e-cigarette use in their household’ and ‘perception of their tobacco use’, could be used in public awareness or targeted e-cigarette and hookah youth education and for policymakers.

Feature selection for improving Indian spoken language identification in utterance duration mismatch condition

Bulletin of Electrical Engineering and Informatics ◽

10.11591/eei.v10i5.3173 ◽

2021 ◽

Vol 10 (5) ◽

pp. 2578-2587

Author(s):

Aarti Bakshi ◽

Sunil Kumar Kopparapu

Keyword(s):

Random Forest ◽

Spoken Language ◽

Relative Increase ◽

Optimal Number ◽

Language Identification ◽

Training Data ◽

Support Vector ◽

Indian Languages ◽

Artificial Neural Network Ann ◽

Mismatch Condition

In spoken language identification (SLID) systems, the test data may be of a sufficiently shorter duration than training data, known as duration mismatch condition. Duration normalized features are used to identify a spoken language for nine Indian languages in duration mismatch conditions. Random forest-based importance vectors of 1582 OpenSMILE features are calculated for each utterance in different duration datasets. The feature importance vectors are normalized across each dataset and later across different duration datasets. The optimal number of duration normalized features is selected to maximize SLID system accuracy. Three classifiers, artificial neural network (ANN), support vector machine (SVM), and random forest (RF), and their fusion, weights optimized using logistic regression, are used. The speech material comprised utterances, each of 30 sec, extracted from the All India Radio dataset with nine Indian languages. Seven new datasets of smaller utterance durations were generated by carefully splitting each utterance. Experimental results showed that 150 most important duration normalized features were optimal with a relative increase in 18-80% accuracy for mismatch conditions. The accuracy decreased with increased duration mismatch.

Data analysis using representation theory and clustering algorithms

International Journal of Engineering & Technology ◽

10.14419/ijet.v9i4.31234 ◽

2020 ◽

Vol 9 (4) ◽

pp. 887

Author(s):

Suboh Alkhushayni ◽

Taeyoung Choi ◽

Du'a Alzaleq

Keyword(s):

Data Analysis ◽

Random Forest ◽

Hierarchical Clustering ◽

Clustering Algorithm ◽

Clustering Algorithms ◽

Optimal Number ◽

Categorical Variables ◽

Common Disease ◽

Agglomerative Hierarchical Clustering ◽

Data Set

This work aims to expand the knowledge of the area of data analysis through persistence homology and representations of directed graphs. To be specific, we looked for how we can analyze homology cluster groups using agglomerative Hierarchical Clustering algorithms and methods. Additionally, the Wine data, which is offered in R studio, was analyzed using various cluster algorithms such as Hierarchical Clustering, K-Means Clustering, and PAM Clustering. The goal of the analysis was to find out which cluster's method is proper for a given numerical dataset. We tried to find the agglomerative hierarchical clustering method by testing the data that will be the optimal clustering algorithm among these three; K-Means, PAM, and Random Forest methods. By comparing each model's accuracy value with cultivar coefficients, we concluded that K-Means methods are the most helpful when working with numerical variables. On the other hand, PAM clustering and Gower with Random Forest are the most beneficial approaches when using categorical variables. These tests can determine the optimal number of clustering groups, given the data set, and by doing the proper analysis. Using those the project, we can apply our method to several industrial areas such that clinical, business, and others. For example, people can make different groups based on each patient who has a common disease, required therapy, and other things in the clinical society. Additionally, people can expect to get several clustered groups based on the marginal profit, marginal cost, or other economic indicators for the business area.

Detection of Electricity Theft Behavior Based on Improved Synthetic Minority Oversampling Technique and Random Forest Classifier

Energies ◽

10.3390/en13082039 ◽

2020 ◽

Vol 13 (8) ◽

pp. 2039 ◽

Cited By ~ 6

Author(s):

Zhengwei Qu ◽

Hongwen Li ◽

Yunjing Wang ◽

Jiaxi Zhang ◽

Ahmed Abu-Siada ◽

...

Keyword(s):

Random Forest ◽

Smart Grids ◽

Clustering Algorithm ◽

Optimal Number ◽

Cluster Center ◽

Complex Data ◽

Electricity Theft ◽

Positive Data ◽

Data Error ◽

Detection Technologies

Effective detection of electricity theft is essential to maintain power system reliability. With the development of smart grids, traditional electricity theft detection technologies have become ineffective to deal with the increasingly complex data on the users’ side. To improve the auditing efficiency of grid enterprises, a new electricity theft detection method based on improved synthetic minority oversampling technique (SMOTE) and improve random forest (RF) method is proposed in this paper. The data of normal and electricity theft users were classified as positive data (PD) and negative data (ND), respectively. In practice, the number of ND was far less than PD, which made the dataset composed of these two types of data become unbalanced. An improved SOMTE based on K-means clustering algorithm (K-SMOTE) was firstly presented to balance the dataset. The cluster center of ND was determined by K-means method. Then, the ND were interpolated by SMOTE on the basis of the cluster center to balance the entire data. Finally, the RF classifier was trained with the balanced dataset, and the optimal number of decision trees in RF was decided according to the convergence of out-of-bag data error (OOB error). Electricity theft behaviors on the user side were detected by the trained RF classifier.

Agricultural Irrigation Area Prediction Based on Improved Random Forest Model

10.21203/rs.3.rs-156767/v1 ◽

2021 ◽

Author(s):

Guangda Gao ◽

Maofa Wang ◽

Hongliang Huang ◽

Weiyu Tang

Keyword(s):

Random Forest ◽

Prediction Models ◽

Absolute Error ◽

Mean Value ◽

Optimal Number ◽

Random Forest Model ◽

Irrigation Area ◽

Forest Model ◽

Grid Search Method ◽

The World

Abstract The food problem is a major problem of common concern in the world, and the prediction of irrigation area can promote the solution of food and agricultural problems. In this paper, the data of grain production and irrigation area in the world are analyzed. An improved Random Forest Regression model is proposed and applied to the prediction of irrigation area. Based on ordinary Random Forest and Limit Tree Regression algorithm, an improved random forest prediction model for irrigation area in China is proposed. Firstly, the arithmetic mean value (AMM) of mean square error (MSE) and mean absolute error (MAE) was used as the evaluation index of the improved impure function and irrigation area prediction effect. Then, the grid search method is used to determine the optimal number of decision trees (70 trees and 30 trees respectively) in ordinary random forest and limit tree regression, and a new improved random forest model is established. After following, the model is compared with other prediction models, and 10 fold cross validation shows the rationality of the model. Finally, the error analysis of the improved Random Forest model shows that the prediction error is small. It is expected to be applied in the annual analysis of irrigation area in China.

Optimal Experimental Design With Nesting of Persons in Organizations

Zeitschrift für Psychologie ◽

10.1027/2151-2604/a000143 ◽

2013 ◽

Vol 221 (3) ◽

pp. 145-159 ◽

Cited By ~ 1

Author(s):

Gerard J. P. van Breukelen

Keyword(s):

Repeated Measures ◽

Optimal Number ◽

Health Centers ◽

Randomized Experiments ◽

Optimal Sample ◽

Treatment Conditions ◽

Sampling Cost ◽

Nested Designs ◽

Simple Equations ◽

Number Of Individuals

This paper introduces optimal design of randomized experiments where individuals are nested within organizations, such as schools, health centers, or companies. The focus is on nested designs with two levels (organization, individual) and two treatment conditions (treated, control), with treatment assignment to organizations, or to individuals within organizations. For each type of assignment, a multilevel model is first presented for the analysis of a quantitative dependent variable or outcome. Simple equations are then given for the optimal sample size per level (number of organizations, number of individuals) as a function of the sampling cost and outcome variance at each level, with realistic examples. Next, it is explained how the equations can be applied if the dependent variable is dichotomous, or if there are covariates in the model, or if the effects of two treatment factors are studied in a factorial nested design, or if the dependent variable is repeatedly measured. Designs with three levels of nesting and the optimal number of repeated measures are briefly discussed, and the paper ends with a short discussion of robust design.

Optimal number of disc clock tracks for block-oriented rotating associative processors

IEE Proceedings E Computers and Digital Techniques ◽

10.1049/ip-e.1989.0073 ◽

1989 ◽

Vol 136 (6) ◽

pp. 535

Author(s):

B. Parhami

Keyword(s):

Optimal Number

Implementation of data mining as a support of business application strategy

Journal of Applied Information, Communication and Technology ◽

10.33555/ejaict.v5i1.49 ◽

2018 ◽

Vol 5 (1) ◽

pp. 47-55

Author(s):

Florensia Unggul Damayanti

Keyword(s):

Data Mining ◽

Random Forest ◽

Business Strategy ◽

Input Parameter ◽

Data Mining Algorithm ◽

Complex Data ◽

Business Decision ◽

Marketing Department ◽

Business Application ◽

Complex Data Sets

Data mining help industries create intelligent decision on complex problems. Data mining algorithm can be applied to the data in order to forecasting, identity pattern, make rules and recommendations, analyze the sequence in complex data sets and retrieve fresh insights. Yet, increasing of technology and various techniques among data mining availability data give opportunity to industries to explore and gain valuable information from their data and use the information to support business decision making. This paper implement classification data mining in order to retrieve knowledge in customer databases to support marketing department while planning strategy for predict plan premium. The dataset decompose into conceptual analytic to identify characteristic data that can be used as input parameter of data mining model. Business decision and application is characterized by processing step, processing characteristic and processing outcome (Seng, J.L., Chen T.C. 2010). This paper set up experimental of data mining based on J48 and Random Forest classifiers and put a light on performance evaluation between J48 and random forest in the context of dataset in insurance industries. The experiment result are about classification accuracy and efficiency of J48 and Random Forest , also find out the most attribute that can be used to predict plan premium in context of strategic planning to support business strategy.

An analysis on the optimal number of options in multiple-choice items of the National Assessment of Educational Achievement

Foreign Languages Education ◽

10.15334/fle.2014.21.2.107 ◽

2014 ◽

Vol 21 (2) ◽

pp. 107-128

Author(s):

Young-Ju Lee ◽

Keyword(s):

Educational Achievement ◽

Multiple Choice ◽

Optimal Number ◽

National Assessment ◽

Multiple Choice Items