A Spheriform Quantization Method Based on Sub-Region Inherent Dimension

Quantization methods are very significant mining task for presenting some operations, i.e., learning and classification in machine learning and data mining since many mining and learning methods in such fields require that the testing data set must include the partitioned features. In this paper, we propose a spheriform quantization method based on sub-region inherent dimension, which induces the quantified interval number and size in data-driven way. The method assumes that a quantified cluster of points can be contained in a lower intrinsicm-dimensional spheriform space of expected radius. These sample points in the spheriform can be obtained by adaptively selecting the neighborhood at initial observation based on sub-region inherent dimension. Experimental results and analysis on UCI real data sets demonstrate that our method significantly enhances the accuracy of classification than traditional quantization methods by implementing C4.5 decision tree.

Download Full-text

Partition Real Data in Decision Tree Using Statistical Criterion

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.380-384.1469 ◽

2013 ◽

Vol 380-384 ◽

pp. 1469-1472

Author(s):

Gui Jun Shan

Keyword(s):

Machine Learning ◽

Data Mining ◽

Decision Tree ◽

Classification Accuracy ◽

Real Data ◽

Statistical Criterion ◽

Partition Scheme ◽

C4.5 Decision Tree ◽

Tree Algorithms ◽

Partition Method

Partition methods for real data play an extremely important role in decision tree algorithms in data mining and machine learning because the decision tree algorithms require that the values of attributes are discrete. In this paper, we propose a novel partition method for real data in decision tree using statistical criterion. This method constructs a statistical criterion to find accurate merging intervals. In addition, we present a heuristic partition algorithm to achieve a desired partition result with the aim to improve the performance of decision tree algorithms. Empirical experiments on UCI real data show that the new algorithm generates a better partition scheme that improves the classification accuracy of C4.5 decision tree than existing algorithms.

Download Full-text

AI Testing: Ensuring a Good Data Split Between Data Sets (Training and Test) using K-means Clustering and Decision Tree Analysis

International Journal on Soft Computing ◽

10.5121/ijsc.2021.12101 ◽

2021 ◽

Vol 12 (1) ◽

pp. 1-11

Author(s):

Kishore Sugali ◽

Chris Sprunger ◽

Venkata N Inukollu

Keyword(s):

Decision Tree ◽

Software Testing ◽

Training Data ◽

Data Sets ◽

Full Data ◽

Data Set ◽

Full Dataset ◽

Development Methodology ◽

Testing Data ◽

Long Time

Artificial Intelligence and Machine Learning have been around for a long time. In recent years, there has been a surge in popularity for applications integrating AI and ML technology. As with traditional development, software testing is a critical component of a successful AI/ML application. The development methodology used in AI/ML contrasts significantly from traditional development. In light of these distinctions, various software testing challenges arise. The emphasis of this paper is on the challenge of effectively splitting the data into training and testing data sets. By applying a k-Means clustering strategy to the data set followed by a decision tree, we can significantly increase the likelihood of the training data set to represent the domain of the full dataset and thus avoid training a model that is likely to fail because it has only learned a subset of the full data domain.

Download Full-text

Diagnosis of Various Thyroid Ailments using Data Mining Classification Techniques

International Journal of Scientific Research in Computer Science Engineering and Information Technology ◽

10.32628/cseit195119 ◽

2019 ◽

pp. 131-136

Author(s):

Umar Sidiq ◽

Syed Mutahar Aaqib ◽

Rafi Ahmad Khan

Keyword(s):

Data Mining ◽

Decision Tree ◽

Research Work ◽

Support Vector ◽

Data Sets ◽

Data Mining Technique ◽

K Nearest Neighbors ◽

Data Set ◽

Classification Techniques ◽

Using Data

Classification is one of the most considerable supervised learning data mining technique used to classify predefined data sets the classification is mainly used in healthcare sectors for making decisions, diagnosis system and giving better treatment to the patients. In this work, the data set used is taken from one of recognized lab of Kashmir. The entire research work is to be carried out with ANACONDA3-5.2.0 an open source platform under Windows 10 environment. An experimental study is to be carried out using classification techniques such as k nearest neighbors, Support vector machine, Decision tree and Naïve bayes. The Decision Tree obtained highest accuracy of 98.89% over other classification techniques.

Download Full-text

PENERAPAN DECISION TREE C4.5 SEBAGAI SELEKSI FITUR DAN SUPPORT VECTOR MACHINE (SVM) UNTUK DIAGNOSA KANKER PAYUDARA

Jurnal Informatika ◽

10.30873/ji.v19i1.1442 ◽

2019 ◽

Vol 19 (1) ◽

pp. 54-61

Author(s):

Pakarti Riswanto ◽

RZ. Abdul Aziz ◽

Sriyanto -

Keyword(s):

Breast Cancer ◽

Machine Learning ◽

Data Mining ◽

Decision Tree ◽

Cancer Cells ◽

Support Vector ◽

Data Set ◽

Advantages And Disadvantages ◽

New Findings ◽

Tree Classifier

In the field of medicine, the use of data mining has a quite important and evolutionary role that can change the perspective of doctors, practitioners and health researchers in the process of detecting breast cancer in a patient. There are 2 classification applications in it, namely the process of diagnosing (diagnosing) cancer cells that distinguishes between tumors (benign cancer) or malignant cancer and prognosis (prognosis) to determine the possibility of reappearance of cancer cells in patients who have been operated on in the future. Data mining aims to describe new findings in the dataset and explain a process that uses statistical, mathematical, artificial intelligence, and machine learning techniques to extract and identify useful information and related knowledge from the database.Classification with data mining can be done using several methods, namely Decision Tree, K-Nearest Neighbor, Naive Bayes, ID3, CART, Linear Discriminant Analysis, etc., which certainly have advantages and disadvantages of each. But in this study, the author focuses on the classification of data mining using the Support Vector Mechine and Deccision Tree algorithms.This study will analyze the Breast Cancer Wisconsin Original data set obtained from the UCI Machine Learning Repository (repository of research data) to classify breast cancer malignancies. This time the author correlates between the Decision Tree classifier algorithm which has good ability to process large databases as a feature selection, then with a proper and relevant SVM Method used in analyzing and diagnosing breast breast cancer patients because it has accurate results for existing problems and several bases . Keywords— Data Mining, diagnosis, Decision Tree, SVM Method

Download Full-text

Utilizing the Genetic Algorithm to Pruning the C4.5 Decision Tree Algorithm

Asian Journal of Applied Sciences ◽

10.24203/ajas.v9i1.6503 ◽

2021 ◽

Vol 9 (1) ◽

Author(s):

Maad M. Mijwil ◽

Rana A. Abttan

Keyword(s):

Machine Learning ◽

Genetic Algorithm ◽

Decision Tree ◽

Learning Algorithm ◽

Machine Learning Algorithms ◽

Continuous Data ◽

Decision Tree Algorithm ◽

Data Set ◽

Confidence Factor ◽

C4.5 Decision Tree

A decision tree (DTs) is one of the most popular machine learning algorithms that divide data repeatedly to form groups or classes. It is a supervised learning algorithm that can be used on discrete or continuous data for classification or regression. The most traditional classifier in this algorithm is the C4.5 decision tree, which is the point of this research. This classifier has the advantage of building a vast data set and does not stop until it reaches the desired goal. The problem with this classifier is that there are unnecessary nodes and branches leading to overfitting. This overfitting can negatively affect the classification process. In this context, the authors suggest utilizing a genetic algorithm to prune the effect of overfitting. This dataset study consists of four datasets: IRIS, Car Evaluation, GLASS, and WINE collected from UC Irvine (UCI) machine learning repository. The experimental results have confirmed the effectiveness of the genetic algorithm in pruning the effect of overfitting on the four datasets and optimizing confidence factor (CF) of the C4.5 decision tree. The proposed method has reached about 92% accuracy in this work.

Download Full-text

PseUdeep: RNA Pseudouridine Site Identification with Deep Learning Algorithm

Frontiers in Genetics ◽

10.3389/fgene.2021.773882 ◽

2021 ◽

Vol 12 ◽

Author(s):

Jujuan Zhuang ◽

Danyang Liu ◽

Meng Lin ◽

Wenjing Qiu ◽

Jinyang Liu ◽

...

Keyword(s):

Machine Learning ◽

Deep Learning ◽

Learning Algorithm ◽

Machine Learning Algorithms ◽

Data Sets ◽

Biological Processes ◽

Rna Sequences ◽

Frequency Pattern ◽

Data Set ◽

Testing Data

Background: Pseudouridine (Ψ) is a common ribonucleotide modification that plays a significant role in many biological processes. The identification of Ψ modification sites is of great significance for disease mechanism and biological processes research in which machine learning algorithms are desirable as the lab exploratory techniques are expensive and time-consuming.Results: In this work, we propose a deep learning framework, called PseUdeep, to identify Ψ sites of three species: H. sapiens, S. cerevisiae, and M. musculus. In this method, three encoding methods are used to extract the features of RNA sequences, that is, one-hot encoding, K-tuple nucleotide frequency pattern, and position-specific nucleotide composition. The three feature matrices are convoluted twice and fed into the capsule neural network and bidirectional gated recurrent unit network with a self-attention mechanism for classification.Conclusion: Compared with other state-of-the-art methods, our model gets the highest accuracy of the prediction on the independent testing data set S-200; the accuracy improves 12.38%, and on the independent testing data set H-200, the accuracy improves 0.68%. Moreover, the dimensions of the features we derive from the RNA sequences are only 109,109, and 119 in H. sapiens, M. musculus, and S. cerevisiae, which is much smaller than those used in the traditional algorithms. On evaluation via tenfold cross-validation and two independent testing data sets, PseUdeep outperforms the best traditional machine learning model available. PseUdeep source code and data sets are available at https://github.com/dan111262/PseUdeep.

Download Full-text

Analysis for Clients Churn of Credit Cards in Model Construction in Banking Industry

Proceedings of Business and Economic Studies ◽

10.26689/pbes.v3i2.1165 ◽

2020 ◽

Vol 3 (2) ◽

Author(s):

Jianyao Liu

Keyword(s):

Data Mining ◽

Decision Tree ◽

Financial Market ◽

Banking Industry ◽

Credit Cards ◽

Training Data ◽

Tree Model ◽

Data Set ◽

Mining Technology ◽

Testing Data

Data mining technology has been more and more important in the economics and financial market. Helping the banks to predict a customers’ behavior, which is that whether the existing customers will continue use their credit cards or not, we utilize the data mining technology to construct a convenient and effective model, Decision Tree. By using our Decision Tree model, which can classify the customers according to different features step by step, the banks are able to predict the customers’ behavior well. The main steps of our experiment includes collecting statistics from the bank, utilizing Min-Max normalization to preprocess the data set, employing the training data set to construct our model, examining the model by testing data set, and analyzing the results.

Download Full-text

A Data Mining Approach for Cardiovascular Diagnosis

Open Computer Science ◽

10.1515/comp-2017-0007 ◽

2017 ◽

Vol 7 (1) ◽

pp. 36-40 ◽

Cited By ~ 3

Author(s):

Joana Pereira ◽

Hugo Peixoto ◽

José Machado ◽

António Abelha

Keyword(s):

Machine Learning ◽

Data Mining ◽

Data Sets ◽

Complex Data ◽

Data Set ◽

Data Mining Approach ◽

Degree Of Disability ◽

Cardio Vascular ◽

Cardiovascular Diagnosis

Abstract The large amounts of data generated by healthcare transactions are too complex and voluminous to be processed and analysed by traditional methods. Data mining can improve decision-making by discovering patterns and trends in large amounts of complex data. In the healthcare industry specifically, data mining can be used to decrease costs by increasing efficiency, improve patient quality of life, and perhaps most importantly, save the lives of more patients. The main goal of this project is to apply data mining techniques in order to make possible the prediction of the degree of disability that patients will present when they leave hospitalization. The clinical data that will compose the data set was obtained from one single hospital and contains information about patients who were hospitalized in Cardio Vascular Disease’s (CVD) unit in 2016 for having suffered a cardiovascular accident. To develop this project, it will be used the Waikato Environment for Knowledge Analysis (WEKA) machine learning Workbench since this one allows users to quickly try out and compare different machine learning methods on new data sets

Download Full-text

Soil Quality Analysis and Crop Fertility Prediction

International Journal for Research in Engineering Application & Management ◽

10.35291/2454-9150.2020.0280 ◽

2020 ◽

pp. 189-193

Keyword(s):

Data Mining ◽

Decision Tree ◽

Large Data ◽

Soil Classification ◽

Quality Analysis ◽

Soil Parameters ◽

Data Sets ◽

Data Set ◽

Nutrient Analysis ◽

Tree Algorithms

Data Mining is a technique used to retrieve information for the analysis and discovery of hidden trends in large data sets. Data Mining extends to numerous areas such as education, banking, marketing, retail, communications and agriculture. Agriculture is the backbone of country’s economy. It is the important source of livelihood. Agriculture depends primarily on the weather, geology, soil and biology. Agricultural Mining is a technology that can contribute information for the growth of agriculture. The current study presents the various techniques of data mining, and their role in soil fertility, nutrient analysis. Decision tree is a well-known Data Mining classification approach. C4.5 and Classification and Regression Trees (ID3) are two widely used decision tree algorithms for classification. The C4.5, ID3 and the proposed classifier have been trained using the soil sample data set by taking into account the optimal soil parameters pH (hydrogen power), EC (electrical conductivity) and ESP (exchangeable sodium percentage). The model is evaluated using a collection of soil samples test results. Classification of soil is the division of soil into classes or groups each having similar characteristics and likely similar behavior. Soil classification is easy to allow the farmer to know the type of soil and to plough the crops based on the soil type.

Download Full-text

Decision Tree Regressions Estimate Liquid Holdup in Two-Phase Gas/Liquid Flows

Journal of Petroleum Technology ◽

10.2118/1121-0075-jpt ◽

2021 ◽

Vol 73 (11) ◽

pp. 75-76

Author(s):

Chris Carpenter

Keyword(s):

Machine Learning ◽

Decision Tree ◽

Multiphase Flows ◽

Data Sets ◽

Liquid Holdup ◽

Two Phase ◽

Data Set ◽

Feature Importance ◽

Data Points ◽

Liquid Flows

This article, written by JPT Technology Editor Chris Carpenter, contains highlights of paper SPE 203448, “Decision-Tree Regressions for Estimating Liquid Holdup in Two-Phase Gas/Liquid Flows,” by Meshal Almashan, SPE, Yoshiaki Narusue, and Hiroyuki Morikawa, University of Tokyo, prepared for the 2020 Abu Dhabi International Petroleum Exhibition and Conference, Abu Dhabi, held virtually 9–12 November. The paper has not been peer reviewed. In the authors’ study, a machine-learning predictive model—boosted decision tree regression (BDTR)—is trained, tested, and evaluated in predicting liquid holdup (HL) in multiphase flows in oil and gas wells. Results show that the proposed BDTR model outperforms the best empirical correlations and the fuzzy-logic model used in estimating HL in gas/liquid multiphase flows. Using the BDTR model with its interpretable representation, the heuristic feature importance of the input features used in building the model can be determined clearly. Introduction Machine-learning approaches in predicting HL in multiphase flows have been recently studied to improve prediction accuracy compared with existing empirical correlations. However, these approaches ignore the heuristic feature importance of the input parameters to the predicted HL values. The heuristic feature importance can help provide better insight into the issues associated with HL studies, such as the liquid-loading phenomenon. To the best of the authors’ knowledge, the present study is the first work that shows how decision-forest regression predictive models can predict HL accurately. Data Acquisition The performance and the predictive power of a machine-learning model relies greatly on the quality and completeness of the data set used in building the model. The data sets used in training and testing the predictive model are experimental and were collected from the literature (111 data points). Air/kerosene and air/water mixtures were used in obtaining the 111 experimental data points. In this study, this data set is divided into three different subsets: training, validation, and testing. The data sets consist of the properties of HL, the superficial gas velocity (Vsg), the superficial liquid velocity (Vsl), pressure, and temperature (T). The statistical measures of the data sets are shown in Table 1 of the complete paper.

Download Full-text