scholarly journals PCA based Regression Decision Tree Classification for Somatic Mutations.

The analization of cancer data and normal data for the predication of somatic mu-tation occurrences in the data set plays an important role and several challenges persist in detectingsomatic mutations which leads to complexity of handling large volumes of data in classifi-cation with good accuracy. In many situations the dataset may consist of redundant and less significant features and there is a need to remove insignificant features in order to improve the performance of classification. Feature selection techniques are useful for dimensionality reduction purpose. PCA is one type of feature selection technique to identify significant attributes and is adopted in this paper. A novel technique, PCA based regression decision tree is proposed for classification of somatic mutations data in this paper.The performance analysis of this clas-sification process for the detection of somatic mutation is compared with existing algorithms and satisfactory results are obtained with the proposed model.

Author(s):  
Dhilsath Fathima.M ◽  
S. Justin Samuel ◽  
R. Hari Haran

Aim: This proposed work is used to develop an improved and robust machine learning model for predicting Myocardial Infarction (MI) could have substantial clinical impact. Objectives: This paper explains how to build machine learning based computer-aided analysis system for an early and accurate prediction of Myocardial Infarction (MI) which utilizes framingham heart study dataset for validation and evaluation. This proposed computer-aided analysis model will support medical professionals to predict myocardial infarction proficiently. Methods: The proposed model utilize the mean imputation to remove the missing values from the data set, then applied principal component analysis to extract the optimal features from the data set to enhance the performance of the classifiers. After PCA, the reduced features are partitioned into training dataset and testing dataset where 70% of the training dataset are given as an input to the four well-liked classifiers as support vector machine, k-nearest neighbor, logistic regression and decision tree to train the classifiers and 30% of test dataset is used to evaluate an output of machine learning model using performance metrics as confusion matrix, classifier accuracy, precision, sensitivity, F1-score, AUC-ROC curve. Results: Output of the classifiers are evaluated using performance measures and we observed that logistic regression provides high accuracy than K-NN, SVM, decision tree classifiers and PCA performs sound as a good feature extraction method to enhance the performance of proposed model. From these analyses, we conclude that logistic regression having good mean accuracy level and standard deviation accuracy compared with the other three algorithms. AUC-ROC curve of the proposed classifiers is analyzed from the output figure.4, figure.5 that logistic regression exhibits good AUC-ROC score, i.e. around 70% compared to k-NN and decision tree algorithm. Conclusion: From the result analysis, we infer that this proposed machine learning model will act as an optimal decision making system to predict the acute myocardial infarction at an early stage than an existing machine learning based prediction models and it is capable to predict the presence of an acute myocardial Infarction with human using the heart disease risk factors, in order to decide when to start lifestyle modification and medical treatment to prevent the heart disease.


2021 ◽  
pp. 1063293X2110160
Author(s):  
Dinesh Morkonda Gunasekaran ◽  
Prabha Dhandayudam

Nowadays women are commonly diagnosed with breast cancer. Feature based Selection method plays an important step while constructing a classification based framework. We have proposed Multi filter union (MFU) feature selection method for breast cancer data set. The feature selection process based on random forest algorithm and Logistic regression (LG) algorithm based union model is used for selecting important features in the dataset. The performance of the data analysis is evaluated using optimal features subset from selected dataset. The experiments are computed with data set of Wisconsin diagnostic breast cancer center and next the real data set from women health care center. The result of the proposed approach shows high performance and efficient when comparing with existing feature selection algorithms.


2021 ◽  
pp. 2796-2812
Author(s):  
Nishath Ansari

     Feature selection, a method of dimensionality reduction, is nothing but collecting a range of appropriate feature subsets from the total number of features. In this paper, a point by point explanation review about the feature selection in this segment preferred affairs and its appraisal techniques are discussed. I will initiate my conversation with a straightforward approach so that we consider taking care of features and preferred issues depending upon meta-heuristic strategy. These techniques help in obtaining the best highlight subsets. Thereafter, this paper discusses some system models that drive naturally from the environment are discussed and calculations are performed so that we can take care of the preferred feature matters in complex and massive data. Here, furthermore, I discuss algorithms like the genetic algorithm (GA), the Non-Dominated Sorting Genetic Algorithm (NSGA-II), Particle Swarm Optimization (PSO), and some other meta-heuristic strategies for considering the provisional separation of issues. A comparison of these algorithms has been performed; the results show that the feature selection technique benefits machine learning algorithms by improving the performance of the algorithm. This paper also presents various real-world applications of using feature selection.


2014 ◽  
Vol 998-999 ◽  
pp. 1357-1361
Author(s):  
Qing Song Tang ◽  
Jian Ying He

Electronic document is presented in the form of data table through applying characterization and discrete processing method to its type, size, MD5 value etc. A duplication removal model for documentation is constructed by using information entropy based decision tree classification technique. Simple experiments are carried out that show that the proposed model is feasible to a certain degree and can achieve documentation’s duplication removal to a certain extent.


Author(s):  
Rafid Sagban ◽  
Haydar A. Marhoon ◽  
Raaid Alubady

Rule-based classification in the field of health care using artificial intelligence provides solutions in decision-making problems involving different domains. An important challenge is providing access to good and fast health facilities. Cervical cancer is one of the most frequent causes of death in females. The diagnostic methods for cervical cancer used in health centers are costly and time-consuming. In this paper, bat algorithm for feature selection and ant colony optimization-based classification algorithm were applied on cervical cancer data set obtained from the repository of the University of California, Irvine to analyze the disease based on optimal features. The proposed algorithm outperforms other methods in terms of comprehensibility and obtains better results in terms of classification accuracy.


2020 ◽  
Vol 3 (1) ◽  
pp. 40-54
Author(s):  
Ikong Ifongki

Data mining is a series of processes to explore the added value of a data set in the form of knowledge that has not been known manually. The use of data mining techniques is expected to provide knowledge - knowledge that was previously hidden in the data warehouse, so that it becomes valuable information. C4.5 algorithm is a decision tree classification algorithm that is widely used because it has the main advantages of other algorithms. The advantages of the C4.5 algorithm can produce decision trees that are easily interpreted, have an acceptable level of accuracy, are efficient in handling discrete type attributes and can handle discrete and numeric type attributes. The output of the C4.5 algorithm is a decision tree like other classification techniques, a decision tree is a structure that can be used to divide a large data set into smaller sets of records by applying a series of decision rules, with each series of division members of the resulting set become similar to each other. In this case study what is discussed is the effect of coffee sales by processing 106 data from 1087 coffee sales data at PT. JPW Indonesia. Data samples taken will be calculated manually using Microsoft Excel and Rapidminer software. The results of the calculation of the C4.5 algorithm method show that the Quantity and Price attributes greatly affect coffee sales so that sales at PT. JPW Indonesia is still often unstable.


Author(s):  
Abhishek Bhattacharya ◽  
Radha Tamal Goswami ◽  
Kuntal Mukherjee ◽  
Nhu Gia Nguyen

Each Android application requires accumulations of permissions in installation time and they are considered as the features which can be utilized in permission-based identification of Android malwares. Recently, ensemble feature selection techniques have received increasing attention over conventional techniques in different applications. In this work, a cluster based voted ensemble voted feature selection technique combining five base wrapper approaches of R libraries is projected for identifying most prominent set of features in the predictive modeling of Android malwares. The proposed method preserves both the desirable features of an ensemble feature selector, accuracy and diversity. Moreover, in this work, five different data partitioning ratios are considered and the impact of those ratios on predictive model are measured using coefficient of determination (r-square) and root mean square error. The proposed strategy has created significant better outcome in term of the number of selected features and classification accuracy.


2012 ◽  
Vol 95 (3) ◽  
pp. 636-651 ◽  
Author(s):  
Mohammad Goodarzi ◽  
Bieke Dejaegher ◽  
Yvan Vander Heyden

Abstract A quantitative structure-activity relationship (QSAR) relates quantitative chemical structure attributes (molecular descriptors) to a biological activity. QSAR studies have now become attractive in drug discovery and development because their application can save substantial time and human resources. Several parameters are important in the prediction ability of a QSAR model. On the one hand, different statistical methods may be applied to check the linear or nonlinear behavior of a data set. On the other hand, feature selection techniques are applied to decrease the model complexity, to decrease the overfitting/overtraining risk, and to select the most important descriptors from the often more than 1000 calculated. The selected descriptors are then linked to a biological activity of the corresponding compound by means of a mathematical model. Different modeling techniques can be applied, some of which explicitly require a feature selection. A QSAR model can be useful in the design of new compounds with improved potency in the class under study. Only molecules with a predicted interesting activity will be synthesized. In the feature selection problem, a learning algorithm is faced with the problem of selecting a relevant subset of features upon which to focus attention, while ignoring the rest. Up to now, many feature selection techniques, such as genetic algorithms, forward selection, backward elimination, stepwise regression, and simulated annealing have been used extensively. Swarm intelligence optimizations, such as ant colony optimization and partial swarm optimization, which are feature selection techniques usually simulated based on animal and insect life behavior to find the shortest path between a food source and their nests, recently are also involved in QSAR studies. This review paper provides an overview of different feature selection techniques applied in QSAR modeling.


Sign in / Sign up

Export Citation Format

Share Document