scholarly journals An Efficient Cancer Classification Model Using Microarray and High-Dimensional Data

2021 ◽  
Vol 2021 ◽  
pp. 1-14
Author(s):  
Hanaa Fathi ◽  
Hussain AlSalman ◽  
Abdu Gumaei ◽  
Ibrahim I. M. Manhrawy ◽  
Abdelazim G. Hussien ◽  
...  

Cancer can be considered as one of the leading causes of death widely. One of the most effective tools to be able to handle cancer diagnosis, prognosis, and treatment is by using expression profiling technique which is based on microarray gene. For each data point (sample), gene data expression usually receives tens of thousands of genes. As a result, this data is large-scale, high-dimensional, and highly redundant. The classification of gene expression profiles is considered to be a (NP)-Hard problem. Feature (gene) selection is one of the most effective methods to handle this problem. A hybrid cancer classification approach is presented in this paper, and several machine learning techniques were used in the hybrid model: Pearson’s correlation coefficient as a correlation-based feature selector and reducer, a Decision Tree classifier that is easy to interpret and does not require a parameter, and Grid Search CV (cross-validation) to optimize the maximum depth hyperparameter. Seven standard microarray cancer datasets are used to evaluate our model. To identify which features are the most informative and relative using the proposed model, various performance measurements are employed, including classification accuracy, specificity, sensitivity, F1-score, and AUC. The suggested strategy greatly decreases the number of genes required for classification, selects the most informative features, and increases classification accuracy, according to the results.

2021 ◽  
Vol 29 ◽  
pp. 287-295
Author(s):  
Zhiming Zhou ◽  
Haihui Huang ◽  
Yong Liang

BACKGROUND: In genome research, it is particularly important to identify molecular biomarkers or signaling pathways related to phenotypes. Logistic regression model is a powerful discrimination method that can offer a clear statistical explanation and obtain the classification probability of classification label information. However, it is unable to fulfill biomarker selection. OBJECTIVE: The aim of this paper is to give the model efficient gene selection capability. METHODS: In this paper, we propose a new penalized logsum network-based regularization logistic regression model for gene selection and cancer classification. RESULTS: Experimental results on simulated data sets show that our method is effective in the analysis of high-dimensional data. For a large data set, the proposed method has achieved 89.66% (training) and 90.02% (testing) AUC performances, which are, on average, 5.17% (training) and 4.49% (testing) better than mainstream methods. CONCLUSIONS: The proposed method can be considered a promising tool for gene selection and cancer classification of high-dimensional biological data.


Deriving the methodologies to detect heart issues at an earlier stage and intimating the patient to improve their health. To resolve this problem, we will use Machine Learning techniques to predict the incidence at an earlier stage. We have a tendency to use sure parameters like age, sex, height, weight, case history, smoking and alcohol consumption and test like pressure ,cholesterol, diabetes, ECG, ECHO for prediction. In machine learning there are many algorithms which will be used to solve this issue. The algorithms include K-Nearest Neighbour, Support vector classifier, decision tree classifier, logistic regression and Random Forest classifier. Using these parameters and algorithms we need to predict whether or not the patient has heart disease or not and recommend the patient to improve his/her health.


The online discussion forums and blogs are very vibrant platforms for cancer patients to express their views in the form of stories. These stories sometimes become a source of inspiration for some patients who are anxious in searching the similar cases. This paper proposes a method using natural language processing and machine learning to analyze unstructured texts accumulated from patient’s reviews and stories. The proposed methodology aims to identify behavior, emotions, side-effects, decisions and demographics associated with the cancer victims. The pre-processing phase of our work involves extraction of web text followed by text-cleaning where some special characters and symbols are omitted, and finally tagging the texts using NLTK’s (Natural Language Toolkit) POS (Parts of Speech) Tagger. The post-processing phase performs training of seven machine learning classifiers (refer Table 6). The Decision Tree classifier shows the higher precision (0.83) among the other classifiers while, the Area under the operating Characteristics (AUC) for Support Vector Machine (SVM) classifier is highest (0.98).


2021 ◽  
pp. 1-11
Author(s):  
Jesús Miguel García-Gorrostieta ◽  
Aurelio López-López ◽  
Samuel González-López ◽  
Adrián Pastor López-Monroy

Academic theses writing is a complex task that requires the author to be skilled in argumentation. The goal of the academic author is to communicate clear ideas and to convince the reader of the presented claims. However, few students are good arguers, and this is a skill that takes time to master. In this paper, we present an exploration of lexical features used to model automatic detection of argumentative paragraphs using machine learning techniques. We present a novel proposal, which combines the information in the complete paragraph with the detection of argumentative segments in order to achieve improved results for the detection of argumentative paragraphs. We propose two approaches; a more descriptive one, which uses the decision tree classifier with indicators and lexical features; and another more efficient, which uses an SVM classifier with lexical features and a Document Occurrence Representation (DOR). Both approaches consider the detection of argumentative segments to ensure that a paragraph detected as argumentative has indeed segments with argumentation. We achieved encouraging results for both approaches.


2021 ◽  
Author(s):  
A B Pawar ◽  
M A Jawale ◽  
Ravi Kumar Tirandasu ◽  
Saiprasad Potharaju

High dimensionality is the serious issue in the preprocessing of data mining. Having large number of features in the dataset leads to several complications for classifying an unknown instance. In a initial dataspace there may be redundant and irrelevant features present, which leads to high memory consumption, and confuse the learning model created with those properties of features. Always it is advisable to select the best features and generate the classification model for better accuracy. In this research, we proposed a novel feature selection approach and Symmetrical uncertainty and Correlation Coefficient (SU-CCE) for reducing the high dimensional feature space and increasing the classification accuracy. The experiment is performed on colon cancer microarray dataset which has 2000 features. The proposed method derived 38 best features from it. To measure the strength of proposed method, top 38 features extracted by 4 traditional filter-based methods are compared with various classifiers. After careful investigation of result, the proposed approach is competing with most of the traditional methods.


Author(s):  
V. Jinubala ◽  
P. Jeyakumar

Aims: To classify the rice pest data based on the weather attributes using a machine learning approach, a decision tree classifier, and to validate the performance results with other existing techniques through comparison. Design: Rice pest classification using C5.0 algorithm Methodology: We collected rice pest data from the crop fields of various regions in the state of Maharashtra of India. The dataset contains the name of the region (Taluk), period (week), pest data, temperature, rainfall, and relative humidity. The data is collected from 39 taluks within four districts in different weeks of the year of 2013-2014. The weather information plays a vital role in this rice pest data analysis, because based on the weather, pest infestation varies in all the regions. The pests considered in this research are Yellow Stem borer, Gall midge, Leaf folder, and Planthopper. The collected dataset is given as input to the classifier, where 75% of data from the dataset is used for training, and 25% of data are used for testing the classifier. Results: The proposed C5.0 algorithm performed better in the classification of rice pest dataset based on weather attributes. The C5.0 algorithm achieved 88.99% accuracy, 78.81% sensitivity, and 89.11% specificity, which are higher in performance when compared with other techniques. Compared with the other different methods, the C5.0 algorithm achieved 1.3 to 8.5% improved accuracy, 2.4 to 9% improved sensitivity, and 0.8 to 7.8% improved specificity. Conclusion: Early detection of pest and pest based diseases is an essential process to avoid major crop losses. The proposed classification model is designed to classify the level of pest infestations based on weather attributes, as level of infestations caused by the rice pest varies based on weather conditions. The C5.0 algorithm classified the rice pest data based on the weather attributes in the dataset.


2011 ◽  
Vol 467-469 ◽  
pp. 2123-2128
Author(s):  
Zhi Yuan Zeng ◽  
Bo Li ◽  
Xiao Jun Tan ◽  
Jian Zhong Zhou

In this paper, we proposed a vehicle classification algorithm based on cloud model. Cloud model is a new theory which can express the relationship between randomness and fuzziness. Vehicle features, such as vehicle size, shape information, contour information and edge information are extracted for cloud model. Each vehicle class is expressed through cloud model parameters, such as Ex (expectation), En (entropy), with multi-dimensional feature. And cloud classification model is employed to judge the optimal class for each vehicle. Furthermore, attribute similarity is introduced to judge the weight of each feature in classification. Decision tree classifier is utilized for classification. The algorithm’s evaluations on video image series, the results show that cloud model ensures a promising and stable performance in recognizing these vehicle classes, and the algorithm can achieve accuracy and real-time.


2016 ◽  
Vol 2016 ◽  
pp. 1-21 ◽  
Author(s):  
Taimur Bakhshi ◽  
Bogdan Ghita

Traffic classification utilizing flow measurement enables operators to perform essential network management. Flow accounting methods such as NetFlow are, however, considered inadequate for classification requiring additional packet-level information, host behaviour analysis, and specialized hardware limiting their practical adoption. This paper aims to overcome these challenges by proposing two-phased machine learning classification mechanism with NetFlow as input. The individual flow classes are derived per application throughk-means and are further used to train a C5.0 decision tree classifier. As part of validation, the initial unsupervised phase used flow records of fifteen popular Internet applications that were collected and independently subjected tok-means clustering to determine unique flow classes generated per application. The derived flow classes were afterwards used to train and test a supervised C5.0 based decision tree. The resulting classifier reported an average accuracy of 92.37% on approximately 3.4 million test cases increasing to 96.67% with adaptive boosting. The classifier specificity factor which accounted for differentiating content specific from supplementary flows ranged between 98.37% and 99.57%. Furthermore, the computational performance and accuracy of the proposed methodology in comparison with similar machine learning techniques lead us to recommend its extension to other applications in achieving highly granular real-time traffic classification.


Sign in / Sign up

Export Citation Format

Share Document