scholarly journals Data mining methods for gene selection on the basis of gene expression arrays

2014 ◽  
Vol 24 (3) ◽  
pp. 657-668 ◽  
Author(s):  
Michał Muszyński ◽  
Stanisław Osowski

Abstract The paper presents data mining methods applied to gene selection for recognition of a particular type of prostate cancer on the basis of gene expression arrays. Several chosen methods of gene selection, including the Fisher method, correlation of gene with a class, application of the support vector machine and statistical hypotheses, are compared on the basis of clustering measures. The results of applying these individual selection methods are combined together to identify the most often selected genes forming the required pattern, best associated with the cancerous cases. This resulting pattern of selected gene lists is treated as the input data to the classifier, performing the task of the final recognition of the patterns. The numerical results of the recognition of prostate cancer from normal (reference) cases using the selected genes and the support vector machine confirm the good performance of the proposed gene selection approach

Author(s):  
Triantafyllos Paparountas ◽  
Maria Nefeli Nikolaidou-Katsaridou ◽  
Gabriella Rustici ◽  
Vasilis Aidinis

Microarray technology enables high-throughput parallel gene expression analysis, and use has grown exponentially thanks to the development of a variety of applications for expression, genetics and epigenetic studies. A wealth of data is now available from public repositories, providing unprecedented opportunities for meta-analysis approaches, which could generate new biological information, unrelated to the original scope of individual studies. This study provides a guideline for identification of biological significance of the statistically-selected differentially-expressed genes derived from gene expression arrays as well as to suggest further analysis pathways. The authors review the prerequisites for data-mining and meta-analysis, summarize the conceptual methods to derive biological information from microarray data and suggest software for each category of data mining or meta-analysis.


Currently, the automatic lung cancer classification remains a challenging issue for the researchers, due to noisy gene expression data, high dimensional data, and the small sample size. To address these problems, an enhanced gene selection algorithm and multiclass classifier are developed. In this research, the lung cancer-related genes (GEO IDs: GSE10245, GSE19804, GSE7670, GSE10072, and GSE6044) were collected from Gene Expression Omnibus (GEO) dataset. After acquiring the lung cancer-related genes, gene selection was carried out by using enhanced reliefF algorithm for selecting the optimal genes. In enhanced reliefF gene selection algorithm, earthmover distance measure and firefly optimizer were used instead of Manhattan distance measure for identifying the nearest miss and nearest hit instances, which significantly lessens the “curse of dimensionality” issue. These optimal genes were given as the input for Multiclass Support Vector Machine (MSVM) classifier for classifying the sub-classes of lung cancer. The experimental section showed that the proposed system improved the classification accuracy up to 3-10% related to the existing systems in light of accuracy, False Positive Rate (FPR), error rate, and True Positive Rate (TPR).


2019 ◽  
Vol 1 (92) ◽  
pp. 65-70
Author(s):  
G.V. Marchuk ◽  
V.L. Levkivskyy ◽  
S.S. Kaliberda

The main research of the article is the data mining methods, such as linear and polynomial regression and the support vector machine. The application success is based on the fact that the methods and technologies of Data mining ensure the study of data and the research of hidden patterns in them. The analysis assists in identification of various features and data parameters, and therefore it is a powerful tool in the stage of forming forecasting models.


2020 ◽  
Vol 18 (11) ◽  
pp. 01-13
Author(s):  
Dr.M. Kalaiarasu ◽  
Dr.J. Anitha

Autism Spectrum Disorder (ASD) is a neuro developmental disorder characterized by weakened social skills, impaired verbal and non-verbal interaction, and repeated behavior. ASD has increased in the past few years and the root cause of the symptom cannot yet be determined. In ASD with gene expression is analyzed by classification methods. For the selection of genes in ASD, statistical philtres and a wrapper-based Geometric Binary Particle Swarm Optimization-Support Vector Machine (GBPSO-SVM) algorithm have recently been implemented. However GBPSO has provides lesser accuracy, if the dataset samples are large and it cannot directly apply to multiple output systems. To overcome this issue, Modified Cuckoo Search-Support Vector Machine (MCS-SVM) based wrapper feature selection algorithm is proposed which improves the accuracy of the classifier in ASD. This work consists of three major steps, (i) preprocessing, (ii) gene selection, and (iii) classification. Firstly, preprocessing is performed by mean or median ratios close to unity was removed from original gene dataset; based on this samples are reduced from 54,613 to 9454. Secondly, gene selection is performed by using statistical filters and wrapper algorithm. Statistical filters methods like Wilcox on Rank Sum test (WRS), Class Correlation (COR) function and Two-sample T-test (TT) were applied in parallel to a ten-fold cross validation range of the most discriminatory genes. In the wrapper algorithm, Modified Cuckoo Search (MCS) is also proposed to gene selection. This step decreases the number of genes of the dataset by removing genes. Finally, SVM classifier combined forms of gene subsets for grading. The autism microarray dataset used in the analysis was downloaded from the benchmark public repository Gene Expression Omnibus (GEO) (National Center for Biotechnology Information (NCBI)). The classification methods are measured in terms of the metrics like precision, recall, f-measure and accuracy. Proposed MCS-SVM classifier achieves highest accuracy when compared Linear Regression (LR), and GBPSO-SVM classifiers.


2012 ◽  
Vol 60 (3) ◽  
pp. 461-470 ◽  
Author(s):  
A. Wiliński ◽  
S. Osowski

Abstract The paper presents the ensemble of data mining methods for discovering the most important genes and gene sequences generated by the gene expression arrays, responsible for the recognition of a particular type of cancer. The analyzed methods include the correlation of the feature with a class, application of the statistical hypotheses, the Fisher measure of discrimination and application of the linear Support Vector Machine for characterization of the discrimination ability of the features. In the first step of ranking we apply each method individually, choosing the genes most often selected in the cross validation of the available data set. In the next step we combine the results of different selection methods together and once again choose the genes most frequently appearing in the selected sets. On the basis of this we form the final ranking of the genes. The most important genes form the input information delivered to the Support Vector Machine (SVM) classifier, responsible for the final recognition of tumor from non-tumor data. Different forms of checking the correctness of the proposed ranking procedure have been applied. The first one is relied on mapping the distribution of selected genes on the two-coordinate system formed by two most important principal components of the PCA transformation and applying the cluster quality measures. The other one depicts the results in the graphical form by presenting the gene expressions in the form of pixel intensity for the available data. The final confirmation of the quality of the proposed ranking method are the classification results of recognition of the cancer cases from the non-cancer (normal) ones, performed using the Gaussian kernel SVM. The results of selection of the most significant genes used by the SVM for recognition of the prostate cancer cases from normal cases have confirmed a good accuracy of results. The presented methodology is of potential use for practical application in bioinformatics.


Author(s):  
Sarangam Kodati ◽  
Jeeva Selvaraj

Data mining is the most famous knowledge extraction approach for knowledge discovery from data (KDD). Machine learning is used to enable a program to analyze data, recognize correlations, and make usage on insights to solve issues and/or enrich data and because of prediction. The chapter highlights the need for more research within the usage of robust data mining methods in imitation of help healthcare specialists between the diagnosis regarding heart diseases and other debilitating disease conditions. Heart disease is the primary reason of death of people in the world. Nearly 47% of death is caused by heart disease. The authors use algorithms including random forest, naïve Bayes, support vector machine to analyze heart disease. Accuracy on the prediction stage is high when using a greater number of attributes. The goal is to function predictive evaluation using data mining, using data mining to analyze heart disease, and show which methods are effective and efficient.


2019 ◽  
Vol 15 (2) ◽  
pp. 275-280
Author(s):  
Agus Setiyono ◽  
Hilman F Pardede

It is now common for a cellphone to receive spam messages. Great number of received messages making it difficult for human to classify those messages to Spam or no Spam.  One way to overcome this problem is to use Data Mining for automatic classifications. In this paper, we investigate various data mining techniques, named Support Vector Machine, Multinomial Naïve Bayes and Decision Tree for automatic spam detection. Our experimental results show that Support Vector Machine algorithm is the best algorithm over three evaluated algorithms. Support Vector Machine achieves 98.33%, while Multinomial Naïve Bayes achieves 98.13% and Decision Tree is at 97.10 % accuracy.


Sign in / Sign up

Export Citation Format

Share Document