scholarly journals Predicting RNA-seq data using genetic algorithm and ensemble classification algorithms

Author(s):  
Micheal Olaolu Arowolo ◽  
Marion O. Adebiyi ◽  
Ayodele A. Adebiyi ◽  
Olatunji J. Okesola

<p>Malaria parasites accept uncertain, inconsistent life span breeding through vectors of mosquitoes stratospheres. Thousands of different transcriptome parasites exist. A prevalent ribonucleic acid sequencing (RNA-seq) technique for gene expression has brought about enhanced identifications of genetical queries. Computation of RNA-seq gene expression data transcripts requires enhancements using analytical machine learning procedures. Numerous learning approaches have been adopted for analyzing and enhancing the performance of biological data and machines. In this study, a genetic algorithm dimensionality reduction technique is proposed to fetch relevant information from a huge dimensional RNA-seq dataset, and classification uses Ensemble classification algorithms. The experiment is performed using a mosquito Anopheles gambiae dataset with a classification accuracy of 81.7% and 88.3%.</p>

Author(s):  
Micheal Olaolu Arowolo ◽  
Marion O. Adebiyi ◽  
Ayodele A. Adebiyi ◽  
Charity Aremu

Malaria parasites introduce outstanding life-phase variations as they grow across multiple atmospheres of the mosquito vector. There are transcriptomes of several thousand different parasites. (RNA-seq) Ribonucleic acid sequencing is a prevalent gene expression tool leading to better understanding of genetic interrogations. RNA-seq measures transcriptions of expressions of genes. Data from RNA-seq necessitate procedural enhancements in machine learning techniques. Researchers have suggested various approached learning for the study of biological data. This study works on ICA feature extraction algorithm to realize dormant components from a huge dimensional RNA-seq vector dataset, and estimates its classification performance, Ensemble classification algorithm is used in carrying out the experiment. This study is tested on RNA-Seq mosquito anopheles gambiae dataset. The results of the experiment obtained an output metrics with a 93.3% classification accuracy.


2021 ◽  
Vol 50 (9) ◽  
pp. 2579-2589
Author(s):  
Micheal Olaolu Arowolo ◽  
Marion Olubunmi Adebiyi ◽  
Ayodele Ariyo Adebiyi

RNA-Seq data are utilized for biological applications and decision making for classification of genes. Lots of work in recent time are focused on reducing the dimension of RNA-Seq data. Dimensionality reduction approaches have been proposed in fetching relevant information in a given data. In this study, a novel optimized dimensionality reduction algorithm is proposed, by combining an optimized genetic algorithm with Principal Component Analysis and Independent Component Analysis (GA-O-PCA and GAO-ICA), which are used to identify an optimum subset and latent correlated features, respectively. The classifier uses Decision tree on the reduced mosquito anopheles gambiae dataset to enhance the accuracy and scalability in the gene expression analysis. The proposed algorithm is used to fetch relevant features based from the high-dimensional input feature space. A feature ranking and earlier experience are used. The performances of the model are evaluated and validated using the classification accuracy to compare existing approaches in the literature. The achieved experimental results prove to be promising for feature selection and classification in gene expression data analysis and specify that the approach is a capable accumulation to prevailing data mining techniques.


2021 ◽  
Vol 10 (2) ◽  
pp. 1071-1079
Author(s):  
Marion O. Adebiyi ◽  
Micheal O. Arowolo ◽  
Oludayo Olugbara

Malaria larvae embrace unpredictable variable life periods as they spread across many stratospheres of the mosquito vectors. There are transcriptomes of a thousand distinct species. Ribonucleic acid sequencing (RNA-seq) is a ubiquitous gene expression strategy that contributes to the improvement of genetic survey recognition. RNA-seq measures gene expression transcripts data, including methodological enhancements to machine learning procedures. Scientists have suggested many addressed learning for the study of biological evidence. An enhanced optimized Genetic Algorithm feature selection technique is used in this analysis to obtain relevant information from a high-dimensional Anopheles gambiae dataset and test its classification using SVM-Kernel algorithms. The efficacy of this assay is tested, and the outcome of the experiment obtained an accuracy metric of 93% and 96% respectively.


2020 ◽  
Vol 21 (8) ◽  
pp. 2748 ◽  
Author(s):  
Ruth Barral-Arca ◽  
Alberto Gómez-Carballa ◽  
Miriam Cebey-López ◽  
María José Currás-Tuala ◽  
Sara Pischedda ◽  
...  

There is a growing interest in unraveling gene expression mechanisms leading to viral host invasion and infection progression. Current findings reveal that long non-coding RNAs (lncRNAs) are implicated in the regulation of the immune system by influencing gene expression through a wide range of mechanisms. By mining whole-transcriptome shotgun sequencing (RNA-seq) data using machine learning approaches, we detected two lncRNAs (ENSG00000254680 and ENSG00000273149) that are downregulated in a wide range of viral infections and different cell types, including blood monocluclear cells, umbilical vein endothelial cells, and dermal fibroblasts. The efficiency of these two lncRNAs was positively validated in different viral phenotypic scenarios. These two lncRNAs showed a strong downregulation in virus-infected patients when compared to healthy control transcriptomes, indicating that these biomarkers are promising targets for infection diagnosis. To the best of our knowledge, this is the very first study using host lncRNAs biomarkers for the diagnosis of human viral infections.


2021 ◽  
Vol 18 (17) ◽  
Author(s):  
Micheal Olaolu AROWOLO ◽  
Marion Olubunmi ADEBIYI ◽  
Chiebuka Timothy NNODIM ◽  
Sulaiman Olaniyi ABDULSALAM ◽  
Ayodele Ariyo ADEBIYI

As mosquito parasites breed across many parts of the sub-Saharan Africa part of the world, infected cells embrace an unpredictable and erratic life period. Millions of individual parasites have gene expressions. Ribonucleic acid sequencing (RNA-seq) is a popular transcriptional technique that has improved the detection of major genetic probes. The RNA-seq analysis generally requires computational improvements of machine learning techniques since it computes interpretations of gene expressions. For this study, an adaptive genetic algorithm (A-GA) with recursive feature elimination (RFE) (A-GA-RFE) feature selection algorithms was utilized to detect important information from a high-dimensional gene expression malaria vector RNA-seq dataset. Support Vector Machine (SVM) kernels were used as the classification algorithms to evaluate its predictive performances. The feasibility of this study was confirmed by using an RNA-seq dataset from the mosquito Anopheles gambiae. The technique results in related performance had 98.3 and 96.7 % accuracy rates, respectively. HIGHLIGHTS Dimensionality reduction method based of feature selection Classification using Support vector machine Classification of malaria vector dataset using an adaptive GA-RFE-SVM GRAPHICAL ABSTRACT


2018 ◽  
Author(s):  
Chi Tung Choy ◽  
Chi Hang Wong ◽  
Stephen Lam Chan

AbstractArtificial neural networks (ANNs) have been utilized for classification and prediction task with remarkable accuracy. However, its implications for unsupervised data mining using molecular data is under-explored. We adopted a method of unsupervised ANN, namely word embedding, to extract biologically relevant information from TCGA gene expression dataset. Ground truth relationship, such as cancer types of the input sample and semantic meaning of genes, were showed to retain in the resulting entity matrices. We also demonstrated the interpretability and usage of these matrices in shortlisting candidates from a long gene list. This method is feasible to mine big volume of biological data, and would be a valuable tool to discover novel knowledge from omics data. The resulting embedding matrices mined from TCGA gene expression data are interactively explorable online (http://bit.ly/tcga-embedding-cancer) and could serve as an informative reference.


2015 ◽  
Author(s):  
Jeffrey A Thompson ◽  
Jie Tan ◽  
Casey S Greene

Large, publicly available gene expression datasets are often analyzed with the aid of machine learning algorithms. Although RNA-seq is increasingly the technology of choice, a wealth of expression data already exist in the form of microarray data. If machine learning models built from legacy data can be applied to RNA-seq data, larger, more diverse training datasets can be created and validation can be performed on newly generated data. We developed Training Distribution Matching (TDM), which transforms RNA-seq data for use with models constructed from legacy platforms. We evaluated TDM, as well as quantile normalization and a simple log2 transformation, on both simulated and biological datasets of gene expression. Our evaluation included both supervised and unsupervised machine learning approaches. We found that TDM exhibited consistently strong performance across settings and that quantile normalization also performed well in many circumstances. We also provide a TDM package for the R programming language.


PeerJ ◽  
2016 ◽  
Vol 4 ◽  
pp. e1621 ◽  
Author(s):  
Jeffrey A. Thompson ◽  
Jie Tan ◽  
Casey S. Greene

Large, publicly available gene expression datasets are often analyzed with the aid of machine learning algorithms. Although RNA-seq is increasingly the technology of choice, a wealth of expression data already exist in the form of microarray data. If machine learning models built from legacy data can be applied to RNA-seq data, larger, more diverse training datasets can be created and validation can be performed on newly generated data. We developed Training Distribution Matching (TDM), which transforms RNA-seq data for use with models constructed from legacy platforms. We evaluated TDM, as well as quantile normalization, nonparanormal transformation, and a simplelog2transformation, on both simulated and biological datasets of gene expression. Our evaluation included both supervised and unsupervised machine learning approaches. We found that TDM exhibited consistently strong performance across settings and that quantile normalization also performed well in many circumstances. We also provide a TDM package for the R programming language.


2015 ◽  
Author(s):  
Jeffrey A Thompson ◽  
Jie Tan ◽  
Casey S Greene

Large, publicly available gene expression datasets are often analyzed with the aid of machine learning algorithms. Although RNA-seq is increasingly the technology of choice, a wealth of expression data already exist in the form of microarray data. If machine learning models built from legacy data can be applied to RNA-seq data, larger, more diverse training datasets can be created and validation can be performed on newly generated data. We developed Training Distribution Matching (TDM), which transforms RNA-seq data for use with models constructed from legacy platforms. We evaluated TDM, as well as quantile normalization and a simple log2 transformation, on both simulated and biological datasets of gene expression. Our evaluation included both supervised and unsupervised machine learning approaches. We found that TDM exhibited consistently strong performance across settings and that quantile normalization also performed well in many circumstances. We also provide a TDM package for the R programming language.


2020 ◽  
Author(s):  
Micheal Olaolu Arowolo ◽  
Marion Olubunmi Adebiyi ◽  
Ayodele Ariyo Adebiyi ◽  
Oludayo Olugbara

Abstract RNA-Seq data are utilized for biological applications and decision making for the classification of genes. A lot of works in recent time are focused on reducing the dimension of RNA-Seq data. Dimensionality reduction approaches have been proposed in the transformation of these data. In this study, a novel optimized hybrid investigative approach is proposed. It combines an optimized genetic algorithm with Principal Component Analysis and Independent Component Analysis (GA-O-PCA and GAO-ICA), which are used to identify an optimum subset and latent correlated features, respectively. The classifier uses KNN on the reduced mosquito Anopheles gambiae dataset, to enhance the accuracy and scalability in the gene expression analysis. The proposed algorithm is used to fetch relevant features based on the high-dimensional input feature space. A fast algorithm for feature ranking is used to select relevant features. The performances of the model are evaluated and validated using the classification accuracy to compare existing approaches in the literature. The achieved experimental results prove to be promising for selecting relevant genes and classifying pertinent gene expression data analysis by indicating that the approach is a capable addition to prevailing machine learning methods.


Sign in / Sign up

Export Citation Format

Share Document