Predicting RNA-seq data using genetic algorithm and ensemble classification algorithms

Malaria parasites introduce outstanding life-phase variations as they grow across multiple atmospheres of the mosquito vector. There are transcriptomes of several thousand different parasites. (RNA-seq) Ribonucleic acid sequencing is a prevalent gene expression tool leading to better understanding of genetic interrogations. RNA-seq measures transcriptions of expressions of genes. Data from RNA-seq necessitate procedural enhancements in machine learning techniques. Researchers have suggested various approached learning for the study of biological data. This study works on ICA feature extraction algorithm to realize dormant components from a huge dimensional RNA-seq vector dataset, and estimates its classification performance, Ensemble classification algorithm is used in carrying out the experiment. This study is tested on RNA-Seq mosquito anopheles gambiae dataset. The results of the experiment obtained an output metrics with a 93.3% classification accuracy.

Download Full-text

Enhanced Dimensionality Reduction Methods for Classifying Malaria Vector Dataset using Decision Tree

Sains Malaysiana ◽

10.17576/jsm-2021-5009-07 ◽

2021 ◽

Vol 50 (9) ◽

pp. 2579-2589

Author(s):

Micheal Olaolu Arowolo ◽

Marion Olubunmi Adebiyi ◽

Ayodele Ariyo Adebiyi

Keyword(s):

Gene Expression ◽

Decision Tree ◽

Dimensionality Reduction ◽

Principal Component ◽

Feature Space ◽

Relevant Information ◽

Component Analysis ◽

Rna Seq ◽

Reduction Methods ◽

Mosquito Anopheles Gambiae

RNA-Seq data are utilized for biological applications and decision making for classification of genes. Lots of work in recent time are focused on reducing the dimension of RNA-Seq data. Dimensionality reduction approaches have been proposed in fetching relevant information in a given data. In this study, a novel optimized dimensionality reduction algorithm is proposed, by combining an optimized genetic algorithm with Principal Component Analysis and Independent Component Analysis (GA-O-PCA and GAO-ICA), which are used to identify an optimum subset and latent correlated features, respectively. The classifier uses Decision tree on the reduced mosquito anopheles gambiae dataset to enhance the accuracy and scalability in the gene expression analysis. The proposed algorithm is used to fetch relevant features based from the high-dimensional input feature space. A feature ranking and earlier experience are used. The performances of the model are evaluated and validated using the classification accuracy to compare existing approaches in the literature. The achieved experimental results prove to be promising for feature selection and classification in gene expression data analysis and specify that the approach is a capable accumulation to prevailing data mining techniques.

Download Full-text

A genetic algorithm for prediction of RNA-seq malaria vector gene expression data classification using SVM kernels

Bulletin of Electrical Engineering and Informatics ◽

10.11591/eei.v10i2.2769 ◽

2021 ◽

Vol 10 (2) ◽

pp. 1071-1079

Author(s):

Marion O. Adebiyi ◽

Micheal O. Arowolo ◽

Oludayo Olugbara

Keyword(s):

Gene Expression ◽

Genetic Algorithm ◽

Distinct Species ◽

Relevant Information ◽

High Dimensional ◽

Expression Data ◽

Rna Seq ◽

Mosquito Vectors ◽

Feature Selection Technique ◽

Genetic Survey

Malaria larvae embrace unpredictable variable life periods as they spread across many stratospheres of the mosquito vectors. There are transcriptomes of a thousand distinct species. Ribonucleic acid sequencing (RNA-seq) is a ubiquitous gene expression strategy that contributes to the improvement of genetic survey recognition. RNA-seq measures gene expression transcripts data, including methodological enhancements to machine learning procedures. Scientists have suggested many addressed learning for the study of biological evidence. An enhanced optimized Genetic Algorithm feature selection technique is used in this analysis to obtain relevant information from a high-dimensional Anopheles gambiae dataset and test its classification using SVM-Kernel algorithms. The efficacy of this assay is tested, and the outcome of the experiment obtained an accuracy metric of 93% and 96% respectively.

Download Full-text

RNA-Seq Data-Mining Allows the Discovery of Two Long Non-Coding RNA Biomarkers of Viral Infection in Humans

International Journal of Molecular Sciences ◽

10.3390/ijms21082748 ◽

2020 ◽

Vol 21 (8) ◽

pp. 2748 ◽

Cited By ~ 1

Author(s):

Ruth Barral-Arca ◽

Alberto Gómez-Carballa ◽

Miriam Cebey-López ◽

María José Currás-Tuala ◽

Sara Pischedda ◽

...

Keyword(s):

Gene Expression ◽

Viral Infections ◽

Umbilical Vein ◽

Cell Types ◽

Dermal Fibroblasts ◽

Learning Approaches ◽

Rna Seq ◽

Wide Range ◽

Healthy Control ◽

Umbilical Vein Endothelial Cells

There is a growing interest in unraveling gene expression mechanisms leading to viral host invasion and infection progression. Current findings reveal that long non-coding RNAs (lncRNAs) are implicated in the regulation of the immune system by influencing gene expression through a wide range of mechanisms. By mining whole-transcriptome shotgun sequencing (RNA-seq) data using machine learning approaches, we detected two lncRNAs (ENSG00000254680 and ENSG00000273149) that are downregulated in a wide range of viral infections and different cell types, including blood monocluclear cells, umbilical vein endothelial cells, and dermal fibroblasts. The efficiency of these two lncRNAs was positively validated in different viral phenotypic scenarios. These two lncRNAs showed a strong downregulation in virus-infected patients when compared to healthy control transcriptomes, indicating that these biomarkers are promising targets for infection diagnosis. To the best of our knowledge, this is the very first study using host lncRNAs biomarkers for the diagnosis of human viral infections.

Download Full-text

An Adaptive Genetic Algorithm with Recursive Feature Elimination Approach for Predicting Malaria Vector Gene Expression Data Classification using Support Vector Machine Kernels

Walailak Journal of Science and Technology (WJST) ◽

10.48048/wjst.2021.9849 ◽

2021 ◽

Vol 18 (17) ◽

Author(s):

Micheal Olaolu AROWOLO ◽

Marion Olubunmi ADEBIYI ◽

Chiebuka Timothy NNODIM ◽

Sulaiman Olaniyi ABDULSALAM ◽

Ayodele Ariyo ADEBIYI

Keyword(s):

Gene Expression ◽

Genetic Algorithm ◽

Support Vector Machine ◽

Feature Selection ◽

Malaria Vector ◽

Recursive Feature Elimination ◽

Support Vector ◽

Adaptive Genetic Algorithm ◽

Rna Seq ◽

Gene Expressions

As mosquito parasites breed across many parts of the sub-Saharan Africa part of the world, infected cells embrace an unpredictable and erratic life period. Millions of individual parasites have gene expressions. Ribonucleic acid sequencing (RNA-seq) is a popular transcriptional technique that has improved the detection of major genetic probes. The RNA-seq analysis generally requires computational improvements of machine learning techniques since it computes interpretations of gene expressions. For this study, an adaptive genetic algorithm (A-GA) with recursive feature elimination (RFE) (A-GA-RFE) feature selection algorithms was utilized to detect important information from a high-dimensional gene expression malaria vector RNA-seq dataset. Support Vector Machine (SVM) kernels were used as the classification algorithms to evaluate its predictive performances. The feasibility of this study was confirmed by using an RNA-seq dataset from the mosquito Anopheles gambiae. The technique results in related performance had 98.3 and 96.7 % accuracy rates, respectively. HIGHLIGHTS Dimensionality reduction method based of feature selection Classification using Support vector machine Classification of malaria vector dataset using an adaptive GA-RFE-SVM GRAPHICAL ABSTRACT

Download Full-text

Infer related genes from large scale gene expression dataset with embedding

10.1101/362848 ◽

2018 ◽

Author(s):

Chi Tung Choy ◽

Chi Hang Wong ◽

Stephen Lam Chan

Keyword(s):

Gene Expression ◽

Large Scale ◽

Gene List ◽

Ground Truth ◽

Relevant Information ◽

Molecular Data ◽

Biological Data ◽

Gene Expression Dataset ◽

Biologically Relevant ◽

Unsupervised Data Mining

AbstractArtificial neural networks (ANNs) have been utilized for classification and prediction task with remarkable accuracy. However, its implications for unsupervised data mining using molecular data is under-explored. We adopted a method of unsupervised ANN, namely word embedding, to extract biologically relevant information from TCGA gene expression dataset. Ground truth relationship, such as cancer types of the input sample and semantic meaning of genes, were showed to retain in the resulting entity matrices. We also demonstrated the interpretability and usage of these matrices in shortlisting candidates from a long gene list. This method is feasible to mine big volume of biological data, and would be a valuable tool to discover novel knowledge from omics data. The resulting embedding matrices mined from TCGA gene expression data are interactively explorable online (http://bit.ly/tcga-embedding-cancer) and could serve as an informative reference.

Download Full-text

Cross-platform normalization of microarray and RNA-seq data for machine learning applications

10.7287/peerj.preprints.1460 ◽

2015 ◽

Author(s):

Jeffrey A Thompson ◽

Jie Tan ◽

Casey S Greene

Keyword(s):

Gene Expression ◽

Machine Learning ◽

Machine Learning Algorithms ◽

Quantile Normalization ◽

Learning Approaches ◽

Rna Seq ◽

Distribution Matching ◽

Machine Learning Applications ◽

Cross Platform ◽

R Programming

Large, publicly available gene expression datasets are often analyzed with the aid of machine learning algorithms. Although RNA-seq is increasingly the technology of choice, a wealth of expression data already exist in the form of microarray data. If machine learning models built from legacy data can be applied to RNA-seq data, larger, more diverse training datasets can be created and validation can be performed on newly generated data. We developed Training Distribution Matching (TDM), which transforms RNA-seq data for use with models constructed from legacy platforms. We evaluated TDM, as well as quantile normalization and a simple log2 transformation, on both simulated and biological datasets of gene expression. Our evaluation included both supervised and unsupervised machine learning approaches. We found that TDM exhibited consistently strong performance across settings and that quantile normalization also performed well in many circumstances. We also provide a TDM package for the R programming language.

Download Full-text

Cross-platform normalization of microarray and RNA-seq data for machine learning applications

PeerJ ◽

10.7717/peerj.1621 ◽

2016 ◽

Vol 4 ◽

pp. e1621 ◽

Cited By ~ 42

Author(s):

Jeffrey A. Thompson ◽

Jie Tan ◽

Casey S. Greene

Keyword(s):

Gene Expression ◽

Machine Learning ◽

Machine Learning Algorithms ◽

Quantile Normalization ◽

Learning Approaches ◽

Rna Seq ◽

Distribution Matching ◽

Machine Learning Applications ◽

Cross Platform ◽

R Programming

Large, publicly available gene expression datasets are often analyzed with the aid of machine learning algorithms. Although RNA-seq is increasingly the technology of choice, a wealth of expression data already exist in the form of microarray data. If machine learning models built from legacy data can be applied to RNA-seq data, larger, more diverse training datasets can be created and validation can be performed on newly generated data. We developed Training Distribution Matching (TDM), which transforms RNA-seq data for use with models constructed from legacy platforms. We evaluated TDM, as well as quantile normalization, nonparanormal transformation, and a simplelog2transformation, on both simulated and biological datasets of gene expression. Our evaluation included both supervised and unsupervised machine learning approaches. We found that TDM exhibited consistently strong performance across settings and that quantile normalization also performed well in many circumstances. We also provide a TDM package for the R programming language.

Download Full-text

Cross-platform normalization of microarray and RNA-seq data for machine learning applications

10.7287/peerj.preprints.1460v1 ◽

2015 ◽

Author(s):

Jeffrey A Thompson ◽

Jie Tan ◽

Casey S Greene

Keyword(s):

Gene Expression ◽

Machine Learning ◽

Machine Learning Algorithms ◽

Quantile Normalization ◽

Learning Approaches ◽

Rna Seq ◽

Distribution Matching ◽

Machine Learning Applications ◽

Cross Platform ◽

R Programming

Large, publicly available gene expression datasets are often analyzed with the aid of machine learning algorithms. Although RNA-seq is increasingly the technology of choice, a wealth of expression data already exist in the form of microarray data. If machine learning models built from legacy data can be applied to RNA-seq data, larger, more diverse training datasets can be created and validation can be performed on newly generated data. We developed Training Distribution Matching (TDM), which transforms RNA-seq data for use with models constructed from legacy platforms. We evaluated TDM, as well as quantile normalization and a simple log2 transformation, on both simulated and biological datasets of gene expression. Our evaluation included both supervised and unsupervised machine learning approaches. We found that TDM exhibited consistently strong performance across settings and that quantile normalization also performed well in many circumstances. We also provide a TDM package for the R programming language.

Download Full-text

Optimized Hybrid Heuristic Based Dimensionality Reduction Methods for Malaria Vector Using KNN Classifier

10.21203/rs.3.rs-107396/v1 ◽

2020 ◽

Author(s):

Micheal Olaolu Arowolo ◽

Marion Olubunmi Adebiyi ◽

Ayodele Ariyo Adebiyi ◽

Oludayo Olugbara

Keyword(s):

Gene Expression ◽

Dimensionality Reduction ◽

Principal Component ◽

Feature Space ◽

Component Analysis ◽

Rna Seq ◽

Knn Classifier ◽

Data Dimensionality Reduction ◽

Reduction Methods ◽

Mosquito Anopheles Gambiae

Abstract RNA-Seq data are utilized for biological applications and decision making for the classification of genes. A lot of works in recent time are focused on reducing the dimension of RNA-Seq data. Dimensionality reduction approaches have been proposed in the transformation of these data. In this study, a novel optimized hybrid investigative approach is proposed. It combines an optimized genetic algorithm with Principal Component Analysis and Independent Component Analysis (GA-O-PCA and GAO-ICA), which are used to identify an optimum subset and latent correlated features, respectively. The classifier uses KNN on the reduced mosquito Anopheles gambiae dataset, to enhance the accuracy and scalability in the gene expression analysis. The proposed algorithm is used to fetch relevant features based on the high-dimensional input feature space. A fast algorithm for feature ranking is used to select relevant features. The performances of the model are evaluated and validated using the classification accuracy to compare existing approaches in the literature. The achieved experimental results prove to be promising for selecting relevant genes and classifying pertinent gene expression data analysis by indicating that the approach is a capable addition to prevailing machine learning methods.

Download Full-text