scholarly journals An ICA-ensemble learning approaches for prediction of RNA-seq malaria vector gene expression data classification

Author(s):  
Micheal Olaolu Arowolo ◽  
Marion O. Adebiyi ◽  
Ayodele A. Adebiyi ◽  
Charity Aremu

Malaria parasites introduce outstanding life-phase variations as they grow across multiple atmospheres of the mosquito vector. There are transcriptomes of several thousand different parasites. (RNA-seq) Ribonucleic acid sequencing is a prevalent gene expression tool leading to better understanding of genetic interrogations. RNA-seq measures transcriptions of expressions of genes. Data from RNA-seq necessitate procedural enhancements in machine learning techniques. Researchers have suggested various approached learning for the study of biological data. This study works on ICA feature extraction algorithm to realize dormant components from a huge dimensional RNA-seq vector dataset, and estimates its classification performance, Ensemble classification algorithm is used in carrying out the experiment. This study is tested on RNA-Seq mosquito anopheles gambiae dataset. The results of the experiment obtained an output metrics with a 93.3% classification accuracy.

Author(s):  
Micheal Olaolu Arowolo ◽  
Marion O. Adebiyi ◽  
Ayodele A. Adebiyi ◽  
Olatunji J. Okesola

<p>Malaria parasites accept uncertain, inconsistent life span breeding through vectors of mosquitoes stratospheres. Thousands of different transcriptome parasites exist. A prevalent ribonucleic acid sequencing (RNA-seq) technique for gene expression has brought about enhanced identifications of genetical queries. Computation of RNA-seq gene expression data transcripts requires enhancements using analytical machine learning procedures. Numerous learning approaches have been adopted for analyzing and enhancing the performance of biological data and machines. In this study, a genetic algorithm dimensionality reduction technique is proposed to fetch relevant information from a huge dimensional RNA-seq dataset, and classification uses Ensemble classification algorithms. The experiment is performed using a mosquito Anopheles gambiae dataset with a classification accuracy of 81.7% and 88.3%.</p>


2020 ◽  
Vol 21 (8) ◽  
pp. 2748 ◽  
Author(s):  
Ruth Barral-Arca ◽  
Alberto Gómez-Carballa ◽  
Miriam Cebey-López ◽  
María José Currás-Tuala ◽  
Sara Pischedda ◽  
...  

There is a growing interest in unraveling gene expression mechanisms leading to viral host invasion and infection progression. Current findings reveal that long non-coding RNAs (lncRNAs) are implicated in the regulation of the immune system by influencing gene expression through a wide range of mechanisms. By mining whole-transcriptome shotgun sequencing (RNA-seq) data using machine learning approaches, we detected two lncRNAs (ENSG00000254680 and ENSG00000273149) that are downregulated in a wide range of viral infections and different cell types, including blood monocluclear cells, umbilical vein endothelial cells, and dermal fibroblasts. The efficiency of these two lncRNAs was positively validated in different viral phenotypic scenarios. These two lncRNAs showed a strong downregulation in virus-infected patients when compared to healthy control transcriptomes, indicating that these biomarkers are promising targets for infection diagnosis. To the best of our knowledge, this is the very first study using host lncRNAs biomarkers for the diagnosis of human viral infections.


2020 ◽  
Author(s):  
Aristidis G. Vrahatis ◽  
Sotiris Tasoulis ◽  
Spiros Georgakopoulos ◽  
Vassilis Plagianakos

AbstractNowadays the biomedical data are generated exponentially, creating datasets for analysis with ultra-high dimensionality and complexity. This revolution, which has been caused by recent advents in biotechnologies, has driven to big-data and data-driven computational approaches. An indicative example is the emerging single-cell RNA-sequencing (scRNA-seq) technology, which isolates and measures individual cells. Although scRNA-seq has revolutionized the biotechnology domain, such data computational analysis is a major challenge because of their ultra-high dimensionality and complexity. Following this direction, in this work we study the properties, effectiveness and generalization of the recently proposed MRPV algorithm for single cell RNA-seq data. MRPV is an ensemble classification technique utilizing multiple ultra-low dimensional Random Projected spaces. A given classifier determines the class for each sample for all independent spaces while a majority voting scheme defines their predominant class. We show that Random Projection ensembles offer a platform not only for a low computational time analysis but also for enhancing classification performance. The developed methodologies were applied to four real biomedical high dimensional data from single-cell RNA-seq studies and compared against well-known and similar classification tools. Experimental results showed that based on simplistic tools we can create a computationally fast, simple, yet effective approach for single cell RNA-seq data with ultra-high dimensionality.


BMC Genomics ◽  
2011 ◽  
Vol 12 (1) ◽  
Author(s):  
Mariangela Bonizzoni ◽  
W Augustine Dunn ◽  
Corey L Campbell ◽  
Ken E Olson ◽  
Michelle T Dimon ◽  
...  

2015 ◽  
Author(s):  
Jeffrey A Thompson ◽  
Jie Tan ◽  
Casey S Greene

Large, publicly available gene expression datasets are often analyzed with the aid of machine learning algorithms. Although RNA-seq is increasingly the technology of choice, a wealth of expression data already exist in the form of microarray data. If machine learning models built from legacy data can be applied to RNA-seq data, larger, more diverse training datasets can be created and validation can be performed on newly generated data. We developed Training Distribution Matching (TDM), which transforms RNA-seq data for use with models constructed from legacy platforms. We evaluated TDM, as well as quantile normalization and a simple log2 transformation, on both simulated and biological datasets of gene expression. Our evaluation included both supervised and unsupervised machine learning approaches. We found that TDM exhibited consistently strong performance across settings and that quantile normalization also performed well in many circumstances. We also provide a TDM package for the R programming language.


PeerJ ◽  
2016 ◽  
Vol 4 ◽  
pp. e1621 ◽  
Author(s):  
Jeffrey A. Thompson ◽  
Jie Tan ◽  
Casey S. Greene

Large, publicly available gene expression datasets are often analyzed with the aid of machine learning algorithms. Although RNA-seq is increasingly the technology of choice, a wealth of expression data already exist in the form of microarray data. If machine learning models built from legacy data can be applied to RNA-seq data, larger, more diverse training datasets can be created and validation can be performed on newly generated data. We developed Training Distribution Matching (TDM), which transforms RNA-seq data for use with models constructed from legacy platforms. We evaluated TDM, as well as quantile normalization, nonparanormal transformation, and a simplelog2transformation, on both simulated and biological datasets of gene expression. Our evaluation included both supervised and unsupervised machine learning approaches. We found that TDM exhibited consistently strong performance across settings and that quantile normalization also performed well in many circumstances. We also provide a TDM package for the R programming language.


2015 ◽  
Author(s):  
Jeffrey A Thompson ◽  
Jie Tan ◽  
Casey S Greene

Large, publicly available gene expression datasets are often analyzed with the aid of machine learning algorithms. Although RNA-seq is increasingly the technology of choice, a wealth of expression data already exist in the form of microarray data. If machine learning models built from legacy data can be applied to RNA-seq data, larger, more diverse training datasets can be created and validation can be performed on newly generated data. We developed Training Distribution Matching (TDM), which transforms RNA-seq data for use with models constructed from legacy platforms. We evaluated TDM, as well as quantile normalization and a simple log2 transformation, on both simulated and biological datasets of gene expression. Our evaluation included both supervised and unsupervised machine learning approaches. We found that TDM exhibited consistently strong performance across settings and that quantile normalization also performed well in many circumstances. We also provide a TDM package for the R programming language.


2020 ◽  
Author(s):  
Micheal Olaolu Arowolo ◽  
Marion Olubunmi Adebiyi ◽  
Ayodele Ariyo Adebiyi ◽  
Oludayo Olugbara

Abstract RNA-Seq data are utilized for biological applications and decision making for the classification of genes. A lot of works in recent time are focused on reducing the dimension of RNA-Seq data. Dimensionality reduction approaches have been proposed in the transformation of these data. In this study, a novel optimized hybrid investigative approach is proposed. It combines an optimized genetic algorithm with Principal Component Analysis and Independent Component Analysis (GA-O-PCA and GAO-ICA), which are used to identify an optimum subset and latent correlated features, respectively. The classifier uses KNN on the reduced mosquito Anopheles gambiae dataset, to enhance the accuracy and scalability in the gene expression analysis. The proposed algorithm is used to fetch relevant features based on the high-dimensional input feature space. A fast algorithm for feature ranking is used to select relevant features. The performances of the model are evaluated and validated using the classification accuracy to compare existing approaches in the literature. The achieved experimental results prove to be promising for selecting relevant genes and classifying pertinent gene expression data analysis by indicating that the approach is a capable addition to prevailing machine learning methods.


2021 ◽  
Vol 50 (9) ◽  
pp. 2579-2589
Author(s):  
Micheal Olaolu Arowolo ◽  
Marion Olubunmi Adebiyi ◽  
Ayodele Ariyo Adebiyi

RNA-Seq data are utilized for biological applications and decision making for classification of genes. Lots of work in recent time are focused on reducing the dimension of RNA-Seq data. Dimensionality reduction approaches have been proposed in fetching relevant information in a given data. In this study, a novel optimized dimensionality reduction algorithm is proposed, by combining an optimized genetic algorithm with Principal Component Analysis and Independent Component Analysis (GA-O-PCA and GAO-ICA), which are used to identify an optimum subset and latent correlated features, respectively. The classifier uses Decision tree on the reduced mosquito anopheles gambiae dataset to enhance the accuracy and scalability in the gene expression analysis. The proposed algorithm is used to fetch relevant features based from the high-dimensional input feature space. A feature ranking and earlier experience are used. The performances of the model are evaluated and validated using the classification accuracy to compare existing approaches in the literature. The achieved experimental results prove to be promising for feature selection and classification in gene expression data analysis and specify that the approach is a capable accumulation to prevailing data mining techniques.


Diabetes is considered as one of the most chronic disease which has serious impact on human health and leading cause of mortality worldwide. The early prediction of diabetes can help clinicians to provide a better diagnosis to the patients. Recently, computed aided diagnosis systems have gained attention due to significant growth in data mining, and machine learning. Several approaches are present based on the machine learning techniques but due to poor classification performance and computational complexity, it becomes difficult to utilize for real-time applications. Ensemble classification approaches have reported a noteworthy improvement in diabetes classification but desired accuracy is still a challenging task. Hence, in this work we introduce a combined hybrid approach called as ENNEnsemble based neural network approach for diabetes classification. In this approach, a feature selection process is presented using neighboring search technique; the selected features are processed through the feature ranking model to generate the efficient feature subset for better classification accuracy. Finally, these features are learned and classified using neural network classifier. The experimental study shows that the proposed approach achieves better accuracy when compared with the existing techniques.


Sign in / Sign up

Export Citation Format

Share Document