An integrative machine learning strategy for improved prediction of essential genes in Escherichia coli metabolism using flux-coupled features

2017 ◽  
Vol 13 (8) ◽  
pp. 1584-1596 ◽  
Author(s):  
Sutanu Nandi ◽  
Abhishek Subramanian ◽  
Ram Rup Sarkar

We propose an integrated machine learning process to predict gene essentiality in Escherichia coli K-12 MG1655 metabolism that outperforms known methods.

2021 ◽  
Vol 22 (10) ◽  
pp. 5056
Author(s):  
Tulio L. Campos ◽  
Pasi K. Korhonen ◽  
Neil D. Young

Experimental studies of Caenorhabditis elegans and Drosophila melanogaster have contributed substantially to our understanding of molecular and cellular processes in metazoans at large. Since the publication of their genomes, functional genomic investigations have identified genes that are essential or non-essential for survival in each species. Recently, a range of features linked to gene essentiality have been inferred using a machine learning (ML)-based approach, allowing essentiality predictions within a species. Nevertheless, predictions between species are still elusive. Here, we undertake a comprehensive study using ML to discover and validate features of essential genes common to both C. elegans and D. melanogaster. We demonstrate that the cross-species prediction of gene essentiality is possible using a subset of features linked to nucleotide/protein sequences, protein orthology and subcellular localisation, single-cell RNA-seq, and histone methylation markers. Complementary analyses showed that essential genes are enriched for transcription and translation functions and are preferentially located away from heterochromatin regions of C. elegans and D. melanogaster chromosomes. The present work should enable the cross-prediction of essential genes between model and non-model metazoans.


Author(s):  
Hirotada Mori ◽  
Tomoya Baba ◽  
Katsushi Yokoyama ◽  
Rikiya Takeuchi ◽  
Wataru Nomura ◽  
...  

Microbiology ◽  
2014 ◽  
Vol 160 (11) ◽  
pp. 2341-2351 ◽  
Author(s):  
Mario Juhas ◽  
Daniel R. Reuß ◽  
Bingyao Zhu ◽  
Fabian M. Commichau

Investigation of essential genes, besides contributing to understanding the fundamental principles of life, has numerous practical applications. Essential genes can be exploited as building blocks of a tightly controlled cell ‘chassis’. Bacillus subtilis and Escherichia coli K-12 are both well-characterized model bacteria used as hosts for a plethora of biotechnological applications. Determination of the essential genes that constitute the B. subtilis and E. coli minimal genomes is therefore of the highest importance. Recent advances have led to the modification of the original B. subtilis and E. coli essential gene sets identified 10 years ago. Furthermore, significant progress has been made in the area of genome minimization of both model bacteria. This review provides an update, with particular emphasis on the current essential gene sets and their comparison with the original gene sets identified 10 years ago. Special attention is focused on the genome reduction analyses in B. subtilis and E. coli and the construction of minimal cell factories for industrial applications.


2013 ◽  
Vol 88 (4) ◽  
pp. 233-240 ◽  
Author(s):  
Han Tek Yong ◽  
Natsuko Yamamoto ◽  
Rikiya Takeuchi ◽  
Yi-Ju Hsieh ◽  
Tom M. Conrad ◽  
...  

2017 ◽  
Vol 13 (3) ◽  
pp. 577-584 ◽  
Author(s):  
Yongming Yu ◽  
Licai Yang ◽  
Zhiping Liu ◽  
Chuansheng Zhu

Predicting bacterial essential genes using only fractal features.


2017 ◽  
Author(s):  
Emily C. A. Goodall ◽  
Ashley Robinson ◽  
Iain G. Johnston ◽  
Sara Jabbari ◽  
Keith A. Turner ◽  
...  

ABSTRACTTransposon-Directed Insertion-site Sequencing (TraDIS) is a high-throughput method coupling transposon mutagenesis with short-fragment DNA sequencing. It is commonly used to identify essential genes. Single gene deletion libraries are considered the gold standard for identifying essential genes. Currently, the TraDIS method has not been benchmarked against such libraries and therefore it remains unclear whether the two methodologies are comparable. To address this, a high density transposon library was constructed inEscherichia coliK-12. Essential genes predicted from sequencing of this library were compared to existing essential gene databases. To decrease false positive identification of essential gene candidates, statistical data analysis included corrections for both gene length and genome length. Through this analysis new essential genes and genes previously incorrectly designated as essential were identified. We show that manual analysis of TraDIS data reveals novel features that would not have been detected by statistical analysis alone. Examples include short essential regions within genes, orientation-dependent effects and fine resolution identification of genome and protein features. Recognition of these insertion profiles in transposon mutagenesis datasets will assist genome annotation of less well characterized genomes and provides new insights into bacterial physiology and biochemistry.IMPORTANCEIncentives to define lists of genes that are essential for bacterial survival include the identification of potential targets for antibacterial drug development, genes required for rapid growth for exploitation in biotechnology, and discovery of new biochemical pathways. To identify essential genes inE. coli, we constructed a very high density transposon mutant library. Initial automated analysis of the resulting data revealed many discrepancies when compared to the literature. We now report more extensive statistical analysis supported by both literature searches and detailed inspection of high density TraDIS sequencing data for each putative essential gene for the model laboratory organism,Escherichia coli. This paper is important because it provides a better understanding of the essential genes ofE. coli, reveals the limitations of relying on automated analysis alone and a provides new standard for the analysis of TraDIS data.


2020 ◽  
Author(s):  
Xue Zhang ◽  
Weijia Xiao ◽  
Wangxin Xiao

ABSTRACTEssential genes are necessary to the survival or reproduction of a living organism. The prediction and analysis of gene essentiality can advance our understanding to basic life and human diseases, and further boost the development of new drugs. Wet lab methods for identifying essential genes are often costly, time consuming, and laborious. As a complement, computational methods have been proposed to predict essential genes by integrating multiple biological data sources. Most of these methods are evaluated on model organisms. However, prediction methods for human essential genes are still limited and the relationship between human gene essentiality and different biological information still needs to be explored. In addition, exploring suitable deep learning techniques to overcome the limitations of traditional machine learning methods and improve the prediction accuracy is also important and interesting. We propose a deep learning based method, DeepSF, to predict human essential genes. DeepSF integrates sequence features derived from DNA and protein sequence data with features extracted or learned from different types of functional data, such as gene ontology, protein complex, protein domain, and protein-protein interaction network. More than 200 features from these biological data are extracted/learned which are integrated together to train a cost-sensitive deep neural network by utilizing multiple deep leaning techniques. The experimental results of 10-fold cross validation show that DeepSF can accurately predict human gene essentiality with an average AUC of 95.17%, the area under precision-recall curve (auPRC) of 92.21%, the accuracy of 91.59%, and the F1 measure about 78.71%. In addition, the comparison experimental results show that DeepSF significantly outperforms several popular traditional machine learning models (SVM, Random Forest, and Adaboost), and performs slightly better than a recent deep learning model (DeepHE). We have demonstrated that the proposed method, DeepSF, is effective for predicting human essential genes. Deep learning techniques are promising at both feature learning and classification levels for the task of essential gene prediction.


2013 ◽  
Vol 196 (5) ◽  
pp. 982-988 ◽  
Author(s):  
A. Mackie ◽  
S. Paley ◽  
I. M. Keseler ◽  
A. Shearer ◽  
I. T. Paulsen ◽  
...  

PLoS ONE ◽  
2020 ◽  
Vol 15 (11) ◽  
pp. e0242943
Author(s):  
Sutanu Nandi ◽  
Piyali Ganguli ◽  
Ram Rup Sarkar

Essential gene prediction helps to find minimal genes indispensable for the survival of any organism. Machine learning (ML) algorithms have been useful for the prediction of gene essentiality. However, currently available ML pipelines perform poorly for organisms with limited experimental data. The objective is the development of a new ML pipeline to help in the annotation of essential genes of less explored disease-causing organisms for which minimal experimental data is available. The proposed strategy combines unsupervised feature selection technique, dimension reduction using the Kamada-Kawai algorithm, and semi-supervised ML algorithm employing Laplacian Support Vector Machine (LapSVM) for prediction of essential and non-essential genes from genome-scale metabolic networks using very limited labeled dataset. A novel scoring technique, Semi-Supervised Model Selection Score, equivalent to area under the ROC curve (auROC), has been proposed for the selection of the best model when supervised performance metrics calculation is difficult due to lack of data. The unsupervised feature selection followed by dimension reduction helped to observe a distinct circular pattern in the clustering of essential and non-essential genes. LapSVM then created a curve that dissected this circle for the classification and prediction of essential genes with high accuracy (auROC > 0.85) even with 1% labeled data for model training. After successful validation of this ML pipeline on both Eukaryotes and Prokaryotes that show high accuracy even when the labeled dataset is very limited, this strategy is used for the prediction of essential genes of organisms with inadequate experimentally known data, such as Leishmania sp. Using a graph-based semi-supervised machine learning scheme, a novel integrative approach has been proposed for essential gene prediction that shows universality in application to both Prokaryotes and Eukaryotes with limited labeled data. The essential genes predicted using the pipeline provide an important lead for the prediction of gene essentiality and identification of novel therapeutic targets for antibiotic and vaccine development against disease-causing parasites.


Sign in / Sign up

Export Citation Format

Share Document