scholarly journals Prediction of condition-specific regulatory genes using machine learning

2020 ◽  
Vol 48 (11) ◽  
pp. e62-e62 ◽  
Author(s):  
Qi Song ◽  
Jiyoung Lee ◽  
Shamima Akter ◽  
Matthew Rogers ◽  
Ruth Grene ◽  
...  

Abstract Recent advances in genomic technologies have generated data on large-scale protein–DNA interactions and open chromatin regions for many eukaryotic species. How to identify condition-specific functions of transcription factors using these data has become a major challenge in genomic research. To solve this problem, we have developed a method called ConSReg, which provides a novel approach to integrate regulatory genomic data into predictive machine learning models of key regulatory genes. Using Arabidopsis as a model system, we tested our approach to identify regulatory genes in data sets from single cell gene expression and from abiotic stress treatments. Our results showed that ConSReg accurately predicted transcription factors that regulate differentially expressed genes with an average auROC of 0.84, which is 23.5–25% better than enrichment-based approaches. To further validate the performance of ConSReg, we analyzed an independent data set related to plant nitrogen responses. ConSReg provided better rankings of the correct transcription factors in 61.7% of cases, which is three times better than other plant tools. We applied ConSReg to Arabidopsis single cell RNA-seq data, successfully identifying candidate regulatory genes that control cell wall formation. Our methods provide a new approach to define candidate regulatory genes using integrated genomic data in plants.

Information ◽  
2020 ◽  
Vol 11 (6) ◽  
pp. 332
Author(s):  
Ernest Kwame Ampomah ◽  
Zhiguang Qin ◽  
Gabriel Nyame

Forecasting the direction and trend of stock price is an important task which helps investors to make prudent financial decisions in the stock market. Investment in the stock market has a big risk associated with it. Minimizing prediction error reduces the investment risk. Machine learning (ML) models typically perform better than statistical and econometric models. Also, ensemble ML models have been shown in the literature to be able to produce superior performance than single ML models. In this work, we compare the effectiveness of tree-based ensemble ML models (Random Forest (RF), XGBoost Classifier (XG), Bagging Classifier (BC), AdaBoost Classifier (Ada), Extra Trees Classifier (ET), and Voting Classifier (VC)) in forecasting the direction of stock price movement. Eight different stock data from three stock exchanges (NYSE, NASDAQ, and NSE) are randomly collected and used for the study. Each data set is split into training and test set. Ten-fold cross validation accuracy is used to evaluate the ML models on the training set. In addition, the ML models are evaluated on the test set using accuracy, precision, recall, F1-score, specificity, and area under receiver operating characteristics curve (AUC-ROC). Kendall W test of concordance is used to rank the performance of the tree-based ML algorithms. For the training set, the AdaBoost model performed better than the rest of the models. For the test set, accuracy, precision, F1-score, and AUC metrics generated results significant to rank the models, and the Extra Trees classifier outperformed the other models in all the rankings.


2016 ◽  
Author(s):  
David Felix Lamparter ◽  
Daniel Marbach ◽  
Rico Rueedi ◽  
Sven Bergmann ◽  
Zoltan Kutalik

To better understand genome regulation, it is important to uncover the role of transcription factors in the process of chromatin structure establishment and maintenance. Here we present a data-driven approach to systematically characterize transcription factors that are relevant for this process. Our method uses a linear mixed modeling approach to combine data sets of transcription factor binding motif enrichments in open chromatin and gene expression across the same set of cell lines. Applying this approach to the ENCODE data set we confirm already known and imply numerous novel transcription factors in playing a role in the establishment or maintenance of open chromatin.


2019 ◽  
Author(s):  
Merce Montoliu-Nerin ◽  
Marisol Sánchez-García ◽  
Claudia Bergin ◽  
Manfred Grabherr ◽  
Barbara Ellis ◽  
...  

SummaryA large proportion of Earth's biodiversity constitutes organisms that cannot be cultured, have cryptic life-cycles and/or live submerged within their substrates1–4. Genomic data are key to unravel both their identity and function5. The development of metagenomic methods6,7 and the advent of single cell sequencing8–10 have revolutionized the study of life and function of cryptic organisms by upending the need for large and pure biological material, and allowing generation of genomic data from complex or limited environmental samples. Genome assemblies from metagenomic data have so far been restricted to organisms with small genomes, such as bacteria11, archaea12 and certain eukaryotes13. On the other hand, single cell technologies have allowed the targeting of unicellular organisms, attaining a better resolution than metagenomics8,9,14–16, moreover, it has allowed the genomic study of cells from complex organisms one cell at a time17,18. However, single cell genomics are not easily applied to multicellular organisms formed by consortia of diverse taxa, and the generation of specific workflows for sequencing and data analysis is needed to expand genomic research to the entire tree of life, including sponges19, lichens3,20, intracellular parasites21,22, and plant endophytes23,24. Among the most important plant endophytes are the obligate mutualistic symbionts, arbuscular mycorrhizal (AM) fungi, that pose an additional challenge with their multinucleate coenocytic mycelia25. Here, the development of a novel single nuclei sequencing and assembly workflow is reported. This workflow allows, for the first time, the generation of reference genome assemblies from large scale, unbiased sorted, and sequenced AM fungal nuclei circumventing tedious, and often impossible, culturing efforts. This method opens infinite possibilities for studies of evolution and adaptation in these important plant symbionts and demonstrates that reference genomes can be generated from complex non-model organisms by isolating only a handful of their nuclei.


2019 ◽  
Author(s):  
Nita Vangeepuram ◽  
Bian Liu ◽  
Po-hsiang Chiu ◽  
Linhua Wang ◽  
Gaurav Pandey

AbstractType 2 diabetes has become alarmingly prevalent among youth in recent years. However, simple questionnaire-based screening tools to reliably identify diabetes risk and prevent the adverse effects of this serious disease are only available for adults, not for youth. As a first step in developing such a tool, we used a large-scale dataset from the National Health and Nutritional Examination Survey (NHANES), to examine the performance of a well-known adult diabetes risk self-assessment screener and published pediatric clinical screening guidelines in identifying youth with pre- diabetes/diabetes (pre-DM/DM) based on American Diabetes Association diagnostic biomarkers. We assessed the agreement between the adult screener/pediatric screening guidelines and biomarker diagnostic criteria by conducting comparisons using the overall data set and sub-datasets stratified by sex, race/ethnicity, and age. While the pediatric guidelines performed better than the adult screener in identifying youth with pre-DM/DM (sensitivity 43.1% vs 7.2%), both are inadequate for general deployment among youth. There were also notable differences in the performance of the pediatric guidelines across subgroups based on age, sex and race/ethnicity. In an effort to improve pre-DM/DM screening, we also evaluated data-driven machine learning-based classification algorithms, several of which performed slightly but statistically significantly better than the pediatric screening guidelines.


2021 ◽  
Vol 99 (Supplement_3) ◽  
pp. 377-378
Author(s):  
Ghader Manafiazar ◽  
Mohammad Riazi ◽  
John A Basarab ◽  
Changxi Li ◽  
Paul Stothard ◽  
...  

Abstract The objective of this study was to explore the potential of Machine Learning (ML) algorithms to predict residual feed intake (RFI) classification group (high or low RFI) and individual RFI using performance records and genomic information. A total of 4145 animals from research and commercial herds with RFI performance records were included in the study from which 3899 cattle had genomic information (genotyped using Illumina Bovine 50k SNP BeadChip). Different libraries based on R and Python including Lazy Predict, Scikit-learn, PyCaret, and H2O Flow were used to test various ML models. Genomic information was subjected to quality control by removing SNPs with an allele frequency less than 0.05 or with a call rate lower than 0.95. A total of 42,689 SNPs remained for further analysis and accounted for 34% of phenotypic variation (heritability of 0.34±0.07) in RFI. Different numbers of SNPs were selected based on their contribution to phenotypic variation (500 SNPs, 1K, 5K, 10K, and 15K) then were included in the ML models. The GLM Stacked Ensemble model with 15k SNPs performed better than the other models to predict RFI classification group (R2 = 0.54). Regardless of the number of SNPs included in the model, GLM Stacked Ensemble performed better than other models to predict individual RFI. This model’s performance improved with increasing SNPs (MAE=0.39 for 500 SNPs; 0.31 for 15k SNPs). In the test data set, an increasing number of SNPs did not change the performance of the model and had a MAE of 0.39). The results demonstrate the potential for ML to improve predictions for feed efficiency compare to genomic analysis in beef cattle without measuring feed intake.


2015 ◽  
Vol 82 (4) ◽  
pp. 992-1003 ◽  
Author(s):  
Eric D. Becraft ◽  
Jeremy A. Dodsworth ◽  
Senthil K. Murugapiran ◽  
J. Ingemar Ohlsson ◽  
Brandon R. Briggs ◽  
...  

ABSTRACTThe vast majority of microbial life remains uncatalogued due to the inability to cultivate these organisms in the laboratory. This “microbial dark matter” represents a substantial portion of the tree of life and of the populations that contribute to chemical cycling in many ecosystems. In this work, we leveraged an existing single-cell genomic data set representing the candidate bacterial phylum “Calescamantes” (EM19) to calibrate machine learning algorithms and define metagenomic bins directly from pyrosequencing reads derived from Great Boiling Spring in the U.S. Great Basin. Compared to other assembly-based methods, taxonomic binning with a read-based machine learning approach yielded final assemblies with the highest predicted genome completeness of any method tested. Read-first binning subsequently was used to extractCalescamantesbins from all metagenomes with abundantCalescamantespopulations, including metagenomes from Octopus Spring and Bison Pool in Yellowstone National Park and Gongxiaoshe Spring in Yunnan Province, China. Metabolic reconstruction suggests thatCalescamantesare heterotrophic, facultative anaerobes, which can utilize oxidized nitrogen sources as terminal electron acceptors for respiration in the absence of oxygen and use proteins as their primary carbon source. Despite their phylogenetic divergence, the geographically separateCalescamantespopulations were highly similar in their predicted metabolic capabilities and core gene content, respiring O2, or oxidized nitrogen species for energy conservation in distant but chemically similar hot springs.


2003 ◽  
Vol 804 ◽  
Author(s):  
Gregory A. Landrum ◽  
Julie Penzotti ◽  
Santosh Putta

ABSTRACTStandard machine-learning algorithms were used to build models capable of predicting the molecular weights of polymers generated by a homogeneous catalyst. Using descriptors calculated from only the two-dimensional structures of the ligands, the average accuracy of the models on an external validation data set was approximately 70%. Because the models show no bias and perform significantly better than equivalent models built using randomized data, we conclude that they learned useful rules and did not overfit the data.


2019 ◽  
Author(s):  
Steffen Albrecht ◽  
Tommaso Andreani ◽  
Miguel A. Andrade-Navarro ◽  
Jean-Fred Fontaine

AbstractSingle-cell ChIP-seq analysis is challenging due to data sparsity. We present SIMPA (https://github.com/salbrec/SIMPA), a single-cell ChIP-seq data imputation method leveraging predictive information within bulk ENCODE data to impute missing protein-DNA interacting regions of target histone marks or transcription factors. Machine learning models trained for each single cell, each target, and each genomic region enable drastic improvement in cell types clustering and genes identification.


Author(s):  
Ann Rose Bright ◽  
Siebe van Genesen ◽  
Qingqing Li ◽  
Simon J. van Heeringen ◽  
Alexia Grasso ◽  
...  

ABSTRACTDuring gastrulation, mesoderm is induced in pluripotent cells, concomitant with dorsal-ventral patterning and establishing of the dorsal axis. How transcription factors operate within the constraints of chromatin accessibility to mediate these processes is not well-understood. We applied chromatin accessibility and single cell transcriptome analyses to explore the emergence of heterogeneity and underlying gene-regulatory mechanisms during early gastrulation in Xenopus. ATAC-sequencing of pluripotent animal cap cells revealed a state of open chromatin of transcriptionally inactive lineage-restricted genes, whereas chromatin accessibility in dorsal marginal zone cells more closely reflected the transcriptional activity of genes. We characterized single cell trajectories in animal cap and dorsal marginal zone in early gastrula embryos, and inferred the activity of transcription factors in single cell clusters by integrating chromatin accessibility and single cell RNA-sequencing. We tested the activity of organizer-expressed transcription factors in mesoderm-competent animal cap cells and found combinatorial effects of these factors on organizer gene expression. In particular the combination of Foxb1 and Eomes induced a gene expression profile that mimicked those observed in head and trunk organizer single cell clusters. In addition, genes induced by Eomes, Otx2 or the Irx3-Otx2 combination, were enriched for promoters with maternally regulated H3K4me3 modifications, whereas promoters selectively induced by Lhx8 were marked more frequently by zygotically controlled H3K4me3. Our results show that combinatorial activity of zygotically expressed transcription factors acts on maternally-regulated accessible chromatin to induce organizer gene expression.


2021 ◽  
Vol 22 (1) ◽  
Author(s):  
K. C. Kishan ◽  
Sridevi K. Subramanya ◽  
Rui Li ◽  
Feng Cui

Abstract Background Most transcription factors (TFs) compete with nucleosomes to gain access to their cognate binding sites. Recent studies have identified several TF-nucleosome interaction modes including end binding (EB), oriented binding, periodic binding, dyad binding, groove binding, and gyre spanning. However, there are substantial experimental challenges in measuring nucleosome binding modes for thousands of TFs in different species. Results We present a computational prediction of the binding modes based on TF protein sequences. With a nested cross-validation procedure, our model outperforms several fine-tuned off-the-shelf machine learning (ML) methods in the multi-label classification task. Our binary classifier for the EB mode performs better than these ML methods with the area under precision-recall curve achieving 75%. The end preference of most TFs is consistent with low nucleosome occupancy around their binding site in GM12878 cells. The nucleosome occupancy data is used as an alternative dataset to confirm the superiority of our EB classifier. Conclusions We develop the first ML-based approach for efficient and comprehensive analysis of nucleosome binding modes of TFs.


Sign in / Sign up

Export Citation Format

Share Document