scholarly journals Stochastic semi-supervised learning to prioritise genes from high-throughput genomic screens

2019 ◽  
Author(s):  
Dimitrios Vitsios ◽  
Slavé Petrovski

AbstractAccess to large-scale genomics datasets has increased the utility of hypothesis-free genome-wide analyses that result in candidate lists of genes. Often these analyses highlight several gene signals that might contribute to pathogenesis but are insufficiently powered to reach experiment-wide significance. This often triggers a process of laborious evaluation of highly-ranked genes through manual inspection of various public knowledge resources to triage those considered sufficiently interesting for deeper investigation. Here, we introduce a novel multi-dimensional, multi-step machine learning framework to objectively and more holistically assess biological relevance of genes to disease studies, by relying on a plethora of gene-associated annotations. We developed mantis-ml to serve as an automated machine learning (AutoML) framework, following a stochastic semi-supervised learning approach to rank known and novel disease-associated genes through iterative training and prediction sessions of random balanced datasets across the protein-coding exome (n=18,626 genes). We applied this framework on a range of disease-specific areas and as a generic disease likelihood estimator, achieving an average Area Under Curve (AUC) prediction performance of 0.85. Critically, to demonstrate applied utility on exome-wide association studies, we overlapped mantis-ml disease-specific predictions with data from published cohort-level association studies. We retrieved statistically significant enrichment of high mantis-ml predictions among the top-ranked genes from hypothesis-free cohort-level statistics (p<0.05), suggesting the capture of true prioritisation signals. We believe that mantis-ml is a novel easy-to-use tool to support objectively triaging gene discovery and overall enhancing our understanding of complex genotype-phenotype associations.

2014 ◽  
Author(s):  
Daniel S Himmelstein ◽  
Sergio E Baranzini

The first decade of Genome Wide Association Studies (GWAS) has uncovered a wealth of disease-associated variants. Two important derivations will be the translation of this information into a multiscale understanding of pathogenic variants, and leveraging existing data to increase the power of existing and future studies through prioritization. We explore edge prediction on heterogeneous networks—graphs with multiple node and edge types—for accomplishing both tasks. First we constructed a network with 18 node types—genes, diseases, tissues, pathophysiologies, and 14 MSigDB (molecular signatures database)collections—and 19 edge types from high-throughput publicly-available resources. From this network composed of 40,343 nodes and 1,608,168 edges, we extracted features that describe the topology between specific genes and diseases. Next, we trained a model from GWAS associations and predicted the probability of association between each protein-coding gene and each of 29 well-studied complex diseases. The model, which achieved 132-fold enrichment in precision at 10% recall, outperformed any individual domain, highlighting the benefit of integrative approaches. We identified pleiotropy, transcriptional signatures of perturbations, pathways, and protein interactions as fundamental mechanisms explaining pathogenesis. Our method successfully predicted the results (with AUROC = 0.79) from a withheld multiple sclerosis (MS) GWAS despite starting with only 13 previously associated genes. Finally, we combined our network predictions with statistical evidence of association to propose four novel MS genes, three of which (JAK2, REL, RUNX3) validated on the masked GWAS. Furthermore, our predictions provide biological support highlighting REL as the causal gene within its gene-rich locus. Users can browse all predictions online (http://het.io). Heterogeneous network edge prediction effectively prioritized genetic associations and provides a powerful new approach for data integration across multiple domains.


2019 ◽  
Vol 10 (1) ◽  
Author(s):  
K. T. Schütt ◽  
M. Gastegger ◽  
A. Tkatchenko ◽  
K.-R. Müller ◽  
R. J. Maurer

AbstractMachine learning advances chemistry and materials science by enabling large-scale exploration of chemical space based on quantum chemical calculations. While these models supply fast and accurate predictions of atomistic chemical properties, they do not explicitly capture the electronic degrees of freedom of a molecule, which limits their applicability for reactive chemistry and chemical analysis. Here we present a deep learning framework for the prediction of the quantum mechanical wavefunction in a local basis of atomic orbitals from which all other ground-state properties can be derived. This approach retains full access to the electronic structure via the wavefunction at force-field-like efficiency and captures quantum mechanics in an analytically differentiable representation. On several examples, we demonstrate that this opens promising avenues to perform inverse design of molecular structures for targeting electronic property optimisation and a clear path towards increased synergy of machine learning and quantum chemistry.


2021 ◽  
Vol 21 ◽  
Author(s):  
Han Yu ◽  
Zi-Ang Shen ◽  
Yuan-Ke Zhou ◽  
Pu-Feng Du

: Long non-coding RNAs (LncRNAs) are a type of RNA with little or no protein-coding ability. Their length is more than 200 nucleotides. A large number of studies have indicated that lncRNAs play a significant role in various biological processes, including chromatin organizations, epigenetic programmings, transcriptional regulations, post-transcriptional processing, and circadian mechanism at the cellular level. Since lncRNAs perform vast functions through their interactions with proteins, identifying lncRNA-protein interaction is crucial to the understandings of the lncRNA molecular functions. However, due to the high cost and time-consuming disadvantage of experimental methods, a variety of computational methods have emerged. Recently, many effective and novel machine learning methods have been developed. In general, these methods fall into two categories: semi-supervised learning methods and supervised learning methods. The latter category can be further classified into the deep learning-based method, the ensemble learning-based method, and the hybrid method. In this paper, we focused on supervised learning methods. We summarized the state-of-the-art methods in predicting lncRNA-protein interactions. Furthermore, the performance and the characteristics of different methods have also been compared in this work. Considering the limits of the existing models, we analyzed the problems and discussed future research potentials.


Author(s):  
Aijaz Ahmad Malik ◽  
Warot Chotpatiwetchkul ◽  
Chuleeporn Phanus-umporn ◽  
Chanin Nantasenamat ◽  
Phasit Charoenkwan ◽  
...  

2019 ◽  
Author(s):  
Heba Z. Sailem ◽  
Jens Rittscher ◽  
Lucas Pelkmans

AbstractCharacterising context-dependent gene functions is crucial for understanding the genetic bases of health and disease. To date, inference of gene functions from large-scale genetic perturbation screens is based on ad-hoc analysis pipelines involving unsupervised clustering and functional enrichment. We present Knowledge-Driven Machine Learning (KDML), a framework that systematically predicts multiple functions for a given gene based on the similarity of its perturbation phenotype to those with known function. As proof of concept, we test KDML on three datasets describing phenotypes at the molecular, cellular and population levels, and show that it outperforms traditional analysis pipelines. In particular, KDML identified an abnormal multicellular organisation phenotype associated with the depletion of olfactory receptors and TGFβ and WNT signalling genes in colorectal cancer cells. We validate these predictions in colorectal cancer patients and show that olfactory receptors expression is predictive of worse patient outcome. These results highlight KDML as a systematic framework for discovering novel scale-crossing and clinically relevant gene functions. KDML is highly generalizable and applicable to various large-scale genetic perturbation screens.


Informatics ◽  
2021 ◽  
Vol 8 (3) ◽  
pp. 59
Author(s):  
Alexander Chowdhury ◽  
Jacob Rosenthal ◽  
Jonathan Waring ◽  
Renato Umeton

Machine learning has become an increasingly ubiquitous technology, as big data continues to inform and influence everyday life and decision-making. Currently, in medicine and healthcare, as well as in most other industries, the two most prevalent machine learning paradigms are supervised learning and transfer learning. Both practices rely on large-scale, manually annotated datasets to train increasingly complex models. However, the requirement of data to be manually labeled leaves an excess of unused, unlabeled data available in both public and private data repositories. Self-supervised learning (SSL) is a growing area of machine learning that can take advantage of unlabeled data. Contrary to other machine learning paradigms, SSL algorithms create artificial supervisory signals from unlabeled data and pretrain algorithms on these signals. The aim of this review is two-fold: firstly, we provide a formal definition of SSL, divide SSL algorithms into their four unique subsets, and review the state of the art published in each of those subsets between the years of 2014 and 2020. Second, this work surveys recent SSL algorithms published in healthcare, in order to provide medical experts with a clearer picture of how they can integrate SSL into their research, with the objective of leveraging unlabeled data.


2019 ◽  
Author(s):  
Ryan L. Collins ◽  
Harrison Brand ◽  
Konrad J. Karczewski ◽  
Xuefang Zhao ◽  
Jessica Alföldi ◽  
...  

SUMMARYStructural variants (SVs) rearrange large segments of the genome and can have profound consequences for evolution and human diseases. As national biobanks, disease association studies, and clinical genetic testing grow increasingly reliant on genome sequencing, population references such as the Genome Aggregation Database (gnomAD) have become integral for interpreting genetic variation. To date, no large-scale reference maps of SVs exist from high-coverage sequencing comparable to those available for point mutations in protein-coding genes. Here, we constructed a reference atlas of SVs across 14,891 genomes from diverse global populations (54% non-European) as a component of gnomAD. We discovered a rich landscape of 433,371 distinct SVs, including 5,295 multi-breakpoint complex SVs across 11 mutational subclasses, and examples of localized chromosome shattering, as in chromothripsis. The average individual harbored 7,439 SVs, which accounted for 25-29% of all rare protein-truncating events per genome. We found strong correlations between constraint against damaging point mutations and rare SVs that both disrupt and duplicate protein-coding sequence, suggesting intolerance to reciprocal dosage alterations for a subset of tightly regulated genes. We also uncovered modest selection against noncoding SVs in cis-regulatory elements, although selection against protein-truncating SVs was stronger than any effect on noncoding SVs. Finally, we benchmarked carrier rates for medically relevant SVs, finding very large (≥1Mb) rare SVs in 3.8% of genomes (~1:26 individuals) and clinically reportable incidental SVs in 0.18% of genomes (~1:556 individuals). These data have been integrated directly into the gnomAD browser (https://gnomad.broadinstitute.org) and will have broad utility for population genetics, disease association, and diagnostic screening.


2011 ◽  
Vol 271-273 ◽  
pp. 1451-1454
Author(s):  
Gang Zhang ◽  
Jian Yin ◽  
Liang Lun Cheng ◽  
Chun Ru Wang

Teaching quality is a key metric in college teaching effect and ability evaluation. In many previous literatures, evaluation of such metric is merely depended on subjective judgment of few experts based on their experience, which leads to some false, bias or unstable results. Moreover, pure human based evaluation is expensive that is difficult to extend to large scale. With the application of information technology, much information in college teaching is recorded and stored electronically, which founds the basic of a computer-aid analysis. In this paper, we perform teaching quality evaluation within machine learning framework, focusing on learning and modeling electronic information associated with quality of teaching, to get a stable model described the substantial principles of teaching quality. Artificial Neural Network (ANN) is selected as the main model in this work. Experiment results on real data sets consisted of 4 subjects / 8 semesters show the effectiveness of the proposed method.


2020 ◽  
Vol 6 (30) ◽  
pp. eaay2922
Author(s):  
Aleksandra Nivina ◽  
Maj Svea Grieb ◽  
Céline Loot ◽  
David Bikard ◽  
Jean Cury ◽  
...  

Recombination systems are widely used as bioengineering tools, but their sites have to be highly similar to a consensus sequence or to each other. To develop a recombination system free of these constraints, we turned toward attC sites from the bacterial integron system: single-stranded DNA hairpins specifically recombined by the integrase. Here, we present an algorithm that generates synthetic attC sites with conserved structural features and minimal sequence-level constraints. We demonstrate that all generated sites are functional, their recombination efficiency can reach 60%, and they can be embedded into protein coding sequences. To improve recombination of less efficient sites, we applied large-scale mutagenesis and library enrichment coupled to next-generation sequencing and machine learning. Our results validated the efficiency of this approach and allowed us to refine synthetic attC design principles. They can be embedded into virtually any sequence and constitute a unique example of a structure-specific DNA recombination system.


Sign in / Sign up

Export Citation Format

Share Document