Classification of non-coding variants with high pathogenic impact

Whole genome sequencing is increasingly used to diagnose medical conditions of genetic origin. While both coding and non-coding DNA variants contribute to a wide range of diseases, most patients who receive a WGS-based diagnosis today harbour a protein-coding mutation. Functional interpretation and prioritization of non-coding variants represents a persistent challenge, and disease-causing non-coding variants remain largely unidentified. Depending on the disease, WGS fails to identify a candidate variant in 20-80% of patients, severely limiting the usefulness of sequencing for personalised medicine. Here we present FINSURF, a machine-learning approach to predict the functional impact of non-coding variants in regulatory regions. FINSURF outperforms state-of-the-art methods, owing to control optimisation during training. In addition to ranking candidate variants, FINSURF also delivers diagnostic information on functional consequences of mutations. We applied FINSURF to a diverse set of 30 diseases with described causative non-coding mutations, and correctly identified the disease-causative non-coding variant within the ten top hits in 22 cases. FINSURF is implemented as an online server to as well as custom browser tracks, and provides a quick and efficient solution to prioritize candidate non-coding variants in realistic clinical settings.

Download Full-text

Functional Interpretation of Genetic Variants Using Deep Learning Predicts Impact on Epigenome

10.1101/389056 ◽

2018 ◽

Cited By ~ 1

Author(s):

Gabriel E. Hoffman ◽

Eric E. Schadt ◽

Panos Roussos

Keyword(s):

Deep Learning ◽

Dna Sequence ◽

Genetic Variants ◽

Disease Risk ◽

Functional Interpretation ◽

Protein Coding ◽

Functional Consequence ◽

Risk Variants ◽

Causal Variants ◽

Coding Variants

ABSTRACTIdentifying causal variants underling disease risk and adoption of personalized medicine are currently limited by the challenge of interpreting the functional consequences of genetic variants. Predicting the functional effects of disease-associated protein-coding variants is increasingly routine. Yet the vast majority of risk variants are non-coding, and predicting the functional consequence and prioritizing variants for functional validation remains a major challenge. Here we develop a deep learning model to accurately predict locus-specific signals from four epigenetic assays using only DNA sequence as input. Given the predicted epigenetic signal from DNA sequence for the reference and alternative alleles at a given locus, we generate a score of the predicted epigenetic consequences for 438 million variants. These impact scores are assay-specific, are predictive of allele-specific transcription factor binding and are enriched for variants associated with gene expression and disease risk. Nucleotide-level functional consequence scores for non-coding variants can refine the mechanism of known causal variants, identify novel risk variants and prioritize downstream experiments.

Download Full-text

GARFIELD - GWAS Analysis of Regulatory or Functional Information Enrichment with LD correction

10.1101/085738 ◽

2016 ◽

Cited By ~ 17

Author(s):

Valentina Iotchkova ◽

Graham R.S. Ritchie ◽

Matthias Geihs ◽

Sandro Morganella ◽

Josine L. Min ◽

...

Keyword(s):

Association Studies ◽

R Package ◽

Genome Wide Association Studies ◽

Protein Coding ◽

Functional Annotations ◽

Novel Approach ◽

Genome Wide ◽

Functional Consequences ◽

Genomic Regions ◽

Coding Variants

Loci discovered by genome-wide association studies (GWAS) predominantly map outside protein-coding genes. The interpretation of functional consequences of non-coding variants can be greatly enhanced by catalogs of regulatory genomic regions in cell lines and primary tissues. However, robust and readily applicable methods are still lacking to systematically evaluate the contribution of these regions to genetic variation implicated in diseases or quantitative traits. Here we propose a novel approach that leverages GWAS findings with regulatory or functional annotations to classify features relevant to a phenotype of interest. Within our framework, we account for major sources of confounding that current methods do not offer. We further assess enrichment statistics for 27 GWAS traits within regulatory regions from the ENCODE and Roadmap projects. We characterise unique enrichment patterns for traits and annotations, driving novel biological insights. The method is implemented in standalone software and R package to facilitate its application by the research community.

Download Full-text

AGFusion: annotate and visualize gene fusions

10.1101/080903 ◽

2016 ◽

Cited By ~ 7

Author(s):

Charlie Murphy ◽

Olivier Elemento

Keyword(s):

Fusion Proteins ◽

Homo Sapiens ◽

Domain Architecture ◽

Protein Domain ◽

Gene Fusions ◽

Functional Interpretation ◽

Protein Coding ◽

Link Type ◽

Functional Consequences ◽

Python Package

AbstractSummaryThe discovery of novel gene fusions in tumor samples has rapidly accelerated with the rise of next-generation sequencing. A growing number of tools enable discovery of gene fusions from RNA-seq data. However it is likely that not all gene fusions are driving tumors. Assessing the potential functional consequences of a fusion is critical to understand their driver role. It is also challenging as gene fusions are described by chromosomal breakpoint coordinates that need to be translated into an actual amino acid fusion sequence and predicted domain architecture of the fusion proteins. Currently there are no easy-to-use tools that can automatically reconstruct and visualize fusion proteins from genomic breakpoints. To facilitate the functional interpretation of gene fusions, we developed AGFusion, available as an online web tool that can be readily used by non-computational researchers as well as a python package that can be built into computational pipelines. With minimal input from the user, AGFusion predicts the cDNA, CDS, and protein sequences of all gene fusion products based on all combinations of gene isoforms. For protein coding fusions, AGFusion can annotate and visualize the protein domain architecture. AGFusion currently supports Homo sapiens (genome builds GRCh37 and GRCh38) and Mus musculus (genome build GRCm38) and new genomes can easily be added.AvailabilityAGFusion python package is freely available at https://github.com/murphycj/AGFusion under the MIT license. The AGFusion web app is available at http://agfusion.info

Download Full-text

Classification of Tumor Samples from Expression Data Using Decision Trunks

Cancer Informatics ◽

10.4137/cin.s10356 ◽

2013 ◽

Vol 12 ◽

pp. CIN.S10356 ◽

Cited By ~ 3

Author(s):

Benjamin Ulfenborg ◽

Karin Klinga-Levan ◽

Björn Olsson

Keyword(s):

Decision Trees ◽

Comprehensive Evaluation ◽

Expression Data ◽

Current State ◽

Wide Range ◽

Machine Learning Approach ◽

Human Decision ◽

Classification Tasks ◽

Testing Practices

We present a novel machine learning approach for the classification of cancer samples using expression data. We refer to the method as “decision trunks,” since it is loosely based on decision trees, but contains several modifications designed to achieve an algorithm that: (1) produces smaller and more easily interpretable classifiers than decision trees; (2) is more robust in varying application scenarios; and (3) achieves higher classification accuracy. The decision trunk algorithm has been implemented and tested on 26 classification tasks, covering a wide range of cancer forms, experimental methods, and classification scenarios. This comprehensive evaluation indicates that the proposed algorithm performs at least as well as the current state of the art algorithms in terms of accuracy, while producing classifiers that include on average only 2–3 markers. We suggest that the resulting decision trunks have clear advantages over other classifiers due to their transparency, interpretability, and their correspondence with human decision-making and clinical testing practices.

Download Full-text

Significant abundance of cis configurations of mutations in diploid human genomes

10.1101/221085 ◽

2017 ◽

Cited By ~ 2

Author(s):

Margret R. Hoehe ◽

Ralf Herwig ◽

Qing Mao ◽

Brock A. Peters ◽

Radoje Drmanac ◽

...

Keyword(s):

Genetic Variation ◽

Genome Project ◽

Personal Genome ◽

Functional Interpretation ◽

Protein Coding ◽

Specific Distribution ◽

Gene Sets ◽

Human Genomes ◽

Significant Enrichment ◽

Coding Variants

AbstractTo fully understand human genetic variation, one must assess the specific distribution of variants between the two chromosomal homologues of genes, and any functional units of interest, as the phase of variants can significantly impact gene function and phenotype. To this end, we have systematically analyzed 18,121 autosomal protein-coding genes in 1,092 statistically phased genomes from the 1000 Genomes Project, and an unprecedented number of 184 experimentally phased genomes from the Personal Genome Project. Here we show that mutations predicted to functionally alter the protein, and coding variants as a whole, are not randomly distributed between the two homologues of a gene, but do occur significantly more frequently in cis-than trans-configurations, with cis/trans ratios of ∼60:40. Significant cis-abundance was observed in virtually all individual genomes in all populations. Nearly all variable genes exhibited either cis, or trans configurations of protein-altering mutations in significant excess, allowing distinction of cis- and trans-abundant genes. These common patterns of phase were largely constituted by a shared, global set of phase-sensitive genes. We show significant enrichment of this global set with gene sets indicating its involvement in adaptation and evolution. Moreover, cis- and trans-abundant genes were found functionally distinguishable, and exhibited strikingly different distributional patterns of protein-altering mutations. This work establishes common patterns of phase as key characteristics of diploid human exomes and provides evidence for their potential functional significance. Thus, it highlights the importance of phase for the interpretation of protein-coding genetic variation, challenging the current conceptual and functional interpretation of autosomal genes.

Download Full-text

MichelaNglo: sculpting protein views on web pages without coding

Bioinformatics ◽

10.1093/bioinformatics/btaa104 ◽

2020 ◽

Vol 36 (10) ◽

pp. 3268-3270 ◽

Cited By ~ 5

Author(s):

Matteo P Ferla ◽

Alistair T Pagnamenta ◽

David Damerell ◽

Jenny C Taylor ◽

Brian D Marsden

Keyword(s):

Digital Media ◽

Structural Information ◽

Web Pages ◽

Protein Coding ◽

Web Based ◽

Functional Consequences ◽

Static Images ◽

Interactive 3D ◽

Coding Variants ◽

Do So

Abstract Motivation The sharing of macromolecular structural information online by scientists is predominantly performed via 2D static images, since the embedding of interactive 3D structures in webpages is non-trivial. Whilst the technologies to do so exist, they are often only implementable with significant web coding experience. Results Michelaɴɢʟo is an accessible and open-source web-based application that supports the generation, customization and sharing of interactive 3D macromolecular visualizations for digital media without requiring programming skills. A PyMOL file, PDB file, PDB identifier code or protein/gene name can be provided to form the basis of visualizations using the NGL JavaScript library. Hyperlinks that control the view can be added to text within the page. Protein-coding variants can be highlighted to support interpretation of their potential functional consequences. The resulting visualizations and text can be customized and shared, as well as embedded within existing websites by following instructions and using a self-contained download. Michelaɴɢʟo allows researchers to move away from static images and instead engage, describe and explain their protein to a wider audience in a more interactive fashion. Availability and implementation Michelaɴɢʟo is hosted at michelanglo.sgc.ox.ac.uk. The Python code is freely available at https://github.com/thesgc/MichelaNGLo, along with documentations about its implementation.

Download Full-text

A Brief Survey on Text Classification Using Various Machine Learning Techniques

International Journal of Advanced Research in Computer Science and Software Engineering ◽

10.23956/ijarcsse.v8i1.521 ◽

2018 ◽

Vol 8 (1) ◽

pp. 14

Author(s):

Padmavathi .S ◽

M. Chidambaram

Keyword(s):

Machine Learning ◽

Text Classification ◽

Fixed Number ◽

Machine Learning Techniques ◽

Online Information ◽

Rule Based ◽

Learning Techniques ◽

Machine Learning Approach ◽

Rule Based Approach

Text classification has grown into more significant in managing and organizing the text data due to tremendous growth of online information. It does classification of documents in to fixed number of predefined categories. Rule based approach and Machine learning approach are the two ways of text classification. In rule based approach, classification of documents is done based on manually defined rules. In Machine learning based approach, classification rules or classifier are defined automatically using example documents. It has higher recall and quick process. This paper shows an investigation on text classification utilizing different machine learning techniques.

Download Full-text

A Machine Learning Approach to Study Glycosidase Activities from Bifidobacterium

Microorganisms ◽

10.3390/microorganisms9051034 ◽

2021 ◽

Vol 9 (5) ◽

pp. 1034

Author(s):

Carlos Sabater ◽

Lorena Ruiz ◽

Abelardo Margolles

Keyword(s):

Machine Learning ◽

Supervised Classification ◽

Machine Learning Algorithms ◽

Learning Approach ◽

Human Milk Oligosaccharides ◽

Future Studies ◽

High Fiber ◽

Machine Learning Approach ◽

Prebiotic Oligosaccharides

This study aimed to recover metagenome-assembled genomes (MAGs) from human fecal samples to characterize the glycosidase profiles of Bifidobacterium species exposed to different prebiotic oligosaccharides (galacto-oligosaccharides, fructo-oligosaccharides and human milk oligosaccharides, HMOs) as well as high-fiber diets. A total of 1806 MAGs were recovered from 487 infant and adult metagenomes. Unsupervised and supervised classification of glycosidases codified in MAGs using machine-learning algorithms allowed establishing characteristic hydrolytic profiles for B. adolescentis, B. bifidum, B. breve, B. longum and B. pseudocatenulatum, yielding classification rates above 90%. Glycosidase families GH5 44, GH32, and GH110 were characteristic of B. bifidum. The presence or absence of GH1, GH2, GH5 and GH20 was characteristic of B. adolescentis, B. breve and B. pseudocatenulatum, while families GH1 and GH30 were relevant in MAGs from B. longum. These characteristic profiles allowed discriminating bifidobacteria regardless of prebiotic exposure. Correlation analysis of glycosidase activities suggests strong associations between glycosidase families comprising HMOs-degrading enzymes, which are often found in MAGs from the same species. Mathematical models here proposed may contribute to a better understanding of the carbohydrate metabolism of some common bifidobacteria species and could be extrapolated to other microorganisms of interest in future studies.

Download Full-text

Comparative Genomics: Insights on the Pathogenicity and Lifestyle of Rhizoctonia solani

International Journal of Molecular Sciences ◽

10.3390/ijms22042183 ◽

2021 ◽

Vol 22 (4) ◽

pp. 2183

Author(s):

Nurhani Mat Razali ◽

Siti Norvahida Hisham ◽

Ilakiya Sharanee Kumar ◽

Rohit Nandan Shukla ◽

Melvin Lee ◽

...

Keyword(s):

Comparative Genomics ◽

Rhizoctonia Solani ◽

Abiotic Factors ◽

Biotic Factor ◽

Protein Coding ◽

Sustainable Food ◽

Repeat Elements ◽

Gene Sets ◽

Core Genes

Proper management of agricultural disease is important to ensure sustainable food security. Staple food crops like rice, wheat, cereals, and other cash crops hold great export value for countries. Ensuring proper supply is critical; hence any biotic or abiotic factors contributing to the shortfall in yield of these crops should be alleviated. Rhizoctonia solani is a major biotic factor that results in yield losses in many agriculturally important crops. This paper focuses on genome informatics of our Malaysian Draft R. solani AG1-IA, and the comparative genomics (inter- and intra- AG) with four AGs including China AG1-IA (AG1-IA_KB317705.1), AG1-IB, AG3, and AG8. The genomic content of repeat elements, transposable elements (TEs), syntenic genomic blocks, functions of protein-coding genes as well as core orthologous genic information that underlies R. solani’s pathogenicity strategy were investigated. Our analyses show that all studied AGs have low content and varying profiles of TEs. All AGs were dominant for Class I TE, much like other basidiomycete pathogens. All AGs demonstrate dominance in Glycoside Hydrolase protein-coding gene assignments suggesting its importance in infiltration and infection of host. Our profiling also provides a basis for further investigation on lack of correlation observed between number of pathogenicity and enzyme-related genes with host range. Despite being grouped within the same AG with China AG1-IA, our Draft AG1-IA exhibits differences in terms of protein-coding gene proportions and classifications. This implies that strains from similar AG do not necessarily have to retain similar proportions and classification of TE but must have the necessary arsenal to enable successful infiltration and colonization of host. In a larger perspective, all the studied AGs essentially share core genes that are generally involved in adhesion, penetration, and host colonization. However, the different infiltration strategies will depend on the level of host resilience where this is clearly exhibited by the gene sets encoded for the process of infiltration, infection, and protection from host.

Download Full-text

Modes of Interaction in Naturally Occurring Medical Encounters With General Practitioners: The “One in a Million” Study

Qualitative Health Research ◽

10.1177/1049732321993790 ◽

2021 ◽

pp. 104973232199379

Author(s):

Olaug S. Lian ◽

Sarah Nettleton ◽

Åge Wifstad ◽

Christopher Dowrick

Keyword(s):

General Practitioners ◽

Narrative Analysis ◽

Mode Competition ◽

General Applicability ◽

Naturally Occurring ◽

Wide Range ◽

The One ◽

Narrative Mode ◽

Modes Of Interaction

In this article, we qualitatively explore the manner and style in which medical encounters between patients and general practitioners (GPs) are mutually conducted, as exhibited in situ in 10 consultations sourced from the One in a Million: Primary Care Consultations Archive in England. Our main objectives are to identify interactional modes, to develop a classification of these modes, and to uncover how modes emerge and shift both within and between consultations. Deploying an interactional perspective and a thematic and narrative analysis of consultation transcripts, we identified five distinctive interactional modes: question and answer (Q&A) mode, lecture mode, probabilistic mode, competition mode, and narrative mode. Most modes are GP-led. Mode shifts within consultations generally map on to the chronology of the medical encounter. Patient-led narrative modes are initiated by patients themselves, which demonstrates agency. Our classification of modes derives from complete naturally occurring consultations, covering a wide range of symptoms, and may have general applicability.

Download Full-text