How Machine Learning and Statistical Models Advance Molecular Diagnostics of Rare Disorders Via Analysis of RNA Sequencing Data

Rare diseases, although individually rare, collectively affect approximately 350 million people worldwide. Currently, nearly 6,000 distinct rare disorders with a known molecular basis have been described, yet establishing a specific diagnosis based on the clinical phenotype is challenging. Increasing integration of whole exome sequencing into routine diagnostics of rare diseases is improving diagnostic rates. Nevertheless, about half of the patients do not receive a genetic diagnosis due to the challenges of variant detection and interpretation. During the last years, RNA sequencing is increasingly used as a complementary diagnostic tool providing functional data. Initially, arbitrary thresholds have been applied to call aberrant expression, aberrant splicing, and mono-allelic expression. With the application of RNA sequencing to search for the molecular diagnosis, the implementation of robust statistical models on normalized read counts allowed for the detection of significant outliers corrected for multiple testing. More recently, machine learning methods have been developed to improve the normalization of RNA sequencing read count data by taking confounders into account. Together the methods have increased the power and sensitivity of detection and interpretation of pathogenic variants, leading to diagnostic rates of 10–35% in rare diseases. In this review, we provide an overview of the methods used for RNA sequencing and illustrate how these can improve the diagnostic yield of rare diseases.

Download Full-text

The Role of RNA-Sequencing as a New Genetic Diagnosis Tool

Current Genetic Medicine Reports ◽

10.1007/s40142-021-00199-x ◽

2021 ◽

Vol 9 (2) ◽

pp. 13-21

Author(s):

Philippa D. K. Curry ◽

Krystyna L. Broda ◽

Christopher J. Carroll

Keyword(s):

Rna Sequencing ◽

Rare Diseases ◽

Diagnostic Yield ◽

Neuromuscular Disorders ◽

Genetic Diagnosis ◽

Rna Seq ◽

Novel Approach ◽

Whole Exome ◽

Diagnosis Tool

Abstract Purpose of Review Whole exome sequencing (WES) and whole-genome sequencing (WGS) are frontline approaches for the genetic diagnosis of rare diseases. However, WES/WGS fails in up to 75% of cases. Transcriptomics via RNA-sequencing (RNA-Seq) is a novel approach that aims to increase the diagnostic yield in rare diseases. Recent Findings Recent publications focus on the success of RNA-Seq for increasing diagnosis rates in WES/WGS-negative patients in up to 36% of cases, across a range of different diseases, sample sizes, and tissue types. Summary RNA-Seq is beneficial for aiding prioritisation of causative variants currently not detected or often overlooked by WES/WGS alone. An improvement in diagnostic yields has been demonstrated using multiple source tissues, with muscle and fibroblasts being the most representative, but the more accessible blood still demonstrating diagnostic success, particularly in neuromuscular disorders. The introduction of RNA-Seq to the genetic diagnosis toolbox promises to be a useful complementary tool to WES/WGS for improving genetic diagnosis in patients with rare disease.

Download Full-text

Detection of pathogenic splicing events from RNA-sequencing data using dasper

10.1101/2021.03.29.437534 ◽

2021 ◽

Author(s):

David Zhang ◽

Regina H. Reynolds ◽

Sonia Garcia-Ruiz ◽

Emil K Gustavsson ◽

Sid Sethi ◽

...

Keyword(s):

Rna Sequencing ◽

Diagnostic Yield ◽

Genetic Diagnosis ◽

Fibroblast Cell ◽

Sequencing Data ◽

Sequencing Technologies ◽

Pathogenic Variants ◽

Splicing Variants ◽

Gene Filter ◽

Disease Associations

AbstractAlthough next-generation sequencing technologies have accelerated the discovery of novel gene-to-disease associations, many patients with suspected Mendelian diseases still leave the clinic without a genetic diagnosis. An estimated one third of these patients will have disorders caused by mutations impacting splicing. RNA-sequencing has been shown to be a promising diagnostic tool, however few methods have been developed to integrate RNA-sequencing data into the diagnostic pipeline. Here, we introduce dasper, an R/Bioconductor package that improves upon existing tools for detecting aberrant splicing by using machine learning to incorporate disruptions in exon-exon junction counts as well as coverage. dasper is designed for diagnostics, providing a rank-based report of how aberrant each splicing event looks, as well as including visualization functionality to facilitate interpretation. We validate dasper using 16 patient-derived fibroblast cell lines harbouring pathogenic variants known to impact splicing. We find that dasper is able to detect pathogenic splicing events with greater accuracy than existing LeafCutterMD or z-score approaches. Furthermore, by only applying a broad OMIM gene filter (without any variant-level filters), dasper is able to detect pathogenic splicing events within the top 10 most aberrant identified for each patient. Since using publicly available control data minimises costs associated with incorporating RNA-sequencing into diagnostic pipelines, we also investigate the use of 504 GTEx fibroblast samples as controls. We find that dasper leverages publicly available data effectively, ranking pathogenic splicing events in the top 25. Thus, we believe dasper can increase diagnostic yield for a pathogenic splicing variants and enable the efficient implementation of RNA-sequencing for diagnostics in clinical laboratories.

Download Full-text

FC 011KIDNEYNETWORK: USING KIDNEY DERIVED GENE EXPRESSION DATA TO PREDICT AND PRIORITIZE NOVEL GENES INVOLVED IN KIDNEY DISEASE

Nephrology Dialysis Transplantation ◽

10.1093/ndt/gfab131.001 ◽

2021 ◽

Vol 36 (Supplement_1) ◽

Author(s):

Floranne Boulogne ◽

Laura Claus ◽

Henry Wiersma ◽

Roy Oelen ◽

Floor Schukking ◽

...

Keyword(s):

Gene Expression ◽

Kidney Disease ◽

Candidate Gene ◽

Exome Sequencing ◽

Rna Sequencing ◽

Expression Patterns ◽

Genetic Diagnosis ◽

Specific Gene ◽

Sequencing Data ◽

Exome Sequencing Data

Abstract Background and Aims Genetic testing in patients with suspected hereditary kidney disease does not always reveal the genetic cause for the patient's disorder. Potentially pathogenic variants can reside in genes that are not known to be involved in kidney disease, which makes it difficult to prioritize and interpret the relevance of these variants. As such, there is a clear need for methods that predict the phenotypic consequences of gene expression in a way that is as unbiased as possible. To help identify candidate genes we have developed KidneyNetwork, in which tissue-specific expression is utilized to predict kidney-specific gene functions. Method We combined gene co-expression in 878 publicly available kidney RNA-sequencing samples with the co-expression of a multi-tissue RNA-sequencing dataset of 31,499 samples to build KidneyNetwork. The expression patterns were used to predict which genes have a kidney-related function, and which (disease) phenotypes might be caused when these genes are mutated. By integrating the information from the HPO database, in which known phenotypic consequences of disease genes are annotated, with the gene co-expression network we obtained prediction scores for each gene per HPO term. As proof of principle, we applied KidneyNetwork to prioritize variants in exome-sequencing data from 13 kidney disease patients without a genetic diagnosis. Results We assessed the prediction performance of KidneyNetwork by comparing it to GeneNetwork, a multi-tissue co-expression network we previously developed. In KidneyNetwork, we observe a significantly improved prediction accuracy of kidney-related HPO-terms, as well as an increase in the total number of significantly predicted kidney-related HPO-terms (figure 1). To examine its clinical utility, we applied KidneyNetwork to 13 patients with a suspected hereditary kidney disease without a genetic diagnosis. Based on the HPO terms “Renal cyst” and “Hepatic cysts”, combined with a list of potentially damaging variants in one of the undiagnosed patients with mild ADPKD/PCLD, we identified ALG6 as a new candidate gene. ALG6 bears a high resemblance to other genes implicated in this phenotype in recent years. Through the 100,000 Genomes Project and collaborators we identified three additional patients with kidney and/or liver cysts carrying a suspected deleterious variant in ALG6. Conclusion We present KidneyNetwork, a kidney specific co-expression network that accurately predicts what genes have kidney-specific functions and may result in kidney disease. Gene-phenotype associations of genes unknown for kidney-related phenotypes can be predicted by KidneyNetwork. We show the added value of KidneyNetwork by applying it to exome sequencing data of kidney disease patients without a molecular diagnosis and consequently we propose ALG6 as a promising candidate gene. KidneyNetwork can be applied to clinically unsolved kidney disease cases, but it can also be used by researchers to gain insight into individual genes to better understand kidney physiology and pathophysiology. Acknowledgments This research was made possible through access to the data and findings generated by the 100,000 Genomes Project; http://www.genomicsengland.co.uk.

Download Full-text

MLSeq: Machine learning interface for RNA-sequencing data

Computer Methods and Programs in Biomedicine ◽

10.1016/j.cmpb.2019.04.007 ◽

2019 ◽

Vol 175 ◽

pp. 223-231 ◽

Cited By ~ 8

Author(s):

Dincer Goksuluk ◽

Gokmen Zararsiz ◽

Selcuk Korkmaz ◽

Vahap Eldem ◽

Gozde Erturk Zararsiz ◽

...

Keyword(s):

Machine Learning ◽

Rna Sequencing ◽

Sequencing Data

Download Full-text

Missing Value Imputation for RNA-Sequencing Data Using Statistical Models: A Comparative Study

Journal of Statistical Theory and Applications ◽

10.2991/jsta.2016.15.3.3 ◽

2016 ◽

Vol 15 (3) ◽

pp. 221 ◽

Cited By ~ 1

Author(s):

Taban Baghfalaki ◽

Mojtaba Ganjali ◽

Damon Berridge

Keyword(s):

Comparative Study ◽

Rna Sequencing ◽

Statistical Models ◽

Sequencing Data ◽

Missing Value ◽

Missing Value Imputation

Download Full-text

Automated annotation of rare-cell types from single-cell RNA-sequencing data through synthetic oversampling

10.1101/2021.01.20.427486 ◽

2021 ◽

Author(s):

Saptarshi Bej ◽

Anne-Marie Galow ◽

Robert David ◽

Markus Wolfien ◽

Olaf Wolkenhauer

Keyword(s):

Machine Learning ◽

Single Cell ◽

Rna Sequencing ◽

Cell Types ◽

Classification Problem ◽

Use Case ◽

Cell Capture ◽

Sequencing Data ◽

Rare Cells ◽

The Impact

AbstractThe research landscape of single-cell and single-nuclei RNA sequencing is evolving rapidly, and one area that is enabled by this technology, is the detection of rare cells. An automated, unbiased and accurate annotation of rare subpopulations is challenging. Once rare cells are identified in one dataset, it will usually be necessary to generate other datasets to enrich the analysis (e.g., with samples from other tissues). From a machine learning perspective, the challenge arises from the fact that rare cell subpopulations constitute an imbalanced classification problem.We here introduce a Machine Learning (ML)-based oversampling method that uses gene expression counts of already identified rare cells as an input to generate synthetic cells to then identify similar (rare) cells in other publicly available experiments. We utilize single-cell synthetic oversampling (sc-SynO), which is based on the Localized Random Affine Shadowsampling (LoRAS) algorithm. The algorithm corrects for the overall imbalance ratio of the minority and majority class.We demonstrate the effectiveness of the method for two independent use cases, each consisting of two published datasets. The first use case identifies cardiac glial cells in snRNA-Seq data (17 nuclei out of 8,635). This use case was designed to take a larger imbalance ratio (∼1 to 500) into account and only uses single-nuclei data. The second use case was designed to jointly use snRNA-Seq data and scRNA-Seq on a lower imbalance ratio (∼1 to 26) for the training step to likewise investigate the potential of the algorithm to consider both single cell capture procedures and the impact of “less” rare-cell types. For validation purposes, all datasets have also been analyzed in a traditional manner using common data analysis approaches, such as the Seurat3 workflow.Our algorithm identifies rare-cell populations with a high accuracy and low false positive detection rate. A striking benefit of our algorithm is that it can be readily implemented in other and existing workflows. The code basis is publicly available at FairdomHub (https://fairdomhub.org/assays/1368) and can easily be transferred to train other customized approaches.

Download Full-text

Authentication of Differential Gene Expression in Oral Squamous Cell Carcinoma using Machine Learning Applications

10.21203/rs.3.rs-128045/v1 ◽

2020 ◽

Author(s):

Rian Pratama ◽

Jae Joon Hwang ◽

Ji Hye Lee ◽

Giltae Song ◽

Hae Ryoun Park

Keyword(s):

Gene Expression ◽

Machine Learning ◽

Squamous Cell Carcinoma ◽

Oral Squamous Cell Carcinoma ◽

Cell Carcinoma ◽

Rna Sequencing ◽

Differential Gene Expression ◽

Sequencing Data ◽

Machine Learning Applications ◽

Differential Gene

Abstract Background: Recently, the possibility of tumour classification based on genetic data has been investigated. However, genetic datasets are difficult to handle because of their massive size and complexity of manipulation. In the present study, we examined the diagnostic performance of machine learning applications using imaging-based classifications of oral squamous cell carcinoma (OSCC) gene sets.Methods: RNA sequencing data from SCC tissues from various sites, including oral, non-oral head and neck, oesophageal, and cervical regions, were downloaded from The Cancer Genome Atlas (TCGA). The feature genes were extracted through a convolutional neural network (CNN) and machine learning, and the performance of each analysis was compared.Results: The ability of the machine learning analysis to classify OSCC tumours was excellent. However, the tool exhibited poorer performance in discriminating histopathologically dissimilar cancers derived from the same type of tissue than in differentiating cancers of the same histopathologic type with different tissue origins, revealing that the differential gene expression pattern is a more important factor than the histopathologic features for differentiating cancer types.Conclusion: The CNN-based diagnostic model and the visualisation methods using RNA sequencing data were useful for correctly categorising OSCC. The analysis showed differentially expressed genes in multiwise comparisons of various types of SCCs, such as KCNA10, FOSL2, and PRDM16, and extracted leader genes from pairwise comparisons were FGF20, DLC1, and ZNF705D.

Download Full-text

A Machine Learning Approach to Prostate Cancer Risk Classification Through Use of RNA Sequencing Data

Lecture Notes in Computer Science - Big Data – BigData 2019 ◽

10.1007/978-3-030-23551-2_5 ◽

2019 ◽

pp. 65-79

Author(s):

Matthew Casey ◽

Baldwin Chen ◽

Jonathan Zhou ◽

Nianjun Zhou

Keyword(s):

Prostate Cancer ◽

Machine Learning ◽

Cancer Risk ◽

Rna Sequencing ◽

Prostate Cancer Risk ◽

Risk Classification ◽

Learning Approach ◽

Sequencing Data ◽

Machine Learning Approach

Download Full-text

Machine learning applied to whole‐blood RNA‐sequencing data uncovers distinct subsets of patients with systemic lupus erythematosus

Clinical & Translational Immunology ◽

10.1002/cti2.1093 ◽

2019 ◽

Vol 8 (12) ◽

Cited By ~ 4

Author(s):

William A Figgett ◽

Katherine Monaghan ◽

Milica Ng ◽

Monther Alhamdoosh ◽

Eugene Maraskovsky ◽

...

Keyword(s):

Machine Learning ◽

Systemic Lupus Erythematosus ◽

Rna Sequencing ◽

Lupus Erythematosus ◽

Whole Blood ◽

Sequencing Data ◽

Systemic Lupus

Download Full-text

The altered transcriptome of pediatric myelodysplastic syndrome revealed by RNA sequencing

Journal of Hematology & Oncology ◽

10.1186/s13045-020-00974-3 ◽

2020 ◽

Vol 13 (1) ◽

Author(s):

Lorena Zubovic ◽

Silvano Piazza ◽

Toma Tebaldi ◽

Luca Cozzuto ◽

Giuliana Palazzo ◽

...

Keyword(s):

Cell Cycle ◽

Myelodysplastic Syndrome ◽

Rna Sequencing ◽

Adaptive Immunity ◽

Differentially Expressed Genes ◽

Ribosome Biogenesis ◽

Therapeutic Targets ◽

Differentially Expressed ◽

Sequencing Data ◽

Aberrant Expression

Abstract Pediatric myelodysplastic syndrome (PMDS) is a very rare and still poorly characterized disorder. In this work, we identified novel potential targets of PMDS by determining genes with aberrant expression, which can be correlated with PMDS pathogenesis. We identified 291 differentially expressed genes (DEGs) in PMDS patients, comprising genes involved in the regulation of apoptosis and the cell cycle, ribosome biogenesis, inflammation and adaptive immunity. Ten selected DEGs were then validated, confirming the sequencing data. These DEGs will potentially represent new molecular biomarkers and therapeutic targets for PMDS.

Download Full-text