Genotype-free individual genome reconstruction of Multiparental Population Models by RNA sequencing data

AbstractMulti-parent populations (MPPs), genetically segregating model systems derived from two or more inbred founder strains, are widely used in biomedical and agricultural research. Gene expression profiling by direct RNA sequencing (RNA-Seq) is commonly applied to MPPs to investigate gene expression regulation and to identify candidate genes. In genetically diverse populations, including most MPPs, quantification of gene expression is improved when the RNA-Seq reads are aligned to individualized transcriptomes that incorporate known polymorphic loci. However, the process of constructing and analyzing individual genomes can be computationally demanding and error prone. We propose a new approach, genome reconstruction by RNA-Seq (GBRS), that relies on simultaneous alignment of RNA-Seq reads to the founder strain transcriptomes. GBRS can reconstruct the diploid genome of each individual and quantify both total and allele-specific gene expression. We demonstrate that GBRS performs as well as methods that rely on high-density genotyping arrays to reconstruct the founder haplotype mosaic of MPP individuals. Using GBRS in addition to other genotyping methods provides quality control for detecting sample mix-ups and improves power to detect expression quantitative trait loci. GBRS software is freely available at https://github.com/churchill-lab/gbrs.

Download Full-text

FC 011KIDNEYNETWORK: USING KIDNEY DERIVED GENE EXPRESSION DATA TO PREDICT AND PRIORITIZE NOVEL GENES INVOLVED IN KIDNEY DISEASE

Nephrology Dialysis Transplantation ◽

10.1093/ndt/gfab131.001 ◽

2021 ◽

Vol 36 (Supplement_1) ◽

Author(s):

Floranne Boulogne ◽

Laura Claus ◽

Henry Wiersma ◽

Roy Oelen ◽

Floor Schukking ◽

...

Keyword(s):

Gene Expression ◽

Kidney Disease ◽

Candidate Gene ◽

Exome Sequencing ◽

Rna Sequencing ◽

Expression Patterns ◽

Genetic Diagnosis ◽

Specific Gene ◽

Sequencing Data ◽

Exome Sequencing Data

Abstract Background and Aims Genetic testing in patients with suspected hereditary kidney disease does not always reveal the genetic cause for the patient's disorder. Potentially pathogenic variants can reside in genes that are not known to be involved in kidney disease, which makes it difficult to prioritize and interpret the relevance of these variants. As such, there is a clear need for methods that predict the phenotypic consequences of gene expression in a way that is as unbiased as possible. To help identify candidate genes we have developed KidneyNetwork, in which tissue-specific expression is utilized to predict kidney-specific gene functions. Method We combined gene co-expression in 878 publicly available kidney RNA-sequencing samples with the co-expression of a multi-tissue RNA-sequencing dataset of 31,499 samples to build KidneyNetwork. The expression patterns were used to predict which genes have a kidney-related function, and which (disease) phenotypes might be caused when these genes are mutated. By integrating the information from the HPO database, in which known phenotypic consequences of disease genes are annotated, with the gene co-expression network we obtained prediction scores for each gene per HPO term. As proof of principle, we applied KidneyNetwork to prioritize variants in exome-sequencing data from 13 kidney disease patients without a genetic diagnosis. Results We assessed the prediction performance of KidneyNetwork by comparing it to GeneNetwork, a multi-tissue co-expression network we previously developed. In KidneyNetwork, we observe a significantly improved prediction accuracy of kidney-related HPO-terms, as well as an increase in the total number of significantly predicted kidney-related HPO-terms (figure 1). To examine its clinical utility, we applied KidneyNetwork to 13 patients with a suspected hereditary kidney disease without a genetic diagnosis. Based on the HPO terms “Renal cyst” and “Hepatic cysts”, combined with a list of potentially damaging variants in one of the undiagnosed patients with mild ADPKD/PCLD, we identified ALG6 as a new candidate gene. ALG6 bears a high resemblance to other genes implicated in this phenotype in recent years. Through the 100,000 Genomes Project and collaborators we identified three additional patients with kidney and/or liver cysts carrying a suspected deleterious variant in ALG6. Conclusion We present KidneyNetwork, a kidney specific co-expression network that accurately predicts what genes have kidney-specific functions and may result in kidney disease. Gene-phenotype associations of genes unknown for kidney-related phenotypes can be predicted by KidneyNetwork. We show the added value of KidneyNetwork by applying it to exome sequencing data of kidney disease patients without a molecular diagnosis and consequently we propose ALG6 as a promising candidate gene. KidneyNetwork can be applied to clinically unsolved kidney disease cases, but it can also be used by researchers to gain insight into individual genes to better understand kidney physiology and pathophysiology. Acknowledgments This research was made possible through access to the data and findings generated by the 100,000 Genomes Project; http://www.genomicsengland.co.uk.

Download Full-text

Importance of experimental information (metadata) for archived sequence data: case of specific gene bias due to lag time between sample harvest and RNA protection in RNA sequencing

PeerJ ◽

10.7717/peerj.11875 ◽

2021 ◽

Vol 9 ◽

pp. e11875

Author(s):

Tomoko Matsuda

Keyword(s):

Gene Expression ◽

Rna Sequencing ◽

Time Course ◽

Sequence Data ◽

Specific Gene ◽

Time Interval ◽

Short Time Interval ◽

Rna Seq ◽

Lysis Buffer ◽

Rna Protection

Large volumes of high-throughput sequencing data have been submitted to the Sequencing Read Archive (SRA). The lack of experimental metadata associated with the data makes reuse and understanding data quality very difficult. In the case of RNA sequencing (RNA-Seq), which reveals the presence and quantity of RNA in a biological sample at any moment, it is necessary to consider that gene expression responds over a short time interval (several seconds to a few minutes) in many organisms. Therefore, to isolate RNA that accurately reflects the transcriptome at the point of harvest, raw biological samples should be processed by freezing in liquid nitrogen, immersing in RNA stabilization reagent or lysing and homogenizing in RNA lysis buffer containing guanidine thiocyanate as soon as possible. As the number of samples handled simultaneously increases, the time until the RNA is protected can increase. Here, to evaluate the effect of different lag times in RNA protection on RNA-Seq data, we harvested CHO-S cells after 3, 5, 6, and 7 days of cultivation, added RNA lysis buffer in a time course of 15, 30, 45, and 60 min after harvest, and conducted RNA-Seq. These RNA samples showed high RNA integrity number (RIN) values indicating non-degraded RNA, and sequence data from libraries prepared with these RNA samples was of high quality according to FastQC. We observed that, at the same cultivation day, global trends of gene expression were similar across the time course of addition of RNA lysis buffer; however, the expression of some genes was significantly different between the time-course samples of the same cultivation day; most of these differentially expressed genes were related to apoptosis. We conclude that the time lag between sample harvest and RNA protection influences gene expression of specific genes. It is, therefore, necessary to know not only RIN values of RNA and the quality of the sequence data but also how the experiment was performed when acquiring RNA-Seq data from the database.

Download Full-text

Abstract 4689: Subclone-specific evolution of tumor phenotypes – A framework to study subclone-specific gene expression from a combination of bulk DNA and single cell RNA sequencing data

10.1158/1538-7445.sabcs18-4689 ◽

2019 ◽

Author(s):

Yi Qiao ◽

Xiaomeng Huang ◽

Samuel Brady ◽

Andrea Bild ◽

David Bowtell ◽

...

Keyword(s):

Gene Expression ◽

Single Cell ◽

Rna Sequencing ◽

Specific Gene ◽

Sequencing Data ◽

Specific Gene Expression ◽

Single Cell Rna Sequencing ◽

Tumor Phenotypes

Download Full-text

RNA sequencing data: hitchhiker's guide to expression analysis

10.7287/peerj.preprints.27283 ◽

2018 ◽

Author(s):

Koen Van Den Berge ◽

Katharina Hembach ◽

Charlotte Soneson ◽

Simone Tiberi ◽

Lieven Clement ◽

...

Keyword(s):

Gene Expression ◽

Rna Sequencing ◽

Large Scale ◽

Science Studies ◽

Rna Seq ◽

Sequencing Data ◽

Data Types ◽

The Past ◽

Long Read ◽

Statistical Approaches

Gene expression is the fundamental level at which the result of various genetic and regulatory programs are observable. The measurement of transcriptome-wide gene expression has convincingly switched from microarrays to sequencing in a matter of years. RNA sequencing (RNA-seq) provides a quantitative and open system for profiling transcriptional outcomes on a large scale and therefore facilitates a large diversity of applications, including basic science studies, but also agricultural or clinical situations. In the past 10 years or so, much has been learned about the characteristics of the RNA-seq datasets as well as the performance of the myriad of methods developed. In this review, we give an overall view of the developments in RNA-seq data analysis, including experimental design, with an explicit focus on quantification of gene expression and statistical approaches for differential expression. We also highlight emerging data types, such as single-cell RNA-seq and gene expression profiling using long-read technologies.

Download Full-text

Transcriptome diversity is a systematic source of bias in RNA-sequencing data

10.1101/2021.04.27.441712 ◽

2021 ◽

Author(s):

Pablo E. García-Nieto ◽

Ban Wang ◽

Hunter B. Fraser

Keyword(s):

Gene Expression ◽

Rna Sequencing ◽

Systematic Bias ◽

Simple Explanation ◽

Rna Seq ◽

Sequencing Data ◽

Biological Variables ◽

Systematic Effects ◽

Standard Practices ◽

Transcriptome Diversity

ABSTRACTBackgroundRNA sequencing has been widely used as an essential tool to probe gene expression. While standard practices have been established to analyze RNA-seq data, it is still challenging to detect and remove artifactual signals. Several factors such as sex, age, and sequencing technology have been found to bias these estimates. Probabilistic estimation of expression residuals (PEER) has been used to account for some systematic effects, but it has remained challenging to interpret these PEER factors.ResultsHere we show that transcriptome diversity – a simple metric based on Shannon entropy – explains a large portion of variability in gene expression, and is a major factor detected by PEER. We then show that transcriptome diversity has significant associations with multiple technical and biological variables across diverse organisms and datasets. This prevalent confounding factor provides a simple explanation for a major source of systematic biases in gene expression estimates.ConclusionsOur results show that transcriptome diversity is a metric that captures a systematic bias in RNA-seq and is the strongest known factor encoded in PEER covariates.

Download Full-text

SPsimSeq: semi-parametric simulation of bulk and single cell RNA sequencing data

10.1101/677740 ◽

2019 ◽

Cited By ~ 1

Author(s):

Alemu Takele Assefa ◽

Jo Vandesompele ◽

Olivier Thas

Keyword(s):

Gene Expression ◽

Single Cell ◽

Rna Sequencing ◽

Empirical Distribution ◽

Supplementary Information ◽

Rna Seq ◽

Sequencing Data ◽

Actual Distribution ◽

Wide Range ◽

Single Cell Rna Sequencing

SummarySPsimSeq is a semi-parametric simulation method for bulk and single cell RNA sequencing data. It simulates data from a good estimate of the actual distribution of a given real RNA-seq dataset. In contrast to existing approaches that assume a particular data distribution, our method constructs an empirical distribution of gene expression data from a given source RNA-seq experiment to faithfully capture the data characteristics of real data. Importantly, our method can be used to simulate a wide range of scenarios, such as single or multiple biological groups, systematic variations (e.g. confounding batch effects), and different sample sizes. It can also be used to simulate different gene expression units resulting from different library preparation protocols, such as read counts or UMI counts.Availability and implementationThe R package and associated documentation is available from https://github.com/CenterForStatistics-UGent/SPsimSeq.Supplementary informationSupplementary data are available at bioRχiv online.

Download Full-text

Global Gene Expression and Mutation Signatures Are Preserved in PDX Models of Pediatric AML and Aid Discovery of Targeted Therapy for Cases with CBFA2T3/GLIS2 Rearrangement

Blood ◽

10.1182/blood-2019-129225 ◽

2019 ◽

Vol 134 (Supplement_1) ◽

pp. 3766-3766

Author(s):

Mark Wunderlich ◽

Jing Chen ◽

Eric O'Brien ◽

Nicole Manning ◽

Christina Sexton ◽

...

Keyword(s):

Gene Expression ◽

Rna Sequencing ◽

Targeted Therapies ◽

Global Gene Expression ◽

Specific Gene ◽

Sequencing Data ◽

Patient Sample ◽

Nsg Mice ◽

Pediatric Aml ◽

Pdx Models

Therapies for pediatric acute myeloid leukemia (AML) remain unsatisfactory and generally do not incorporate molecularly-targeted agents aside from FLT3 inhibitors outside of the relapse setting. Patient-derived xenograft (PDX) models of AML are increasingly accessible for the preclinical evaluation of targeted therapies, though the degree to which these systems recapitulate the disease state as found in patients has not been well defined for AML. Gene expression profiling of patient blasts has been successfully used to discriminate distinct subtypes of AML, to uncover sub-type specific vulnerabilities, and to predict response to therapy and outcomes. We sought to systematically examine PDX models of pediatric AML for their ability to replicate global gene expression patterns and preserve mutational signatures found in patients. In addition, we conducted in-depth bioinformatic analyses of samples with cryptic CBA2T3-GLIS2 fusion generated by the inv(16)(p13.3q24.3) for identification of potential novel targeted therapies. We performed detailed analyses of RNA sequencing data from a diverse series of 24 pediatric AML PDX models established from samples obtained from patients with relapse and refractory disease. Initially we compared our PDX data against 49 selected relapse and refractory patient sample data files found in the NCI TARGET dataset of pediatric AML. When applying unsupervised hierarchical clustering to the PDX samples, we found that clustering was associated with MLL status. Clustering of the combined sets of samples by MLL status showed integration of samples according to mutation profile, regardless of data source (PDX or patient). The expression levels of all detectable transcripts were highly conserved between PDX and patient MLL-r samples. Separate analysis of each dataset yielded MLL specific gene lists that included a subset of overlapping genes which may point to a unique relapse and refractory pediatric MLL-r signature. This list contains several interesting new targets for further study. A subset of 12 PDX models were compared directly to the matched patient sample from which they were established. This analysis revealed strong similarity, with each PDX most closely related to its matched patient sample, suggesting retention of sample-specific gene expression in immune deficient mice. We set up our PDX models in NSG mice with transgenic expression of human myelo-supportive cytokines SCF, GM-CSF, and IL-3 in order to promote the most efficient and robust engraftment of precious patient material. In order to detect any skewing effects due to the host mouse strain, we compared NSGS PDX RNA sequencing data to 10 matched NSG PDX models. This comparison revealed consistent differences in only 9 transcripts, which were almost entirely related to increased JAK/STAT signaling and macrophage activation pathways in NSGS mice relative to NSG mice. Interestingly, during this analysis we observed a distinct PCA-driven clustering of a pair of PDX samples with previously clinically unidentified driver mutations. Reanalysis of the RNA sequencing data revealed evidence of a cryptic GLIS2 rearrangement (found in ~1% of pediatric AML cases) as the driver mutation, which was subsequently confirmed by RT-PCR in both samples. The unique CBFA2T3/GLIS2 RNA signature was mined to guide the composition of a focused 75-molecule in vitro drug screen against ex vivo PDX samples with an emphasis on the SHH, WNT, and BCL2 pathways. This screen identified the Wnt-C59 PORCN inhibitor as having specific activity against CBFA2T3/GLIS2+ AMLs. Further testing of C-59 in combinatorial studies revealed enhanced effects with the addition of the BCL2 inhibitor, venetoclax. In vivo experiments are currently underway to determine the pre-clinical efficacy of this novel combination. In summary, we found highly significant fidelity of gene expression in PDX models of relapse and refractory pediatric AML. Analysis of this dataset has led to several insights, including potential targeted therapies, highlighting how this system could be a valuable tool for discovery of novel targeted therapies, especially for very rare, distinct subtypes of disease. Disclosures Perentesis: Kurome Therapeutics: Consultancy.

Download Full-text

SCDC: bulk gene expression deconvolution by multiple single-cell RNA sequencing references

Briefings in Bioinformatics ◽

10.1093/bib/bbz166 ◽

2020 ◽

Cited By ~ 13

Author(s):

Meichen Dong ◽

Aatish Thennavan ◽

Eugene Urrutia ◽

Yun Li ◽

Charles M Perou ◽

...

Keyword(s):

Gene Expression ◽

Single Cell ◽

Rna Sequencing ◽

Expression Profiles ◽

Gene Expression Profiles ◽

Specific Gene ◽

Rna Seq ◽

Cell Type ◽

Mixed Cell ◽

Single Cell Rna Sequencing

Abstract Recent advances in single-cell RNA sequencing (scRNA-seq) enable characterization of transcriptomic profiles with single-cell resolution and circumvent averaging artifacts associated with traditional bulk RNA sequencing (RNA-seq) data. Here, we propose SCDC, a deconvolution method for bulk RNA-seq that leverages cell-type specific gene expression profiles from multiple scRNA-seq reference datasets. SCDC adopts an ENSEMBLE method to integrate deconvolution results from different scRNA-seq datasets that are produced in different laboratories and at different times, implicitly addressing the problem of batch-effect confounding. SCDC is benchmarked against existing methods using both in silico generated pseudo-bulk samples and experimentally mixed cell lines, whose known cell-type compositions serve as ground truths. We show that SCDC outperforms existing methods with improved accuracy of cell-type decomposition under both settings. To illustrate how the ENSEMBLE framework performs in complex tissues under different scenarios, we further apply our method to a human pancreatic islet dataset and a mouse mammary gland dataset. SCDC returns results that are more consistent with experimental designs and that reproduce more significant associations between cell-type proportions and measured phenotypes.

Download Full-text

Machine Learning-Assisted Identification of Factors Contributing to the Technical Variability Between Bulk and Single-Cell RNA-Seq Experiments

10.21203/rs.3.rs-1247889/v1 ◽

2022 ◽

Author(s):

Sofya Lipnitskaya ◽

Yang Shen ◽

Stefan Legewie ◽

Holger Klein ◽

Kolja Becker

Keyword(s):

Gene Expression ◽

Machine Learning ◽

Single Cell ◽

Rna Sequencing ◽

Quantitative Difference ◽

Rna Seq ◽

Sequencing Data ◽

Factors Affecting ◽

Expression Variability ◽

Technical Variability

Abstract Background: Recent studies in the area of transcriptomics performed on single-cell and population levels reveal noticeable variability in gene expression measurements provided by different RNA sequencing technologies. Due to increased noise and complexity of single-cell RNA-Seq (scRNA-Seq) data over the bulk experiment, there is a substantial number of variably-expressed genes and so-called dropouts, challenging the subsequent computational analysis and potentially leading to false positive discoveries. In order to investigate factors affecting technical variability between RNA sequencing experiments of different technologies, we performed a systematic assessment of single-cell and bulk RNA-Seq data, which have undergone the same pre-processing and sample preparation procedures. Results: Our analysis indicates that variability between gene expression measurements as well as dropout events are not exclusively caused by biological variability, low expression levels, or random variation. Furthermore, we propose FAVSeq, a machine learning-assisted pipeline for detection of factors contributing to gene expression variability in matched RNA-Seq data provided by two technologies. Based on the analysis of the matched bulk and single-cell dataset, we found the 3'-UTR and transcript lengths as the most relevant effectors of the observed variation between RNA-Seq experiments, while the same factors together with cellular compartments were shown to be associated with dropouts. Conclusions: Here, we investigated the sources of variation in RNA-Seq profiles of matched single-cell and bulk experiments. In addition, we proposed the FAVSeq pipeline for analyzing multimodal RNA sequencing data, which allowed to identify factors affecting quantitative difference in gene expression measurements as well as the presence of dropouts. Hereby, the derived knowledge can be employed further in order to improve the interpretation of RNA-Seq data and identify genes that can be affected by assay-based deviations. Source code is available under the MIT license at https://github.com/slipnitskaya/FAVSeq.

Download Full-text

Strategies for cellular deconvolution in human brain RNA sequencing data

10.1101/2020.01.19.910976 ◽

2020 ◽

Cited By ~ 1

Author(s):

Olukayode A. Sosina ◽

Matthew N Tran ◽

Kristen R Maynard ◽

Ran Tao ◽

Margaret A. Taub ◽

...

Keyword(s):

Gene Expression ◽

Cell Size ◽

Expression Profiles ◽

Brain Regions ◽

Specific Gene ◽

Rna Seq ◽

Cell Type ◽

Sequencing Data ◽

Target User ◽

Size Estimates

AbstractStatistical deconvolution strategies have emerged over the past decade to estimate the proportion of various cell populations in homogenate tissue sources like brain using gene expression data. Here we show that several existing deconvolution algorithms which estimate the RNA composition of homogenate tissue, relates to the amount of RNA attributable to each cell type, and not the cellular composition relating to the underlying fraction of cells. Incorporating “cell size” parameters into RNA-based deconvolution algorithms can successfully recover cellular fractions in homogenate brain RNA-seq data. We lastly show that using both cell sizes and cell type-specific gene expression profiles from brain regions other than the target/user-provided bulk tissue RNA-seq dataset consistently results in biased cell fractions. We report several independently constructed cell size estimates as a community resource and extend the MuSiC framework to accommodate these cell size estimates (https://github.com/xuranw/MuSiC/).

Download Full-text