Rigorous Benchmarking of HLA Callers for RNA Sequencing Data

SSCC: a novel computational framework for rapid and accurate clustering large single cell RNA-seq data

10.1101/344242 ◽

2018 ◽

Cited By ~ 2

Author(s):

Xianwen Ren ◽

Liangtao Zheng ◽

Zemin Zhang

Keyword(s):

Single Cell ◽

Rna Sequencing ◽

Large Scale ◽

Random Projection ◽

Rna Seq ◽

Sequencing Data ◽

Computational Framework ◽

Human Blood Cells ◽

Single Cell Rna Sequencing ◽

Data Volume

ABSTRACTClustering is a prevalent analytical means to analyze single cell RNA sequencing data but the rapidly expanding data volume can make this process computational challenging. New methods for both accurate and efficient clustering are of pressing needs. Here we proposed a new clustering framework based on random projection and feature construction for large scale single-cell RNA sequencing data, which greatly improves clustering accuracy, robustness and computational efficacy for various state-of-the-art algorithms benchmarked on multiple real datasets. On a dataset with 68,578 human blood cells, our method reached 20% improvements for clustering accuracy and 50-fold acceleration but only consumed 66% memory usage compared to the widely-used software package SC3. Compared to k-means, the accuracy improvement can reach 3-fold depending on the concrete dataset. An R implementation of the framework is available from https://github.com/Japrin/sscClust.

RNA sequencing data: hitchhiker's guide to expression analysis

10.7287/peerj.preprints.27283 ◽

2018 ◽

Author(s):

Koen Van Den Berge ◽

Katharina Hembach ◽

Charlotte Soneson ◽

Simone Tiberi ◽

Lieven Clement ◽

...

Keyword(s):

Gene Expression ◽

Rna Sequencing ◽

Large Scale ◽

Science Studies ◽

Rna Seq ◽

Sequencing Data ◽

Data Types ◽

The Past ◽

Long Read ◽

Statistical Approaches

Gene expression is the fundamental level at which the result of various genetic and regulatory programs are observable. The measurement of transcriptome-wide gene expression has convincingly switched from microarrays to sequencing in a matter of years. RNA sequencing (RNA-seq) provides a quantitative and open system for profiling transcriptional outcomes on a large scale and therefore facilitates a large diversity of applications, including basic science studies, but also agricultural or clinical situations. In the past 10 years or so, much has been learned about the characteristics of the RNA-seq datasets as well as the performance of the myriad of methods developed. In this review, we give an overall view of the developments in RNA-seq data analysis, including experimental design, with an explicit focus on quantification of gene expression and statistical approaches for differential expression. We also highlight emerging data types, such as single-cell RNA-seq and gene expression profiling using long-read technologies.

arcasHLA: high resolution HLA typing from RNA seq

10.1101/479824 ◽

2018 ◽

Cited By ~ 1

Author(s):

Rose Orenbuch ◽

Ioan Filip ◽

Devon Comito ◽

Jeffrey Shaman ◽

Itsik Pe'er ◽

...

Keyword(s):

Rna Sequencing ◽

Mhc Class Ii ◽

Critical Role ◽

Human Leukocyte ◽

Hla Typing ◽

Sequencing Data ◽

Leukocyte Antigen ◽

Hla Genes ◽

Biological Dataset ◽

High Level

Human leukocyte antigen (HLA) locus makes up the major compatibility complex (MHC) and plays a critical role in host response to disease, including cancers and autoimmune disorders. In the clinical setting, HLA typing is necessary for determining tissue compatibility. Recent improvements in the quality and accessibility of next-generation sequencing have made HLA typing from standard short-read data practical. However, this task remains challenging given the high level of polymorphism and homology between the HLA genes. HLA typing from RNA sequencing is further complicated by post-transcriptional splicing and bias due to amplification. Here, we present arcasHLA: a fast and accurate in silico tool that infers HLA genotypes from RNA sequencing data. Our tool outperforms established tools on the gold-standard benchmark dataset for HLA typing in terms of both accuracy and speed, with an accuracy rate of 100% at two field precision for MHC class I genes, and over 99.7% for MHC class II. Importantly, arcasHLA takes as its input pre-aligned BAM files, and outputs three-field resolution for all HLA genes in less than 2 minutes. Finally, we discuss evaluate the performance of our tool on a new biological dataset of 447 single-end total RNA samples from nasopharyngeal swabs, and establish the applicability of arcasHLA in metatranscriptome studies. arcasHLA is available at https://github.com/RabadanLab/arcasHLA.

Comprehensive analysis of RNA-sequencing to find the source of 1 trillion reads across diverse adult human tissues

10.1101/053041 ◽

2016 ◽

Cited By ~ 1

Author(s):

Serghei Mangul ◽

Harry Taegyun Yang ◽

Nicolas Strauli ◽

Franziska Gruhl ◽

Hagit T. Porath ◽

...

Keyword(s):

B Cell ◽

Rna Sequencing ◽

Single Cells ◽

Cell Receptor ◽

Tissue Expression ◽

Read Length ◽

Rna Seq ◽

Disease Etiology ◽

Rna Molecules ◽

Sequencing Technologies

AbstractHigh throughput RNA sequencing technologies have provided invaluable research opportunities across distinct scientific domains by producing quantitative readouts of the transcriptional activity of both entire cellular populations and single cells. The majority of RNA-Seq analyses begin by mapping each experimentally produced sequence (i.e., read) to a set of annotated reference sequences for the organism of interest. For both biological and technical reasons, a significant fraction of reads remains unmapped. In this work, we develop Read Origin Protocol (ROP) to discover the source of all reads originating from complex RNA molecules, recombinant T and B cell receptors, and microbial communities. We applied ROP to 8,641 samples across 630 individuals from 54 tissues. A fraction of RNA-Seq data (n=86) was obtained in-house; the remaining data was obtained from the Genotype-Tissue Expression (GTEx v6) project. To generalize the reported number of accounted reads, we also performed ROP analysis on thousands of different, randomly selected, and publicly available RNA-Seq samples in the Sequence Read Archive (SRA). Our approach can account for 99.9% of 1 trillion reads of various read length across the merged dataset (n=10641). Using in-house RNA-Seq data, we show that immune profiles of asthmatic individuals are significantly different from the profiles of control individuals, with decreased average per sample T and B cell receptor diversity. We also show that immune diversity is inversely correlated with microbial load. Our results demonstrate the potential of ROP to exploit unmapped reads in order to better understand the functional mechanisms underlying connections between the immune system, microbiome, human gene expression, and disease etiology. ROP is freely available athttps://github.com/smangul1/ropand currently supports human and mouse RNA-Seq reads.

RsQTL: correlation of expressed SNVs with splicing using RNA-sequencing data

10.1101/840504 ◽

2019 ◽

Cited By ~ 1

Author(s):

Justin Sein ◽

Liam F. Spurr ◽

Pavlos Bousounis ◽

N M Prashant ◽

Hongyu Liu ◽

...

Keyword(s):

Rna Sequencing ◽

Tissue Expression ◽

Supplementary Information ◽

Rna Seq ◽

Sequencing Data ◽

Dynamic Nature ◽

Dynamic Variation ◽

Exon Junctions ◽

Variant Allele Fraction ◽

Allele Fraction

SummaryRsQTL is a tool for identification of splicing quantitative trait loci (sQTLs) from RNA-sequencing (RNA-seq) data by correlating the variant allele fraction at expressed SNV loci in the transcriptome (VAFRNA) with the proportion of molecules spanning local exon-exon junctions at loci with differential intron excision (percent spliced in, PSI). We exemplify the method on sets of RNA-seq data from human tissues obtained though the Genotype-Tissue Expression Project (GTEx). RsQTL does not require matched DNA and can identify a subset of expressed sQTL loci. Due to the dynamic nature of VAFRNA, RsQTL is applicable for the assessment of conditional and dynamic variation-splicing relationships.Availability and implementationhttps://github.com/HorvathLab/[email protected] or [email protected] InformationRsQTL_Supplementary_Data.zip

RNA sequencing data: hitchhiker's guide to expression analysis

10.7287/peerj.preprints.27283v1 ◽

2018 ◽

Cited By ~ 1

Author(s):

Koen Van Den Berge ◽

Katharina Hembach ◽

Charlotte Soneson ◽

Simone Tiberi ◽

Lieven Clement ◽

...

Keyword(s):

Gene Expression ◽

Rna Sequencing ◽

Large Scale ◽

Science Studies ◽

Rna Seq ◽

Sequencing Data ◽

Data Types ◽

The Past ◽

Long Read ◽

Statistical Approaches

Gene expression is the fundamental level at which the result of various genetic and regulatory programs are observable. The measurement of transcriptome-wide gene expression has convincingly switched from microarrays to sequencing in a matter of years. RNA sequencing (RNA-seq) provides a quantitative and open system for profiling transcriptional outcomes on a large scale and therefore facilitates a large diversity of applications, including basic science studies, but also agricultural or clinical situations. In the past 10 years or so, much has been learned about the characteristics of the RNA-seq datasets as well as the performance of the myriad of methods developed. In this review, we give an overall view of the developments in RNA-seq data analysis, including experimental design, with an explicit focus on quantification of gene expression and statistical approaches for differential expression. We also highlight emerging data types, such as single-cell RNA-seq and gene expression profiling using long-read technologies.

arcasHLA: high-resolution HLA typing from RNAseq

Bioinformatics ◽

10.1093/bioinformatics/btz474 ◽

2019 ◽

Vol 36 (1) ◽

pp. 33-40 ◽

Cited By ~ 11

Author(s):

Rose Orenbuch ◽

Ioan Filip ◽

Devon Comito ◽

Jeffrey Shaman ◽

Itsik Pe’er ◽

...

Keyword(s):

Rna Sequencing ◽

Critical Role ◽

Human Leukocyte ◽

Hla Typing ◽

Supplementary Information ◽

Sequencing Data ◽

Leukocyte Antigen ◽

Biological Dataset ◽

Hla Genotypes ◽

High Level

Abstract Motivation The human leukocyte antigen (HLA) locus plays a critical role in tissue compatibility and regulates the host response to many diseases, including cancers and autoimmune di3orders. Recent improvements in the quality and accessibility of next-generation sequencing have made HLA typing from standard short-read data practical. However, this task remains challenging given the high level of polymorphism and homology between HLA genes. HLA typing from RNA sequencing is further complicated by post-transcriptional modifications and bias due to amplification. Results Here, we present arcasHLA: a fast and accurate in silico tool that infers HLA genotypes from RNA-sequencing data. Our tool outperforms established tools on the gold-standard benchmark dataset for HLA typing in terms of both accuracy and speed, with an accuracy rate of 100% at two-field resolution for Class I genes, and over 99.7% for Class II. Furthermore, we evaluate the performance of our tool on a new biological dataset of 447 single-end total RNA samples from nasopharyngeal swabs, and establish the applicability of arcasHLA in metatranscriptome studies. Availability and implementation arcasHLA is available at https://github.com/RabadanLab/arcasHLA. Supplementary information Supplementary data are available at Bioinformatics online.

RNA Sequencing Data: Hitchhiker's Guide to Expression Analysis

Annual Review of Biomedical Data Science ◽

10.1146/annurev-biodatasci-072018-021255 ◽

2019 ◽

Vol 2 (1) ◽

pp. 139-173 ◽

Cited By ~ 23

Author(s):

Koen Van den Berge ◽

Katharina M. Hembach ◽

Charlotte Soneson ◽

Simone Tiberi ◽

Lieven Clement ◽

...

Keyword(s):

Gene Expression ◽

Rna Sequencing ◽

Large Scale ◽

Science Studies ◽

Data Sets ◽

Rna Seq ◽

Sequencing Data ◽

Data Types ◽

The Past ◽

Long Read

Gene expression is the fundamental level at which the results of various genetic and regulatory programs are observable. The measurement of transcriptome-wide gene expression has convincingly switched from microarrays to sequencing in a matter of years. RNA sequencing (RNA-seq) provides a quantitative and open system for profiling transcriptional outcomes on a large scale and therefore facilitates a large diversity of applications, including basic science studies, but also agricultural or clinical situations. In the past 10 years or so, much has been learned about the characteristics of the RNA-seq data sets, as well as the performance of the myriad of methods developed. In this review, we give an overview of the developments in RNA-seq data analysis, including experimental design, with an explicit focus on the quantification of gene expression and statistical approachesfor differential expression. We also highlight emerging data types, such as single-cell RNA-seq and gene expression profiling using long-read technologies.

RNA sequencing data: hitchhiker's guide to expression analysis

10.7287/peerj.preprints.27283v2 ◽

2018 ◽

Cited By ~ 1

Author(s):

Koen Van Den Berge ◽

Katharina Hembach ◽

Charlotte Soneson ◽

Simone Tiberi ◽

Lieven Clement ◽

...

Keyword(s):

Gene Expression ◽

Rna Sequencing ◽

Large Scale ◽

Science Studies ◽

Rna Seq ◽

Sequencing Data ◽

Data Types ◽

The Past ◽

Long Read ◽

Statistical Approaches

Gene expression is the fundamental level at which the result of various genetic and regulatory programs are observable. The measurement of transcriptome-wide gene expression has convincingly switched from microarrays to sequencing in a matter of years. RNA sequencing (RNA-seq) provides a quantitative and open system for profiling transcriptional outcomes on a large scale and therefore facilitates a large diversity of applications, including basic science studies, but also agricultural or clinical situations. In the past 10 years or so, much has been learned about the characteristics of the RNA-seq datasets as well as the performance of the myriad of methods developed. In this review, we give an overall view of the developments in RNA-seq data analysis, including experimental design, with an explicit focus on quantification of gene expression and statistical approaches for differential expression. We also highlight emerging data types, such as single-cell RNA-seq and gene expression profiling using long-read technologies.

Enabling cross-study analysis of RNA-Sequencing data

10.1101/110734 ◽

2017 ◽

Cited By ~ 4

Author(s):

Qingguo Wang ◽

Joshua Armenia ◽

Chao Zhang ◽

Alexander V. Penson ◽

Ed Reznik ◽

...

Keyword(s):

Large Scale ◽

Tissue Expression ◽

Underlying Disease ◽

The Cancer Genome Atlas ◽

Rna Seq ◽

Sequencing Data ◽

Whole Transcriptome Sequencing ◽

Cancer Genome Atlas ◽

Next Generation Sequencing Ngs ◽

Different Sources

AbstractDriven by the recent advances of next generation sequencing (NGS) technologies and an urgent need to decode complex human diseases, a multitude of large-scale studies were conducted recently that have resulted in an unprecedented volume of whole transcriptome sequencing (RNA-seq) data. While these data offer new opportunities to identify the mechanisms underlying disease, the comparison of data from different sources poses a great challenge, due to differences in sample and data processing. Here, we present a pipeline that processes and unifies RNA-seq data from different studies, which includes uniform realignment and gene expression quantification as well as batch effect removal. We find that uniform alignment and quantification is not sufficient when combining RNA-seq data from different sources and that the removal of other batch effects is essential to facilitate data comparison. We have processed data from the Genotype Tissue Expression project (GTEx) and The Cancer Genome Atlas (TCGA) and have successfully corrected for study-specific biases, enabling comparative analysis across studies. The normalized data are available for download via GitHub (at https://github.com/mskcc/RNAseqDB).