scholarly journals Effects of duplicated mapped read PCR artifacts on RNA-seq differential expression analysis based on qRNA-seq

2018 ◽  
Author(s):  
Anna C. Salzberg ◽  
Jiafen Hu ◽  
Elizabeth J. Conroy ◽  
Nancy M. Cladel ◽  
Robert M. Brucklacher ◽  
...  

AbstractBest practices to handling duplicated mapped reads in RNA-seq analyses has long been discussed but a gold standard method has yet to be established, as such duplicates could originate from valid biological transcripts or they could be PCR-related artifacts. Here we used the NEXTflex™qRNA-SeqTM(aka Molecular Indexing™) technology to identify PCR duplicates via the random attachment of unique molecular labels to each cDNA molecule prior to PCR amplification. We found that up to 64.3% of the single end and 19.3% of the mouse paired end duplicates originated from valid biological transcripts rather than PCR artifacts. For single end reads, either removing or retaining all duplicates resulted in a substantial number of false positives (up to 47.0%) and false negatives (up to 12.1%) in the sets of significantly differentially expressed genes. For paired end reads, only the alignment retaining all duplicates resulted in a substantial number of false positives. This is the first effort to evaluate the performance of qRNA-seq using ‘real-world’ biomedical samples, and we found that PCR duplicate identification provided minor benefits for paired end reads but greatly improved the sensitivity and specificity in the determination of the significantly differentially expressed genes for single end reads.


2014 ◽  
Author(s):  
Zong Hong Zhang ◽  
Dhanisha J. Jhaveri ◽  
Vikki M. Marshall ◽  
Denis C. Bauer ◽  
Janette Edson ◽  
...  

Recent advances in next-generation sequencing technology allow high-throughput cDNA sequencing (RNA-Seq) to be widely applied in transcriptomic studies, in particular for detecting differentially expressed genes between groups. Many software packages have been developed for the identification of differentially expressed genes (DEGs) between treatment groups based on RNA-Seq data. However, there is a lack of consensus on how to approach an optimal study design and choice of suitable software for the analysis. In this comparative study we evaluate the performance of three of the most frequently used software tools: Cufflinks-Cuffdiff2, DESeq and edgeR. A number of important parameters of RNA-Seq technology were taken into consideration, including the number of replicates, sequencing depth, and balanced vs. unbalanced sequencing depth within and between groups. We benchmarked results relative to sets of DEGs identified through either quantitative RT-PCR or microarray. We observed that edgeR performs slightly better than DESeq and Cuffdiff2 in terms of the ability to uncover true positives. Overall, DESeq or taking the intersection of DEGs from two or more tools is recommended if the number of false positives is a major concern in the study. In other circumstances, edgeR is slightly preferable for differential expression analysis at the expense of potentially introducing more false positives.



2015 ◽  
Author(s):  
Swati Parekh ◽  
Christoph Ziegenhain ◽  
Beate Vieth ◽  
Wolfgang Enard ◽  
Ines Hellmann

Background Currently quantitative RNA-Seq methods are pushed to work with increasingly small starting amounts of RNA that require PCR amplification to generate libraries. However, it is unclear how much noise or bias amplification introduces and how this effects precision and accuracy of RNA quantification. To assess the effects of amplification, reads that originated from the same RNA molecule (PCR-duplicates) need to be identified. Computationally, read duplicates are defined via their mapping position, which does not distinguish PCR- from natural duplicates that are bound to occur for highly transcribed RNAs. Hence, it is unclear how to treat duplicate reads and how important it is to reduce PCR amplification experimentally. Here, we generate and analyse RNA-Seq datasets that were prepared with three different protocols (Smart-Seq, TruSeq and UMI-seq). We find that a large fraction of computationally identified read duplicates can be explained by sampling and fragmentation bias. Consequently, the computational removal of duplicates does not improve accuracy, power or false discovery rates, but can actually worsen them. Even when duplicates are experimentally identified by unique molecular identifiers (UMIs), power and false discovery rate are only mildly improved. However, we do find that power does improve with fewer PCR amplification cycles across datasets and that early barcoding of samples and hence PCR amplification in one reaction can restore this loss of power. Conclusions Computational removal of read duplicates is not recommended for differential expression analysis. However, the pooling of samples as made possible by the early barcoding of the UMI-protocol leads to an appreciable increase in the power to detect differentially expressed genes.



2012 ◽  
Vol 2012 ◽  
pp. 1-8 ◽  
Author(s):  
Rashi Gupta ◽  
Isha Dewan ◽  
Richa Bharti ◽  
Alok Bhattacharya

RNA-Seq is increasingly being used for gene expression profiling. In this approach, next-generation sequencing (NGS) platforms are used for sequencing. Due to highly parallel nature, millions of reads are generated in a short time and at low cost. Therefore analysis of the data is a major challenge and development of statistical and computational methods is essential for drawing meaningful conclusions from this huge data. In here, we assessed three different types of normalization (transcript parts per million, trimmed mean of M values, quantile normalization) and evaluated if normalized data reduces technical variability across replicates. In addition, we also proposed two novel methods for detecting differentially expressed genes between two biological conditions: (i) likelihood ratio method, and (ii) Bayesian method. Our proposed methods for finding differentially expressed genes were tested on three real datasets. Our methods performed at least as well as, and often better than, the existing methods for analysis of differential expression.



2020 ◽  
Author(s):  
Chanwoo Kim ◽  
Hanbin Lee ◽  
Juhee Jeong ◽  
Keehoon Jung ◽  
Buhm Han

ABSTRACTA common approach to analyzing single-cell RNA-sequencing data is to cluster cells first and then identify differentially expressed genes based on the clustering result. However, clustering has an innate uncertainty and can be imperfect, undermining the reliability of differential expression analysis results. To overcome this challenge, we present MarcoPolo, a clustering-free approach to exploring differentially expressed genes. To find informative genes without clustering, MarcoPolo exploits the bimodality of gene expression to learn the group information of the cells with respect to the expression level directly from given data. Using simulations and real data analyses, we showed that our method puts biologically informative genes at higher ranks more accurately and robustly than other existing methods. As our method provides information on how cells can be grouped for each gene, it can help identify cell types that are not separated well in the standard clustering process. Our method can also be used as a feature selection method to improve the robustness against changes in the number of genes used in clustering.



2018 ◽  
Author(s):  
Almas Jabeen ◽  
Nadeem Ahmad ◽  
Khalid Raza

Zika virus (ZIKV) is considered to be an emerging viral outbreak due to its link to diseases like microcephaly, Guillain-Barre Syndrome in human. In this paper, we identify differentially expressed genes (DEGs) using RNA-seq data. In this study, we adopted the RNA-seq analysis pipeline to quantify RNA-seq data into read counts. Our analysis uncovers the significant DEGs which may be involved in the altered biological process somehow. Here, we report the list of significant DEGs, out of which three genes are found to be highly differentially expressed. In addition, our analysis also predicts other moderate DEGs, low DEGs whose differential expression was induced due to ZIKV infections.





Viruses ◽  
2021 ◽  
Vol 13 (2) ◽  
pp. 244 ◽  
Author(s):  
Antonio Victor Campos Coelho ◽  
Rossella Gratton ◽  
João Paulo Britto de Melo ◽  
José Leandro Andrade-Santos ◽  
Rafael Lima Guimarães ◽  
...  

HIV-1 infection elicits a complex dynamic of the expression various host genes. High throughput sequencing added an expressive amount of information regarding HIV-1 infections and pathogenesis. RNA sequencing (RNA-Seq) is currently the tool of choice to investigate gene expression in a several range of experimental setting. This study aims at performing a meta-analysis of RNA-Seq expression profiles in samples of HIV-1 infected CD4+ T cells compared to uninfected cells to assess consistently differentially expressed genes in the context of HIV-1 infection. We selected two studies (22 samples: 15 experimentally infected and 7 mock-infected). We found 208 differentially expressed genes in infected cells when compared to uninfected/mock-infected cells. This result had moderate overlap when compared to previous studies of HIV-1 infection transcriptomics, but we identified 64 genes already known to interact with HIV-1 according to the HIV-1 Human Interaction Database. A gene ontology (GO) analysis revealed enrichment of several pathways involved in immune response, cell adhesion, cell migration, inflammation, apoptosis, Wnt, Notch and ERK/MAPK signaling.



2019 ◽  
Vol 32 (5) ◽  
pp. 515-526 ◽  
Author(s):  
William E. Fry ◽  
Sean P. Patev ◽  
Kevin L. Myers ◽  
Kan Bao ◽  
Zhangjun Fei

Sporangia of Phytophthora infestans from pure cultures on agar plates are typically used in lab studies, whereas sporangia from leaflet lesions drive natural infections and epidemics. Multiple assays were performed to determine if sporangia from these two sources are equivalent. Sporangia from plate cultures showed much lower rates of indirect germination and produced much less disease in field and moist-chamber tests. This difference in aggressiveness was observed whether the sporangia had been previously incubated at 4°C (to induce indirect germination) or at 21°C (to prevent indirect germination). Furthermore, lesions caused by sporangia from plates produced much less sporulation. RNA-Seq analysis revealed that thousands of the >17,000 P. infestans genes with a RPKM (reads per kilobase of exon model per million mapped reads) >1 were differentially expressed in sporangia obtained from plate cultures of two independent field isolates compared with sporangia of those isolates from leaflet lesions. Among the significant differentially expressed genes (DEGs), putative RxLR effectors were overrepresented, with almost half of the 355 effectors with RPKM >1 being up- or downregulated. DEGs of both isolates include nine flagellar-associated genes, and all were down-regulated in plate sporangia. Ten elicitin genes were also detected as DEGs in both isolates, and nine (including INF1) were up-regulated in plate sporangia. These results corroborate previous observations that sporangia produced from plates and leaflets sometimes yield different experimental results and suggest hypotheses for potential mechanisms. We caution that use of plate sporangia in assays may not always produce results reflective of natural infections and epidemics.



2021 ◽  
Author(s):  
Chengang Guo ◽  
Zhimin wei ◽  
Wei Lyu ◽  
Yanlou Geng

Abstract Quinoa saponins have complex, diverse and evident physiologic activities. However, the key regulatory genes for quinoa saponin metabolism are not yet well studied. The purpose of this study was to explore genes closely related to quinoa saponin metabolism. In this study, the significantly differentially expressed genes in yellow quinoa were firstly screened based on RNA-seq technology. Then, the key genes for saponin metabolism were selected by gene set enrichment analysis (GSEA) and principal component analysis (PCA) statistical methods. Finally, the specificity of the key genes was verified by hierarchical clustering. The results of differential analysis showed that 1654 differentially expressed genes were achieved after pseudogenes deletion. Therein, there were 142 long non-coding genes and 1512 protein-coding genes. Based on GSEA analysis, 116 key candidate genes were found to be significantly correlated with quinoa saponin metabolism. Through PCA dimension reduction analysis, 57 key genes were finally obtained. Hierarchical cluster analysis further demonstrated that these key genes can clearly separate the four groups of samples. The present results could provide references for the breeding of sweet quinoa and would be helpful for the rational utilization of quinoa saponins.



Sign in / Sign up

Export Citation Format

Share Document