A survey on identification and quantification of alternative polyadenylation sites from RNA-seq data

2019 ◽  
Vol 21 (4) ◽  
pp. 1261-1276 ◽  
Author(s):  
Moliang Chen ◽  
Guoli Ji ◽  
Hongjuan Fu ◽  
Qianmin Lin ◽  
Congting Ye ◽  
...  

Abstract Alternative polyadenylation (APA) has been implicated to play an important role in post-transcriptional regulation by regulating mRNA abundance, stability, localization and translation, which contributes considerably to transcriptome diversity and gene expression regulation. RNA-seq has become a routine approach for transcriptome profiling, generating unprecedented data that could be used to identify and quantify APA site usage. A number of computational approaches for identifying APA sites and/or dynamic APA events from RNA-seq data have emerged in the literature, which provide valuable yet preliminary results that should be refined to yield credible guidelines for the scientific community. In this review, we provided a comprehensive overview of the status of currently available computational approaches. We also conducted objective benchmarking analysis using RNA-seq data sets from different species (human, mouse and Arabidopsis) and simulated data sets to present a systematic evaluation of 11 representative methods. Our benchmarking study showed that the overall performance of all tools investigated is moderate, reflecting that there is still lot of scope to improve the prediction of APA site or dynamic APA events from RNA-seq data. Particularly, prediction results from individual tools differ considerably, and only a limited number of predicted APA sites or genes are common among different tools. Accordingly, we attempted to give some advice on how to assess the reliability of the obtained results. We also proposed practical recommendations on the appropriate method applicable to diverse scenarios and discussed implications and future directions relevant to profiling APA from RNA-seq data.

2014 ◽  
Author(s):  
Daehwan Kim ◽  
Ben Langmead ◽  
Steven Salzberg

HISAT is a new, highly efficient system for alignment of sequences from RNA sequencing experiments that achieves dramatically faster performance than previous methods. HISAT uses a new indexing scheme, hierarchical indexing, which is based on the Burrows-Wheeler transform and the Ferragina-Manzini (FM) index. Hierarchical indexing employs two types of indexes for alignment: (1) a whole-genome FM index to anchor each alignment, and (2) numerous local FM indexes for very rapid extensions of these alignments. HISAT?s hierarchical index for the human genome contains 48,000 local FM indexes, each representing a genomic region of ~64,000 bp. The algorithm includes several customized alignment strategies specifically designed for mapping RNA-seq reads across multiple exons. In tests on a variety of real and simulated data sets, we show that HISAT is the fastest system currently available, approximately 50 times faster than TopHat2 and 12 times faster than GSNAP, with equal or better accuracy than any other method. Despite its very large number of indexes, HISAT requires only 4.3 Gigabytes of memory to align reads to the human genome. HISAT supports genomes of any size, including those larger than 4 billion bases. HISAT is available as free, open-source software from http://www.ccb.jhu.edu/software/hisat.


Author(s):  
Krzysztof J Szkop ◽  
David S Moss ◽  
Irene Nobeli

Abstract Motivation We present flexible Modeling of Alternative PolyAdenylation (flexiMAP), a new beta-regression-based method implemented in R, for discovering differential alternative polyadenylation events in standard RNA-seq data. Results We show, using both simulated and real data, that flexiMAP exhibits a good balance between specificity and sensitivity and compares favourably to existing methods, especially at low fold changes. In addition, the tests on simulated data reveal some hitherto unrecognized caveats of existing methods. Importantly, flexiMAP allows modeling of multiple known covariates that often confound the results of RNA-seq data analysis. Availability and implementation The flexiMAP R package is available at: https://github.com/kszkop/flexiMAP. Scripts and data to reproduce the analysis in this paper are available at: https://doi.org/10.5281/zenodo.3689788. Supplementary information Supplementary data are available at Bioinformatics online.


2019 ◽  
Author(s):  
Krzysztof J. Szkop ◽  
David S. Moss ◽  
Irene Nobeli

AbstractSummaryWe present flexiMAP (flexible Modeling of Alternative PolyAdenylation), a new beta-regression-based method implemented in R, for discovering differential alternative polyadenylation events in standard RNA-seq data. Importantly, flexiMAP allows modeling of multiple known covariates that often confound the results of RNA-seq data analysis. We show, using simulated data, that flexiMAP is very specific and outperforms in sensitivity existing methods, especially at low fold changes. In addition, the tests on simulated data reveal some hitherto unrecognised caveats of existing methods.AvailabilityThe flexiMAP R package is available at: https://github.com/kszkop/flexiMAPScripts and data to reproduce the analysis in this paper are available at: https://doi.org/10.5281/zenodo.3238619ContactIrene Nobeli, [email protected]


2014 ◽  
Author(s):  
Nuno A Fonseca ◽  
John A Marioni ◽  
Alvis Brazma

Accurately quantifying gene expression levels is a key goal of experiments using RNA-sequencing to assay the transcriptome. This typically requires aligning the short reads generated to the genome or transcriptome before quantifying expression of pre-defined sets of genes. Differences in the alignment/quantification tools can have a major effect upon the expression levels found with important consequences for biological interpretation. Here we address two main issues: do different analysis pipelines affect the gene expression levels inferred from RNA-seq data? And, how close are the expression levels inferred to the ``true'' expression levels? We evaluate fifty gene profiling pipelines in experimental and simulated data sets with different characteristics (e.g, read length and sequencing depth). In the absence of knowledge of the 'ground truth' in real RNAseq data sets, we used simulated data to assess the differences between the true expression and those reconstructed by the analysis pipelines. Even though this approach does not take into account all known biases present in RNAseq data, it still allows to assess the accuracy of the gene expression values inferred by different analysis pipelines. The results show that i) overall there is a high correlation between the expression levels inferred by the best pipelines and the true quantification values; ii) the error in the estimated gene expression values can vary considerably across genes; and iii) a small set of genes have expression estimates with consistently high error (across data sets and methods). Finally, although the mapping software is important, the quantification method makes a greater difference to the results.


2019 ◽  
Vol 21 (6) ◽  
pp. 2052-2065 ◽  
Author(s):  
Arfa Mehmood ◽  
Asta Laiho ◽  
Mikko S Venäläinen ◽  
Aidan J McGlinchey ◽  
Ning Wang ◽  
...  

Abstract Differential splicing (DS) is a post-transcriptional biological process with critical, wide-ranging effects on a plethora of cellular activities and disease processes. To date, a number of computational approaches have been developed to identify and quantify differentially spliced genes from RNA-seq data, but a comprehensive intercomparison and appraisal of these approaches is currently lacking. In this study, we systematically evaluated 10 DS analysis tools for consistency and reproducibility, precision, recall and false discovery rate, agreement upon reported differentially spliced genes and functional enrichment. The tools were selected to represent the three different methodological categories: exon-based (DEXSeq, edgeR, JunctionSeq, limma), isoform-based (cuffdiff2, DiffSplice) and event-based methods (dSpliceType, MAJIQ, rMATS, SUPPA). Overall, all the exon-based methods and two event-based methods (MAJIQ and rMATS) scored well on the selected measures. Of the 10 tools tested, the exon-based methods performed generally better than the isoform-based and event-based methods. However, overall, the different data analysis tools performed strikingly differently across different data sets or numbers of samples.


2019 ◽  
Vol 45 (9) ◽  
pp. 1183-1198
Author(s):  
Gaurav S. Chauhan ◽  
Pradip Banerjee

Purpose Recent papers on target capital structure show that debt ratio seems to vary widely in space and time, implying that the functional specifications of target debt ratios are of little empirical use. Further, target behavior cannot be adjudged correctly using debt ratios, as they could revert due to mechanical reasons. The purpose of this paper is to develop an alternative testing strategy to test the target capital structure. Design/methodology/approach The authors make use of a major “shock” to the debt ratios as an event and think of a subsequent reversion as a movement toward a mean or target debt ratio. By doing this, the authors no longer need to identify target debt ratios as a function of firm-specific variables or any other rigid functional form. Findings Similar to the broad empirical evidence in developed economies, there is no perceptible and systematic mean reversion by Indian firms. However, unlike developed countries, proportionate usage of debt to finance firms’ marginal financing deficits is extensive; equity is used rather sparingly. Research limitations/implications The trade-off theory could be convincingly refuted at least for the emerging market of India. The paper here stimulated further research on finding reasons for specific financing behavior of emerging market firms. Practical implications The results show that the firms’ financing choices are not only depending on their own firm’s specific variables but also on the financial markets in which they operate. Originality/value This study attempts to assess mean reversion in debt ratios in a unique but reassuring manner. The results are confirmed by extensive calibration of the testing strategy using simulated data sets.


2021 ◽  
Vol 21 (1) ◽  
Author(s):  
Zhengjie Chen ◽  
Dengguo Tang ◽  
Jixing Ni ◽  
Peng Li ◽  
Le Wang ◽  
...  

Abstract Background Maize is one of the most important field crops in the world. Most of the key agronomic traits, including yield traits and plant architecture traits, are quantitative. Fine mapping of genes/ quantitative trait loci (QTL) influencing a key trait is essential for marker-assisted selection (MAS) in maize breeding. However, the SNP markers with high density and high polymorphism are lacking, especially kompetitive allele specific PCR (KASP) SNP markers that can be used for automatic genotyping. To date, a large volume of sequencing data has been produced by the next generation sequencing technology, which provides a good pool of SNP loci for development of SNP markers. In this study, we carried out a multi-step screening method to identify kompetitive allele specific PCR (KASP) SNP markers based on the RNA-Seq data sets of 368 maize inbred lines. Results A total of 2,948,985 SNPs were identified in the high-throughput RNA-Seq data sets with the average density of 1.4 SNP/kb. Of these, 71,311 KASP SNP markers (the average density of 34 KASP SNP/Mb) were developed based on the strict criteria: unique genomic region, bi-allelic, polymorphism information content (PIC) value ≥0.4, and conserved primer sequences, and were mapped on 16,161 genes. These 16,161 genes were annotated to 52 gene ontology (GO) terms, including most of primary and secondary metabolic pathways. Subsequently, the 50 KASP SNP markers with the PIC values ranging from 0.14 to 0.5 in 368 RNA-Seq data sets and with polymorphism between the maize inbred lines 1212 and B73 in in silico analysis were selected to experimentally validate the accuracy and polymorphism of SNPs, resulted in 46 SNPs (92.00%) showed polymorphism between the maize inbred lines 1212 and B73. Moreover, these 46 polymorphic SNPs were utilized to genotype the other 20 maize inbred lines, with all 46 SNPs showing polymorphism in the 20 maize inbred lines, and the PIC value of each SNP was 0.11 to 0.50 with an average of 0.35. The results suggested that the KASP SNP markers developed in this study were accurate and polymorphic. Conclusions These high-density polymorphic KASP SNP markers will be a valuable resource for map-based cloning of QTL/genes and marker-assisted selection in maize. Furthermore, the method used to develop SNP markers in maize can also be applied in other species.


Sign in / Sign up

Export Citation Format

Share Document