Detecting differentially expressed genes by smoothing effect of gene length on variance estimation

2015 ◽  
Vol 13 (06) ◽  
pp. 1542004 ◽  
Author(s):  
Jinyang Tang ◽  
Fei Wang

Next-generation sequencing technologies are widely used in genome research, and RNA sequencing (RNA-Seq) is becoming the main application for gene expression profiling. A large number of computational methods have been developed for analyzing differentially expressed (DE) genes in RNA-Seq data. However, most existing algorithms prefer to call long genes as DE. Short DE genes are rarely detected. In this work, we set out to gain insight into the influence of gene length on RNA-Seq data analysis and to figure out the effect of gene length on variance estimation of RNA-Seq read counts, which is important for statistic test to identify DE genes. We proposed a balanced method of hunting for short DE genes with significance by smoothing a gene length factor. Computational experiments indicate that our method performs well. Software available: http://www.iipl.fudan.edu.cn/lenseq/ .

2010 ◽  
Vol 28 (1) ◽  
pp. E6 ◽  
Author(s):  
Paul A. Northcott ◽  
James T. Rutka ◽  
Michael D. Taylor

Advances in the field of genomics have recently enabled the unprecedented characterization of the cancer genome, providing novel insight into the molecular mechanisms underlying malignancies in humans. The application of high-resolution microarray platforms to the study of medulloblastoma has revealed new oncogenes and tumor suppressors and has implicated changes in DNA copy number, gene expression, and methylation state in its etiology. Additionally, the integration of medulloblastoma genomics with patient clinical data has confirmed molecular markers of prognostic significance and highlighted the potential utility of molecular disease stratification. The advent of next-generation sequencing technologies promises to greatly transform our understanding of medulloblastoma pathogenesis in the next few years, permitting comprehensive analyses of all aspects of the genome and increasing the likelihood that genomic medicine will become part of the routine diagnosis and treatment of medulloblastoma.


2021 ◽  
Author(s):  
Sandi Yen ◽  
Jethro S. Johnson

AbstractThe gut microbiome is a major determinant of host health, yet it is only in the last 2 decades that the advent of next-generation sequencing has enabled it to be studied at a genomic level. Shotgun sequencing is beginning to provide insight into the prokaryotic as well as eukaryotic and viral components of the gut community, revealing not just their taxonomy, but also the functions encoded by their collective metagenome. This revolution in understanding is being driven by continued development of sequencing technologies and in consequence necessitates reciprocal development of computational approaches that can adapt to the evolving nature of sequence datasets. In this review, we provide an overview of current bioinformatic strategies for handling metagenomic sequence data and discuss their strengths and limitations. We then go on to discuss key technological developments that have the potential to once again revolutionise the way we are able to view and hence understand the microbiome.


2019 ◽  
Author(s):  
Eric Prince ◽  
Todd C. Hankinson

ABSTRACTHigh throughput data is commonplace in biomedical research as seen with technologies such as single-cell RNA sequencing (scRNA-seq) and other Next Generation Sequencing technologies. As these techniques continue to be increasingly utilized it is critical to have analysis tools that can identify meaningful complex relationships between variables (i.e., in the case of scRNA-seq: genes) in a way such that human bias is absent. Moreover, it is equally paramount that both linear and non-linear (i.e., one-to-many) variable relationships be considered when contrasting datasets. HD Spot is a deep learning-based framework that generates an optimal interpretable classifier a given high-throughput dataset using a simple genetic algorithm as well as an autoencoder to classifier transfer learning approach. Using four unique publicly available scRNA-seq datasets with published ground truth, we demonstrate the robustness of HD Spot and the ability to identify ontologically accurate gene lists for a given data subset. HD Spot serves as a bioinformatic tool to allow novice and advanced analysts to gain complex insight into their respective datasets enabling novel hypotheses development.


2017 ◽  
Author(s):  
Stefan Wyder ◽  
Michael T. Raissig ◽  
Ueli Grossniklaus

ABSTRACTGenomic imprinting leads to different expression levels of maternally and paternally derived alleles. Over the last years, major progress has been made in identifying novel imprinted candidate genes in plants, owing to affordable next-generation sequencing technologies. However, reports on sequencing the transcriptome of hybrid F1 seed tissues strongly disagree about how many and which genes are imprinted. This raises questions about the relative impact of biological, environmental, technical, and analytic differences or biases. Here, we adopt a statistical approach, frequently used in RNA-seq data analysis, which properly models count overdispersion and considers replicate information of reciprocal crosses. We show that our statistical pipeline outperforms other methods in identifying imprinted genes in simulated and real data. Accordingly, reanalysis of genome-wide imprinting studies in Arabidopsis and maize shows that, at least for the Arabidopsis dataset, an increased agreement across datasets can be observed. For maize, however, consistent reanalysis did not yield in a larger overlap between the datasets. This suggests that the discrepancy across publications might be partially due to different analysis pipelines but that technical, biological, and environmental factors underlie much of the discrepancy between datasets. Finally, we show that the set of genes that can be characterized regarding allelic bias by all studies with minimal confidence is small (~8,000/27,416 genes for Arabidopsis and ~12,000/39,469 for maize). In conclusion, we propose to use biologically replicated reciprocal crosses, high sequence coverage, and a generalized linear model approach to identify differentially expressed alleles in developing seeds.


Author(s):  
Baokun Sui ◽  
Dong Chen ◽  
Wei Liu ◽  
Bin Tian ◽  
Lei Lv ◽  
...  

Rabies is a lethal disease caused by Rabies lyssavirus, commonly known as rabies virus (RABV), and results in nearly 100 % death once clinical symptoms occur in human and animals. Long non-coding RNAs (lncRNAs) have been reported to be associated with viral infection. But the role of lncRNAs involved in RABV infection is still elusive. In this study, we performed global transcriptome analysis of both of lncRNA and mRNA expression profiles in wild-type (WT) and lab-attenuated RABV-infected mouse brains by using next-generation sequencing. The differentially expressed lncRNAs and mRNAs were analysed by using the edgeR package. We identified 1422 differentially expressed lncRNAs and 4475 differentially expressed mRNAs by comparing WT and lab-attenuated RABV-infected brains. Then we predicted the enriched biological pathways by the Gene Ontology (GO) and Kyoto Encyclopaedia of Genes and Genomes (KEGG) database based on the differentially expressed lncRNAs and mRNAs. Our analysis revealed the relationships between lncRNAs and RABV-infection-associated immune response and ion transport-related pathways, which provide a fresh insight into the potential role of lncRNA in immune evasion and neuron injury induced by WT RABV.


2019 ◽  
Author(s):  
Bozena Mika-Gospodorz ◽  
Suparat Giengkam ◽  
Alexander J. Westermann ◽  
Jantana Wongsantichon ◽  
Willow Kion-Crosby ◽  
...  

SummaryEmerging and neglected diseases pose challenges as their biology is frequently poorly understood, and genetic tools often do not exist to manipulate the responsible pathogen. Organism agnostic sequencing technologies offer a promising approach to understand the molecular processes underlying these diseases. Here we apply dual RNA-seq to Orientia tsutsugamushi (Ot), an obligate intracellular bacterium and the causative agent of the vector-borne human disease scrub typhus. Half the Ot genome is composed of repetitive DNA, and there is minimal collinearity in gene order between strains. Integrating RNA-seq, comparative genomics, proteomics, and machine learning, we investigated the transcriptional architecture of Ot, including operon structure and non-coding RNAs, and found evidence for wide-spread post-transcriptional antisense regulation. We compared the host response to two clinical isolates and identified distinct immune response networks that are up-regulated in response to each strain, leading to predictions of relative virulence which were confirmed in a mouse infection model. Thus, dual RNA-seq can provide insight into the biology and host-pathogen interactions of a poorly characterized and genetically intractable organism such as Ot.


Cells ◽  
2021 ◽  
Vol 10 (11) ◽  
pp. 2961
Author(s):  
He Zhu ◽  
Jian Song ◽  
Nikhilesh Dhar ◽  
Ying Shan ◽  
Xi-Yue Ma ◽  
...  

Cotton is an important economic crop worldwide. Verticillium wilt (VW) caused by Verticillium dahliae (V. dahliae) is a serious disease in cotton, resulting in massive yield losses and decline of fiber quality. Breeding resistant cotton cultivars is an efficient but elaborate method to improve the resistance of cotton against V. dahliae infection. However, the functional mechanism of several excellent VW resistant cotton cultivars is poorly understood at present. In our current study, we carried out RNA-seq to discover the differentially expressed genes (DEGs) in the roots of susceptible cotton Gossypium hirsutum cultivar Junmian 1 (J1) and resistant cotton G.hirsutum cultivar Liaomian 38 (L38) upon Vd991 inoculation at two time points compared with the mock inoculated control plants. The potential function of DEGs uniquely expressed in J1 and L38 was also analyzed by GO enrichment and KEGG pathway associations. Most DEGs were assigned to resistance-related functions. In addition, resistance gene analogues (RGAs) were identified and analyzed for their role in the heightened resistance of the L38 cultivar against the devastating Vd991. In summary, we analyzed the regulatory network of genes in the resistant cotton cultivar L38 during V. dahliae infection, providing a novel and comprehensive insight into VW resistance in cotton.


2020 ◽  
Author(s):  
Estefania Mancini ◽  
Andres Rabinovich ◽  
Javier Iserte ◽  
Marcelo Yanovsky ◽  
Ariel Chernomoretz

AbstractGenome-wide analysis of alternative splicing has been a very active field of research since the early days of NGS (Next generation sequencing) technologies. Since then, ever-growing data availability and the development of increasingly sophisticated analysis methods have uncovered the complexity of the general splicing repertoire. However, independently of the considered quantification methodology, very often changes in variant concentration profiles can be hard to disentangle. In order to tackle this problem we present ASpli2, a computational suite implemented in R, that allows the identification of changes in both, annotated and novel alternative splicing events, and can deal with complex experimental designs.Our analysis workflow relies on the analysis of differential usage of subgenic features in combination with a junction-based description of local splicing changes. Analyzing simulated and real data we found that the consolidation of these signals resulted in a robust proxy of the occurrence of splicing alterations. While junction-based signals allowed us to uncover annotated as well and non-annotated events, bin-associated signals notably increased recall capabilities at a very competitive performance in terms of precision.


2014 ◽  
Author(s):  
Emmanuel Dimont ◽  
Jiantao Shi ◽  
Rory Kirchner ◽  
Winston Hide

Summary: Next-generation sequencing platforms for measuring digital expression such as RNA-Seq are displacing traditional microarray-based methods in biological experiments. The detection of differentially expressed genes between groups of biological conditions has led to the development of numerous bioinformatics tools, but so far few, exploit the expanded dynamic range afforded by the new technologies. We present edgeRun, an R package that implements an unconditional exact test that is a more powerful version of the exact test in edgeR. This increase in power is especially pronounced for experiments with as few as 2 replicates per condition, for genes with low total expression and with large biological coefficient of variation. In comparison with a panel of other tools, edgeRun consistently captures functionally similar differentially expressed genes. Availability: The package is freely available under the MIT license from CRAN (http://cran.r-project.org/web/packages/edgeRun) Contact: [email protected]


2020 ◽  
Author(s):  
Diana Lobo ◽  
Raquel Godinho ◽  
John Archer

Abstract Background In the last decades, the evolution of RNA-Seq has yielded archived datasets that possess the potential for providing unprecedented inter-study insight into transcriptome evolution, once background noise has been reduced. Here we present a method to quantify intra-condition variation and to remove reference-based transcripts associated with highly variable read counts, prior to differential expression analysis. The method utilizes variation within pairwise distances between normalized read counts for each transcript across all included samples of a given condition. As a case study, we demonstrate our approach at an inter and intra-study level using RNA-seq data from brain samples of dogs, wolves, and two strains of fox (aggressive and tame) prior to performing differential expression analysis to identify common genes associated with tame behaviour. Results By applying our method, the distribution of the gene-wise dispersion estimates improved and the number of outliers detected in differential expression analysis decreased. Several genes that initially were differentially expressed in the non-filtered datasets were removed due to high intra-condition variation. Additionally, by optimizing the detection of differentially expressed transcripts, the overall number increased between dogs vs wolves and tame vs aggressive foxes when compared to the non-filtered datasets. Using these filtered sets, we found common over expressed genes in dogs and tame foxes, including those involved in brain development, neurotransmission and immunity, factors known to be involved in domestication. Conclusions We presented a method to quantify and remove intra-condition variation from RNA-seq count data and demonstrate its usage in improving the distribution of gene-wise dispersion estimates and ultimately, reduce the number of false positives in differential gene expression analysis. We provide the method as a freely available tool, to aid studies using RNA-seq to calculate and characterize the variation present within data prior to perform differential expression analysis. Additionally, we identify candidate genes involved with selection for tameness, which seems to have played a crucial role in the canine domestication.


Sign in / Sign up

Export Citation Format

Share Document