scholarly journals Transcriptional Landscape and Splicing Efficiency in Arabidopsis Mitochondria

Cells ◽  
2021 ◽  
Vol 10 (8) ◽  
pp. 2054
Author(s):  
Laura E. Garcia ◽  
M. Virginia Sanchez-Puerta

Plant mitochondrial transcription is initiated from multiple promoters without an apparent motif, which precludes their identification in other species based on sequence comparisons. Even though coding regions take up only a small fraction of plant mitochondrial genomes, deep RNAseq studies uncovered that these genomes are fully or nearly fully transcribed with significantly different RNA read depth across the genome. Transcriptomic analysis can be a powerful tool to understand the transcription process in diverse angiosperms, including the identification of potential promoters and co-transcribed genes or to study the efficiency of intron splicing. In this work, we analyzed the transcriptional landscape of the Arabidopsis mitochondrial genome (mtDNA) based on large-scale RNA sequencing data to evaluate the use of RNAseq to study those aspects of the transcription process. We found that about 98% of the Arabidopsis mtDNA is transcribed with highly different RNA read depth, which was elevated in known genes. The location of a sharp increase in RNA read depth upstream of genes matched the experimentally identified promoters. The continuously high RNA read depth across two adjacent genes agreed with the known co-transcribed units in Arabidopsis mitochondria. Most intron-containing genes showed a high splicing efficiency with no differences between cis and trans-spliced introns or between genes with distinct splicing mechanisms. Deep RNAseq analyses of diverse plant species will be valuable to recognize general and lineage-specific characteristics related to the mitochondrial transcription process.

2019 ◽  
Vol 20 (1) ◽  
Author(s):  
Michael D. Linderman ◽  
Davin Chia ◽  
Forrest Wallace ◽  
Frank A. Nothaft

Abstract Background XHMM is a widely used tool for copy-number variant (CNV) discovery from whole exome sequencing data but can require hours to days to run for large cohorts. A more scalable implementation would reduce the need for specialized computational resources and enable increased exploration of the configuration parameter space to obtain the best possible results. Results DECA is a horizontally scalable implementation of the XHMM algorithm using the ADAM framework and Apache Spark that incorporates novel algorithmic optimizations to eliminate unneeded computation. DECA parallelizes XHMM on both multi-core shared memory computers and large shared-nothing Spark clusters. We performed CNV discovery from the read-depth matrix in 2535 exomes in 9.3 min on a 16-core workstation (35.3× speedup vs. XHMM), 12.7 min using 10 executor cores on a Spark cluster (18.8× speedup vs. XHMM), and 9.8 min using 32 executor cores on Amazon AWS’ Elastic MapReduce. We performed CNV discovery from the original BAM files in 292 min using 640 executor cores on a Spark cluster. Conclusions We describe DECA’s performance, our algorithmic and implementation enhancements to XHMM to obtain that performance, and our lessons learned porting a complex genome analysis application to ADAM and Spark. ADAM and Apache Spark are a performant and productive platform for implementing large-scale genome analyses, but efficiently utilizing large clusters can require algorithmic optimizations and careful attention to Spark’s configuration parameters.


2019 ◽  
Vol 35 (19) ◽  
pp. 3855-3856 ◽  
Author(s):  
Emma A Fox ◽  
Alison E Wright ◽  
Matteo Fumagalli ◽  
Filipe G Vieira

Abstract Motivation Linkage disequilibrium (LD) measures the correlation between genetic loci and is highly informative for association mapping and population genetics. As many studies rely on called genotypes for estimating LD, their results can be affected by data uncertainty, especially when employing a low read depth sequencing strategy. Furthermore, there is a manifest lack of tools for the analysis of large-scale, low-depth and short-read sequencing data from non-model organisms with limited sample sizes. Results ngsLD addresses these issues by estimating LD directly from genotype likelihoods in a fast, reliable and user-friendly implementation. This method makes use of the full information available from sequencing data and provides accurate estimates of linkage disequilibrium patterns compared with approaches based on genotype calling. We conducted a case study to investigate how LD decays over physical distance in two avian species. Availability and implementation The methods presented in this work were implemented in C/C and are freely available for non-commercial use from https://github.com/fgvieira/ngsLD. Supplementary information Supplementary data are available at Bioinformatics online.


2017 ◽  
Author(s):  
Hui Yang ◽  
Gary Chen ◽  
Leandro Lima ◽  
Han Fang ◽  
Laura Jimenez ◽  
...  

ABSTRACTBACKGROUNDWhole-genome sequencing (WGS) data may be used to identify copy number variations (CNVs). Existing CNV detection methods mostly rely on read depth or alignment characteristics (paired-end distance and split reads) to infer gains/losses, while neglecting allelic intensity ratios and cannot quantify copy numbers. Additionally, most CNV callers are not scalable to handle a large number of WGS samples.METHODSTo facilitate large-scale and rapid CNV detection from WGS data, we developed a Dynamic Programming Imputation (DPI) based algorithm called HadoopCNV, which infers copy number changes through both allelic frequency and read depth information. Our implementation is built on the Hadoop framework, enabling multiple compute nodes to work in parallel.RESULTSCompared to two widely used tools – CNVnator and LUMPY, HadoopCNV has similar or better performance on both simulated data sets and real data on the NA12878 individual. Additionally, analysis on a 10-member pedigree showed that HadoopCNV has a Mendelian precision that is similar or better than other tools. Furthermore, HadoopCNV can accurately infer loss of heterozygosity (LOH), while other tools cannot. HadoopCNV requires only 1.6 hours for a human genome with 30X coverage, on a 32-node cluster, with a linear relationship between speed improvement and the number of nodes. We further developed a method to combine HadoopCNV and LUMPY result, and demonstrated that the combination resulted in better performance than any individual tools.CONCLUSIONSThe combination of high-resolution, allele-specific read depth from WGS data and Hadoop framework can result in efficient and accurate detection of CNVs.


BMC Genomics ◽  
2021 ◽  
Vol 22 (1) ◽  
Author(s):  
Yanan Ren ◽  
Ting-You Wang ◽  
Leah C. Anderton ◽  
Qi Cao ◽  
Rendong Yang

Abstract Background Long non-coding RNAs (lncRNAs) are a growing focus in cancer research. Deciphering pathways influenced by lncRNAs is important to understand their role in cancer. Although knock-down or overexpression of lncRNAs followed by gene expression profiling in cancer cell lines are established approaches to address this problem, these experimental data are not available for a majority of the annotated lncRNAs. Results As a surrogate, we present lncGSEA, a convenient tool to predict the lncRNA associated pathways through Gene Set Enrichment Analysis of gene expression profiles from large-scale cancer patient samples. We demonstrate that lncGSEA is able to recapitulate lncRNA associated pathways supported by literature and experimental validations in multiple cancer types. Conclusions LncGSEA allows researchers to infer lncRNA regulatory pathways directly from clinical samples in oncology. LncGSEA is written in R, and is freely accessible at https://github.com/ylab-hi/lncGSEA.


2021 ◽  
Vol 11 (1) ◽  
Author(s):  
Smritikana Dutta ◽  
Anwesha Deb ◽  
Prasun Biswas ◽  
Sukanya Chakraborty ◽  
Suman Guha ◽  
...  

AbstractBamboos, member of the family Poaceae, represent many interesting features with respect to their fast and extended vegetative growth, unusual, yet divergent flowering time across species, and impact of sudden, large scale flowering on forest ecology. However, not many studies have been conducted at the molecular level to characterize important genes that regulate vegetative and flowering habit in bamboo. In this study, two bamboo FD genes, BtFD1 and BtFD2, which are members of the florigen activation complex (FAC) have been identified by sequence and phylogenetic analyses. Sequence comparisons identified one important amino acid, which was located in the DNA-binding basic region and was altered between BtFD1 and BtFD2 (Ala146 of BtFD1 vs. Leu100 of BtFD2). Electrophoretic mobility shift assay revealed that this alteration had resulted into ten times higher binding efficiency of BtFD1 than BtFD2 to its target ACGT motif present at the promoter of the APETALA1 gene. Expression analyses in different tissues and seasons indicated the involvement of BtFD1 in flower and vegetative development, while BtFD2 was very lowly expressed throughout all the tissues and conditions studied. Finally, a tenfold increase of the AtAP1 transcript level by p35S::BtFD1 Arabidopsis plants compared to wild type confirms a positively regulatory role of BtFD1 towards flowering. However, constitutive expression of BtFD1 had led to dwarfisms and apparent reduction in the length of flowering stalk and numbers of flowers/plant, whereas no visible phenotype was observed for BtFD2 overexpression. This signifies that timely expression of BtFD1 may be critical to perform its programmed developmental role in planta.


2010 ◽  
Vol 26 (17) ◽  
pp. 2101-2108 ◽  
Author(s):  
Jiří Macas ◽  
Pavel Neumann ◽  
Petr Novák ◽  
Jiming Jiang

Abstract Motivation: Satellite DNA makes up significant portion of many eukaryotic genomes, yet it is relatively poorly characterized even in extensively sequenced species. This is, in part, due to methodological limitations of traditional methods of satellite repeat analysis, which are based on multiple alignments of monomer sequences. Therefore, we employed an alternative, alignment-free, approach utilizing k-mer frequency statistics, which is in principle more suitable for analyzing large sets of satellite repeat data, including sequence reads from next generation sequencing technologies. Results: k-mer frequency spectra were determined for two sets of rice centromeric satellite CentO sequences, including 454 reads from ChIP-sequencing of CENH3-bound DNA (7.6 Mb) and the whole genome Sanger sequencing reads (5.8 Mb). k-mer frequencies were used to identify the most conserved sequence regions and to reconstruct consensus sequences of complete monomers. Reconstructed consensus sequences as well as the assessment of overall divergence of k-mer spectra revealed high similarity of the two datasets, suggesting that CentO sequences associated with functional centromeres (CENH3-bound) do not significantly differ from the total population of CentO, which includes both centromeric and pericentromeric repeat arrays. On the other hand, considerable differences were revealed when these methods were used for comparison of CentO populations between individual chromosomes of the rice genome assembly, demonstrating preferential sequence homogenization of the clusters within the same chromosome. k-mer frequencies were also successfully used to identify and characterize smRNAs derived from CentO repeats. Contact: [email protected] Supplementary information: Supplementary data are available at Bioinformatics online.


2020 ◽  
Author(s):  
Benedict Hew ◽  
Qiao Wen Tan ◽  
William Goh ◽  
Jonathan Wei Xiong Ng ◽  
Kenny Koh ◽  
...  

AbstractBacterial resistance to antibiotics is a growing problem that is projected to cause more deaths than cancer in 2050. Consequently, novel antibiotics are urgently needed. Since more than half of the available antibiotics target the bacterial ribosomes, proteins that are involved in protein synthesis are thus prime targets for the development of novel antibiotics. However, experimental identification of these potential antibiotic target proteins can be labor-intensive and challenging, as these proteins are likely to be poorly characterized and specific to few bacteria. In order to identify these novel proteins, we established a Large-Scale Transcriptomic Analysis Pipeline in Crowd (LSTrAP-Crowd), where 285 individuals processed 26 terabytes of RNA-sequencing data of the 17 most notorious bacterial pathogens. In total, the crowd processed 26,269 RNA-seq experiments and used the data to construct gene co-expression networks, which were used to identify more than a hundred uncharacterized genes that were transcriptionally associated with protein synthesis. We provide the identity of these genes together with the processed gene expression data. The data can be used to identify other vulnerabilities or bacteria, while our approach demonstrates how the processing of gene expression data can be easily crowdsourced.


Viruses ◽  
2021 ◽  
Vol 13 (10) ◽  
pp. 2006
Author(s):  
Anna Y Budkina ◽  
Elena V Korneenko ◽  
Ivan A Kotov ◽  
Daniil A Kiselev ◽  
Ilya V Artyushin ◽  
...  

According to various estimates, only a small percentage of existing viruses have been discovered, naturally much less being represented in the genomic databases. High-throughput sequencing technologies develop rapidly, empowering large-scale screening of various biological samples for the presence of pathogen-associated nucleotide sequences, but many organisms are yet to be attributed specific loci for identification. This problem particularly impedes viral screening, due to vast heterogeneity in viral genomes. In this paper, we present a new bioinformatic pipeline, VirIdAl, for detecting and identifying viral pathogens in sequencing data. We also demonstrate the utility of the new software by applying it to viral screening of the feces of bats collected in the Moscow region, which revealed a significant variety of viruses associated with bats, insects, plants, and protozoa. The presence of alpha and beta coronavirus reads, including the MERS-like bat virus, deserves a special mention, as it once again indicates that bats are indeed reservoirs for many viral pathogens. In addition, it was shown that alignment-based methods were unable to identify the taxon for a large proportion of reads, and we additionally applied other approaches, showing that they can further reveal the presence of viral agents in sequencing data. However, the incompleteness of viral databases remains a significant problem in the studies of viral diversity, and therefore necessitates the use of combined approaches, including those based on machine learning methods.


2021 ◽  
Author(s):  
Wen Feng ◽  
Lei Zhou ◽  
Pengju Zhao ◽  
Heng Du ◽  
Chenguang Diao ◽  
...  

As warthog (Phacochoerus africanus) has innate immunity against African swine fever (ASF), it is critical to understanding the evolutionary novelty of warthog to explain its specific ASF resistance. Here, we present two completed new genomes of one warthog and one Kenyan domestic pig, as the fundamental genomic references to decode the genetic mechanism on ASF tolerance. Our results indicated, multiple genomic variations, including gene losses, independent contraction and expansion of specific gene families, likely moulded warthog's genome to adapt the environment. Importantly, the analysis of presence and absence of genomic sequences revealed that, the warthog genome had a DNA sequence absence of the lactate dehydrogenase B (LDHB) gene on chromosome 2 compared to the reference genome. The overexpression and siRNA of LDHB indicated that its inhibition on the replication of ASFV. The Combining with large scale sequencing data of 123 pigs from all over world, contraction and expansion of TRIM genes families revealed that TRIM family genes in the warthog genome were potentially responsible for its tolerance to ASF. Our results will help further improve the understanding of genetic resistance ASF in pigs.


Sign in / Sign up

Export Citation Format

Share Document