L1EM: a tool for accurate locus specific LINE-1 RNA quantification

Bioinformatics ◽

10.1093/bioinformatics/btz724 ◽

2019 ◽

Cited By ~ 3

Author(s):

Wilson McKerrow ◽

David Fenyö

Keyword(s):

Expectation Maximization Algorithm ◽

Simulated Data ◽

Cellular Damage ◽

Supplementary Information ◽

Genomic Locus ◽

Protein Coding ◽

Disease States ◽

Rna Quantification ◽

Long Read ◽

Specific Line

Abstract Motivation LINE-1 elements are retrotransposons that are capable of copying their sequence to new genomic loci. LINE-1 derepression is associated with a number of disease states, and has the potential to cause significant cellular damage. Because LINE-1 elements are repetitive, it is difficult to quantify LINE-1 RNA at specific loci and to separate transcripts with protein coding capability from other sources of LINE-1 RNA. Results We provide a tool, L1EM that uses the expectation maximization algorithm to quantify LINE-1 RNA at each genomic locus, separating transcripts that are capable of generating retrotransposition from those that are not. We show the accuracy of L1EM on simulated data and against long read sequencing from HEK cells. Availability and implementation L1EM is written in python. The source code along with the necessary annotations are available at https://github.com/FenyoLab/L1EM and distributed under GPLv3. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

L1EM: A tool for accurate locus specific LINE-1 RNA quantification

10.1101/714014 ◽

2019 ◽

Author(s):

Wilson McKerrow ◽

David Fenyö

Keyword(s):

Expectation Maximization ◽

Expectation Maximization Algorithm ◽

Simulated Data ◽

Cellular Damage ◽

Genomic Locus ◽

Protein Coding ◽

Disease States ◽

Rna Quantification ◽

Long Read ◽

Specific Line

AbstractMotivationLINE-1 elements are retrotransposons that are capable of copying their sequence to new genomic loci. LINE-1 derepression is associated with a number of disease states, and has the potential to cause significant cellular damage. Because LINE-1 elements are repetitive, it is difficult to quantify RNA at specific LINE-1 loci and to separate transcripts with protein coding capability from other sources of LINE-1 RNA.ResultsWe provide a tool, L1-EM that uses the expectation maximization algorithm to quantify LINE-1 RNA at each genomic locus, separating transcripts that are capable of generating retrotransposition from those that are not. We show the accuracy of L1-EM on simulated data and against long read sequencing from HEK cells.AvailabilityL1-EM is written in python. The source code along with the necessary annotations are available at https://github.com/FenyoLab/L1EM and distributed under [email protected], [email protected]

Download Full-text

SVIM: structural variant identification using mapped long reads

Bioinformatics ◽

10.1093/bioinformatics/btz041 ◽

2019 ◽

Vol 35 (17) ◽

pp. 2907-2915 ◽

Cited By ~ 32

Author(s):

David Heller ◽

Martin Vingron

Keyword(s):

Single Molecule ◽

Simulated Data ◽

Supplementary Information ◽

Nucleotide Polymorphisms ◽

Structural Variants ◽

Human Phenotype ◽

Structural Variant ◽

Pacific Biosciences ◽

Sequencing Technologies ◽

Long Read

Abstract Motivation Structural variants are defined as genomic variants larger than 50 bp. They have been shown to affect more bases in any given genome than single-nucleotide polymorphisms or small insertions and deletions. Additionally, they have great impact on human phenotype and diversity and have been linked to numerous diseases. Due to their size and association with repeats, they are difficult to detect by shotgun sequencing, especially when based on short reads. Long read, single-molecule sequencing technologies like those offered by Pacific Biosciences or Oxford Nanopore Technologies produce reads with a length of several thousand base pairs. Despite the higher error rate and sequencing cost, long-read sequencing offers many advantages for the detection of structural variants. Yet, available software tools still do not fully exploit the possibilities. Results We present SVIM, a tool for the sensitive detection and precise characterization of structural variants from long-read data. SVIM consists of three components for the collection, clustering and combination of structural variant signatures from read alignments. It discriminates five different variant classes including similar types, such as tandem and interspersed duplications and novel element insertions. SVIM is unique in its capability of extracting both the genomic origin and destination of duplications. It compares favorably with existing tools in evaluations on simulated data and real datasets from Pacific Biosciences and Nanopore sequencing machines. Availability and implementation The source code and executables of SVIM are available on Github: github.com/eldariont/svim. SVIM has been implemented in Python 3 and published on bioconda and the Python Package Index. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

MODER2: first-order Markov modeling and discovery of monomeric and dimeric binding motifs

Bioinformatics ◽

10.1093/bioinformatics/btaa045 ◽

2020 ◽

Vol 36 (9) ◽

pp. 2690-2696

Author(s):

Jarkko Toivonen ◽

Pratyush K Das ◽

Jussi Taipale ◽

Esko Ukkonen

Keyword(s):

Markov Models ◽

Expectation Maximization Algorithm ◽

Software Tool ◽

Specific Weight ◽

Training Data ◽

Supplementary Information ◽

Markov Modeling ◽

Binding Motifs ◽

The Difference ◽

Probability Matrices

Abstract Motivation Position-specific probability matrices (PPMs, also called position-specific weight matrices) have been the dominating model for transcription factor (TF)-binding motifs in DNA. There is, however, increasing recent evidence of better performance of higher order models such as Markov models of order one, also called adjacent dinucleotide matrices (ADMs). ADMs can model dependencies between adjacent nucleotides, unlike PPMs. A modeling technique and software tool that would estimate such models simultaneously both for monomers and their dimers have been missing. Results We present an ADM-based mixture model for monomeric and dimeric TF-binding motifs and an expectation maximization algorithm MODER2 for learning such models from training data and seeds. The model is a mixture that includes monomers and dimers, built from the monomers, with a description of the dimeric structure (spacing, orientation). The technique is modular, meaning that the co-operative effect of dimerization is made explicit by evaluating the difference between expected and observed models. The model is validated using HT-SELEX and generated datasets, and by comparing to some earlier PPM and ADM techniques. The ADM models explain data slightly better than PPM models for 314 tested TFs (or their DNA-binding domains) from four families (bHLH, bZIP, ETS and Homeodomain), the ADM mixture models by MODER2 being the best on average. Availability and implementation Software implementation is available from https://github.com/jttoivon/moder2. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Long-read cDNA sequencing identifies functional pseudogenes in the human transcriptome

Genome Biology ◽

10.1186/s13059-021-02369-0 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Robin-Lee Troskie ◽

Yohaann Jafrani ◽

Tim R. Mercer ◽

Adam D. Ewing ◽

Geoffrey J. Faulkner ◽

...

Keyword(s):

Cultured Cells ◽

Open Reading Frames ◽

Cdna Sequencing ◽

Protein Coding ◽

Dynamic Component ◽

Gene Copies ◽

Long Read ◽

Normal Human ◽

Reading Frames ◽

Transcriptional Landscape

AbstractPseudogenes are gene copies presumed to mainly be functionless relics of evolution due to acquired deleterious mutations or transcriptional silencing. Using deep full-length PacBio cDNA sequencing of normal human tissues and cancer cell lines, we identify here hundreds of novel transcribed pseudogenes expressed in tissue-specific patterns. Some pseudogene transcripts have intact open reading frames and are translated in cultured cells, representing unannotated protein-coding genes. To assess the biological impact of noncoding pseudogenes, we CRISPR-Cas9 delete the nucleus-enriched pseudogene PDCL3P4 and observe hundreds of perturbed genes. This study highlights pseudogenes as a complex and dynamic component of the human transcriptional landscape.

Download Full-text

Genome sequences of human cytomegalovirus strain TB40/E variants propagated in fibroblasts and epithelial cells

Virology Journal ◽

10.1186/s12985-021-01583-3 ◽

2021 ◽

Vol 18 (1) ◽

Author(s):

Ahmed Al Qaffas ◽

Salvatore Camiolo ◽

Mai Vo ◽

Alexis Aguiar ◽

Amine Ourahmane ◽

...

Keyword(s):

Epithelial Cells ◽

Human Cytomegalovirus ◽

Viral Entry ◽

Sequence Data ◽

Laboratory Strain ◽

Serial Passage ◽

Wild Type Virus ◽

Protein Coding ◽

Genetic Changes ◽

Long Read

AbstractThe advent of whole genome sequencing has revealed that common laboratory strains of human cytomegalovirus (HCMV) have major genetic deficiencies resulting from serial passage in fibroblasts. In particular, tropism for epithelial and endothelial cells is lost due to mutations disrupting genes UL128, UL130, or UL131A, which encode subunits of a virion-associated pentameric complex (PC) important for viral entry into these cells but not for entry into fibroblasts. The endothelial cell-adapted strain TB40/E has a relatively intact genome and has emerged as a laboratory strain that closely resembles wild-type virus. However, several heterogeneous TB40/E stocks and cloned variants exist that display a range of sequence and tropism properties. Here, we report the use of PacBio sequencing to elucidate the genetic changes that occurred, both at the consensus level and within subpopulations, upon passaging a TB40/E stock on ARPE-19 epithelial cells. The long-read data also facilitated examination of the linkage between mutations. Consistent with inefficient ARPE-19 cell entry, at least 83% of viral genomes present before adaptation contained changes impacting PC subunits. In contrast, and consistent with the importance of the PC for entry into endothelial and epithelial cells, genomes after adaptation lacked these or additional mutations impacting PC subunits. The sequence data also revealed six single noncoding substitutions in the inverted repeat regions, single nonsynonymous substitutions in genes UL26, UL69, US28, and UL122, and a frameshift truncating gene UL141. Among the changes affecting protein-coding regions, only the one in UL122 was strongly selected. This change, resulting in a D390H substitution in the encoded protein IE2, has been previously implicated in rendering another viral protein, UL84, essential for viral replication in fibroblasts. This finding suggests that IE2, and perhaps its interactions with UL84, have important functions unique to HCMV replication in epithelial cells.

Download Full-text

Amynthas corticis genome reveals molecular mechanisms behind global distribution

Communications Biology ◽

10.1038/s42003-021-01659-4 ◽

2021 ◽

Vol 4 (1) ◽

Author(s):

Xing Wang ◽

Yi Zhang ◽

Yufeng Zhang ◽

Mingming Kang ◽

Yuanbo Li ◽

...

Keyword(s):

Genome Assembly ◽

Molecular Mechanisms ◽

Gene Families ◽

The Body ◽

Gene Family Evolution ◽

Complex Environments ◽

Protein Coding ◽

Itraq Analysis ◽

Rdna Sequencing ◽

Long Read

AbstractEarthworms (Annelida: Crassiclitellata) are widely distributed around the world due to their ancient origination as well as adaptation and invasion after introduction into new habitats over the past few centuries. Herein, we report a 1.2 Gb complete genome assembly of the earthworm Amynthas corticis based on a strategy combining third-generation long-read sequencing and Hi-C mapping. A total of 29,256 protein-coding genes are annotated in this genome. Analysis of resequencing data indicates that this earthworm is a triploid species. Furthermore, gene family evolution analysis shows that comprehensive expansion of gene families in the Amynthas corticis genome has produced more defensive functions compared with other species in Annelida. Quantitative proteomic iTRAQ analysis shows that expression of 147 proteins changed in the body of Amynthas corticis and 16 S rDNA sequencing shows that abundance of 28 microorganisms changed in the gut of Amynthas corticis when the earthworm was incubated with pathogenic Escherichia coli O157:H7. Our genome assembly provides abundant and valuable resources for the earthworm research community, serving as a first step toward uncovering the mysteries of this species, and may provide molecular level indicators of its powerful defensive functions, adaptation to complex environments and invasion ability.

Download Full-text

The draft genome sequence of the grove snail Cepaea nemoralis

G3 Genes|Genome|Genetics ◽

10.1093/g3journal/jkaa071 ◽

2021 ◽

Vol 11 (2) ◽

Author(s):

Suzanne V Saenko ◽

Dick S J Groenenberg ◽

Angus Davison ◽

Menno Schilthuizen

Keyword(s):

Genome Sequence ◽

Draft Genome ◽

Agricultural Pests ◽

Shell Color ◽

Protein Coding ◽

Cepaea Nemoralis ◽

Edible Species ◽

Wide Range ◽

In Captivity ◽

Long Read

Abstract Studies on the shell color and banding polymorphism of the grove snail Cepaea nemoralis and the sister taxon Cepaea hortensis have provided compelling evidence for the fundamental role of natural selection in promoting and maintaining intraspecific variation. More recently, Cepaea has been the focus of citizen science projects on shell color evolution in relation to climate change and urbanization. C. nemoralis is particularly useful for studies on the genetics of shell polymorphism and the evolution of “supergenes,” as well as evo-devo studies of shell biomineralization, because it is relatively easily maintained in captivity. However, an absence of genomic resources for C. nemoralis has generally hindered detailed genetic and molecular investigations. We therefore generated ∼23× coverage long-read data for the ∼3.5 Gb genome, and produced a draft assembly composed of 28,537 contigs with the N50 length of 333 kb. Genome completeness, estimated by BUSCO using the metazoa dataset, was 91%. Repetitive regions cover over 77% of the genome. A total of 43,519 protein-coding genes were predicted in the assembled genome, and 97.3% of these were functionally annotated from either sequence homology or protein signature searches. This first assembled and annotated genome sequence for a helicoid snail, a large group that includes edible species, agricultural pests, and parasite hosts, will be a core resource for identifying the loci that determine the shell polymorphism, as well as in a wide range of analyses in evolutionary and developmental biology, and snail biology in general.

Download Full-text

Dual Isoform Sequencing Reveals a Multifaceted Transcriptional Architecture of a Prototype Baculovirus

10.21203/rs.3.rs-637036/v1 ◽

2021 ◽

Author(s):

Gábor Torma ◽

Dóra Tombácz ◽

Norbert Moldován ◽

Ádám Fülöp ◽

István Prazsák ◽

...

Keyword(s):

Protein Coding ◽

Rna Molecules ◽

Non Coding Rna ◽

Oxford Nanopore ◽

The Pacific ◽

Viral Genes ◽

Long Read ◽

Oxford Nanopore Technologies ◽

Overlapping Transcripts

Abstract In this study, we used two long-read sequencing (LRS) techniques, Sequel from the Pacific Biosciences and MinION from Oxford Nanopore Technologies, for the transcriptional characterization of a prototype baculovirus, Autographacalifornica multiple nucleopolyhedrovirus. LRS is able to read full-length RNA molecules, and thereby to distinguish between transcript isoforms, mono- and polycistronic RNAs, and overlapping transcripts. Altogether, we detected 875 transcripts, of which 759 are novel and 116 have been annotated previously. These RNA molecules include 41 novel putative protein coding transcript (each containing 5’-truncated in-frame ORFs), 14 monocistronic transcripts, 99 multicistronic RNAs, 101 non-coding RNA, and 504 length isoforms. We also detected RNA methylation in 12 viral genes and RNA hyper-editing in the longer 5’-UTR transcript isoform of ORF 19 gene.

Download Full-text

Genome sequence resource of Phomopsis longicolla strain YC2-1, a fungal pathogen causing Phomopsis stem blight in soybean

Molecular Plant-Microbe Interactions ◽

10.1094/mpmi-12-20-0340-a ◽

2021 ◽

Author(s):

Xiaolin Zhao ◽

Zhichao Zhang ◽

Sujiao Zheng ◽

Wenwu Ye ◽

Xiaobo Zheng ◽

...

Keyword(s):

Genome Assembly ◽

Stem Canker ◽

Quality Data ◽

Phomopsis Longicolla ◽

Protein Coding ◽

Stem Blight ◽

A Genome ◽

Long Read ◽

Genomic Resource ◽

Blight Disease

Diaporthe-Phomopsis disease complex causes considerable yield losses in soybean production worldwide. As one of the major pathogens, Phomopsis longicolla T. W. Hobbs (syn. Diaporthe longicolla) is not only the primary agent of Phomopsis seed decay, but also one of the agents of Phomopsis pod and stem blight, and Phomopsis stem canker. We performed both PacBio long read sequencing and Illumina short read sequencing, and obtained a genome assembly for the P. longicolla strain YC2-1, which was isolated from soybean stem with Phomopsis stem blight disease. The 63.1 Mb genome assembly contains 87 scaffolds, with a minimum, maximum, and N50 scaffold length of 20 kb, 4.6 Mb, and 1.5 Mb respectively, and a total of 17,407 protein-coding genes. The high-quality data expand the genomic resource of P. longicolla species and will provide a solid foundation for a better understanding of their genetic diversity and pathogenic mechanisms.

Download Full-text

Chromosome-level assembly of Drosophila bifasciata reveals important karyotypic transition of the X chromosome

10.1101/847558 ◽

2019 ◽

Author(s):

Ryan Bracewell ◽

Anita Tran ◽

Kamalakar Chatla ◽

Doris Bachtrog

Keyword(s):

X Chromosome ◽

Genome Assembly ◽

De Novo ◽

Pericentromeric Region ◽

Species Group ◽

Chromosome 15 ◽

Protein Coding ◽

Protein Coding Genes ◽

Long Read ◽

Chromosome Level

ABSTRACTThe Drosophila obscura species group is one of the most studied clades of Drosophila and harbors multiple distinct karyotypes. Here we present a de novo genome assembly and annotation of D. bifasciata, a species which represents an important subgroup for which no high-quality chromosome-level genome assembly currently exists. We combined long-read sequencing (Nanopore) and Hi-C scaffolding to achieve a highly contiguous genome assembly approximately 193Mb in size, with repetitive elements constituting 30.1% of the total length. Drosophila bifasciata harbors four large metacentric chromosomes and the small dot, and our assembly contains each chromosome in a single scaffold, including the highly repetitive pericentromere, which were largely composed of Jockey and Gypsy transposable elements. We annotated a total of 12,821 protein-coding genes and comparisons of synteny with D. athabasca orthologs show that the large metacentric pericentromeric regions of multiple chromosomes are conserved between these species. Importantly, Muller A (X chromosome) was found to be metacentric in D. bifasciata and the pericentromeric region appears homologous to the pericentromeric region of the fused Muller A-AD (XL and XR) of pseudoobscura/affinis subgroup species. Our finding suggests a metacentric ancestral X fused to a telocentric Muller D and created the large neo-X (Muller A-AD) chromosome ∼15 MYA. We also confirm the fusion of Muller C and D in D. bifasciata and show that it likely involved a centromere-centromere fusion.

Download Full-text