riboSeed: leveraging prokaryotic genomic architecture to assemble across ribosomal regions

Mapping Intimacies ◽

10.1101/159798 ◽

2017 ◽

Author(s):

Nicholas R. Waters ◽

Florence Abram ◽

Fiona Brennan ◽

Ashleigh Holmes ◽

Leighton Pritchard

Keyword(s):

De Novo ◽

Bacterial Genome ◽

Genomic Context ◽

Inherent Difficulty ◽

Short Reads ◽

Genomic Architecture ◽

A Genome ◽

Ribosomal Operons ◽

Flanking Regions ◽

Bacterial Genome Sequencing

The vast majority of bacterial genome sequencing has been performed using Illumina short reads. Because of the inherent difficulty of resolving repeated regions with short reads alone, only ≈10% of sequencing projects have resulted in a closed genome. The most common repeated regions are those coding for ribosomal operons (rDNAs), which occur in a bacterial genome between 1 and 15 times, and are typically used as sequence markers to classify and identify bacteria. Here, we exploit conservation in the genomic context in which rDNAs occur across taxa to improve assembly of these regions relative to de novo sequencing by using the conserved nature of rDNAs across taxa and the uniqueness of their flanking regions within a genome. We describe a method to construct targeted pseudocontigs generated by iteratively assembling reads that map to a reference genome’s rDNAs. These pseudocontigs are then used to more accurately assemble the newly-sequenced chromosome. We show that this method, implemented as riboSeed, correctly bridges across adjacent contigs in bacterial genome assembly and, when used in conjunction with other genome polishing tools, can assist in closure of a genome.

De novo bacterial genome sequencing: Millions of very short reads assembled on a desktop computer

Genome Research ◽

10.1101/gr.072033.107 ◽

2008 ◽

Vol 18 (5) ◽

pp. 802-809 ◽

Cited By ~ 408

Author(s):

D. Hernandez ◽

P. Francois ◽

L. Farinelli ◽

M. Osteras ◽

J. Schrenzel

Keyword(s):

Genome Sequencing ◽

De Novo ◽

Bacterial Genome ◽

Desktop Computer ◽

Short Reads ◽

Bacterial Genome Sequencing

Resolving the Full Spectrum of Human Genome Variation using Linked-Reads

10.1101/230946 ◽

2017 ◽

Cited By ~ 8

Author(s):

Patrick Marks ◽

Sarah Garcia ◽

Alvaro Martinez Barrio ◽

Kamila Belhocine ◽

Jorge Bernate ◽

...

Keyword(s):

Human Genome ◽

Large Scale ◽

De Novo ◽

Simultaneous Detection ◽

Whole Genome ◽

Structural Variations ◽

Full Spectrum ◽

Short Read ◽

Short Reads ◽

A Genome

AbstractLarge-scale population based analyses coupled with advances in technology have demonstrated that the human genome is more diverse than originally thought. To date, this diversity has largely been uncovered using short read whole genome sequencing. However, standard short-read approaches, used primarily due to accuracy, throughput and costs, fail to give a complete picture of a genome. They struggle to identify large, balanced structural events, cannot access repetitive regions of the genome and fail to resolve the human genome into its two haplotypes. Here we describe an approach that retains long range information while harnessing the advantages of short reads. Starting from only ∼1ng of DNA, we produce barcoded short read libraries. The use of novel informatic approaches allows for the barcoded short reads to be associated with the long molecules of origin producing a novel datatype known as ‘Linked-Reads’. This approach allows for simultaneous detection of small and large variants from a single Linked-Read library. We have previously demonstrated the utility of whole genome Linked-Reads (lrWGS) for performing diploid, de novo assembly of individual genomes (Weisenfeld et al. 2017). In this manuscript, we show the advantages of Linked-Reads over standard short read approaches for reference based analysis. We demonstrate the ability of Linked-Reads to reconstruct megabase scale haplotypes and to recover parts of the genome that are typically inaccessible to short reads, including phenotypically important genes such as STRC, SMN1 and SMN2. We demonstrate the ability of both lrWGS and Linked-Read Whole Exome Sequencing (lrWES) to identify complex structural variations, including balanced events, single exon deletions, and single exon duplications. The data presented here show that Linked-Reads provide a scalable approach for comprehensive genome analysis that is not possible using short reads alone.

Multi-hit autism genomic architecture evidenced from consanguineous families with involvement of FEZF2 and mutations in high-risk genes

10.1101/759480 ◽

2019 ◽

Author(s):

Mounia Bensaid ◽

Yann Loe-Mie ◽

Aude-Marie Lepagnol-Bestel ◽

Wenqi Han ◽

Gabriel Santpere ◽

...

Keyword(s):

High Risk ◽

Cortical Neurons ◽

De Novo ◽

Deep Layer ◽

Autism Spectrum ◽

Projection Neurons ◽

Cortical Projection ◽

Linkage Region ◽

Genomic Architecture ◽

A Genome

ABSTRACTAutism Spectrum Disorders (ASDs) are a heterogeneous collection of neurodevelopmental disorders with a strong genetic basis. Recent studies identified that a single hit of either a de novo or transmitted gene-disrupting, or likely gene-disrupting, mutation in a subset of 65 strongly associated genes can be sufficient to generate an ASD phenotype. We took advantage of consanguineous families with an ASD proband to evaluate this model. By a genome-wide homozygosity mapping of ten families with eleven children displaying ASD, we identified a linkage region of 133 kb in five families at the 3p14.2 locus that includes FEZF2 with a LOD score of 5.8 suggesting a founder effect. Sequencing FEZF2 revealed a common deletion of four codons. However, the damaging FEZF2 mutation did not appear to be sufficient to induce the disease as non-affected parents also carry the mutation and, similarly, Fezf2 knockout mouse embryos electroporated with the mutant human FEZF2 construct did not display any obvious defects in the corticospinal tract, a pathway whose development depends on FEZF2. We extended the genetic analysis of these five FEZF2-linked families versus five FEZF2 non-linked families by studying de novo and transmitted copy number variation (CNV) and performing Whole Exome Sequencing (WES). We identified damaging mutations in the subset of 65 genes strongly associated with ASD whose co-expression analysis suggests an impact on the prefrontal cortex during the mid-fetal periods. From these results, we propose that both FEZF2 deletion and multiple hits in the repertoire of these 65 genes are necessary to generate an ASD phenotype.Significance StatementThe human neocortex is a highly organized laminar structure with neuron positioning and identity of deep-layer cortical neurons that depend on key transcription factors, such as FEZF2, SATB2, TSHZ3 and TBR1. These genes have a specific spatio-temporal pattern of expression in human midfetal deep cortical projection neurons and display mutations in patients with Autism Spectrum Disorder (ASD). Here, we identified a linkage region involving FEZF2 gene in five consanguineous families with an ASD proband. For these FEZF2-allele linked probands, we identified a four-codon deletion in FEZF2 and damaging mutations in other high-risk ASD genes, that exhibit regional and cell type–specific convergence in neocortical deep-layer excitatory neurons, suggesting a multi-hit genomic architecture of ASD in these consanguineous families.

Unicycler: resolving bacterial genome assemblies from short and long sequencing reads

10.1101/096412 ◽

2016 ◽

Cited By ~ 1

Author(s):

Ryan R. Wick ◽

Louise M. Judd ◽

Claire L. Gorrie ◽

Kathryn E. Holt

Keyword(s):

Dna Sequencing ◽

De Novo ◽

Bacterial Genome ◽

Read Depth ◽

Resolving Power ◽

Short Reads ◽

Sequencing Platform ◽

Long Reads ◽

Combining Data ◽

Genome Assemblies

1.AbstractThe Illumina DNA sequencing platform generates accurate but short reads, which can be used to produce accurate but fragmented genome assemblies. Pacific Biosciences and Oxford Nanopore Technologies DNA sequencing platforms generate long reads that can produce more complete genome assemblies, but the sequencing is more expensive and error prone. There is significant interest in combining data from these complementary sequencing technologies to generate more accurate “hybrid” assemblies. However, few tools exist that truly leverage the benefits of both types of data, namely the accuracy of short reads and the structural resolving power of long reads. Here we present Unicycler, a new tool for assembling bacterial genomes from a combination of short and long reads, which produces assemblies that are accurate, complete and cost-effective. Unicycler builds an initial assembly graph from short reads using the de novo assembler SPAdes and then simplifies the graph using information from short and long reads. Unicycler utilises a novel semi-global aligner, which is used to align long reads to the assembly graph. Tests on both synthetic and real reads show Unicycler can assemble larger contigs with fewer misassemblies than other hybrid assemblers, even when long read depth and accuracy are low. Unicycler is open source (GPLv3) and available at github.com/rrwick/Unicycler.

An Unbiased Predictive Model to Detect DNA Methylation Propensity of CpG Islands in the Human Genome

Current Bioinformatics ◽

10.2174/1574893615999200724145835 ◽

2020 ◽

Vol 15 ◽

Author(s):

Dicle Yalcin ◽

Hasan H. Otu

Keyword(s):

Model Building ◽

De Novo ◽

Cpg Islands ◽

Treatment Strategies ◽

Area Under The Curve ◽

Global Methylation ◽

Sequence Features ◽

A Genome ◽

Combined Features ◽

Epigenetic Repression

Background: Epigenetic repression mechanisms play an important role in gene regulation, specifically in cancer development. In many cases, a CpG island’s (CGI) susceptibility or resistance to methylation are shown to be contributed by local DNA sequence features. Objective: To develop unbiased machine learning models–individually and combined for different biological features–that predict the methylation propensity of a CGI. Methods: We developed our model consisting of CGI sequence features on a dataset of 75 sequences (28 prone, 47 resistant) representing a genome-wide methylation structure. We tested our model on two independent datasets that are chromosome (132 sequences) and disease (70 sequences) specific. Results: We provided improvements in prediction accuracy over previous models. Our results indicate that combined features better predict the methylation propensity of a CGI (area under the curve (AUC) ~0.81). Our global methylation classifier performs well on independent datasets reaching an AUC of ~0.82 for the complete model and an AUC of ~0.88 for the model using select sequences that better represent their classes in the training set. We report certain de novo motifs and transcription factor binding site (TFBS) motifs that are consistently better in separating prone and resistant CGIs. Conclusion: Predictive models for the methylation propensity of CGIs lead to a better understanding of disease mechanisms and can be used to classify genes based on their tendency to contain methylation prone CGIs, which may lead to preventative treatment strategies. MATLAB and Python™ scripts used for model building, prediction, and downstream analyses are available at https://github.com/dicleyalcin/methylProp_predictor.

Serodiagnosis and Bacterial Genome of Helicobacter pylori Infection

Toxins ◽

10.3390/toxins13070467 ◽

2021 ◽

Vol 13 (7) ◽

pp. 467

Author(s):

Aina Ichihara ◽

Hinako Ojima ◽

Kazuyoshi Gotoh ◽

Osamu Matsushita ◽

Susumu Take ◽

...

Keyword(s):

Helicobacter Pylori ◽

Antibody Titer ◽

Bacterial Genome ◽

Serum Antibody ◽

Gene Mutations ◽

Bacterial Genomes ◽

Western Blots ◽

A Genome ◽

Vaca Gene ◽

H Pylori

The infection caused by Helicobacter pylori is associated with several diseases, including gastric cancer. Several methods for the diagnosis of H. pylori infection exist, including endoscopy, the urea breath test, and the fecal antigen test, which is the serum antibody titer test that is often used since it is a simple and highly sensitive test. In this context, this study aims to find the association between different antibody reactivities and the organization of bacterial genomes. Next-generation sequences were performed to determine the genome sequences of four strains of antigens with different reactivity. The search was performed on the common genes, with the homology analysis conducted using a genome ring and dot plot analysis. The two antigens of the highly reactive strains showed a high gene homology, and Western blots for CagA and VacA also showed high expression levels of proteins. In the poorly responsive antigen strains, it was found that the inversion occurred around the vacA gene in the genome. The structure of bacterial genomes might contribute to the poor reactivity exhibited by the antibodies of patients. In the future, an accurate serodiagnosis could be performed by using a strain with few gene mutations of the antigen used for the antibody titer test of H. pylori.

The Location of Substitutions and Bacterial Genome Arrangements

Genome Biology and Evolution ◽

10.1093/gbe/evaa260 ◽

2020 ◽

Author(s):

Daniella F Lato ◽

G Brian Golding

Keyword(s):

Sinorhizobium Meliloti ◽

Bacterial Genome ◽

Evolutionary Analysis ◽

Origin Of Replication ◽

Ancestral Reconstruction ◽

Molecular Change ◽

E Coli ◽

A Genome ◽

The Impact ◽

Molecular Evolutionary Analysis

Abstract Increasing evidence supports the notion that different regions of a genome have unique rates of molecular change. This variation is particularly evident in bacterial genomes where previous studies have reported gene expression and essentiality tend to decrease, while substitution rates usually increases with increasing distance from the origin of replication. Genomic reorganization such as rearrangements occur frequently in bacteria and allow for the introduction and restructuring of genetic content, creating gradients of molecular traits along genomes. Here, we explore the interplay of these phenomena by mapping substitutions to the genomes of Escherichia coli, Bacillus subtilis, Streptomyces, and Sinorhizobium meliloti, quantifying how many substitutions have occurred at each position in the genome. Preceding work indicates that substitution rate significantly increases with distance from the origin. Using a larger sample size and accounting for genome rearrangements through ancestral reconstruction, our analysis demonstrates that the correlation between the number of substitutions and distance from the origin of replication is often significant but small and inconsistent in direction. Some replicons had a significantly decreasing trend (E. coli and the chromosome of S. meliloti), while others showed the opposite significant trend (B. subtilis, Streptomyces, pSymA and pSymB in S. meliloti). dN, dS and ω were examined across all genes and there was no significant correlation between those values and distance from the origin. This study highlights the impact that genomic rearrangements and location have on molecular trends in some bacteria, illustrating the importance of considering spatial trends in molecular evolutionary analysis. Assuming that molecular trends are exclusively in one direction can be problematic.

Genomic Characterization Provides an Insight into the Pathogenicity of the Poplar Canker Bacterium Lonsdalea populi

Genes ◽

10.3390/genes12020246 ◽

2021 ◽

Vol 12 (2) ◽

pp. 246

Author(s):

Xiaomeng Chen ◽

Rui Li ◽

Yonglin Wang ◽

Aining Li

Keyword(s):

Genome Sequence ◽

Extracellular Enzymes ◽

De Novo ◽

Whole Genome Sequence ◽

Hybrid Poplars ◽

A Genome ◽

Conserved Genes ◽

Genomic Characterization ◽

Molecular Bases ◽

Insight Into

An emerging poplar canker caused by the gram-negative bacterium, Lonsdalea populi, has led to high mortality of hybrid poplars Populus × euramericana in China and Europe. The molecular bases of pathogenicity and bark adaptation of L. populi have become a focus of recent research. This study revealed the whole genome sequence and identified putative virulence factors of L. populi. A high-quality L. populi genome sequence was assembled de novo, with a genome size of 3,859,707 bp, containing approximately 3434 genes and 107 RNAs (75 tRNA, 22 rRNA, and 10 ncRNA). The L. populi genome contained 380 virulence-associated genes, mainly encoding for adhesion, extracellular enzymes, secretory systems, and two-component transduction systems. The genome had 110 carbohydrate-active enzyme (CAZy)-coding genes and putative secreted proteins. The antibiotic-resistance database annotation listed that L. populi was resistant to penicillin, fluoroquinolone, and kasugamycin. Analysis of comparative genomics found that L. populi exhibited the highest homology with the L. britannica genome and L. populi encompassed 1905 specific genes, 1769 dispensable genes, and 1381 conserved genes, suggesting high evolutionary diversity and genomic plasticity. Moreover, the pan genome analysis revealed that the N-5-1 genome is an open genome. These findings provide important resources for understanding the molecular basis of the pathogenicity and biology of L. populi and the poplar-bacterium interaction.

Sequencing an F1 hybrid of Silurus asotus and S. meridionalis enabled the assembly of high-quality parental genomes

Scientific Reports ◽

10.1038/s41598-021-93257-x ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

Weitao Chen ◽

Ming Zou ◽

Yuefei Li ◽

Shuli Zhu ◽

Xinhui Li ◽

...

Keyword(s):

De Novo ◽

Parental Species ◽

F1 Hybrids ◽

Pelteobagrus Fulvidraco ◽

F1 Hybrid ◽

Short Reads ◽

Final Assembly ◽

Genome Complexity ◽

Hybrid Genome ◽

Silurus Asotus

AbstractGenome complexity such as heterozygosity may heavily influence its de novo assembly. Sequencing somatic cells of the F1 hybrids harboring two sets of genetic materials from both of the paternal and maternal species may avoid alleles discrimination during assembly. However, the feasibility of this strategy needs further assessments. We sequenced and assembled the genome of an F1 hybrid between Silurus asotus and S. meridionalis using the SequelII platform and Hi-C scaffolding technologies. More than 300 Gb raw data were generated, and the final assembly obtained 2344 scaffolds composed of 3017 contigs. The N50 length of scaffolds and contigs was 28.55 Mb and 7.49 Mb, respectively. Based on the mapping results of short reads generated for the paternal and maternal species, each of the 29 chromosomes originating from S. asotus and S. meridionalis was recognized. We recovered nearly 94% and 96% of the total length of S. asotus and S. meridionalis. BUSCO assessments and mapping analyses suggested that both genomes had high completeness and accuracy. Further analyses demonstrated the high collinearity between S. asotus, S. meridionalis, and the related Pelteobagrus fulvidraco. Comparison of the two genomes with that assembled only using the short reads from non-hybrid parental species detected a small portion of sequences that may be incorrectly assigned to the different species. We supposed that at least part of these situations may have resulted from mitotic recombination. The strategy of sequencing the F1 hybrid genome can recover the vast majority of the parental genomes and may improve the assembly of complex genomes.

riboCIRC: a comprehensive database of translatable circRNAs

Genome Biology ◽

10.1186/s13059-021-02300-7 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Huihui Li ◽

Mingzhe Xie ◽

Yan Wang ◽

Ludong Yang ◽

Zhi Xie ◽

...

Keyword(s):

De Novo ◽

Species Conservation ◽

Structure And Function ◽

Research Community ◽

Genome Browser ◽

Valuable Resource ◽

Sequence Structure ◽

A Genome ◽

Context Specific ◽

And Function

AbstractriboCIRC is a translatome data-oriented circRNA database specifically designed for hosting, exploring, analyzing, and visualizing translatable circRNAs from multi-species. The database provides a comprehensive repository of computationally predicted ribosome-associated circRNAs; a manually curated collection of experimentally verified translated circRNAs; an evaluation of cross-species conservation of translatable circRNAs; a systematic de novo annotation of putative circRNA-encoded peptides, including sequence, structure, and function; and a genome browser to visualize the context-specific occupant footprints of circRNAs. It represents a valuable resource for the circRNA research community and is publicly available at http://www.ribocirc.com.