Long-read assembly of a Great Dane genome highlights the contribution of GC-rich sequence and mobile elements to canine genomes

Technological advances have allowed improvements in genome reference sequence assemblies. Here, we combined long- and short-read sequence resources to assemble the genome of a female Great Dane dog. This assembly has improved continuity compared to the existing Boxer-derived (CanFam3.1) reference genome. Annotation of the Great Dane assembly identified 22,182 protein-coding gene models and 7,049 long noncoding RNAs, including 49 protein-coding genes not present in the CanFam3.1 reference. The Great Dane assembly spans the majority of sequence gaps in the CanFam3.1 reference and illustrates that 2,151 gaps overlap the transcription start site of a predicted protein-coding gene. Moreover, a subset of the resolved gaps, which have an 80.95% median GC content, localize to transcription start sites and recombination hotspots more often than expected by chance, suggesting the stable canine recombinational landscape has shaped genome architecture. Alignment of the Great Dane and CanFam3.1 assemblies identified 16,834 deletions and 15,621 insertions, as well as 2,665 deletions and 3,493 insertions located on secondary contigs. These structural variants are dominated by retrotransposon insertion/deletion polymorphisms and include 16,221 dimorphic canine short interspersed elements (SINECs) and 1,121 dimorphic long interspersed element-1 sequences (LINE-1_Cfs). Analysis of sequences flanking the 3′ end of LINE-1_Cfs (i.e., LINE-1_Cf 3′-transductions) suggests multiple retrotransposition-competent LINE-1_Cfs segregate among dog populations. Consistent with this conclusion, we demonstrate that a canine LINE-1_Cf element with intact open reading frames can retrotranspose its own RNA and that of a SINEC_Cf consensus sequence in cultured human cells, implicating ongoing retrotransposon activity as a driver of canine genetic variation.

Download Full-text

Long-read assembly of a Great Dane genome highlights the contribution of GC-rich sequence and mobile elements to canine genomes

10.1101/2020.07.31.231761 ◽

2020 ◽

Author(s):

Julia V. Halo ◽

Amanda L. Pendleton ◽

Feichen Shen ◽

Aurélien J. Doucet ◽

Thomas Derrien ◽

...

Keyword(s):

Consensus Sequence ◽

Gc Content ◽

Human Cells ◽

Transcription Start ◽

Protein Coding ◽

Protein Coding Genes ◽

Long Read ◽

Cultured Human Cells ◽

Retrotransposon Activity ◽

Great Dane

AbstractTechnological advances have allowed improvements in genome reference sequence assemblies. Here, we combined long- and short-read sequence resources to assemble the genome of a female Great Dane dog. This assembly has improved continuity compared to the existing Boxer-derived (CanFam3.1) reference genome. Annotation of the Great Dane assembly identified 22,182 protein-coding gene models and 7,049 long non-coding RNAs, including 49 protein-coding genes not present in the CanFam3.1 reference. The Great Dane assembly spans the majority of sequence gaps in the CanFam3.1 reference and illustrates that 2,151 gaps overlap the transcription start site of a predicted protein-coding gene. Moreover, a subset of the resolved gaps, which have an 80.95% median GC content, localize to transcription start sites and recombination hotspots more often than expected by chance, suggesting the stable canine recombinational landscape has shaped genome architecture. Alignment of the Great Dane and CanFam3.1 assemblies identified 16,834 deletions and 15,621 insertions, as well as 2,665 deletions and 3,493 insertions located on secondary contigs. These structural variants are dominated by retrotransposon insertion/deletion polymorphisms and include 16,221 dimorphic canine short interspersed elements (SINECs) and 1,121 dimorphic long interspersed element-1 sequences (LINE-1_Cfs). Analysis of sequences flanking the 3’ end of LINE-1_Cfs (i.e., LINE-1_Cf 3’-transductions) suggests multiple retrotransposition-competent LINE-1_Cfs segregate among dog populations. Consistent with this conclusion, we demonstrate that a canine LINE-1_Cf element with intact open reading frames can retrotranspose its own RNA and that of a SINEC_Cf consensus sequence in cultured human cells, implicating ongoing retrotransposon activity as a driver of canine genetic variation.SignificanceAdvancements in long-read DNA sequencing technologies provide more comprehensive views of genomes. We used long-read sequences to assemble a Great Dane dog genome that provides several improvements over the existing reference derived from a Boxer dog. Assembly comparisons revealed that gaps in the Boxer assembly often occur at the beginning of protein-coding genes and have a high-GC content, which likely reflects limitations of previous technologies in resolving GC-rich sequences. Dimorphic LINE-1 and SINEC retrotransposon sequences represent the predominant differences between the Great Dane and Boxer assemblies. Proof-of-principle experiments demonstrated that expression of a canine LINE-1 could promote the retrotransposition of itself and a SINEC_Cf consensus sequence in cultured human cells. Thus, ongoing retrotransposon activity may contribute to canine genetic diversity.

Download Full-text

Mitochondrial genome evolution of placozoans: gene rearrangements and repeat expansions

Genome Biology and Evolution ◽

10.1093/gbe/evaa213 ◽

2020 ◽

Author(s):

Hideyuki Miyazawa ◽

Hans-Jürgen Osigus ◽

Sarah Rolfes ◽

Kai Kamm ◽

Bernd Schierwater ◽

...

Keyword(s):

Mitochondrial Genome ◽

Gc Content ◽

Distribution Patterns ◽

Sister Group ◽

Open Reading Frames ◽

Mitochondrial Genomes ◽

Inverted Repeats ◽

Protein Coding ◽

Group I ◽

Phylogenomic Analyses

Abstract Placozoans, non-bilaterian animals with the simplest known metazoan bauplan, are currently classified into 20 haplotypes belonging to three genera, Polyplacotoma, Trichoplax, and Hoilungia. The latter two comprise two and five clades, respectively. In Trichoplax and Hoilungia, previous studies on six haplotypes belonging to four different clades have shown that their mtDNA are circular chromosomes of 32-43 kbp in size, which encode 12 protein-coding genes, 24 tRNAs, and 2 rRNAs. These mitochondrial genomes (mitogenomes) also show unique features rarely seen in other metazoans, including open reading frames (ORFs) of unknown function, and group I and II introns. Here, we report seven new mitogenomes, covering the five previously described haplotypes H2, H17, H19, H9, and H11, as well as two new haplotypes, H23 (clade III) and H24 (clade VII). The overall gene content is shared between all placozoan mitochondrial genomes, but genome sizes, gene orders, and several exon-intron boundaries vary among clades. Phylogenomic analyses strongly support a tree topology different from previous 16S rRNA analyses, with clade VI as the sister group to all other Hoilungia clades. We found small inverted repeats in all 13 mitochondrial genomes of the Trichoplax and Hoilungia genera and evaluated their distribution patterns among haplotypes. Since P. mediterranea (H0), the sister to the remaining haplotypes, has a small mitochondrial genome with few small inverted repeats and ORFs, we hypothesized that the proliferation of inverted repeats and ORFs substantially contributed to the observed increase in the size and GC content of the Trichoplax and Hoilungia mitochondrial genomes.

Download Full-text

An integrative strategy to identify the entire protein coding potential of prokaryotic genomes by proteogenomics

10.1101/153213 ◽

2017 ◽

Cited By ~ 1

Author(s):

Ulrich Omasits ◽

Adithi R. Varadarajan ◽

Michael Schmid ◽

Sandra Goetze ◽

Damianos Melidis ◽

...

Keyword(s):

Gene Prediction ◽

Bartonella Henselae ◽

Prokaryotic Genome ◽

Gc Content ◽

Laboratory Strain ◽

Open Reading Frames ◽

General Applicability ◽

Protein Coding ◽

Prokaryotic Genomes ◽

Coding Potential

AbstractAccurate annotation of all protein-coding sequences (CDSs) is an essential prerequisite to fully exploit the rapidly growing repertoire of completely sequenced prokaryotic genomes. However, large discrepancies among the number of CDSs annotated by different resources, missed functional short open reading frames (sORFs), and overprediction of spurious ORFs represent serious limitations.Our strategy towards accurate and complete genome annotation consolidates CDSs from multiple reference annotation resources,ab initiogene prediction algorithms andin silicoORFs in an integrated proteogenomics database (iPtgxDB) that covers the entire protein-coding potential of a prokaryotic genome. By extending the PeptideClassifier concept of unambiguous peptides for prokaryotes, close to 95% of the identifiable peptides imply one distinct protein, largely simplifying downstream analysis. Searching a comprehensiveBartonella henselaeproteomics dataset against such an iPtgxDB allowed us to unambiguously identify novel ORFs uniquely predicted by each resource, including lipoproteins, differentially expressed and membrane-localized proteins, novel start sites and wrongly annotated pseudogenes. Most novelties were confirmed by targeted, parallel reaction monitoring mass spectrometry, including unique ORFs and variants identified in a re-sequenced laboratory strain that are not present in its reference genome. We demonstrate the general applicability of our strategy for genomes with varying GC content and distinct taxonomic origin, and release iPtgxDBs forB. henselae,Bradyrhozibium diazoefficiensandEscherichia colias well as the software to generate such proteogenomics search databases for any prokaryote.

Download Full-text

RIFRAF: a frame-resolving consensus algorithm

10.1101/227520 ◽

2017 ◽

Author(s):

Kemal Eren ◽

Ben Murrell

Keyword(s):

Search Algorithm ◽

Consensus Sequence ◽

Consensus Algorithm ◽

Reference Sequence ◽

Protein Coding ◽

Independent Sequence ◽

Reading Frame ◽

Sequencing Errors ◽

Novel Structure ◽

Long Read

AbstractMotivationProtein coding genes can be studied using long-read next generation sequencing. However, high rates of indel sequencing errors are problematic, corrupting the reading frame. Even the consensus of multiple independent sequence reads retains indel errors. To solve this problem, we introduce RIFRAF, a sequence consensus algorithm that takes a set of error-prone reads and a reference sequence and infers an accurate in-frame consensus. RIFRAF uses a novel structure, analogous to a two-layer hidden Markov model: the consensus is optimized to maximize alignment scores with both the set of noisy reads and with a reference. The template-to-reads component of the model encodes the preponderance of indels, and is sensitive to the per-base quality scores, giving greater weight to more accurate bases. The reference-to-template component of the model penalizes frame-destroying indels. A local search algorithm proceeds in stages to find the best consensus sequence for both objectives.ResultsUsing Pacific Biosciences SMRT sequences of NL4-3 env, we compare our approach to other consensus and frame correction methods. RIFRAF consistently finds a consensus sequence that is more accurate and in-frame, especially with small numbers of reads. It was able to perfectly reconstruct over 80% of consensus sequences from as few as three reads, whereas the best alternative required twice as many. RIFRAF is able to achieve these results and keep the consensus in-frame even with a distantly related reference sequence. Moreover, unlike other frame correction methods, RIFRAF can detect and keep true indels while removing erroneous ones.AvailabilityRIFRAF is implemented in Julia, and source code is publicly available at https://github.com/MurrellGroup/[email protected]

Download Full-text

Long-read cDNA sequencing identifies functional pseudogenes in the human transcriptome

Genome Biology ◽

10.1186/s13059-021-02369-0 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Robin-Lee Troskie ◽

Yohaann Jafrani ◽

Tim R. Mercer ◽

Adam D. Ewing ◽

Geoffrey J. Faulkner ◽

...

Keyword(s):

Cultured Cells ◽

Open Reading Frames ◽

Cdna Sequencing ◽

Protein Coding ◽

Dynamic Component ◽

Gene Copies ◽

Long Read ◽

Normal Human ◽

Reading Frames ◽

Transcriptional Landscape

AbstractPseudogenes are gene copies presumed to mainly be functionless relics of evolution due to acquired deleterious mutations or transcriptional silencing. Using deep full-length PacBio cDNA sequencing of normal human tissues and cancer cell lines, we identify here hundreds of novel transcribed pseudogenes expressed in tissue-specific patterns. Some pseudogene transcripts have intact open reading frames and are translated in cultured cells, representing unannotated protein-coding genes. To assess the biological impact of noncoding pseudogenes, we CRISPR-Cas9 delete the nucleus-enriched pseudogene PDCL3P4 and observe hundreds of perturbed genes. This study highlights pseudogenes as a complex and dynamic component of the human transcriptional landscape.

Download Full-text

Genome Sequence of the Asian Honeybee in Pakistan Sheds Light on Its Phylogenetic Relationship with Other Honeybees

Insects ◽

10.3390/insects12070652 ◽

2021 ◽

Vol 12 (7) ◽

pp. 652

Author(s):

Hongwei Tan ◽

Muhammad Naeem ◽

Hussain Ali ◽

Muhammad Shakeel ◽

Haiou Kuang ◽

...

Keyword(s):

Phylogenetic Relationship ◽

Genome Sequence ◽

Apis Cerana ◽

Gc Content ◽

Protein Domain ◽

Pollination Services ◽

Protein Coding ◽

Close Relationship ◽

Genome Scale ◽

Asian Honeybee

In Pakistan, Apis cerana, the Asian honeybee, has been used for honey production and pollination services. However, its genomic makeup and phylogenetic relationship with those in other countries are still unknown. We collected A. cerana samples from the main cerana-keeping region in Pakistan and performed whole genome sequencing. A total of 28 Gb of Illumina shotgun reads were generated, which were used to assemble the genome. The obtained genome assembly had a total length of 214 Mb, with a GC content of 32.77%. The assembly had a scaffold N50 of 2.85 Mb and a BUSCO completeness score of 99%, suggesting a remarkably complete genome sequence for A. cerana in Pakistan. A MAKER pipeline was employed to annotate the genome sequence, and a total of 11,864 protein-coding genes were identified. Of them, 6750 genes were assigned at least one GO term, and 8813 genes were annotated with at least one protein domain. Genome-scale phylogeny analysis indicated an unexpectedly close relationship between A. cerana in Pakistan and those in China, suggesting a potential human introduction of the species between the two countries. Our results will facilitate the genetic improvement and conservation of A. cerana in Pakistan.

Download Full-text

Disrupting upstream translation in mRNAs is associated with human disease

Nature Communications ◽

10.1038/s41467-021-21812-1 ◽

2021 ◽

Vol 12 (1) ◽

Author(s):

David S. M. Lee ◽

Joseph Park ◽

Andrew Kromer ◽

Aris Baras ◽

Daniel J. Rader ◽

...

Keyword(s):

Protein Expression ◽

Biological Significance ◽

Ribosome Profiling ◽

Open Reading Frames ◽

Protein Coding ◽

Stop Codons ◽

Human Genes ◽

Strong Negative Selection ◽

Disease Associations ◽

Reading Frames

AbstractRibosome-profiling has uncovered pervasive translation in non-canonical open reading frames, however the biological significance of this phenomenon remains unclear. Using genetic variation from 71,702 human genomes, we assess patterns of selection in translated upstream open reading frames (uORFs) in 5’UTRs. We show that uORF variants introducing new stop codons, or strengthening existing stop codons, are under strong negative selection comparable to protein-coding missense variants. Using these variants, we map and validate gene-disease associations in two independent biobanks containing exome sequencing from 10,900 and 32,268 individuals, respectively, and elucidate their impact on protein expression in human cells. Our results suggest translation disrupting mechanisms relating uORF variation to reduced protein expression, and demonstrate that translation at uORFs is genetically constrained in 50% of human genes.

Download Full-text

Transposase-Mediated Excision, Conjugative Transfer, and Diversity of ICE6013 Elements in Staphylococcus aureus

Journal of Bacteriology ◽

10.1128/jb.00629-16 ◽

2017 ◽

Vol 199 (8) ◽

Cited By ~ 11

Author(s):

Emily A. Sansevere ◽

Xiao Luo ◽

Joo Youn Park ◽

Sunghyun Yoon ◽

Keun Seok Seo ◽

...

Keyword(s):

Staphylococcus Aureus ◽

Sequence Analysis ◽

Integration Site ◽

Consensus Sequence ◽

Open Reading Frames ◽

Conjugative Plasmid ◽

Site Preference ◽

Conjugative Transfer ◽

Content Type ◽

Integrative Conjugative Elements

ABSTRACT ICE6013 represents one of two families of integrative conjugative elements (ICEs) identified in the pan-genome of the human and animal pathogen Staphylococcus aureus. Here we investigated the excision and conjugation functions of ICE6013 and further characterized the diversity of this element. ICE6013 excision was not significantly affected by growth, temperature, pH, or UV exposure and did not depend on recA. The IS30-like DDE transposase (Tpase; encoded by orf1 and orf2) of ICE6013 must be uninterrupted for excision to occur, whereas disrupting three of the other open reading frames (ORFs) on the element significantly affects the level of excision. We demonstrate that ICE6013 conjugatively transfers to different S. aureus backgrounds at frequencies approaching that of the conjugative plasmid pGO1. We found that excision is required for conjugation, that not all S. aureus backgrounds are successful recipients, and that transconjugants acquire the ability to transfer ICE6013. Sequencing of chromosomal integration sites in serially passaged transconjugants revealed a significant integration site preference for a 15-bp AT-rich palindromic consensus sequence, which surrounds the 3-bp target site that is duplicated upon integration. A sequence analysis of ICE6013 from different host strains of S. aureus and from eight other species of staphylococci identified seven divergent subfamilies of ICE6013 that include sequences previously classified as a transposon, a plasmid, and various ICEs. In summary, these results indicate that the IS30-like Tpase functions as the ICE6013 recombinase and that ICE6013 represents a diverse family of mobile genetic elements that mediate conjugation in staphylococci. IMPORTANCE Integrative conjugative elements (ICEs) encode the abilities to integrate into and excise from bacterial chromosomes and plasmids and mediate conjugation between bacteria. As agents of horizontal gene transfer, ICEs may affect bacterial evolution. ICE6013 represents one of two known families of ICEs in the pathogen Staphylococcus aureus, but its core functions of excision and conjugation are not well studied. Here, we show that ICE6013 depends on its IS30-like DDE transposase for excision, which is unique among ICEs, and we demonstrate the conjugative transfer and integration site preference of ICE6013. A sequence analysis revealed that ICE6013 has diverged into seven subfamilies that are dispersed among staphylococci.

Download Full-text

Functional Screening of a Metagenomic Library Reveals Operons Responsible for Enhanced Intestinal Colonization by Gut Commensal Microbes

Applied and Environmental Microbiology ◽

10.1128/aem.00581-13 ◽

2013 ◽

Vol 79 (12) ◽

pp. 3829-3838 ◽

Cited By ~ 15

Author(s):

Mi Young Yoon ◽

Kang-Mu Lee ◽

Yujin Yoon ◽

Junhyeok Go ◽

Yongjin Park ◽

...

Keyword(s):

Bac Library ◽

Metagenomic Library ◽

Artificial Chromosome ◽

Open Reading Frames ◽

Functional Screening ◽

Sequence Alignments ◽

Protein Coding ◽

Content Type ◽

Usage Analysis ◽

Functional Screens

ABSTRACTEvidence suggests that gut microbes colonize the mammalian intestine through propagation as an adhesive microbial community. A bacterial artificial chromosome (BAC) library of murine bowel microbiota DNA in the surrogate hostEscherichia coliDH10B was screened for enhanced adherence capability. Two out of 5,472 DH10B clones, 10G6 and 25G1, exhibited enhanced capabilities to adhere to inanimate surfaces in functional screens. DNA segments inserted into the 10G6 and 25G1 clones were 52 and 41 kb and included 47 and 41 protein-coding open reading frames (ORFs), respectively. DNA sequence alignments, tetranucleotide frequency, and codon usage analysis strongly suggest that these two DNA fragments are derived from species belonging to the genusBacteroides. Consistent with this finding, a large portion of the predicted gene products were highly homologous to those ofBacteroidesspp. Transposon mutagenesis and subsequent experiments that involved heterologous expression identified two operons associated with enhanced adherence.E. colistrains transformed with the 10a or 25b operon adhered to the surface of intestinal epithelium and colonized the mouse intestine more vigorously than did the control strain. This study has revealed the genetic determinants of unknown commensals (probably resemblingBacteroidesspecies) that enhance the ability of the bacteria to colonize the murine bowel.

Download Full-text

Genome Sequence of Pseudomonas sp. Strain LAP_36, A Rhizosphere Bacterium Isolated from King George Island, Antarctica

Microbiology Resource Announcements ◽

10.1128/mra.00731-21 ◽

2021 ◽

Vol 10 (48) ◽

Author(s):

Sarah Mederos da Silveira ◽

Sheila da Silva ◽

Andrew Macrae ◽

Rommel T. J. Ramos ◽

Fabrício A. Araújo ◽

...

Keyword(s):

Genome Sequence ◽

Draft Genome ◽

Gc Content ◽

South Shetland Islands ◽

Deschampsia Antarctica ◽

King George Island ◽

Protein Coding ◽

Content Type ◽

Shetland Islands ◽

Rhizosphere Bacterium

Pseudomonas sp. strain LAP_36 was isolated from rhizosphere soil from Deschampsia antarctica on King George Island, South Shetland Islands, Antarctica. Here, we report on its draft genome sequence, which consists of 8,794,771 bp with 60.0% GC content and 8,011 protein-coding genes.

Download Full-text