Recovery of non-reference sequences missing from the human reference genome

Mapping Intimacies ◽

10.21203/rs.2.11742/v1 ◽

2019 ◽

Author(s):

Ran Li ◽

Xiaomeng Tian ◽

Peng Yang ◽

Yingzhi Fan ◽

Ming Li ◽

...

Keyword(s):

Tandem Repeats ◽

Reference Genome ◽

De Novo ◽

Precise Location ◽

Protein Coding ◽

Human Reference Genome ◽

Protein Coding Genes ◽

Mhc Haplotype ◽

Reference Sequences ◽

Flanking Regions

Abstract The non-reference sequences (NRS) represent structure variations in human genome with potential functional significance. However, besides the known insertions, it is currently unknown whether other types of structure variations with NRS exist. Here, we compared 31 human de novo assemblies with the current reference genome to identify the NRS and their location. We resolved the precise location of 6,113 NRS adding up to 12.8 Mb. Besides 1,571 insertions, we detected 3,041 alternate alleles, which were defined as having less than 90% (or none) identity with the reference alleles. These alternate alleles overlapped with 1,143 protein-coding genes including a putative novel MHC haplotype. Further, we demonstrated that the alternate alleles and their flanking regions had high content of tandem repeats, indicating that their origin was associated with tandem repeats. Our study enriched the spectrum of human genetic variations.

Download Full-text

Recovery of non-reference sequences missing from the human reference genome

BMC Genomics ◽

10.1186/s12864-019-6107-1 ◽

2019 ◽

Vol 20 (1) ◽

Cited By ~ 3

Author(s):

Ran Li ◽

Xiaomeng Tian ◽

Peng Yang ◽

Yingzhi Fan ◽

Ming Li ◽

...

Keyword(s):

Human Genome ◽

Tandem Repeats ◽

Reference Genome ◽

De Novo ◽

Precise Location ◽

Protein Coding ◽

Human Reference Genome ◽

Mhc Haplotype ◽

Reference Sequences ◽

Flanking Regions

Abstract Background The non-reference sequences (NRS) represent structure variations in human genome with potential functional significance. However, besides the known insertions, it is currently unknown whether other types of structure variations with NRS exist. Results Here, we compared 31 human de novo assemblies with the current reference genome to identify the NRS and their location. We resolved the precise location of 6113 NRS adding up to 12.8 Mb. Besides 1571 insertions, we detected 3041 alternate alleles, which were defined as having less than 90% (or none) identity with the reference alleles. These alternate alleles overlapped with 1143 protein-coding genes including a putative novel MHC haplotype. Further, we demonstrated that the alternate alleles and their flanking regions had high content of tandem repeats, indicating that their origin was associated with tandem repeats. Conclusions Our study detected a large number of NRS including many alternate alleles which are previously uncharacterized. We suggested that the origin of alternate alleles was associated with tandem repeats. Our results enriched the spectrum of genetic variations in human genome.

Download Full-text

Recovery of non-reference sequences missing from the human reference genome

10.21203/rs.2.11742/v2 ◽

2019 ◽

Author(s):

Ran Li ◽

Xiaomeng Tian ◽

Peng Yang ◽

Yingzhi Fan ◽

Ming Li ◽

...

Keyword(s):

Human Genome ◽

Tandem Repeats ◽

Reference Genome ◽

De Novo ◽

Precise Location ◽

Protein Coding ◽

Human Reference Genome ◽

Mhc Haplotype ◽

Reference Sequences ◽

Flanking Regions

Abstract Background The non-reference sequences (NRS) represent structure variations in human genome with potential functional significance. However, besides the known insertions, it is currently unknown whether other types of structure variations with NRS exist. Results Here, we compared 31 human de novo assemblies with the current reference genome to identify the NRS and their location. We resolved the precise location of 6,113 NRS adding up to 12.8 Mb. Besides 1,571 insertions, we detected 3,041 alternate alleles, which were defined as having less than 90% (or none) identity with the reference alleles. These alternate alleles overlapped with 1,143 protein-coding genes including a putative novel MHC haplotype. Further, we demonstrated that the alternate alleles and their flanking regions had high content of tandem repeats, indicating that their origin was associated with tandem repeats. Conclusions Our study detected a large number of NRS including many alternate alleles which are previously uncharacterized. We suggested that the origin of alternate alleles was associated with tandem repeats. Our results enriched the spectrum of genetic variations in human genome.

Download Full-text

Recovery of non-reference sequences missing from the human reference genome

10.21203/rs.2.11742/v3 ◽

2019 ◽

Author(s):

Ran Li ◽

Xiaomeng Tian ◽

Peng Yang ◽

Yingzhi Fan ◽

Ming Li ◽

...

Keyword(s):

Human Genome ◽

Tandem Repeats ◽

Reference Genome ◽

De Novo ◽

Precise Location ◽

Protein Coding ◽

Human Reference Genome ◽

Mhc Haplotype ◽

Reference Sequences ◽

Flanking Regions

Abstract Background The non-reference sequences (NRS) represent structure variations in human genome with potential functional significance. However, besides the known insertions, it is currently unknown whether other types of structure variations with NRS exist. Results Here, we compared 31 human de novo assemblies with the current reference genome to identify the NRS and their location. We resolved the precise location of 6,113 NRS adding up to 12.8 Mb. Besides 1,571 insertions, we detected 3,041 alternate alleles, which were defined as having less than 90% (or none) identity with the reference alleles. These alternate alleles overlapped with 1,143 protein-coding genes including a putative novel MHC haplotype. Further, we demonstrated that the alternate alleles and their flanking regions had high content of tandem repeats, indicating that their origin was associated with tandem repeats. Conclusions Our study detected a large number of NRS including many alternate alleles which are previously uncharacterized. We suggested that the origin of alternate alleles was associated with tandem repeats. Our results enriched the spectrum of genetic variations in human genome.

Download Full-text

Assembly and Annotation of an Ashkenazi Human Reference Genome

10.1101/2020.03.18.997395 ◽

2020 ◽

Cited By ~ 2

Author(s):

Alaina Shumate ◽

Aleksey V. Zimin ◽

Rachel M. Sherman ◽

Daniela Puiu ◽

Justin M. Wagner ◽

...

Keyword(s):

Dna Sequences ◽

Reference Genome ◽

Gene Families ◽

Gene Content ◽

Specific Reference ◽

Protein Coding ◽

Human Reference Genome ◽

Protein Coding Genes ◽

Reference Genomes ◽

Similar Gene

AbstractHere we describe the assembly and annotation of the genome of an Ashkenazi individual and the creation of a new, population-specific human reference genome. This genome is more contiguous and more complete than GRCh38, the latest version of the human reference genome, and is annotated with highly similar gene content. The Ashkenazi reference genome, Ash1, contains 2,973,118,650 nucleotides as compared to 2,937,639,212 in GRCh38. Annotation identified 20,157 protein-coding genes, of which 19,563 are >99% identical to their counterparts on GRCh38. Most of the remaining genes have small differences. 40 of the protein-coding genes in GRCh38 are missing from Ash1; however, all of these genes are members of multi-gene families for which Ash1 contains other copies. 11 genes appear on different chromosomes from their homologs in GRCh38. Alignment of DNA sequences from an unrelated Ashkenazi individual to Ash1 identified ~1 million fewer homozygous SNPs than alignment of those same sequences to the more-distant GRCh38 genome, illustrating one of the benefits of population-specific reference genomes.

Download Full-text

From de novo to ‘de nono’: The majority of novel protein coding genes identified with phylostratigraphy are old genes or recent duplicates

Genome Biology and Evolution ◽

10.1093/gbe/evy231 ◽

2018 ◽

Cited By ~ 2

Author(s):

Claudio Casola

Keyword(s):

De Novo ◽

Protein Coding ◽

Protein Coding Genes ◽

Novel Protein

Download Full-text

Chromosome-level assembly of Drosophila bifasciata reveals important karyotypic transition of the X chromosome

10.1101/847558 ◽

2019 ◽

Author(s):

Ryan Bracewell ◽

Anita Tran ◽

Kamalakar Chatla ◽

Doris Bachtrog

Keyword(s):

X Chromosome ◽

Genome Assembly ◽

De Novo ◽

Pericentromeric Region ◽

Species Group ◽

Chromosome 15 ◽

Protein Coding ◽

Protein Coding Genes ◽

Long Read ◽

Chromosome Level

ABSTRACTThe Drosophila obscura species group is one of the most studied clades of Drosophila and harbors multiple distinct karyotypes. Here we present a de novo genome assembly and annotation of D. bifasciata, a species which represents an important subgroup for which no high-quality chromosome-level genome assembly currently exists. We combined long-read sequencing (Nanopore) and Hi-C scaffolding to achieve a highly contiguous genome assembly approximately 193Mb in size, with repetitive elements constituting 30.1% of the total length. Drosophila bifasciata harbors four large metacentric chromosomes and the small dot, and our assembly contains each chromosome in a single scaffold, including the highly repetitive pericentromere, which were largely composed of Jockey and Gypsy transposable elements. We annotated a total of 12,821 protein-coding genes and comparisons of synteny with D. athabasca orthologs show that the large metacentric pericentromeric regions of multiple chromosomes are conserved between these species. Importantly, Muller A (X chromosome) was found to be metacentric in D. bifasciata and the pericentromeric region appears homologous to the pericentromeric region of the fused Muller A-AD (XL and XR) of pseudoobscura/affinis subgroup species. Our finding suggests a metacentric ancestral X fused to a telocentric Muller D and created the large neo-X (Muller A-AD) chromosome ∼15 MYA. We also confirm the fusion of Muller C and D in D. bifasciata and show that it likely involved a centromere-centromere fusion.

Download Full-text

Draft genome assembly data of Anoxybacillus sp. strain MB8 isolated from Tattapani hot springs, India

10.1101/2021.06.09.447659 ◽

2021 ◽

Author(s):

VISHNU PRASOODANAN P K ◽

Shruti S. Menon ◽

Rituja Saxena ◽

Prashant Waiker ◽

Vineet K Sharma

Keyword(s):

Hot Springs ◽

De Novo ◽

Draft Genome ◽

Gc Content ◽

Central India ◽

Glycoside Hydrolases ◽

Rrna Gene ◽

Aerobic Bacterium ◽

Protein Coding ◽

Protein Coding Genes

Discovery of novel thermophiles has shown promising applications in the field of biotechnology. Due to their thermal stability, they can survive the harsh processes in the industries, which make them important to be characterized and studied. Members of Anoxybacillus are alkaline tolerant thermophiles and have been extensively isolated from manure, dairy-processed plants, and geothermal hot springs. This article reports the assembled data of an aerobic bacterium Anoxybacillus sp. strain MB8, isolated from the Tattapani hot springs in Central India, where the 16S rRNA gene shares an identity of 97% (99% coverage) with Anoxybacillus kamchatkensis strain G10. The de novo assembly and annotation performed on the genome of Anoxybacillus sp. strain MB8 comprises of 2,898,780 bp (in 190 contigs) with a GC content of 41.8% and includes 2,976 protein-coding genes,1 rRNA operon, 73 tRNAs, 1 tm-RNA and 10 CRISPR arrays. The predicted protein-coding genes have been classified into 21 eggNOG categories. The KEGG Automated Annotation Server (KAAS) analysis indicated the presence of assimilatory sulfate reduction pathway, nitrate reducing pathway, and genes for glycoside hydrolases (GHs) and glycoside transferase (GTs). GHs and GTs hold widespread applications, in the baking and food industry for bread manufacturing, and in the paper, detergent and cosmetic industry. Hence, Anoxybacillus sp. strain MB8 holds the potential to be screened and characterized for such commercially relevant enzymes.

Download Full-text

Integrating healthcare and research genetic data empowers the discovery of 28 novel developmental disorders

10.1101/797787 ◽

2019 ◽

Cited By ~ 14

Author(s):

Joanna Kaplanis ◽

Kaitlin E. Samocha ◽

Laurens Wiel ◽

Zhancheng Zhang ◽

Kevin J. Arvai ◽

...

Keyword(s):

Developmental Disorders ◽

De Novo ◽

Genetic Data ◽

Statistical Test ◽

Integrated Healthcare ◽

Protein Coding ◽

Protein Coding Genes ◽

Clinical Diagnostic ◽

Simulation Based

SummaryDe novo mutations (DNMs) in protein-coding genes are a well-established cause of developmental disorders (DD). However, known DD-associated genes only account for a minority of the observed excess of such DNMs. To identify novel DD-associated genes, we integrated healthcare and research exome sequences on 31,058 DD parent-offspring trios, and developed a simulation-based statistical test to identify gene-specific enrichments of DNMs. We identified 285 significantly DD-associated genes, including 28 not previously robustly associated with DDs. Despite detecting more DD-associated genes than in any previous study, much of the excess of DNMs of protein-coding genes remains unaccounted for. Modelling suggests that over 1,000 novel DD-associated genes await discovery, many of which are likely to be less penetrant than the currently known genes. Research access to clinical diagnostic datasets will be critical for completing the map of dominant DDs.

Download Full-text

Phylogenetic relationships and taxonomic position of genus Hyperacrius (Rodentia: Arvicolinae) from Kashmir based on evidences from analysis of mitochondrial genome and study of skull morphology

PeerJ ◽

10.7717/peerj.10364 ◽

2020 ◽

Vol 8 ◽

pp. e10364

Author(s):

Natalia I. Abramson ◽

Fedor N. Golenishchev ◽

Semen Yu. Bodrov ◽

Olga V. Bondareva ◽

Evgeny A. Genelt-Yanovskiy ◽

...

Keyword(s):

Mitochondrial Genome ◽

De Novo ◽

Phylogenetic Analyses ◽

Complete Mitochondrial Genome ◽

Morphological Characters ◽

Molecular Data ◽

Phylogenetic Position ◽

Skull Morphology ◽

Protein Coding ◽

Protein Coding Genes

In this article, we present the nearly complete mitochondrial genome of the Subalpine Kashmir vole Hyperacrius fertilis (Arvicolinae, Cricetidae, Rodentia), assembled using data from Illumina next-generation sequencing (NGS) of the DNA from a century-old museum specimen. De novo assembly consisted of 16,341 bp and included all mitogenome protein-coding genes as well as 12S and 16S RNAs, tRNAs and D-loop. Using the alignment of protein-coding genes of 14 previously published Arvicolini tribe mitogenomes, seven Clethrionomyini mitogenomes, and also Ondatra and Dicrostonyx outgroups, we conducted phylogenetic reconstructions based on a dataset of 13 protein-coding genes (PCGs) under maximum likelihood and Bayesian inference. Phylogenetic analyses robustly supported the phylogenetic position of this species within the tribe Arvicolini. Among the Arvicolini, Hyperacrius represents one of the early-diverged lineages. This result of phylogenetic analysis altered the conventional view on phylogenetic relatedness between Hyperacrius and Alticola and prompted the revision of morphological characters underlying the former assumption. Morphological analysis performed here confirmed molecular data and provided additional evidence for taxonomic replacement of the genus Hyperacrius from the tribe Clethrionomyini to the tribe Arvicolini.

Download Full-text

A Chromosome-Scale Genome Assembly Resource for Myriosclerotinia sulcatula Infecting Sedge Grass (Carex sp.)

Molecular Plant-Microbe Interactions ◽

10.1094/mpmi-03-20-0060-a ◽

2020 ◽

Vol 33 (7) ◽

pp. 880-883

Author(s):

Stefan Kusch ◽

Heba M. M. Ibrahim ◽

Catherine Zanchetta ◽

Celine Lopez-Roques ◽

Cecile Donnadieu ◽

...

Keyword(s):

Host Range ◽

Sclerotinia Sclerotiorum ◽

Genome Assembly ◽

Plant Pathogens ◽

Reference Genome ◽

Close Relative ◽

High Quality ◽

Protein Coding ◽

Protein Coding Genes ◽

Reference Genome Assembly

The fungus Myriosclerotinia sulcatula is a close relative of the notorious polyphagous plant pathogens Botrytis cinerea and Sclerotinia sclerotiorum but exhibits a host range restricted to plants from the Carex genus (Cyperaceae family). To date, there are no genomic resources available for fungi in the Myriosclerotinia genus. Here, we present a chromosome-scale reference genome assembly for M. sulcatula. The assembly contains 24 contigs with a total length of 43.53 Mbp, with scaffold N50 of 2,649.7 kbp and N90 of 1,133.1 kbp. BRAKER-predicted gene models were manually curated using WebApollo, resulting in 11,275 protein-coding genes that we functionally annotated. We provide a high-quality reference genome assembly and annotation for M. sulcatula as a resource for studying evolution and pathogenicity in fungi from the Sclerotiniaceae family.

Download Full-text