scholarly journals False gene and chromosome losses affected by assembly and sequence errors

2021 ◽  
Author(s):  
Juwan Kim ◽  
Chul Lee ◽  
Byung June Ko ◽  
DongAhn Yoo ◽  
Sohyoung Won ◽  
...  

Many genome assemblies have been found to be incomplete and contain mis-assemblies. The Vertebrate Genomes Project (VGP) has been producing assemblies with an emphasis on being as complete and error-free as possible, utilizing long reads, long-range scaffolding data, new assembly algorithms, and manual curation. Here we evaluate these new vertebrate genome assemblies relative to the previous references for the same species, including a mammal (platypus), two birds (zebra finch, Anna's hummingbird), and a fish (climbing perch). We found that 3 to 11% of genomic sequence was entirely missing in the previous reference assemblies, which included nearly entire GC-rich and repeat-rich microchromosomes with high gene density. Genome-wide, between 25 to 60% of the genes were either completely or partially missing in the previous assemblies, and this was in part due to a bias in GC-rich 5'-proximal promoters and 5' exon regions. Our findings reveal novel regulatory landscapes and protein coding sequences that have been greatly underestimated in previous assemblies and are now present in the VGP assemblies.

2016 ◽  
Author(s):  
Derek M. Bickhart ◽  
Benjamin D. Rosen ◽  
Sergey Koren ◽  
Brian L. Sayre ◽  
Alex R. Hastie ◽  
...  

AbstractThe decrease in sequencing cost and increased sophistication of assembly algorithms for short-read platforms has resulted in a sharp increase in the number of species with genome assemblies. However, these assemblies are highly fragmented, with many gaps, ambiguities, and errors, impeding downstream applications. We demonstrate current state of the art for de novo assembly using the domestic goat (Capra hircus), based on long reads for contig formation, short reads for consensus validation, and scaffolding by optical and chromatin interaction mapping. These combined technologies produced the most contiguous de novo mammalian assembly to date, with chromosome-length scaffolds and only 663 gaps. Our assembly represents a >250-fold improvement in contiguity compared to the previously published C. hircus assembly, and better resolves repetitive structures longer than 1 kb, supporting the most complete repeat family and immune gene complex representation ever produced for a ruminant species.


2020 ◽  
Author(s):  
Yun Sun ◽  
Dongdong Zhang ◽  
Jianzhi Shi ◽  
Guisen Chen ◽  
Ying Wu ◽  
...  

AbstractCromileptes altivelas that belongs to Serranidae in the order Perciformes, is widely distributed throughout the tropical waters of the Indo-West Pacific regions. Due to their excellent food quality and abundant nutrients, it has become a popular marine food fish with high market values. Here, we reported a chromosome-level genome assembly and annotation of the humpback grouper genome using more than 103X PacBio long-reads and high-throughput chromosome conformation capture (Hi-C) technologies. The N50 contig length of the assembly is as large as 4.14 Mbp, the final assembly is 1.07 Gb with N50 of scaffold 44.78 Mb, and 99.24% of the scaffold sequences were anchored into 24 chromosomes. The high-quality genome assembly also showed high gene completeness with 27,067 protein coding genes and 3,710 ncRNAs. This high accurate genome assembly and annotation will not only provide an essential genome resource for C. altivelas breeding and restocking, but will also serve as a key resource for studying fish genomics and genetics.


2020 ◽  
Vol 3 (1) ◽  
Author(s):  
Ryoichi Yano ◽  
Tohru Ariizumi ◽  
Satoko Nonaka ◽  
Yoichi Kawazu ◽  
Silin Zhong ◽  
...  

AbstractMelon exhibits substantial natural variation especially in fruit ripening physiology, including both climacteric (ethylene-producing) and non-climacteric types. However, genomic mechanisms underlying such variation are not yet fully understood. Here, we report an Oxford Nanopore-based high-grade genome reference in the semi-climacteric cultivar Harukei-3 (378 Mb + 33,829 protein-coding genes), with an update of tissue-wide RNA-seq atlas in the Melonet-DB database. Comparison between Harukei-3 and DHL92, the first published melon genome, enabled identification of 24,758 one-to-one orthologue gene pairs, whereas others were candidates of copy number variation or presence/absence polymorphisms (PAPs). Further comparison based on 10 melon genome assemblies identified genome-wide PAPs of 415 retrotransposon Gag-like sequences. Of these, 160 showed fruit ripening-inducible expression, with 59.4% of the neighboring genes showing similar expression patterns (r > 0.8). Our results suggest that retrotransposons contributed to the modification of gene expression during diversification of melon genomes, and may affect fruit ripening-inducible gene expression.


2021 ◽  
Vol 12 ◽  
Author(s):  
Choon Meng Tan ◽  
Yu-Chen Lin ◽  
Jian-Rong Li ◽  
Yuan-Yu Chien ◽  
Chien-Jui Wang ◽  
...  

Phytoplasmas are uncultivated plant-pathogenic bacteria with agricultural importance. Those belonging to the 16SrII group, represented by ‘Candidatus P. aurantifolia’, have a wide range of plant hosts and cause significant yield losses in valuable crops, such as pear, sweet potato, peanut, and soybean. In this study, a method that combines immunoprecipitation-based enrichment and MinION long-read DNA sequencing was developed to solve the challenge of phytoplasma genome studies. This approach produced long reads with high mapping rates and high genomic coverage that can be combined with Illumina reads to produce complete genome assemblies with high accuracy. We applied this method to strain NCHU2014 and determined its complete genome sequence, which consists of one circular chromosome with 635,584 bp and one plasmid with 4,224 bp. Although ‘Ca. P. aurantifolia’ NCHU2014 has a small chromosome with only 471 protein-coding genes, it contains 33 transporter genes and 27 putative effector genes, which may contribute to obtaining nutrients from hosts and manipulating host developments for their survival and multiplication. Two effectors, the homologs of SAP11 and SAP54/PHYL1 identified in ‘Ca. P. aurantifolia’ NCHU2014, have the biochemical activities in destabilizing host transcription factors, which can explain the disease symptoms observed in infected plants. Taken together, this study provides the first complete genome available for the 16SrII phytoplasmas and contributes to the understanding of phytoplasma pathogenicity.


2017 ◽  
Author(s):  
Emily J. Shields ◽  
Roberto Bonasio

ABSTRACTAnts are an emerging model system for neuroepigenetics, as embryos with virtually identical genomes develop into different adult castes that display strikingly different physiology, morphology, and behavior. Although a number of ant genomes have been sequenced to date, their draft quality is an obstacle to sophisticated analyses of epigenetic gene regulation. Using long reads generated with Pacific Biosystem single molecule real time sequencing, we have reassembled de novo high-quality genomes for two ant species: Camponotus floridanus and Harpegnathos saltator. The long reads allowed us to span large repetitive regions and join sequences previously found in separate scaffolds, leading to comprehensive and accurate protein-coding annotations that facilitated the identification of a Gp-9-like gene as differentially expressed in Harpegnathos castes. The new assemblies also enabled us to annotate long non-coding RNAs for the first time in ants, revealing several that were specifically expressed during Harpegnathos development and in the brains of different castes. These upgraded genomes, along with the new coding and non-coding annotations, will aid future efforts to identify epigenetic mechanisms of phenotypic and behavioral plasticity in ants.


2006 ◽  
Vol 73 ◽  
pp. 59-66 ◽  
Author(s):  
Nick Gilbert ◽  
Wendy A. Bickmore

It has generally been assumed that transcriptionally active genes are in an ‘open’ chromatin structure and that silent genes have a ‘closed’ chromatin structure. Here we re-assess this axiom in the light of genome-wide studies of chromatin fibre structure. Using a combination of sucrose gradient sedimentation and genomic microarrays of the human genome, we argue that open chromatin fibres originate from regions of high gene density, whether or not those genes are transcriptionally active.


Author(s):  
Alex Di Genova ◽  
Elena Buena-Atienza ◽  
Stephan Ossowski ◽  
Marie-France Sagot

AbstractGenerating accurate genome assemblies of large, repeat-rich human genomes has proved difficult using only long, error-prone reads, and most human genomes assembled from long reads add accurate short reads to polish the consensus sequence. Here we report an algorithm for hybrid assembly, WENGAN, that provides very high quality at low computational cost. We demonstrate de novo assembly of four human genomes using a combination of sequencing data generated on ONT PromethION, PacBio Sequel, Illumina and MGI technology. WENGAN implements efficient algorithms to improve assembly contiguity as well as consensus quality. The resulting genome assemblies have high contiguity (contig NG50: 17.24–80.64 Mb), few assembly errors (contig NGA50: 11.8–59.59 Mb), good consensus quality (QV: 27.84–42.88) and high gene completeness (BUSCO complete: 94.6–95.2%), while consuming low computational resources (CPU hours: 187–1,200). In particular, the WENGAN assembly of the haploid CHM13 sample achieved a contig NG50 of 80.64 Mb (NGA50: 59.59 Mb), which surpasses the contiguity of the current human reference genome (GRCh38 contig NG50: 57.88 Mb).


2018 ◽  
Author(s):  
Cristina Sisu ◽  
Paul Muir ◽  
Adam Frankish ◽  
Ian Fiddes ◽  
Mark Diekhans ◽  
...  

Pseudogenes are ideal markers of genome remodeling. In turn, the mouse is an ideal platform for studying them, particularly with the availability of developmental transcriptional data and the sequencing of 18 strains. Here, we present a comprehensive genome-wide annotation of the pseudogenes in the mouse reference genome and associated strains. We compiled this by combining manual curation of over 10,000 pseudogenes with results from automatic annotation pipelines. Also, by comparing the human and mouse, we annotated 165 unitary pseudogenes in mouse, and 303 unitaries in human. We make all our annotation available through mouse.pseudogene.org. The overall mouse pseudogene repertoire (in the reference and strains) is similar to human in terms of overall size, biotype distribution (~80% processed/~20% duplicated) and top family composition (with many GAPDH and ribosomal pseudogenes). However, notable differences arise in the pseudogene age distribution, with multiple retro-transpositional bursts in mouse evolutionary history and only one in human. Furthermore, in each strain about a fifth of the pseudogenes are unique, reflecting strain-specific functions and evolution. Additionally, we find that ~15% of the pseudogenes are transcribed, a fraction similar to that for human, and that pseudogene transcription exhibits greater tissue and strain specificity compared to protein-coding genes. Finally, we show that highly transcribed parent genes tend to give rise to processed pseudogenes.


2020 ◽  
Vol 27 ◽  
Author(s):  
Giulia De Riso ◽  
Sergio Cocozza

: Epigenetics is a field of biological sciences focused on the study of reversible, heritable changes in gene function not due to modifications of the genomic sequence. These changes are the result of a complex cross-talk between several molecular mechanisms, that is in turn orchestrated by genetic and environmental factors. The epigenetic profile captures the unique regulatory landscape and the exposure to environmental stimuli of an individual. It thus constitutes a valuable reservoir of information for personalized medicine, which is aimed at customizing health-care interventions based on the unique characteristics of each individual. Nowadays, the complex milieu of epigenomic marks can be studied at the genome-wide level thanks to massive, highthroughput technologies. This new experimental approach is opening up new and interesting knowledge perspectives. However, the analysis of these complex omic data requires to face important analytic issues. Artificial Intelligence, and in particular Machine Learning, are emerging as powerful resources to decipher epigenomic data. In this review, we will first describe the most used ML approaches in epigenomics. We then will recapitulate some of the recent applications of ML to epigenomic analysis. Finally, we will provide some examples of how the ML approach to epigenetic data can be useful for personalized medicine.


Sign in / Sign up

Export Citation Format

Share Document