Systematic Discovery of Conservation States for Single-Nucleotide Annotation of the Human Genome

Mapping Intimacies ◽

10.1101/262097 ◽

2018 ◽

Author(s):

Adriana Sperlea ◽

Jason Ernst

Keyword(s):

Human Genome ◽

Sequence Alignment ◽

De Novo ◽

Sequence Data ◽

Biological Significance ◽

Evolutionary Constraint ◽

Single Nucleotide ◽

Dna Sequence Alignment ◽

Genome Annotations ◽

Nucleotide Resolution

AbstractComparative genomics sequence data is an important source of information for interpreting genomes. Genome-wide annotations based on this data have largely focused on univariate scores or binary calls of evolutionary constraint. Here we present a complementary whole genome annotation approach, ConsHMM, which applies a multivariate hidden Markov model to learn de novo different ‘conservation states’ based on the combinatorial and spatial patterns of which species align to and match a reference genome in a multiple species DNA sequence alignment. We applied ConsHMM to a 100-way vertebrate sequence alignment to annotate the human genome at single nucleotide resolution into 100 different conservation states. These states have distinct enrichments for other genomic information including gene annotations, chromatin states, and repeat families, which were used to characterize their biological significance. Conservation states have greater or complementary predictive information than standard constraint based measures for a variety of genome annotations. Bases in constrained elements have distinct heritability enrichments depending on the conservation state assignment, demonstrating their relevance to analyzing phenotypic associated variation. The conservation states also highlight differences in the conservation patterns of bases prioritized by a number of scores used for variant prioritization. The ConsHMM method and conservation state annotations provide a valuable resource for interpreting genomes and genetic variation.

Download Full-text

ConsHMM Atlas: conservation state annotations for major genomes and human genetic variation

10.1101/2020.03.01.955443 ◽

2020 ◽

Cited By ~ 1

Author(s):

Adriana Arneson ◽

Brooke Felsheim ◽

Jennifer Chien ◽

Jason Ernst

Keyword(s):

Genetic Variation ◽

Sequence Alignment ◽

Human Genetic Variation ◽

Web Interface ◽

Single Nucleotide ◽

Dna Sequence Alignment ◽

Single Genome ◽

Nucleotide Mutation ◽

Allele Specific ◽

Genome Annotations

AbstractConsHMM is a method recently introduced to annotate genomes into conservation states, which are defined based on the combinatorial and spatial patterns of which species align to and match a reference genome in a multi-species DNA sequence alignment. Previously, ConsHMM was only applied to a single genome for one multi-species sequence alignment. Here we apply ConsHMM to produce 22 additional genome annotations covering human and seven other organisms for a variety of multi-species alignments. Additionally, we have extended ConsHMM to generate allele specific annotations, which we used to produce conservation state annotations for every possible single nucleotide mutation in the human genome. Finally, we provide a web interface to interactively visualize parameters and annotation enrichments for ConsHMM models. These annotations and visualizations comprise the ConsHMM Atlas, which we expect will be a valuable resource for analyzing a variety of major genomes and genetic variation.

Download Full-text

ConsHMM Atlas: conservation state annotations for major genomes and human genetic variation

NAR Genomics and Bioinformatics ◽

10.1093/nargab/lqaa104 ◽

2020 ◽

Vol 2 (4) ◽

Author(s):

Adriana Arneson ◽

Brooke Felsheim ◽

Jennifer Chien ◽

Jason Ernst

Keyword(s):

Genetic Variation ◽

Sequence Alignment ◽

Human Genetic Variation ◽

Web Interface ◽

Single Nucleotide ◽

Dna Sequence Alignment ◽

Single Genome ◽

Nucleotide Mutation ◽

Allele Specific ◽

Genome Annotations

Abstract ConsHMM is a method recently introduced to annotate genomes into conservation states, which are defined based on the combinatorial and spatial patterns of which species align to and match a reference genome in a multi-species DNA sequence alignment. Previously, ConsHMM was only applied to a single genome for one multi-species sequence alignment. Here, we apply ConsHMM to produce 22 additional genome annotations covering human and seven other organisms for a variety of multi-species alignments. Additionally, we extend ConsHMM to generate allele-specific annotations, which we use to produce conservation state annotations for every possible single-nucleotide mutation in the human genome. Finally, we provide a web interface to interactively visualize parameters and annotation enrichments for ConsHMM models. These annotations and visualizations comprise the ConsHMM Atlas, which we expect will be a valuable resource for analyzing a variety of major genomes and genetic variation.

Download Full-text

Increased Frequency of De Novo Copy Number Variants in Congenital Heart Disease by Integrative Analysis of Single Nucleotide Polymorphism Array and Exome Sequence Data

Circulation Research ◽

10.1161/circresaha.115.304458 ◽

2014 ◽

Vol 115 (10) ◽

pp. 884-896 ◽

Cited By ~ 146

Author(s):

Joseph T. Glessner ◽

Alexander G. Bick ◽

Kaoru Ito ◽

Jason G. Homsy ◽

Laura Rodriguez-Murillo ◽

...

Keyword(s):

Copy Number ◽

Congenital Heart ◽

De Novo ◽

Sequence Data ◽

Single Nucleotide Polymorphism Array ◽

Copy Number Variants ◽

Nucleotide Polymorphism ◽

Single Nucleotide ◽

Exome Sequence Data ◽

Exome Sequence

Download Full-text

Haplotype-aware variant calling enables high accuracy in nanopore long-reads using deep neural networks

10.1101/2021.03.04.433952 ◽

2021 ◽

Author(s):

Kishwar Shafin ◽

Trevor Pesout ◽

Pi-Chuan Chang ◽

Maria Nattestad ◽

Alexey Kolesnikov ◽

...

Keyword(s):

De Novo ◽

Sequence Data ◽

Variant Calling ◽

High Accuracy ◽

Superior Performance ◽

Read Length ◽

Single Nucleotide Variants ◽

Single Nucleotide ◽

Short Read ◽

Long Read

Long-read sequencing has the potential to transform variant detection by reaching currently difficult-to-map regions and routinely linking together adjacent variations to enable read based phasing. Third-generation nanopore sequence data has demonstrated a long read length, but current interpretation methods for its novel pore-based signal have unique error profiles, making accurate analysis challenging. Here, we introduce a haplotype-aware variant calling pipeline PEPPER-Margin-DeepVariant that produces state-of-the-art variant calling results with nanopore data. We show that our nanopore-based method outperforms the short-read-based single nucleotide variant identification method at the whole genome-scale and produces high-quality single nucleotide variants in segmental duplications and low-mappability regions where short-read based genotyping fails. We show that our pipeline can provide highly-contiguous phase blocks across the genome with nanopore reads, contiguously spanning between 85% to 92% of annotated genes across six samples. We also extend PEPPER-Margin-DeepVariant to PacBio HiFi data, providing an efficient solution with superior performance than the current WhatsHap-DeepVariant standard. Finally, we demonstrate de novo assembly polishing methods that use nanopore and PacBio HiFi reads to produce diploid assemblies with high accuracy (Q35+ nanopore-polished and Q40+ PacBio-HiFi-polished).

Download Full-text

Improved assembly and variant detection of a haploid human genome using single-molecule, high-fidelity long reads

10.1101/635037 ◽

2019 ◽

Cited By ~ 7

Author(s):

Mitchell R. Vollger ◽

Glennis A. Logsdon ◽

Peter A. Audano ◽

Arvis Sulovari ◽

David Porubsky ◽

...

Keyword(s):

Human Genome ◽

Single Molecule ◽

Tandem Repeats ◽

De Novo ◽

Sequence Data ◽

Gene Annotation ◽

Hydatidiform Mole ◽

High Fidelity ◽

Human Genomes ◽

Long Read

AbstractThe sequence and assembly of human genomes using long-read sequencing technologies has revolutionized our understanding of structural variation and genome organization. We compared the accuracy, continuity, and gene annotation of genome assemblies generated from either high-fidelity (HiFi) or continuous long-read (CLR) datasets from the same complete hydatidiform mole human genome. We find that the HiFi sequence data assemble an additional 10% of duplicated regions and more accurately represent the structure of tandem repeats, as validated with orthogonal analyses. As a result, an additional 5 Mbp of pericentromeric sequences are recovered in the HiFi assembly, resulting in a 2.5-fold increase in the NG50 within 1 Mbp of the centromere (HiFi 480.6 kbp, CLR 191.5 kbp). Additionally, the HiFi genome assembly was generated in significantly less time with fewer computational resources than the CLR assembly. Although the HiFi assembly has significantly improved continuity and accuracy in many complex regions of the genome, it still falls short of the assembly of centromeric DNA and the largest regions of segmental duplication using existing assemblers. Despite these shortcomings, our results suggest that HiFi may be the most effective stand-alone technology for de novo assembly of human genomes.

Download Full-text

Discovery, genotyping and characterization of structural variation and novel sequence at single nucleotide resolution from de novo genome assemblies on a population scale

GigaScience ◽

10.1186/s13742-015-0103-4 ◽

2015 ◽

Vol 4 (1) ◽

Cited By ~ 11

Author(s):

Siyang Liu ◽

◽

Shujia Huang ◽

Junhua Rao ◽

Weijian Ye ◽

...

Keyword(s):

Structural Variation ◽

De Novo ◽

Single Nucleotide ◽

Population Scale ◽

Nucleotide Resolution ◽

Genome Assemblies ◽

Single Nucleotide Resolution

Download Full-text

Structural variation in two human genomes mapped at single-nucleotide resolution by whole genome de novo assembly

Nature Biotechnology ◽

10.1038/nbt.1904 ◽

2011 ◽

Vol 29 (8) ◽

pp. 723-730 ◽

Cited By ~ 88

Author(s):

Yingrui Li ◽

Hancheng Zheng ◽

Ruibang Luo ◽

Honglong Wu ◽

Hongmei Zhu ◽

...

Keyword(s):

De Novo Assembly ◽

Structural Variation ◽

De Novo ◽

Whole Genome ◽

Single Nucleotide ◽

Human Genomes ◽

Nucleotide Resolution ◽

Single Nucleotide Resolution

Download Full-text

Alignment by numbers: sequence assembly using compressed numerical representations

10.1101/011940 ◽

2014 ◽

Cited By ~ 2

Author(s):

Avraam Tapinos ◽

Bede Constantinides ◽

Douglas B Kell ◽

David L Robertson

Keyword(s):

Dimensionality Reduction ◽

Sequence Alignment ◽

De Novo ◽

Sequence Data ◽

Sequence Assembly ◽

Viral Population ◽

Sequential Data ◽

Data Intensive ◽

Reduction Methods ◽

Feature Selection Approach

Motivation: DNA sequencing instruments are enabling genomic analyses of unprecedented scope and scale, widening the gap between our abilities to generate and interpret sequence data. Established methods for computational sequence analysis generally use nucleotide-level resolution of sequences, and while such approaches can be very accurate, increasingly ambitious and data-intensive analyses are rendering them impractical for applications such as genome and metagenome assembly. Comparable analytical challenges are encountered in other data-intensive fields involving sequential data, such as signal processing, in which dimensionality reduction methods are routinely used to reduce the computational burden of analyses. We therefore seek to address the question of whether it is possible to improve the efficiency of sequence alignment by applying dimensionality reduction methods to numerically represented nucleotide sequences. Results: To explore the applicability of signal transformation and dimensionality reduction methods to sequence assembly, we implemented a short read aligner and evaluated its performance against simulated high diversity viral sequences alongside four existing aligners. Using our sequence transformation and feature selection approach, alignment time was reduced by up to 14-fold compared to uncompressed sequences and without reducing alignment accuracy. Despite using highly compressed sequence transformations, our implementation yielded alignments of similar overall accuracy to existing aligners, outperforming all other tools tested at high levels of sequence variation. Our approach was also applied to the de novo assembly of a simulated diverse viral population. Our results demonstrate that full sequence resolution is not a prerequisite of accurate sequence alignment and that analytical performance can be retained and even enhanced through appropriate dimensionality reduction of sequences.

Download Full-text

Prediction of replication time zones at single nucleotide resolution in the human genome

FEBS Letters ◽

10.1016/j.febslet.2008.06.008 ◽

2008 ◽

Vol 582 (16) ◽

pp. 2441-2444 ◽

Cited By ~ 1

Author(s):

Feng Gao ◽

Chun-Ting Zhang

Keyword(s):

Human Genome ◽

Single Nucleotide ◽

Replication Time ◽

Time Zones ◽

Nucleotide Resolution ◽

Single Nucleotide Resolution

Download Full-text

Nanopore sequencing and assembly of a human genome with ultra-long reads

10.1101/128835 ◽

2017 ◽

Cited By ~ 51

Author(s):

Miten Jain ◽

S Koren ◽

J Quick ◽

AC Rand ◽

TA Sasani ◽

...

Keyword(s):

Human Genome ◽

Cancer Progression ◽

De Novo ◽

Sequence Data ◽

Point Of Care ◽

Genetic Diseases ◽

Nanopore Sequencing ◽

Repeat Structure ◽

Long Reads ◽

Amazon Web Services

AbstractNanopore sequencing is a promising technique for genome sequencing due to its portability, ability to sequence long reads from single molecules, and to simultaneously assay DNA methylation. However until recently nanopore sequencing has been mainly applied to small genomes, due to the limited output attainable. We present nanopore sequencing and assembly of the GM12878 Utah/Ceph human reference genome generated using the Oxford Nanopore MinION and R9.4 version chemistry. We generated 91.2 Gb of sequence data (∼30× theoretical coverage) from 39 flowcells. De novo assembly yielded a highly complete and contiguous assembly (NG50 ∼3Mb). We observed considerable variability in homopolymeric tract resolution between different basecallers. The data permitted sensitive detection of both large structural variants and epigenetic modifications. Further we developed a new approach exploiting the long-read capability of this system and found that adding an additional 5×-coverage of ‘ultra-long’ reads (read N50 of 99.7kb) more than doubled the assembly contiguity. Modelling the repeat structure of the human genome predicts extraordinarily contiguous assemblies may be possible using nanopore reads alone. Portable de novo sequencing of human genomes may be important for rapid point-of-care diagnosis of rare genetic diseases and cancer, and monitoring of cancer progression. The complete dataset including raw signal is available as an Amazon Web Services Open Dataset at: https://github.com/nanopore-wgs-consortium/NA12878.

Download Full-text