Using reference-free compressed data structures to analyse sequencing reads from thousands of human genomes

Mapping Intimacies ◽

10.1101/060186 ◽

2016 ◽

Cited By ~ 1

Author(s):

Dirk D. Dolle ◽

Zhicheng Liu ◽

Matthew Cotten ◽

Jared T. Simpson ◽

Zamin Iqbal ◽

...

Keyword(s):

Data Structures ◽

De Novo ◽

Sequencing Data ◽

T Lymphotropic Virus ◽

Viral Genomes ◽

1000 Genomes ◽

Base Position ◽

Human Genomes ◽

Compressed Data Structures ◽

Burrows Wheeler Transform

AbstractWe are rapidly approaching the point where we have sequenced millions of human genomes. There is a pressing need for new data structures to store raw sequencing data and efficient algorithms for population scale analysis. Current reference based data formats do not fully exploit the redundancy in population sequencing nor take advantage of shared genetic variation. In recent years, the Burrows-Wheeler transform (BWT) and FM-index have been widely employed as a full text searchable index for read alignment and de novo assembly. We introduce the concept of a population BWT and use it to store and index the sequencing reads of 2,705 samples from the 1000 Genomes Project. A key feature is that as more genomes are added, identical read sequences are increasingly observed and compression becomes more efficient. We assess the support in the 1000 Genomes read data for every base position of two human reference assembly versions, identifying that 3.2 Mbp with population support was lost in the transition from GRCh37 with 13.7 Mbp added to GRCh38. We show that the vast majority of variant alleles can be uniquely described by overlapping 31-mers and show how rapid and accurate SNP and indel genotyping can be carried out across the genomes in the population BWT. We use the population BWT to carry out non-reference queries to search for the presence of all known viral genomes, and discover human T-lymphotropic virus 1 integrations in six samples in a recognised epidemiological distribution.

State of the art de novo assembly of human genomes from massively parallel sequencing data

Human Genomics ◽

10.1186/1479-7364-4-4-271 ◽

2010 ◽

Vol 4 (4) ◽

pp. 271 ◽

Cited By ~ 49

Author(s):

Yingrui Li ◽

Yujie Hu ◽

Lars Bolund ◽

Jun Wang

Keyword(s):

De Novo Assembly ◽

De Novo ◽

State Of The Art ◽

Massively Parallel Sequencing ◽

Massively Parallel ◽

Sequencing Data ◽

Parallel Sequencing ◽

Human Genomes

Extensive sequencing of seven human genomes to characterize benchmark reference materials

10.1101/026468 ◽

2015 ◽

Cited By ~ 9

Author(s):

Justin M Zook ◽

David Catoe ◽

Jennifer McDaniel ◽

Lindsay Vang ◽

Noah Spies ◽

...

Keyword(s):

Human Genome ◽

Reference Materials ◽

De Novo ◽

Variant Calling ◽

Genome Project ◽

Genome Comparison ◽

Personal Genome ◽

Sequencing Data ◽

Sequencing Technologies ◽

Human Genomes

The Genome in a Bottle Consortium, hosted by the National Institute of Standards and Technology (NIST) is creating reference materials and data for human genome sequencing, as well as methods for genome comparison and benchmarking. Here, we describe a large, diverse set of sequencing data for seven human genomes; five are current or candidate NIST Reference Materials. The pilot genome, NA12878, has been released as NIST RM 8398. We also describe data from two Personal Genome Project trios, one of Ashkenazim Jewish ancestry and one of Chinese ancestry. The data come from 12 technologies: BioNano Genomics, Complete Genomics paired-end and LFR, Ion Proton exome, Oxford Nanopore, Pacific Biosciences, SOLiD, 10X Genomics GemCodeTM WGS, and Illumina exome and WGS paired-end, mate-pair, and synthetic long reads. Cell lines, DNA, and data from these individuals are publicly available. Therefore, we expect these data to be useful for revealing novel information about the human genome and improving sequencing technologies, SNP, indel, and structural variant calling, and de novo assembly.

Efficient de novo assembly of large genomes using compressed data structures

Genome Research ◽

10.1101/gr.126953.111 ◽

2011 ◽

Vol 22 (3) ◽

pp. 549-556 ◽

Cited By ~ 463

Author(s):

J. T. Simpson ◽

R. Durbin

Keyword(s):

Data Structures ◽

De Novo Assembly ◽

De Novo ◽

Compressed Data Structures ◽

Compressed Data ◽

Large Genomes

NovoGraph: Human genome graph construction from multiple long-read de novo assemblies

F1000Research ◽

10.12688/f1000research.15895.2 ◽

2018 ◽

Vol 7 ◽

pp. 1391

Author(s):

Evan Biederstedt ◽

Jeffrey C. Oliver ◽

Nancy F. Hansen ◽

Aarti Jajoo ◽

Nathan Dunn ◽

...

Keyword(s):

Human Genome ◽

De Novo ◽

Wide Spectrum ◽

Third Party ◽

Sequencing Data ◽

Multiple Sequence ◽

Human Genomes ◽

A Genome ◽

Long Read ◽

Genome Graph

Genome graphs are emerging as an important novel approach to the analysis of high-throughput human sequencing data. By explicitly representing genetic variants and alternative haplotypes in a mappable data structure, they can enable the improved analysis of structurally variable and hyperpolymorphic regions of the genome. In most existing approaches, graphs are constructed from variant call sets derived from short-read sequencing. As long-read sequencing becomes more cost-effective and enables de novo assembly for increasing numbers of whole genomes, a method for the direct construction of a genome graph from sets of assembled human genomes would be desirable. Such assembly-based genome graphs would encompass the wide spectrum of genetic variation accessible to long-read-based de novo assembly, including large structural variants and divergent haplotypes. Here we present NovoGraph, a method for the construction of a human genome graph directly from a set of de novo assemblies. NovoGraph constructs a genome-wide multiple sequence alignment of all input contigs and creates a graph by merging the input sequences at positions that are both homologous and sequence-identical. NovoGraph outputs resulting graphs in VCF format that can be loaded into third-party genome graph toolkits. To demonstrate NovoGraph, we construct a genome graph with 23,478,835 variant sites and 30,582,795 variant alleles from de novo assemblies of seven ethnically diverse human genomes (AK1, CHM1, CHM13, HG003, HG004, HX1, NA19240). Initial evaluations show that mapping against the constructed graph reduces the average mismatch rate of reads from sample NA12878 by approximately 0.2%, albeit at a slightly increased rate of reads that remain unmapped.

An integrated personal and population-based Egyptian genome reference

Nature Communications ◽

10.1038/s41467-020-17964-1 ◽

2020 ◽

Vol 11 (1) ◽

Author(s):

Inken Wohlers ◽

Axel Künstner ◽

Matthias Munz ◽

Michael Olbrich ◽

Anke Fähnrich ◽

...

Keyword(s):

Genetic Variation ◽

De Novo ◽

Disease Risk ◽

Population Based ◽

European Ancestry ◽

Whole Genome Sequencing Data ◽

Sequencing Data ◽

Human Genomes ◽

Genome Wide ◽

Polygenic Scores

Abstract A small number of de novo assembled human genomes have been reported to date, and few have been complemented with population-based genetic variation, which is particularly important for North Africa, a region underrepresented in current genome-wide references. Here, we combine long- and short-read whole-genome sequencing data with recent assembly approaches into a de novo assembly of an Egyptian genome. The assembly demonstrates well-balanced quality metrics and is complemented with variant phasing via linked reads into haploblocks, which we associate with gene expression changes in blood. To construct an Egyptian genome reference, we identify genome-wide genetic variation within a cohort of 110 Egyptian individuals. We show that differences in allele frequencies and linkage disequilibrium between Egyptians and Europeans may compromise the transferability of European ancestry-based genetic disease risk and polygenic scores, substantiating the need for multi-ethnic genome references. Thus, the Egyptian genome reference will be a valuable resource for precision medicine.

Algorithms for the compression of genomic big data

10.7287/peerj.preprints.2176 ◽

2016 ◽

Author(s):

Nicola Prezza ◽

Alberto Policriti

Keyword(s):

Data Structures ◽

Dynamic Data Structures ◽

Dynamic Data ◽

Working Space ◽

Input Size ◽

Desktop Computers ◽

Compressed Data Structures ◽

Text Collections ◽

Two Measures ◽

Burrows Wheeler Transform

Motivations. Building the Burrows-Wheeler transform (BWT) and computing the Lempel-Ziv parsing (LZ77) of huge collections of genomes is becoming an important task in bioinformatic analyses as these datasets often need to be compressed and indexed prior to analysis. Given that the sizes of such datasets often exceed RAM capacity of common machines however, standard algorithms cannot be used to solve this problem as they require a working space at least linear in the input size. One way to solve this problem is to exploit the intrinsic compressibility of such datasets: two genomes from the same species share most of their information (often more than 99%), so families of genomes can be considerably compressed. A solution to the above problem could therefore be that of designing algorithms working in compressed working space, i.e. algorithms that stream the input from disk and require in RAM a space that is proportional to the size of the compressed text. Methods. In this work we present algorithms and data structures to compress and index text in compressed working space. These results build upon compressed dynamic data structures, a sub-field of compressed data structures research that is lately receiving a lot of attention. We focus on two measures of compressibility: the empirical entropy H of the text and the number r of equal-letter runs in the BWT of the text. We show how to build the BWT and LZ77 using only O(Hn) and (rlog n) working space, n being the size of the collection. For the case of repetitive text collections (such as sets of genomes from the same species), this considerably improves the working space required by state-of-the art algorithms in the literature. The algorthms and data structures here discussed have all been implemented in a public C++ library, available at github.com/nicolaprezza/DYNAMIC. The library includes dynamic gap-encoded bitvectors, run-length encoded (RLE) strings, and RLE FM-indexes. Results. We conclude with an overview of the experimental results that we obtained running our algorithms on highly repetitive genomic datasets. As expected, our solutions require only a small fraction of the working space used by solutions working in non-compressed space, making it feasible to compute BWT and LZ77 of huge collections of genomes even on desktop computers with small amounts of RAM available. As a downside of using complex dynamic data structures however, running times are still not practical so improvements such as parallelization may be needed in order to make these solutions fully practical.

Jasmine: Population-scale structural variant comparison and analysis

10.1101/2021.05.27.445886 ◽

2021 ◽

Author(s):

Melanie Kirsche ◽

Gautam Prabhu ◽

Rachel Sherman ◽

Bohan Ni ◽

Sergey Aganezov ◽

...

Keyword(s):

De Novo ◽

Population Analysis ◽

Accurate Method ◽

Structural Variants ◽

Sequencing Data ◽

1000 Genomes ◽

Dna And Rna ◽

Long Reads ◽

Long Read ◽

Proximity Graph

The increasing availability of long-reads is revolutionizing studies of structural variants (SVs). However, because SVs vary across individuals and are discovered through imprecise read technologies and methods, they can be difficult to compare. Addressing this, we present Jasmine (https://github.com/mkirsche/Jasmine), a fast and accurate method for SV refinement, comparison, and population analysis. Using an SV proximity graph, Jasmine outperforms five widely-used comparison methods, including reducing the rate of Mendelian discordance in trio datasets by more than five-fold, and reveals a set of high confidence de novo SVs confirmed by multiple long-read technologies. We also present a harmonized callset of 205,192 SVs from 31 samples of diverse ancestry sequenced with long reads. We genotype these SVs in 444 short read samples from the 1000 Genomes Project with both DNA and RNA sequencing data and assess their widespread impact on gene expression, including within several medically relevant genes.

WENGAN: Efficient and high quality hybrid de novo assembly of human genomes

10.1101/840447 ◽

2019 ◽

Cited By ~ 1

Author(s):

Alex Di Genova ◽

Elena Buena-Atienza ◽

Stephan Ossowski ◽

Marie-France Sagot

Keyword(s):

De Novo ◽

Computational Cost ◽

Sequence Information ◽

Sequencing Data ◽

High Quality ◽

Sequencing Technologies ◽

Human Genomes ◽

Long Reads ◽

Long Read ◽

Genome Assemblies

The continuous improvement of long-read sequencing technologies along with the development of ad-doc algorithms has launched a new de novo assembly era that promises high-quality genomes. However, it has proven difficult to use only long reads to generate accurate genome assemblies of large, repeat-rich human genomes. To date, most of the human genomes assembled from long error-prone reads add accurate short reads to further polish the consensus quality. Here, we report the development of a novel algorithm for hybrid assembly, WENGAN, and the de novo assembly of four human genomes using a combination of sequencing data generated on ONT PromethION, PacBio Sequel, Illumina and MGI technology. WENGAN implements efficient algorithms that exploit the sequence information of short and long reads to tackle assembly contiguity as well as consensus quality. The resulting genome assemblies have high contiguity (contig NG50:16.67-62.06 Mb), few assembly errors (contig NGA50:10.9-45.91 Mb), good consensus quality (QV:27.79-33.61), and high gene completeness (BUSCO complete: 94.6-95.1%), while consuming low computational resources (CPU hours:153-1027). In particular, the WENGAN assembly of the haploid CHM13 sample achieved a contig NG50 of 62.06 Mb (NGA50:45.91 Mb), which surpasses the contiguity of the current human reference genome (GRCh38 contig NG50:57.88 Mb). Providing highest quality at low computational cost, WENGAN is an important step towards the democratization of the de novo assembly of human genomes. The WENGAN assembler is available at https://github.com/adigenova/wengan

Efficient hybrid de novo assembly of human genomes with WENGAN

Nature Biotechnology ◽

10.1038/s41587-020-00747-w ◽

2020 ◽

Author(s):

Alex Di Genova ◽

Elena Buena-Atienza ◽

Stephan Ossowski ◽

Marie-France Sagot

Keyword(s):

De Novo Assembly ◽

De Novo ◽

Consensus Sequence ◽

Computational Cost ◽

Sequencing Data ◽

Human Genomes ◽

Long Reads ◽

High Gene ◽

Computational Resources ◽

Genome Assemblies

AbstractGenerating accurate genome assemblies of large, repeat-rich human genomes has proved difficult using only long, error-prone reads, and most human genomes assembled from long reads add accurate short reads to polish the consensus sequence. Here we report an algorithm for hybrid assembly, WENGAN, that provides very high quality at low computational cost. We demonstrate de novo assembly of four human genomes using a combination of sequencing data generated on ONT PromethION, PacBio Sequel, Illumina and MGI technology. WENGAN implements efficient algorithms to improve assembly contiguity as well as consensus quality. The resulting genome assemblies have high contiguity (contig NG50: 17.24–80.64 Mb), few assembly errors (contig NGA50: 11.8–59.59 Mb), good consensus quality (QV: 27.84–42.88) and high gene completeness (BUSCO complete: 94.6–95.2%), while consuming low computational resources (CPU hours: 187–1,200). In particular, the WENGAN assembly of the haploid CHM13 sample achieved a contig NG50 of 80.64 Mb (NGA50: 59.59 Mb), which surpasses the contiguity of the current human reference genome (GRCh38 contig NG50: 57.88 Mb).

The quiescent X, the replicative Y and the Autosomes

10.1101/351288 ◽

2018 ◽

Cited By ~ 2

Author(s):

Guillaume Achaz ◽

Serge Gangloff ◽

Benoit Arcangioli

Keyword(s):

De Novo ◽

Yeast Cells ◽

Simple Pattern ◽

Maternal Lineage ◽

1000 Genomes Project ◽

Phase 3 ◽

X Chromosomes ◽

1000 Genomes ◽

Y Chromosomes ◽

Human Genomes

AbstractFrom the analysis of the mutation spectrum in the 2,504 sequenced human genomes from the 1000 genomes project (phase 3), we show that sexual chromosomes (X and Y) exhibit a different proportion of indel mutations than autosomes (A), ranking them X>A>Y. We further show that X chromosomes exhibit a higher ratio of deletion/insertion when compared to autosomes. This simple pattern shows that the recent report that non-dividing quiescent yeast cells accumulate relatively more indels (and particularly deletions) than replicating ones also applies to metazoan cells, including humans. Indeed, the X chromosomes display more indels than the autosomes, having spent more time in quiescent oocytes, whereas the Y chromosomes are solely present in the replicating spermatocytes. From the proportion of indels, we have inferred that de novo mutations arising in the maternal lineage are twice more likely to be indels than mutations from the paternal lineage. Our observation, consistent with a recent trio analysis of the spectrum of mutations inherited from the maternal lineage, is likely a major component in our understanding of the origin of anisogamy.