ntEdit: scalable genome sequence polishing

Abstract Motivation In the modern genomics era, genome sequence assemblies are routine practice. However, depending on the methodology, resulting drafts may contain considerable base errors. Although utilities exist for genome base polishing, they work best with high read coverage and do not scale well. We developed ntEdit, a Bloom filter-based genome sequence editing utility that scales to large mammalian and conifer genomes. Results We first tested ntEdit and the state-of-the-art assembly improvement tools GATK, Pilon and Racon on controlled Escherichia coli and Caenorhabditis elegans sequence data. Generally, ntEdit performs well at low sequence depths (<20×), fixing the majority (>97%) of base substitutions and indels, and its performance is largely constant with increased coverage. In all experiments conducted using a single CPU, the ntEdit pipeline executed in <14 s and <3 m, on average, on E.coli and C.elegans, respectively. We performed similar benchmarks on a sub-20× coverage human genome sequence dataset, inspecting accuracy and resource usage in editing chromosomes 1 and 21, and whole genome. ntEdit scaled linearly, executing in 30–40 m on those sequences. We show how ntEdit ran in <2 h 20 m to improve upon long and linked read human genome assemblies of NA12878, using high-coverage (54×) Illumina sequence data from the same individual, fixing frame shifts in coding sequences. We also generated 17-fold coverage spruce sequence data from haploid sequence sources (seed megagametophyte), and used it to edit our pseudo haploid assemblies of the 20 Gb interior and white spruce genomes in <4 and <5 h, respectively, making roughly 50M edits at a (substitution+indel) rate of 0.0024. Availability and implementation https://github.com/bcgsc/ntedit Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

ntEdit: scalable genome assembly polishing

10.1101/565374 ◽

2019 ◽

Cited By ~ 2

Author(s):

René L Warren ◽

Lauren Coombe ◽

Hamid Mohamadi ◽

Jessica Zhang ◽

Barry Jaquish ◽

...

Keyword(s):

Human Genome ◽

Genome Sequence ◽

Sequence Data ◽

Bloom Filter ◽

Routine Practice ◽

High Coverage ◽

E Coli ◽

C Elegans ◽

Illumina Sequence ◽

Genome Assemblies

AbstractIn the modern genomics era, genome sequence assemblies are routine practice. However, depending on the methodology, resulting drafts may contain considerable base errors. Although utilities exist for genome base polishing, they work best with high read coverage and do not scale well. We developed ntEdit, a Bloom filter-based genome sequence editing utility that scales to large mammalian and conifer genomes.We first tested ntEdit and the state-of-the-art assembly improvement tools GATK, Pilon and Racon on controlled E. coli and C. elegans sequence data. Generally, ntEdit performs well at low sequence depths (<20X), fixing the majority (>97%) of base substitutions and indels, and its performance is largely constant with increased coverage. In all experiments conducted using a single CPU, the ntEdit pipeline executed in <14s and <3m, on average, on E. coli and C. elegans, respectively. We performed similar benchmarks on a sub-20X coverage human genome sequence dataset, inspecting accuracy and resource usage in editing chromosomes 1 and 21, and whole genome. ntEdit scaled linearly, executing in 30-40m on those sequences. We show how ntEdit ran in <2h20m to improve upon long and linked read human genome assemblies of NA12878, using high coverage (54X) Illumina sequence data from the same individual, fixing frame shifts in coding sequences. We also generated 17-fold coverage spruce sequence data from haploid sequence sources (seed megagametophyte), and used it to edit our pseudo haploid assemblies of the 20 Gbp interior and white spruce genomes in <4 and <5h, respectively, making roughly 50M edits at a (substitution+indel) rate of 0.0024.Availabilityhttps://github.com/bcgsc/nteditSupplemental materialavailable online.

Download Full-text

TIGER: inferring DNA replication timing from whole-genome sequence data

Bioinformatics ◽

10.1093/bioinformatics/btab166 ◽

2021 ◽

Cited By ~ 1

Author(s):

Amnon Koren ◽

Dashiell J Massey ◽

Alexa N Bracci

Keyword(s):

Dna Replication ◽

Genome Sequence ◽

Genomic Dna ◽

Sequence Data ◽

Replication Timing ◽

Whole Genome Sequence ◽

Supplementary Information ◽

Whole Genome ◽

Genome Sequence Data ◽

Dna Replication Timing

Abstract Motivation Genomic DNA replicates according to a reproducible spatiotemporal program, with some loci replicating early in S phase while others replicate late. Despite being a central cellular process, DNA replication timing studies have been limited in scale due to technical challenges. Results We present TIGER (Timing Inferred from Genome Replication), a computational approach for extracting DNA replication timing information from whole genome sequence data obtained from proliferating cell samples. The presence of replicating cells in a biological specimen leads to non-uniform representation of genomic DNA that depends on the timing of replication of different genomic loci. Replication dynamics can hence be observed in genome sequence data by analyzing DNA copy number along chromosomes while accounting for other sources of sequence coverage variation. TIGER is applicable to any species with a contiguous genome assembly and rivals the quality of experimental measurements of DNA replication timing. It provides a straightforward approach for measuring replication timing and can readily be applied at scale. Availability and Implementation TIGER is available at https://github.com/TheKorenLab/TIGER. Supplementary information Supplementary data are available at Bioinformatics online

Download Full-text

Reconstruction and evolutionary history of eutherian chromosomes

Proceedings of the National Academy of Sciences ◽

10.1073/pnas.1702012114 ◽

2017 ◽

Vol 114 (27) ◽

pp. E5379-E5388 ◽

Cited By ~ 41

Author(s):

Jaebum Kim ◽

Marta Farré ◽

Loretta Auvil ◽

Boris Capitanu ◽

Denis M. Larkin ◽

...

Keyword(s):

Human Genome ◽

Chromosomal Rearrangements ◽

Recent Common Ancestor ◽

Homologous Sequence ◽

High Coverage ◽

Most Recent Common Ancestor ◽

Murid Rodents ◽

Chromosomal Breakpoints ◽

Genome Assemblies ◽

Eutherian Evolution

Whole-genome assemblies of 19 placental mammals and two outgroup species were used to reconstruct the order and orientation of syntenic fragments in chromosomes of the eutherian ancestor and six other descendant ancestors leading to human. For ancestral chromosome reconstructions, we developed an algorithm (DESCHRAMBLER) that probabilistically determines the adjacencies of syntenic fragments using chromosome-scale and fragmented genome assemblies. The reconstructed chromosomes of the eutherian, boreoeutherian, and euarchontoglires ancestor each included >80% of the entire length of the human genome, whereas reconstructed chromosomes of the most recent common ancestor of simians, catarrhini, great apes, and humans and chimpanzees included >90% of human genome sequence. These high-coverage reconstructions permitted reliable identification of chromosomal rearrangements over ∼105 My of eutherian evolution. Orangutan was found to have eight chromosomes that were completely conserved in homologous sequence order and orientation with the eutherian ancestor, the largest number for any species. Ruminant artiodactyls had the highest frequency of intrachromosomal rearrangements, and interchromosomal rearrangements dominated in murid rodents. A total of 162 chromosomal breakpoints in evolution of the eutherian ancestral genome to the human genome were identified; however, the rate of rearrangements was significantly lower (0.80/My) during the first ∼60 My of eutherian evolution, then increased to greater than 2.0/My along the five primate lineages studied. Our results significantly expand knowledge of eutherian genome evolution and will facilitate greater understanding of the role of chromosome rearrangements in adaptation, speciation, and the etiology of inherited and spontaneously occurring diseases.

Download Full-text

Free availability of the human genome sequence data and confidentiality of the individual genetic information: A written talk at: Genomics 2000: science and mankind, Annecy, France 1-6 May, 2000

Comptes Rendus de l Académie des Sciences - Series III - Sciences de la Vie ◽

10.1016/s0764-4469(01)01401-9 ◽

2001 ◽

Vol 324 (12) ◽

pp. 1093-1095

Author(s):

Huanming Yang

Keyword(s):

Human Genome ◽

Genome Sequence ◽

Genetic Information ◽

Sequence Data ◽

Human Genome Sequence ◽

Genome Sequence Data ◽

The Individual

Download Full-text

Genomics in 2K10: Fulfilling the Promise of a Sequenced Human Genome

Blood ◽

10.1182/blood.v116.21.sci-16.sci-16 ◽

2010 ◽

Vol 116 (21) ◽

pp. SCI-16-SCI-16

Author(s):

Eric D. Green

Keyword(s):

Human Genome ◽

Genome Sequence ◽

Genetic Basis ◽

Conflicts Of Interest ◽

Sequence Data ◽

Genetic Diseases ◽

Genomic Medicine ◽

Genome Project ◽

Human Genome Sequence ◽

Genomics Research

Abstract Abstract SCI-16 The Human Genome Project's completion of the human genome sequence in 2003 was a landmark scientific achievement of historic significance. It also signified a critical transition for the field of genomics, as the new foundation of genomic knowledge started to be used in powerful ways by researchers and clinicians to tackle increasingly complex problems in biomedicine. To exploit the opportunities provided by the human genome sequence and to ensure the productive growth of genomics as one of the most vital biomedical disciplines of the 21st century, the National Human Genome Research Institute (NHGRI) is pursuing a broad vision for genomics research beyond the Human Genome Project. This vision includes facilitating and supporting the highest-priority research areas that interconnect genomics to biology, to health, and to society.Current efforts in genomics research are focused on using genomic data, technologies, and insights to acquire a deeper understanding of biology and to uncover the genetic basis of human disease. Some of the most profound advances are being catalyzed by revolutionary new DNA sequencing technologies; these methods are already producing prodigious amounts of DNA sequence data, including from large numbers of individual patients. Such a capability, coupled with better associations between genetic diseases and specific regions of the human genome, are accelerating our understanding of the genetic basis for complex genetic disorders and for drug response. Together, these developments will usher in the era of genomic medicine. Disclosures: No relevant conflicts of interest to declare.

Download Full-text

KAT: A K-mer Analysis Toolkit to quality control NGS datasets and genome assemblies

10.1101/064733 ◽

2016 ◽

Cited By ~ 6

Author(s):

Daniel Mapleson ◽

Gonzalo Garcia Accinelli ◽

George Kettleborough ◽

Jonathan Wright ◽

Bernardo J. Clavijo

Keyword(s):

Quality Control ◽

De Novo ◽

Pairwise Comparison ◽

Supplementary Information ◽

Assembly Process ◽

High Coverage ◽

Software Documentation ◽

Link Type ◽

Genome Assemblies ◽

Ngs Data

ABSTRACTMotivationDe novo assembly of whole genome shotgun (WGS) next-generation sequencing (NGS) data beneﬁts from high-quality input with high coverage. However, in practice, determining the quality and quantity of useful reads quickly and in a reference-free manner is not trivial. Gaining a better understanding of the WGS data, and how that data is utilised by assemblers, provides useful insights that can inform the assembly process and result in better assemblies.ResultsWe present the K-mer Analysis Toolkit (KAT): a multi-purpose software toolkit for reference-free quality control (QC) of WGS reads and de novo genome assemblies, primarily via their k-mer frequencies and GC composition. KAT enables users to assess levels of errors, bias and contamination at various stages of the assembly process. In this paper we highlight KAT’s ability to provide valuable insights into assembly composition and quality of genome assemblies through pairwise comparison of k-mers present in both input reads and the assemblies.AvailabilityKAT is available under the GPLv3 license at: https://github.com/TGAC/[email protected] InformationSupplementary Information (SI) is available at Bioinformatics online. In addition, the software documentation is available online at: http://kat.readthedocs.io/en/latest/.

Download Full-text

De novo genome sequence assembly of a filamentous fungus using Sanger, 454 and Illumina sequence data

Genome Biology ◽

10.1186/gb-2009-10-9-r94 ◽

2009 ◽

Vol 10 (9) ◽

pp. R94 ◽

Cited By ~ 114

Author(s):

Scott DiGuistini ◽

Nancy Y Liao ◽

Darren Platt ◽

Gordon Robertson ◽

Michael Seidel ◽

...

Keyword(s):

Genome Sequence ◽

Filamentous Fungus ◽

De Novo ◽

Sequence Data ◽

Sequence Assembly ◽

Genome Sequence Assembly ◽

Illumina Sequence

Download Full-text

Ten Simples Rules on How to Organise a Bioinformatics Hackathon

EMBnet journal ◽

10.14806/ej.26.0.983 ◽

2021 ◽

Vol 26 ◽

pp. e983

Author(s):

Susanne Hollmann ◽

Babette Regierer ◽

Teresa K Attwood ◽

Andreas Gisel ◽

Jacques Van Helden ◽

...

Keyword(s):

Human Genome ◽

Genome Sequence ◽

Protein Interactions ◽

Sequence Data ◽

Genomic Research ◽

Human Genome Sequence ◽

Protein Protein Interactions ◽

Sequencing Technologies ◽

Develop Software ◽

The Many

The completion of the human genome sequence triggered worldwide efforts to unravel the secrets hidden in its deceptively simple code. Numerous bioinformatics projects were undertaken to hunt for genes, predict their protein products, function and post-translational modifications, analyse protein-protein interactions, etc. Many novel analytic and predictive computer programmes fully optimised for manipulating human genome sequence data have been developed, whereas considerably less effort has been invested in exploring the many thousands of other available genomes, from unicellular organisms to plants and non-human animals. Nevertheless, a detailed understanding of these organisms can have a significant impact on human health and well-being.New advances in genome sequencing technologies, bioinformatics, automation, artificial intelligence, etc., enable us to extend the reach of genomic research to all organisms. To this aim gather, develop and implement new bioinformatics solutions (usually in the form of software) is pivotal. A helpful model, often used by the bioinformatics community, is the so-called hackathon. These are events when all stakeholders beyond their disciplines work together creatively to solve a problem. During its runtime, the consortium of the EU-funded project AllBio - Broadening the Bioinformatics Infrastructure to cellular, animal and plant science - conducted many successful hackathons with researchers from different Life Science areas. Based on this experience, in the following, the authors present a step-by-step and standardised workflow explaining how to organise a bioinformatics hackathon to develop software solutions to biological problems.

Download Full-text