Compression with unified and accessible byte blocks to enhance management and analyses of UKBB-scale genotypes

Abstract Whole-genome sequencing projects of millions of persons contain enormous genotypes, entailing a huge memory burden and time overhead during computation. Here, we introduce Genotype Blocking Compressor (GBC), a method for rapidly compressing large-scale genotypes into a fast-accessible and highly parallelizable format. We demonstrate that GBC has a competitive compression ratio to help save storage space. Furthermore, GBC is the fastest method to access and manage compressed large-scale genotype files (sorting, merging, splitting, etc.). Our results indicate that GBC can help resolve the fundamental problem of time- and space-consuming computation with large-scale genotypes, and conventional analysis would be substantially enhanced if integrated with GBC to access genotypes. Therefore, GBC's advanced data structure and algorithms will accelerate future population-based biomedical research involving big genomics data.

Download Full-text

1888. A Nationwide Outbreak of Invasive Pneumococcal Disease (IPD) Caused by a Novel Streptococcus Pneumoniae Serotype 2 (SP2) Clone in the PCV13 Era, in Israel

Open Forum Infectious Diseases ◽

10.1093/ofid/ofz359.118 ◽

2019 ◽

Vol 6 (Supplement_2) ◽

pp. S54-S54

Author(s):

Ron Dagan ◽

Shalom Ben-Shimol ◽

Rachel Benisty ◽

Gili Regev-Yochay ◽

Merav Ron ◽

...

Keyword(s):

Population Structure ◽

Jewish Population ◽

Pneumococcal Disease ◽

Large Scale ◽

Evolutionary Dynamics ◽

Fold Increase ◽

Population Based ◽

Whole Genome ◽

Single Strain ◽

The Uk

Abstract Background IPD caused by Sp2 (non-PCV13 serotype) is relatively rare. However, Sp2 has a high potential for causing IPD including meningitis. Large-scale outbreaks of Sp2 IPD are rare and were not reported post-PCV implementation. We describe Sp2 IPD outbreak in Israel, in the PCV13 era, caused by a novel clone. Additionally, we analyzed the population structure and evolutionary dynamics of Sp2 during 2006–2018. Methods An ongoing, population-based, nationwide active surveillance, conducted since July 2009. PCV7/PCV13 were implemented in Israel in July 2009 and November 2010, respectively. All isolates were tested for antimicrobial susceptibility, PFGE, MLST and whole-genome sequencing (WGS). Results. Overall, 173 Sp2 IPD cases were identified; all isolates were analyzed by MLST (Figure 1). During 2016–2017, Sp2 caused 7.6% of all-IPD, a 7-fold increase compared with 2006–2015, and ranked second (after serotype 12F causing 12%) among IPD isolates. During 2006–2015, 98% (40/41) Sp2 IPD were caused by the previously reported global ST-1504 clone. The outbreak was caused by a novel clone ST-13578, not previously reported (Figure 2). WGS analysis confirmed that ST-13578 was related, but genetically distinct from ST-1504, observed exclusively before the outbreak. A single strain of clone ST-74 previously globally reported was identified in 2017–2018. An additional case was identified in an adult in the UK, following a family visit from Israel. The ST-13578 clone was identified only in the Jewish population and was mainly distributed in 3 of the 7 Israeli districts. All tested strains were penicillin-susceptible (MIC < 0.06 μg/mL). Conclusion To the best of our knowledge, this is the first widespread Sp2 outbreak since PCV13 introduction worldwide, caused by a novel clone ST-13578. The outbreak is still ongoing, although a declining trend was noted since 2017. Disclosures All Authors: No reported Disclosures.

Download Full-text

Repeatability and reproducibility assessment in a large-scale population-based microbiota study: case study on human milk microbiota

10.1101/2020.04.20.052035 ◽

2020 ◽

Author(s):

Shirin Moossavi ◽

Kelsey Fehr ◽

Theo J. Moraes ◽

Ehsan Khafipour ◽

Meghan B. Azad

Keyword(s):

Quality Control ◽

Data Structure ◽

Large Scale ◽

Population Based ◽

Microbiome Composition ◽

Scale Population ◽

Microbiome Research ◽

Control Procedures ◽

Repeatability And Reproducibility ◽

Downstream Analysis

AbstractBackgroundQuality control including assessment of batch variabilities and confirmation of repeatability and reproducibility are integral component of high throughput omics studies including microbiome research. Batch effects can mask true biological results and/or result in irreproducible conclusions and interpretations. Low biomass samples in microbiome research are prone to reagent contamination; yet, quality control procedures for low biomass samples in large-scale microbiome studies are not well established.ResultsIn this study we have proposed a framework for an in-depth step-by-step approach to address this gap. The framework consists of three independent stages: 1) verification of sequencing accuracy by assessing technical repeatability and reproducibility of the results using mock communities and biological controls; 2) contaminant removal and batch variability correction by applying a two-tier strategy using statistical algorithms (e.g. decontam) followed by comparison of the data structure between batches; and 3) corroborating the repeatability and reproducibility of microbiome composition and downstream statistical analysis. Using this approach on the milk microbiota data from the CHILD Cohort generated in two batches (extracted and sequenced in 2016 and 2019), we were able to identify potential reagent contaminants that were missed with standard algorithms, and substantially reduce contaminant-induced batch variability. Additionally, we confirmed the repeatability and reproducibility of our reslults in each batch before merging them for downstream analysis.ConclusionThis study provides important insight to advance quality control efforts in low biomass microbiome research. Within-study quality control that takes advantage of the data structure (i.e. differential prevalence of contaminants between batches) would enhance the overall reliability and reproducibility of research in this field.

Download Full-text

Repeatability and reproducibility assessment in a large-scale population-based microbiota study: case study on human milk microbiota

Microbiome ◽

10.1186/s40168-020-00998-4 ◽

2021 ◽

Vol 9 (1) ◽

Author(s):

Shirin Moossavi ◽

Kelsey Fehr ◽

Ehsan Khafipour ◽

Meghan B. Azad

Keyword(s):

Quality Control ◽

Data Structure ◽

Large Scale ◽

Population Based ◽

Microbiome Composition ◽

Scale Population ◽

Microbiome Research ◽

Control Procedures ◽

Repeatability And Reproducibility ◽

Downstream Analysis

Abstract Background Quality control including assessment of batch variabilities and confirmation of repeatability and reproducibility are integral component of high throughput omics studies including microbiome research. Batch effects can mask true biological results and/or result in irreproducible conclusions and interpretations. Low biomass samples in microbiome research are prone to reagent contamination; yet, quality control procedures for low biomass samples in large-scale microbiome studies are not well established. Results In this study, we have proposed a framework for an in-depth step-by-step approach to address this gap. The framework consists of three independent stages: (1) verification of sequencing accuracy by assessing technical repeatability and reproducibility of the results using mock communities and biological controls; (2) contaminant removal and batch variability correction by applying a two-tier strategy using statistical algorithms (e.g. decontam) followed by comparison of the data structure between batches; and (3) corroborating the repeatability and reproducibility of microbiome composition and downstream statistical analysis. Using this approach on the milk microbiota data from the CHILD Cohort generated in two batches (extracted and sequenced in 2016 and 2019), we were able to identify potential reagent contaminants that were missed with standard algorithms and substantially reduce contaminant-induced batch variability. Additionally, we confirmed the repeatability and reproducibility of our results in each batch before merging them for downstream analysis. Conclusion This study provides important insight to advance quality control efforts in low biomass microbiome research. Within-study quality control that takes advantage of the data structure (i.e. differential prevalence of contaminants between batches) would enhance the overall reliability and reproducibility of research in this field.

Download Full-text

Efficient Merging of Genome Profile Alignments

10.1101/309047 ◽

2018 ◽

Author(s):

André Hennig ◽

Kay Nieselt

Keyword(s):

Data Structure ◽

Parallel Computation ◽

Large Scale ◽

Divide And Conquer ◽

Data Sets ◽

Whole Genome ◽

Multiple Sequence ◽

Construction Methods ◽

Current Implementation ◽

Whole Genomes

AbstractMotivationWhole-genome alignment methods show insufficient scalability towards the generation of large-scale whole-genome alignments (WGAs). Profile alignment-based approaches revolutionized the fields of multiple sequence alignment construction methods by significantly reducing computational complexity and runtime. However, WGAs need to consider genomic rearrangements between genomes, which makes the profile-based extension of several whole-genomes challenging. Currently, none of the available methods offer the possibility to align or extend WGA profiles.ResultsHere, we present GPA, an approach that aligns the profiles of WGAs and is capable of producing large-scale WGAs many times faster than conventional methods. Our concept relies on already available whole-genome aligners, which are used to compute several smaller sets of aligned genomes that are combined to a full WGA with a divide and conquer approach. To align or extend WGA profiles, we make use of the SuperGenome data structure, which features a bidirectional mapping between individual sequence and alignment coordinates. This data structure is used to efficiently transfer different coordinate systems into a common one based on the principles of profiles alignments. The approach allows the computation of a WGA where alignments are subsequently merged along a guide tree. The current implementation uses progressiveMauve (Darling et al., 2010) and offers the possibility for parallel computation of independent genome alignments. Our results based on various bacterial data sets up to several hundred genomes show that we can reduce the runtime from months to hours with a quality that is negligibly worse than the WGA computed with the conventional progressiveMauve tool.AvailabilityGPA is freely available at https://lambda.informatik.uni-tuebingen.de/gitlab/ahennig/GPA. GPA is implemented in Java, uses progressiveMauve and offers a parallel computation of [email protected]

Download Full-text

Efficient merging of genome profile alignments

Bioinformatics ◽

10.1093/bioinformatics/btz377 ◽

2019 ◽

Vol 35 (14) ◽

pp. i71-i80

Author(s):

André Hennig ◽

Kay Nieselt

Keyword(s):

Data Structure ◽

Parallel Computation ◽

Large Scale ◽

Divide And Conquer ◽

Supplementary Information ◽

Whole Genome ◽

Multiple Sequence ◽

Genome Profile ◽

Construction Methods ◽

Profile Alignment

Abstract Motivation Whole-genome alignment (WGA) methods show insufficient scalability toward the generation of large-scale WGAs. Profile alignment-based approaches revolutionized the fields of multiple sequence alignment construction methods by significantly reducing computational complexity and runtime. However, WGAs need to consider genomic rearrangements between genomes, which make the profile-based extension of several whole-genomes challenging. Currently, none of the available methods offer the possibility to align or extend WGA profiles. Results Here, we present genome profile alignment, an approach that aligns the profiles of WGAs and that is capable of producing large-scale WGAs many times faster than conventional methods. Our concept relies on already available whole-genome aligners, which are used to compute several smaller sets of aligned genomes that are combined to a full WGA with a divide and conquer approach. To align or extend WGA profiles, we make use of the SuperGenome data structure, which features a bidirectional mapping between individual sequence and alignment coordinates. This data structure is used to efficiently transfer different coordinate systems into a common one based on the principles of profiles alignments. The approach allows the computation of a WGA where alignments are subsequently merged along a guide tree. The current implementation uses progressiveMauve and offers the possibility for parallel computation of independent genome alignments. Our results based on various bacterial datasets up to several hundred genomes show that we can reduce the runtime from months to hours with a quality that is negligibly worse than the WGA computed with the conventional progressiveMauve tool. Availability and implementation GPA is freely available at https://lambda.informatik.uni-tuebingen.de/gitlab/ahennig/GPA. GPA is implemented in Java, uses progressiveMauve and offers a parallel computation of WGAs. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

0306 Exploring the feasibility of using copy number variants as genetic markers through large-scale whole genome sequencing experiments

Journal of Animal Science ◽

10.2527/jam2016-0306 ◽

2016 ◽

Vol 94 (suppl_5) ◽

pp. 146-146

Author(s):

D. M. Bickhart ◽

L. Xu ◽

J. L. Hutchison ◽

J. B. Cole ◽

D. J. Null ◽

...

Keyword(s):

Whole Genome Sequencing ◽

Genetic Markers ◽

Genome Sequencing ◽

Copy Number ◽

Large Scale ◽

Copy Number Variants ◽

Whole Genome

Download Full-text

Plasmids or no plasmids? A comparison between the agilent TapeStation and whole-genome sequencing data in a large-scale bacterial sequencing project

10.26226/morressier.56d5ba27d462b80296c95fe7 ◽

2016 ◽

Author(s):

Sarah Alexander

Keyword(s):

Whole Genome Sequencing ◽

Genome Sequencing ◽

Large Scale ◽

Whole Genome Sequencing Data ◽

Whole Genome ◽

Sequencing Data ◽

Sequencing Project

Download Full-text

Reaction-based Enumeration, Active Learning, and Free Energy Calculations to Rapidly Explore Synthetically Tractable Chemical Space and Optimize Potency of Cyclin Dependent Kinase 2 Inhibitors

10.26434/chemrxiv.7841270.v2 ◽

2019 ◽

Author(s):

Kyle Konze ◽

Pieter Bos ◽

Markus Dahlgren ◽

Karl Leswing ◽

Ivan Tubert-Brohman ◽

...

Keyword(s):

Free Energy ◽

Drug Discovery ◽

Active Learning ◽

Large Scale ◽

Chemical Space ◽

Population Based ◽

Free Energy Calculations ◽

Computational Technique ◽

Cyclin Dependent Kinase ◽

Energy Calculations

We report a new computational technique, PathFinder, that uses retrosynthetic analysis followed by combinatorial synthesis to generate novel compounds in synthetically accessible chemical space. Coupling PathFinder with active learning and cloud-based free energy calculations allows for large-scale potency predictions of compounds on a timescale that impacts drug discovery. The process is further accelerated by using a combination of population-based statistics and active learning techniques. Using this approach, we rapidly optimized R-groups and core hops for inhibitors of cyclin-dependent kinase 2. We explored greater than 300 thousand ideas and identified 35 ligands with diverse commercially available R-groups and a predicted IC<sub>50</sub> < 100 nM, and four unique cores with a predicted IC<sub>50</sub> < 100 nM. The rapid turnaround time, and scale of chemical exploration, suggests that this is a useful approach to accelerate the discovery of novel chemical matter in drug discovery campaigns.

Download Full-text