trioPhaser: using Mendelian inheritance logic to improve genomic phasing of trios

Abstract Background When analyzing DNA sequence data of an individual, knowing which nucleotide was inherited from each parent can be beneficial when trying to identify certain types of DNA variants. Mendelian inheritance logic can be used to accurately phase (haplotype) the majority (67–83%) of an individual's heterozygous nucleotide positions when genotypes are available for both parents (trio). However, when all members of a trio are heterozygous at a position, Mendelian inheritance logic cannot be used to phase. For such positions, a computational phasing algorithm can be used. Existing phasing algorithms use a haplotype reference panel, sequencing reads, and/or parental genotypes to phase an individual; however, they are limited in that they can only phase certain types of variants, require a specific genotype build, require large amounts of storage capacity, and/or require long run times. We created trioPhaser to address these challenges. Results trioPhaser uses gVCF files from an individual and their parents as initial input, and then outputs a phased VCF file. Input trio data are first phased using Mendelian inheritance logic. Then, the positions that cannot be phased using inheritance information alone are phased by the SHAPEIT4 phasing algorithm. Using whole-genome sequencing data of 52 trios, we show that trioPhaser, on average, increases the total number of phased positions by 21.0% and 10.5%, respectively, when compared to the number of positions that SHAPEIT4 or Mendelian inheritance logic can phase when either is used alone. In addition, we show that the accuracy of the phased calls output by trioPhaser are similar to linked-read and read-backed phasing. Conclusion trioPhaser is a containerized software tool that uses both Mendelian inheritance logic and SHAPEIT4 to phase trios when gVCF files are available. By implementing both phasing methods, more variant positions are phased compared to what either method is able to phase alone.

Download Full-text

MTBseq: a comprehensive pipeline for whole genome sequence analysis of Mycobacterium tuberculosis complex isolates

PeerJ ◽

10.7717/peerj.5895 ◽

2018 ◽

Vol 6 ◽

pp. e5895 ◽

Cited By ~ 35

Author(s):

Thomas Andreas Kohl ◽

Christian Utpatel ◽

Viola Schleusener ◽

Maria Rosaria De Filippo ◽

Patrick Beckert ◽

...

Keyword(s):

Antibiotic Resistance ◽

Mycobacterium Tuberculosis ◽

Genome Sequence ◽

Sequence Data ◽

Whole Genome Sequence ◽

Whole Genome Sequencing Data ◽

Phylogenomic Analysis ◽

Whole Genome ◽

Sequencing Data ◽

Desktop Computer

Analyzing whole-genome sequencing data of Mycobacterium tuberculosis complex (MTBC) isolates in a standardized workflow enables both comprehensive antibiotic resistance profiling and outbreak surveillance with highest resolution up to the identification of recent transmission chains. Here, we present MTBseq, a bioinformatics pipeline for next-generation genome sequence data analysis of MTBC isolates. Employing a reference mapping based workflow, MTBseq reports detected variant positions annotated with known association to antibiotic resistance and performs a lineage classification based on phylogenetic single nucleotide polymorphisms (SNPs). When comparing multiple datasets, MTBseq provides a joint list of variants and a FASTA alignment of SNP positions for use in phylogenomic analysis, and identifies groups of related isolates. The pipeline is customizable, expandable and can be used on a desktop computer or laptop without any internet connection, ensuring mobile usage and data security. MTBseq and accompanying documentation is available from https://github.com/ngs-fzb/MTBseq_source.

Download Full-text

gplas: a comprehensive tool for plasmid analysis using short-read graphs

Bioinformatics ◽

10.1093/bioinformatics/btaa233 ◽

2020 ◽

Vol 36 (12) ◽

pp. 3874-3876 ◽

Cited By ~ 1

Author(s):

Sergio Arredondo-Alonso ◽

Martin Bootsma ◽

Yaïr Hein ◽

Malbert R C Rogers ◽

Jukka Corander ◽

...

Keyword(s):

Large Scale ◽

Sequence Data ◽

Bacterial Genome ◽

Workflow Management ◽

Supplementary Information ◽

Whole Genome Sequencing Data ◽

Network Partitioning ◽

Sequencing Data ◽

Genetic Traits ◽

Short Read

Abstract Summary Plasmids can horizontally transmit genetic traits, enabling rapid bacterial adaptation to new environments and hosts. Short-read whole-genome sequencing data are often applied to large-scale bacterial comparative genomics projects but the reconstruction of plasmids from these data is facing severe limitations, such as the inability to distinguish plasmids from each other in a bacterial genome. We developed gplas, a new approach to reliably separate plasmid contigs into discrete components using sequence composition, coverage, assembly graph information and network partitioning based on a pruned network of plasmid unitigs. Gplas facilitates the analysis of large numbers of bacterial isolates and allows a detailed analysis of plasmid epidemiology based solely on short-read sequence data. Availability and implementation Gplas is written in R, Bash and uses a Snakemake pipeline as a workflow management system. Gplas is available under the GNU General Public License v3.0 at https://gitlab.com/sirarredondo/gplas.git. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Discordant bioinformatic predictions of antimicrobial resistance from whole-genome sequencing data of bacterial isolates: An inter-laboratory study

10.1101/793885 ◽

2019 ◽

Cited By ~ 1

Author(s):

Ronan M. Doyle ◽

Denise M. O’Sullivan ◽

Sean D. Aller ◽

Sebastian Bruchmann ◽

Taane Clark ◽

...

Keyword(s):

Antimicrobial Resistance ◽

Whole Genome Sequencing ◽

Genome Sequencing ◽

Laboratory Study ◽

Clinical Microbiology ◽

Sequence Data ◽

Clinical Samples ◽

Whole Genome Sequencing Data ◽

Whole Genome ◽

Sequencing Data

AbstractBackgroundAntimicrobial resistance (AMR) poses a threat to public health. Clinical microbiology laboratories typically rely on culturing bacteria for antimicrobial susceptibility testing (AST). As the implementation costs and technical barriers fall, whole-genome sequencing (WGS) has emerged as a ‘one-stop’ test for epidemiological and predictive AST results. Few published comparisons exist for the myriad analytical pipelines used for predicting AMR. To address this, we performed an inter-laboratory study providing sets of participating researchers with identical short-read WGS data sequenced from clinical isolates, allowing us to assess the reproducibility of the bioinformatic prediction of AMR between participants and identify problem cases and factors that lead to discordant results.MethodsWe produced ten WGS datasets of varying quality from cultured carbapenem-resistant organisms obtained from clinical samples sequenced on either an Illumina NextSeq or HiSeq instrument. Nine participating teams (‘participants’) were provided these sequence data without any other contextual information. Each participant used their own pipeline to determine the species, the presence of resistance-associated genes, and to predict susceptibility or resistance to amikacin, gentamicin, ciprofloxacin and cefotaxime.ResultsIndividual participants predicted different numbers of AMR-associated genes and different gene variants from the same clinical samples. The quality of the sequence data, choice of bioinformatic pipeline and interpretation of the results all contributed to discordance between participants. Although much of the inaccurate gene variant annotation did not affect genotypic resistance predictions, we observed low specificity when compared to phenotypic AST results but this improved in samples with higher read depths. Had the results been used to predict AST and guide treatment a different antibiotic would have been recommended for each isolate by at least one participant.ConclusionsWe found that participants produced discordant predictions from identical WGS data. These challenges, at the final analytical stage of using WGS to predict AMR, suggest the need for refinements when using this technology in clinical settings. Comprehensive public resistance sequence databases and standardisation in the comparisons between genotype and resistance phenotypes will be fundamental before AST prediction using WGS can be successfully implemented in standard clinical microbiology laboratories.

Download Full-text

Whole genome sequencing data of multiple individuals of Pakistani descent

Scientific Data ◽

10.1038/s41597-020-00664-2 ◽

2020 ◽

Vol 7 (1) ◽

Author(s):

Shahid Y. Khan ◽

Muhammad Ali ◽

Mei-Chong W. Lee ◽

Zhiwei Ma ◽

Pooja Biswas ◽

...

Keyword(s):

Whole Genome Sequencing ◽

Genome Sequencing ◽

Sequence Data ◽

Whole Genome Sequencing Data ◽

Whole Genome ◽

Sequencing Data ◽

Asian Populations ◽

Ethnic Populations ◽

Novel Variants ◽

Intergenic Regions

Abstract Here we report whole genome sequencing of four individuals (H3, H4, H5, and H6) from a family of Pakistani descent. Whole genome sequencing yielded 1084.92, 894.73, 1068.62, and 1005.77 million mapped reads corresponding to 162.73, 134.21, 160.29, and 150.86 Gb sequence data and 52.49x, 43.29x, 51.70x, and 48.66x average coverage for H3, H4, H5, and H6, respectively. We identified 3,529,659, 3,478,495, 3,407,895, and 3,426,862 variants in the genomes of H3, H4, H5, and H6, respectively, including 1,668,024 variants common in the four genomes. Further, we identified 42,422, 39,824, 28,599, and 35,206 novel variants in the genomes of H3, H4, H5, and H6, respectively. A major fraction of the variants identified in the four genomes reside within the intergenic regions of the genome. Single nucleotide polymorphism (SNP) genotype based comparative analysis with ethnic populations of 1000 Genomes database linked the ancestry of all four genomes with the South Asian populations, which was further supported by mitochondria based haplogroup analysis. In conclusion, we report whole genome sequencing of four individuals of Pakistani descent.

Download Full-text

gplas: a comprehensive tool for plasmid analysis using short-read graphs

10.1101/835900 ◽

2019 ◽

Cited By ~ 1

Author(s):

Sergio Arredondo-Alonso ◽

Martin Bootsma ◽

Yaïr Hein ◽

Malbert R.C. Rogers ◽

Jukka Corander ◽

...

Keyword(s):

Large Scale ◽

Sequence Data ◽

Bacterial Genome ◽

Workflow Management ◽

Whole Genome Sequencing Data ◽

Sequencing Data ◽

Genetic Traits ◽

Short Read ◽

Sequence Composition ◽

Short Read Sequence

ABSTRACTSummaryPlasmids can horizontally transmit genetic traits, enabling rapid bacterial adaptation to new environments and hosts. Short-read whole-genome sequencing data is often applied to large-scale bacterial comparative genomics projects but the reconstruction of plasmids from these data is facing severe limitations, such as the inability to distinguish plasmids from each other in a bacterial genome. We developed gplas, a new approach to reliably separate plasmid contigs into discrete components using sequence composition, coverage, assembly graph information and clustering based on a pruned network of plasmid unitigs. Gplas facilitates the analysis of large numbers of bacterial isolates and allows a detailed analysis of plasmid epidemiology based solely on short read sequence data.Availability and implementationGplas is written in R, Bash and uses a Snakemake pipeline as a workflow management system. Gplas is available under the GNU General Public License v3.0 at https://gitlab.com/sirarredondo/[email protected]

Download Full-text

EMBL2checklists: A Python package to facilitate the user-friendly submission of plant DNA barcoding sequences to ENA

10.1101/435644 ◽

2018 ◽

Author(s):

Michael Gruenstaeudl ◽

Yannick Hartmaring

Keyword(s):

Dna Barcoding ◽

Dna Sequence ◽

Dna Sequences ◽

Sequence Data ◽

Software Tool ◽

Plant Dna ◽

Dna Sequence Data ◽

User Friendly ◽

Common Plant ◽

Python Package

AbstractBackgroundThe submission of DNA sequences to public sequence databases is an essential, but insufficiently automated step in the process of generating and disseminating novel DNA sequence data. Despite the centrality of database submissions to biological research, the range of available software tools that facilitate the preparation of sequence data for database submissions is low, especially for sequences generated via plant DNA barcoding. Current submission procedures can be complex and prohibitively time expensive for any but a small number of input sequences. A user-friendly software tool is needed that streamlines the file preparation for database submissions of DNA sequences that are commonly generated in plant DNA barcoding.MethodsA Python package was developed that converts DNA sequences from the common EMBL and GenBank flat file formats to submission-ready, tab-delimited spreadsheets (so-called “checklists”) for a subsequent upload to the public sequence database of the European Nucleotide Archive (ENA). The software tool, titled “EMBL2checklists”, automatically converts DNA sequences, their annotation features, and associated metadata into the idiosyncratic format of marker-specific ENA checklists and, thus, generates output that can be uploaded via the interactive Webin submission system of ENA.ResultsEMBL2checklists provides a simple, platform-independent tool that automates the conversion of common plant DNA barcoding sequences into easily editable spreadsheets that require no further processing but their upload to ENA via the interactive Webin submission system. The software is equipped with an intuitive graphical as well as an efficient command-line interface for its operation. The utility of the software is illustrated by its application in the submission of DNA sequences of two recent plant phylogenetic investigations and one fungal metagenomic study.DiscussionEMBL2checklists bridges the gap between common software suites for DNA sequence assembly and annotation and the interactive data submission process of ENA. It represents an easy-to-use solution for plant biologists without bioinformatics expertise to generate submission-ready checklists from common plant DNA sequence data. It allows the post-processing of checklists as well as work-sharing during the submission process and solves a critical bottleneck in the effort to increase participation in public data sharing.

Download Full-text

Straglr: discovering and genotyping tandem repeat expansions using whole genome long-read sequences

Genome Biology ◽

10.1186/s13059-021-02447-3 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Readman Chiu ◽

Indhu-Shree Rajan-Babu ◽

Jan M. Friedman ◽

Inanc Birol

Keyword(s):

Whole Genome Sequencing ◽

Genome Sequencing ◽

Tandem Repeat ◽

Neurological Disorders ◽

Software Tool ◽

Whole Genome Sequencing Data ◽

Whole Genome ◽

Sequencing Data ◽

Long Read ◽

Repeat Expansions

AbstractTandem repeat (TR) expansion is the underlying cause of over 40 neurological disorders. Long-read sequencing offers an exciting avenue over conventional technologies for detecting TR expansions. Here, we present Straglr, a robust software tool for both targeted genotyping and novel expansion detection from long-read alignments. We benchmark Straglr using various simulations, targeted genotyping data of cell lines carrying expansions of known diseases, and whole genome sequencing data with chromosome-scale assembly. Our results suggest that Straglr may be useful for investigating disease-associated TR expansions using long-read sequencing.

Download Full-text

Validation of genotyping of gastrointestinal stromal tumor in Japan

Journal of Clinical Oncology ◽

10.1200/jco.2009.27.15_suppl.e21502 ◽

2009 ◽

Vol 27 (15_suppl) ◽

pp. e21502-e21502

Author(s):

T. Takahashi ◽

T. Nishida ◽

S. Sakurai ◽

T. Kanda ◽

A. Sawaki ◽

...

Keyword(s):

Special Reference ◽

Dna Extraction ◽

Dna Sequence ◽

Validation Study ◽

Sequence Data ◽

Extraction Methods ◽

Reference Sequence ◽

Sequencing Data ◽

Dna Sequence Data ◽

Pcr Method

e21502 Background: Most gastrointestinal stromal tumors (GIST) have activating mutations in the KIT or PDGFRA gene. Genotyping of GIST is important in Dx and Tx of GIST. Methods of genotyping using genomic DNA extracted from paraffin-embedded specimens are diverse and not standardized. We did validation study of genotyping using special reference to sequencing data obtained from cDNA from fresh GIST samples. Methods: Three DNA extraction methods (QIAamp, DEXPAT, or original) and four PCR methods (Ex Taq, AmpliTaq condition-1, AmpliTaq condition-2, or QIAGEN Tag) were compared using 20 paraffin-embedded specimens with special reference to sequencing data obtained from cDNA from corresponding 20 fresh GIST samples. After DNA extraction, KIT exon 9, 11, 13 and 17, and PDGFRA exon 12 and 18 were amplified by each PCR method using specific primers and directly sequenced. Results: In evaluation of PCR method, the protocol with Ex Taq showed 100% amplication of DNA and sequence agreement, the protocol with QIAGEN Tag 99%, and the protocol with AmpliTaq condition-2 86% agreement, and the protocol with AmpliTaq condition-1 showed much less amplication and higher disagreement. For the DNA extraction, the protocol with QIAamp showed best DNA extraction and its DNA sequence data were consistent with reference sequence in 98%, DNA sequence obtained using DEXPAT showed 33% consistency, and 89% of DNA sequence data obtained from an original method was agreed with reference data. Some modifications improved DNA amplication but inconsistent sequence data also increased probably due to miss-PCR. Conclusions: Each DNA extraction method had different quantity of DNA and four PCR methods showed different quality. Using this validation study, a standard genotyping method in Japan was established. No significant financial relationships to disclose.

Download Full-text

Pattern Matching for DNA Sequencing Data Using Multiple Bloom Filters

BioMed Research International ◽

10.1155/2019/7074387 ◽

2019 ◽

Vol 2019 ◽

pp. 1-9 ◽

Cited By ~ 3

Author(s):

Maleeha Najam ◽

Raihan Ur Rasool ◽

Hafiz Farooq Ahmad ◽

Usman Ashraf ◽

Asad Waqar Malik

Keyword(s):

Dna Sequencing ◽

Dna Sequence ◽

Pattern Matching ◽

Dna Sequences ◽

Sequence Data ◽

Bloom Filters ◽

Sequencing Data ◽

Dna Sequence Data ◽

Efficient Data ◽

Improved Accuracy

Storing and processing of large DNA sequences has always been a major problem due to increasing volume of DNA sequence data. However, a number of solutions have been proposed but they require significant computation and memory. Therefore, an efficient storage and pattern matching solution is required for DNA sequencing data. Bloom filters (BFs) represent an efficient data structure, which is mostly used in the domain of bioinformatics for classification of DNA sequences. In this paper, we explore more dimensions where BFs can be used other than classification. A proposed solution is based on Multiple Bloom Filters (MBFs) that finds all the locations and number of repetitions of the specified pattern inside a DNA sequence. Both of these factors are extremely important in determining the type and intensity of any disease. This paper serves as a first effort towards optimizing the search for location and frequency of substrings in DNA sequences using MBFs. We expect that further optimizations in the proposed solution can bring remarkable results as this paper presents a proof of concept implementation for a given set of data using proposed MBFs technique. Performance evaluation shows improved accuracy and time efficiency of the proposed approach.

Download Full-text

Poking COVID-19: insights on genomic constraints among immune-related genes between Qatari and Italian populations

10.1101/2021.10.04.21264507 ◽

2021 ◽

Author(s):

Hamdi Mbarek ◽

Massimiliano Cocca ◽

Yasser Al Sarraj ◽

Chadi Saad ◽

Massimo Mezzavilla ◽

...

Keyword(s):

Innate Immunity ◽

Sequence Data ◽

Whole Genome Sequencing Data ◽

Sequencing Data ◽

Host Pathogen Interaction ◽

Whole Exome ◽

Exome Sequence Data ◽

Host Pathogen ◽

Immune Related Genes ◽

Exome Sequence

AbstractHost genomic information, specifically genomic variations, may characterize susceptibility to disease and identify people with a higher risk of harm, leading to better targeting of care and vaccination. Italy was the epicentre for the spread of COVID-19 in Europe, the first country to go into a national lockdown and has one of the highest COVID-19 associated mortality rates. Qatar, on the other hand has a very low mortality rate. In this study, we compared whole-genome sequencing data of 14398 adults and Qatari-national to 925 Italian individuals. We also included in the comparison whole-exome sequence data from 189 Italian laboratory confirmed COVID-19 cases. We focused our study on a curated list of 3619 candidate genes involved in innate immunity and host-pathogen interaction. Two population-gene metric scores, the Delta Singleton-Cohort variant score (DSC) and Sum Singleton-Cohort variant score (SSC), were applied to estimate the presence of selective constraints in the Qatari population and in the Italian cohorts. Results based on DSC SSC metrics demonstrated a different selective pressure on three genes (MUC5AC, ABCA7, FLNA) between Qatari and Italian populations. This study highlighted the genetic differences between Qatari and Italian populations and identified a subset of genes involved in innate immunity and host-pathogen interaction.

Download Full-text