ConFindr: rapid detection of intraspecies and cross-species contamination in bacterial whole-genome sequence data

PeerJ ◽

10.7717/peerj.6995 ◽

2019 ◽

Vol 7 ◽

pp. e6995 ◽

Cited By ~ 14

Author(s):

Andrew J. Low ◽

Adam G. Koziol ◽

Paul A. Manninger ◽

Burton Blais ◽

Catherine D. Carrillo

Keyword(s):

Sequence Data ◽

Single Copy ◽

Whole Genome Sequence ◽

Whole Genome ◽

Computational Pipeline ◽

Ribosomal Protein Genes ◽

Multiple Alleles ◽

Quality Issue ◽

Unrelated Strain ◽

Higher Sensitivity

Whole-genome sequencing (WGS) of bacterial pathogens is currently widely used to support public-health investigations. The ability to assess WGS data quality is critical to underpin the reliability of downstream analyses. Sequence contamination is a quality issue that could potentially impact WGS-based findings; however, existing tools do not readily identify contamination from closely-related organisms. To address this gap, we have developed a computational pipeline, ConFindr, for detection of intraspecies contamination. ConFindr determines the presence of contaminating sequences based on the identification of multiple alleles of core, single-copy, ribosomal-protein genes in raw sequencing reads. The performance of this tool was assessed using simulated and lab-generated Illumina short-read WGS data with varying levels of contamination (0–20% of reads) and varying genetic distance between the designated target and contaminant strains. Intraspecies and cross-species contamination was reliably detected in datasets containing 5% or more reads from a second, unrelated strain. ConFindr detected intraspecies contamination with higher sensitivity than existing tools, while also being able to automatically detect cross-species contamination with similar sensitivity. The implementation of ConFindr in quality-control pipelines will help to improve the reliability of WGS databases as well as the accuracy of downstream analyses. ConFindr is written in Python, and is freely available under the MIT License at github.com/OLC-Bioinformatics/ConFindr.

ConFindr: Rapid detection of intraspecies and cross-species contamination in bacterial whole-genome sequence data

10.7287/peerj.preprints.27499 ◽

2019 ◽

Author(s):

Andrew J Low ◽

Adam G Koziol ◽

Paul A Manninger ◽

Burton W Blais ◽

Catherine D Carrillo

Keyword(s):

Sequence Data ◽

Single Copy ◽

Whole Genome Sequence ◽

Whole Genome ◽

Computational Pipeline ◽

Ribosomal Protein Genes ◽

Multiple Alleles ◽

Quality Issue ◽

Unrelated Strain ◽

Higher Sensitivity

Whole-genome sequencing (WGS) of bacterial pathogens is currently widely used to support public-health investigations. The ability to assess WGS data quality is critical to underpin the reliability of downstream analyses. Sequence contamination is a quality issue that could potentially impact WGS-based findings; however, existing tools do not readily identify contamination from closely-related organisms. To address this gap, we have developed a computational pipeline, ConFindr, for detection of intraspecies contamination. ConFindr determines the presence of contaminating sequences based on the identification of multiple alleles of core, single-copy, ribosomal-protein genes in raw sequencing reads. The performance of this tool was assessed using simulated and lab-generated Illumina short-read WGS data with varying levels of contamination (0-20% of reads) and varying genetic distance between the designated target and contaminant strains. Intraspecies and cross-species contamination was reliably detected in datasets containing 5% or more reads from a second, unrelated strain. ConFindr detected intraspecies contamination with higher sensitivity than existing tools, while also being able to automatically detect cross-species contamination with similar sensitivity. The implementation of ConFindr in quality-control pipelines will help to improve the reliability of WGS databases as well as the accuracy of downstream analyses. ConFindr is written in Python, and is freely available under the MIT License at github.com/OLC-Bioinformatics/ConFindr.

ConFindr: Rapid detection of intraspecies and cross-species contamination in bacterial whole-genome sequence data

10.7287/peerj.preprints.27499v1 ◽

2019 ◽

Cited By ~ 1

Author(s):

Andrew J Low ◽

Adam G Koziol ◽

Paul A Manninger ◽

Burton W Blais ◽

Catherine D Carrillo

Keyword(s):

Sequence Data ◽

Single Copy ◽

Whole Genome Sequence ◽

Whole Genome ◽

Computational Pipeline ◽

Ribosomal Protein Genes ◽

Multiple Alleles ◽

Quality Issue ◽

Unrelated Strain ◽

Higher Sensitivity

Whole-genome sequencing (WGS) of bacterial pathogens is currently widely used to support public-health investigations. The ability to assess WGS data quality is critical to underpin the reliability of downstream analyses. Sequence contamination is a quality issue that could potentially impact WGS-based findings; however, existing tools do not readily identify contamination from closely-related organisms. To address this gap, we have developed a computational pipeline, ConFindr, for detection of intraspecies contamination. ConFindr determines the presence of contaminating sequences based on the identification of multiple alleles of core, single-copy, ribosomal-protein genes in raw sequencing reads. The performance of this tool was assessed using simulated and lab-generated Illumina short-read WGS data with varying levels of contamination (0-20% of reads) and varying genetic distance between the designated target and contaminant strains. Intraspecies and cross-species contamination was reliably detected in datasets containing 5% or more reads from a second, unrelated strain. ConFindr detected intraspecies contamination with higher sensitivity than existing tools, while also being able to automatically detect cross-species contamination with similar sensitivity. The implementation of ConFindr in quality-control pipelines will help to improve the reliability of WGS databases as well as the accuracy of downstream analyses. ConFindr is written in Python, and is freely available under the MIT License at github.com/OLC-Bioinformatics/ConFindr.

Whole-genome analysis of Malawian Plasmodium falciparum isolates identifies potential targets of allele-specific immunity to clinical malaria

10.1101/2020.09.16.20196253 ◽

2020 ◽

Author(s):

Zalak Shah ◽

Myo T Naung ◽

Kara A Moser ◽

Matthew Adams ◽

Andrea G Buchwald ◽

...

Keyword(s):

Plasmodium Falciparum ◽

Sequence Data ◽

Clinical Malaria ◽

Whole Genome Sequence ◽

Whole Genome ◽

Whole Genome Analysis ◽

Multiple Alleles ◽

Vaccine Candidates ◽

Genome Wide ◽

Allele Specific

Individuals acquire immunity to clinical malaria after repeated Plasmodium falciparum infections. This immunity to disease is thought to reflect the acquisition of a repertoire of responses to multiple alleles in diverse parasite antigens. In previous studies, we identified polymorphic sites within individual antigens that are associated with parasite immune evasion by examining antigen allele dynamics in individuals followed longitudinally. Here we expand this approach by analyzing genome-wide polymorphisms using whole genome sequence data from 140 parasite isolates representing malaria cases from a longitudinal study in Malawi and identify 25 genes that encode likely targets of naturally acquired immunity and that should be further characterized for their potential as vaccine candidates.

Whole-genome analysis of Malawian Plasmodium falciparum isolates identifies possible targets of allele-specific immunity to clinical malaria

PLoS Genetics ◽

10.1371/journal.pgen.1009576 ◽

2021 ◽

Vol 17 (5) ◽

pp. e1009576

Author(s):

Zalak Shah ◽

Myo T. Naung ◽

Kara A. Moser ◽

Matthew Adams ◽

Andrea G. Buchwald ◽

...

Keyword(s):

Plasmodium Falciparum ◽

Sequence Data ◽

Clinical Malaria ◽

Whole Genome Sequence ◽

Whole Genome ◽

Whole Genome Analysis ◽

Multiple Alleles ◽

Vaccine Candidates ◽

Genome Wide ◽

Allele Specific

Individuals acquire immunity to clinical malaria after repeated Plasmodium falciparum infections. Immunity to disease is thought to reflect the acquisition of a repertoire of responses to multiple alleles in diverse parasite antigens. In previous studies, we identified polymorphic sites within individual antigens that are associated with parasite immune evasion by examining antigen allele dynamics in individuals followed longitudinally. Here we expand this approach by analyzing genome-wide polymorphisms using whole genome sequence data from 140 parasite isolates representing malaria cases from a longitudinal study in Malawi and identify 25 genes that encode possible targets of naturally acquired immunity that should be validated immunologically and further characterized for their potential as vaccine candidates.

Faculty Opinions recommendation of Optimal algorithms for haplotype assembly from whole-genome sequence data.

Faculty Opinions – Post-Publication Peer Review of the Biomedical Literature ◽

10.3410/f.13339986.14707085 ◽

2011 ◽

Author(s):

Alejandro Schaffer

Keyword(s):

Genome Sequence ◽

Sequence Data ◽

Whole Genome Sequence ◽

Whole Genome ◽

Optimal Algorithms ◽

Genome Sequence Data ◽

Haplotype Assembly

TIGER: inferring DNA replication timing from whole-genome sequence data

Bioinformatics ◽

10.1093/bioinformatics/btab166 ◽

2021 ◽

Cited By ~ 1

Author(s):

Amnon Koren ◽

Dashiell J Massey ◽

Alexa N Bracci

Keyword(s):

Dna Replication ◽

Genome Sequence ◽

Genomic Dna ◽

Sequence Data ◽

Replication Timing ◽

Whole Genome Sequence ◽

Supplementary Information ◽

Whole Genome ◽

Genome Sequence Data ◽

Dna Replication Timing

Abstract Motivation Genomic DNA replicates according to a reproducible spatiotemporal program, with some loci replicating early in S phase while others replicate late. Despite being a central cellular process, DNA replication timing studies have been limited in scale due to technical challenges. Results We present TIGER (Timing Inferred from Genome Replication), a computational approach for extracting DNA replication timing information from whole genome sequence data obtained from proliferating cell samples. The presence of replicating cells in a biological specimen leads to non-uniform representation of genomic DNA that depends on the timing of replication of different genomic loci. Replication dynamics can hence be observed in genome sequence data by analyzing DNA copy number along chromosomes while accounting for other sources of sequence coverage variation. TIGER is applicable to any species with a contiguous genome assembly and rivals the quality of experimental measurements of DNA replication timing. It provides a straightforward approach for measuring replication timing and can readily be applied at scale. Availability and Implementation TIGER is available at https://github.com/TheKorenLab/TIGER. Supplementary information Supplementary data are available at Bioinformatics online

Whole genome sequence data of Bacillus australimaris strain B28A, isolated from Marine Water in India

Data in Brief ◽

10.1016/j.dib.2021.107240 ◽

2021 ◽

pp. 107240

Author(s):

Wael Ali Mohammed Hadi ◽

Boby T Edwin ◽

A Jayakumaran Nair

Keyword(s):

Genome Sequence ◽

Sequence Data ◽

Marine Water ◽

Whole Genome Sequence ◽

Whole Genome ◽

Genome Sequence Data

Whole genome sequence data of Mycobacterium tuberculosis XDR strain, isolated from patient in Kazakhstan

Data in Brief ◽

10.1016/j.dib.2020.106416 ◽

2020 ◽

Vol 33 ◽

pp. 106416

Author(s):

Asset Daniyarov ◽

Askhat Molkenov ◽

Saule Rakhimova ◽

Ainur Akhmetova ◽

Zhannur Nurkina ◽

...

Keyword(s):

Mycobacterium Tuberculosis ◽

Genome Sequence ◽

Sequence Data ◽

Whole Genome Sequence ◽

Whole Genome ◽

Genome Sequence Data

Elucidating the genetic basis of an oligogenic birth defect using whole genome sequence data in a non-model organism, Bubalus bubalis

Scientific Reports ◽

10.1038/srep39719 ◽

2017 ◽

Vol 7 (1) ◽

Cited By ~ 10

Author(s):

Lynsey K. Whitacre ◽

Jesse L. Hoff ◽

Robert D. Schnabel ◽

Sara Albarella ◽

Francesca Ciotola ◽

...

Keyword(s):

Genome Sequence ◽

Birth Defect ◽

Genetic Basis ◽

Sequence Data ◽

Model Organism ◽

Bubalus Bubalis ◽

Whole Genome Sequence ◽

Whole Genome ◽

Genome Sequence Data

Whole genome characterization of strains belonging to the Ralstonia solanacearum species complex and in silico analysis of TaqMan assays for detection in this heterogenous species complex

European Journal of Plant Pathology ◽

10.1007/s10658-020-02190-8 ◽

2021 ◽

Author(s):

Viola Kurm ◽

Ilse Houwers ◽

Claudia E. Coipan ◽

Peter Bonants ◽

Cees Waalwijk ◽

...

Keyword(s):

Ralstonia Solanacearum ◽

In Silico ◽

Species Complex ◽

Sequence Data ◽

In Silico Analysis ◽

Whole Genome Sequence ◽

Whole Genome ◽

Genome Sequences ◽

Pcr Assays

AbstractIdentification and classification of members of the Ralstonia solanacearum species complex (RSSC) is challenging due to the heterogeneity of this complex. Whole genome sequence data of 225 strains were used to classify strains based on average nucleotide identity (ANI) and multilocus sequence analysis (MLSA). Based on the ANI score (>95%), 191 out of 192(99.5%) RSSC strains could be grouped into the three species R. solanacearum, R. pseudosolanacearum, and R. syzygii, and into the four phylotypes within the RSSC (I,II, III, and IV). R. solanacearum phylotype II could be split in two groups (IIA and IIB), from which IIB clustered in three subgroups (IIBa, IIBb and IIBc). This division by ANI was in accordance with MLSA. The IIB subgroups found by ANI and MLSA also differed in the number of SNPs in the primer and probe sites of various assays. An in-silico analysis of eight TaqMan and 11 conventional PCR assays was performed using the whole genome sequences. Based on this analysis several cases of potential false positives or false negatives can be expected upon the use of these assays for their intended target organisms. Two TaqMan assays and two PCR assays targeting the 16S rDNA sequence should be able to detect all phylotypes of the RSSC. We conclude that the increasing availability of whole genome sequences is not only useful for classification of strains, but also shows potential for selection and evaluation of clade specific nucleic acid-based amplification methods within the RSSC.