Genome-scale profiling reveals noncoding loci carry higher proportions of concordant data

Molecular Biology and Evolution ◽

10.1093/molbev/msab026 ◽

2021 ◽

Author(s):

Robert Literman ◽

Rachel Schwartz

Keyword(s):

Sequence Data ◽

Phylogenetic Signal ◽

Whole Genome Sequence ◽

Whole Genome Sequencing Data ◽

Whole Genome ◽

Sequencing Data ◽

Coding Sequences ◽

Evolutionary Forces ◽

Tree Inference ◽

Intergenic Regions

Abstract Many evolutionary relationships remain controversial despite whole-genome sequencing data. These controversies arise in part due to challenges associated with accurately modeling the complex phylogenetic signal coming from genomic regions experiencing distinct evolutionary forces. Here we examine how different regions of the genome support or contradict well-established hypotheses among three mammal groups using millions of orthologous parsimony-informative biallelic sites [PIBS] distributed across primate, rodent, and Pecora genomes. We compared PIBS concordance percentages among locus types (e.g. coding sequences, introns, intergenic regions), and contrasted PIBS utility over evolutionary timescales. Sites derived from noncoding sequences provided more data and proportionally more concordant sites compared with those from coding sequences [CDS] in all clades. CDS PIBS were also predominant drivers of tree incongruence in two cases of topological conflict. PIBS derived from most locus types provided surprisingly consistent support for splitting events spread across the timescales we examined, although we find evidence that CDS and intronic PIBS may, respectively and to a limited degree, inform disproportionately about older and younger splits. In this era of accessible whole genome sequence data, these results (1) suggest benefits to more intentionally focusing on noncoding loci as robust data for tree inference, and (2) reinforce the importance of accurate modeling, especially when using CDS data.

Download Full-text

MTBseq: a comprehensive pipeline for whole genome sequence analysis of Mycobacterium tuberculosis complex isolates

PeerJ ◽

10.7717/peerj.5895 ◽

2018 ◽

Vol 6 ◽

pp. e5895 ◽

Cited By ~ 35

Author(s):

Thomas Andreas Kohl ◽

Christian Utpatel ◽

Viola Schleusener ◽

Maria Rosaria De Filippo ◽

Patrick Beckert ◽

...

Keyword(s):

Antibiotic Resistance ◽

Mycobacterium Tuberculosis ◽

Genome Sequence ◽

Sequence Data ◽

Whole Genome Sequence ◽

Whole Genome Sequencing Data ◽

Phylogenomic Analysis ◽

Whole Genome ◽

Sequencing Data ◽

Desktop Computer

Analyzing whole-genome sequencing data of Mycobacterium tuberculosis complex (MTBC) isolates in a standardized workflow enables both comprehensive antibiotic resistance profiling and outbreak surveillance with highest resolution up to the identification of recent transmission chains. Here, we present MTBseq, a bioinformatics pipeline for next-generation genome sequence data analysis of MTBC isolates. Employing a reference mapping based workflow, MTBseq reports detected variant positions annotated with known association to antibiotic resistance and performs a lineage classification based on phylogenetic single nucleotide polymorphisms (SNPs). When comparing multiple datasets, MTBseq provides a joint list of variants and a FASTA alignment of SNP positions for use in phylogenomic analysis, and identifies groups of related isolates. The pipeline is customizable, expandable and can be used on a desktop computer or laptop without any internet connection, ensuring mobile usage and data security. MTBseq and accompanying documentation is available from https://github.com/ngs-fzb/MTBseq_source.

Download Full-text

Whole genome sequencing data of multiple individuals of Pakistani descent

Scientific Data ◽

10.1038/s41597-020-00664-2 ◽

2020 ◽

Vol 7 (1) ◽

Author(s):

Shahid Y. Khan ◽

Muhammad Ali ◽

Mei-Chong W. Lee ◽

Zhiwei Ma ◽

Pooja Biswas ◽

...

Keyword(s):

Whole Genome Sequencing ◽

Genome Sequencing ◽

Sequence Data ◽

Whole Genome Sequencing Data ◽

Whole Genome ◽

Sequencing Data ◽

Asian Populations ◽

Ethnic Populations ◽

Novel Variants ◽

Intergenic Regions

Abstract Here we report whole genome sequencing of four individuals (H3, H4, H5, and H6) from a family of Pakistani descent. Whole genome sequencing yielded 1084.92, 894.73, 1068.62, and 1005.77 million mapped reads corresponding to 162.73, 134.21, 160.29, and 150.86 Gb sequence data and 52.49x, 43.29x, 51.70x, and 48.66x average coverage for H3, H4, H5, and H6, respectively. We identified 3,529,659, 3,478,495, 3,407,895, and 3,426,862 variants in the genomes of H3, H4, H5, and H6, respectively, including 1,668,024 variants common in the four genomes. Further, we identified 42,422, 39,824, 28,599, and 35,206 novel variants in the genomes of H3, H4, H5, and H6, respectively. A major fraction of the variants identified in the four genomes reside within the intergenic regions of the genome. Single nucleotide polymorphism (SNP) genotype based comparative analysis with ethnic populations of 1000 Genomes database linked the ancestry of all four genomes with the South Asian populations, which was further supported by mitochondria based haplogroup analysis. In conclusion, we report whole genome sequencing of four individuals of Pakistani descent.

Download Full-text

KoVariome: Korean National Standard Reference Variome database of whole genomes with comprehensive SNV, indel, CNV, and SV analyses

10.1101/187096 ◽

2017 ◽

Author(s):

Jungeun Kim ◽

Jessica A. Weber ◽

Sungwoong Jho ◽

Jinho Jang ◽

JeHoon Jun ◽

...

Keyword(s):

Sequence Data ◽

Copy Number Variations ◽

Genetic Variations ◽

Korean Population ◽

National Standard ◽

Whole Genome Sequence ◽

Whole Genome Sequencing Data ◽

Personal Genome ◽

Whole Genome ◽

Sequencing Data

AbstractHigh-coverage whole-genome sequencing data of a single ethnicity can provide a useful catalogue of population-specific genetic variations. Herein, we report a comprehensive analysis of the Korean population, and present the Korean National Standard Reference Variome (KoVariome). As a part of the Korean Personal Genome Project (KPGP), we constructed the KoVariome database using 5.5 terabases of whole genome sequence data from 50 healthy Korean individuals with an average coverage depth of 31×. In total, KoVariome includes 12.7M single-nucleotide variants (SNVs), 1.7M short insertions and deletions (indels), 4K structural variations (SVs), and 3.6K copy number variations (CNVs). Among them, 2.4M (19%) SNVs and 0.4M (24%) indels were identified as novel. We also discovered selective enrichment of 3.8M SNVs and 0.5M indels in Korean individuals, which were used to filter out 1,271 coding-SNVs not originally removed from the 1,000 Genomes Project data when prioritizing disease-causing variants. CNV analyses revealed gene losses related to bone mineral densities and duplicated genes involved in brain development and fat reduction. Finally, KoVariome health records were used to identify novel disease-causing variants in the Korean population, demonstrating the value of high-quality ethnic variation databases for the accurate interpretation of individual genomes and the precise characterization of genetic variations.

Download Full-text

Accurate Phasing of Pedigree Genotypes Using Whole Genome Sequence Data

10.1101/148510 ◽

2017 ◽

Author(s):

A.N. Blackburn ◽

M.Z. Kos ◽

N.B. Blackburn ◽

J.M. Peralta ◽

P. Stevens ◽

...

Keyword(s):

Error Rate ◽

Sequence Data ◽

Software Implementation ◽

Whole Genome Sequence ◽

Whole Genome Sequencing Data ◽

Genotype Data ◽

Whole Genome ◽

Genotyping Error ◽

Sequencing Data ◽

Missing Genotypes

AbstractPhasing, the process of predicting haplotypes from genotype data, is an important undertaking in genetics and an ongoing area of research. Phasing methods, and associated software, designed specifically for pedigrees are urgently needed. Here we present a new method for phasing genotypes from whole genome sequencing data in pedigrees: PULSAR (Phasing Using Lineage Specific Alleles / Rare variants). The method is built upon the idea that alleles that are specific to a single founding chromosome within a pedigree, which we refer to as lineage-specific alleles, are highly informative for identifying haplotypes that are identical-by-decent between individuals within a pedigree. Through extensive simulation we assess the performance of PULSAR in a variety of pedigree sizes and structures, and we explore the effects of genotyping errors and presence of non-sequenced individuals on its performance. If the genotyping error rate is sufficiently low PULSAR can phase > 99.9% of heterozygous genotypes with a switch error rate below 1 x 10-4 in pedigrees where all individuals are sequenced. We demonstrate that the method is highly accurate and consistently outperforms the long-range phasing approach used for comparison in our benchmarking. The method also holds promise for fixing genotype errors or imputing missing genotypes. The software implementation of this method is freely available.

Download Full-text

Genome-scale profiling reveals higher proportions of phylogenetic signal in non-coding data

10.1101/712646 ◽

2019 ◽

Author(s):

Robert Literman ◽

Rachel S. Schwartz

Keyword(s):

Genome Sequence ◽

Sequence Data ◽

Phylogenetic Signal ◽

Whole Genome Sequence ◽

Whole Genome ◽

Phylogenetic Information ◽

Genome Sequence Data ◽

Evolutionary Forces ◽

Limited Degree ◽

Genomic Regions

AbstractAccurate estimates of species relationships are integral to our understanding of evolution, yet many relationships remain controversial despite whole-genome sequence data. These controversies are due in part to complex patterns of phylogenetic and non-phylogenetic signal coming from regions of the genome experiencing distinct evolutionary forces, which can be difficult to disentangle. Here we profile the amounts and proportions of phylogenetic and non-phylogenetic signal derived from loci spread across mammalian genomes. We identified orthologous sequences from primates, rodents, and pecora, annotated sites as one or more of nine locus types (e.g. coding, intronic, intergenic), and profiled the phylogenetic information contained within locus types across evolutionary timescales associated with each clade. In all cases, non-coding loci provided more overall signal and a higher proportion of phylogenetic signal compared to coding loci. This suggests potential benefits of shifting away from primarily targeting genes or coding regions for phylogenetic studies, particularly in this era of accessible whole genome sequence data. In contrast to long-held assumptions about the phylogenetic utility of more variable genomic regions, most locus types provided relatively consistent phylogenetic information across timescales, although we find evidence that coding and intronic regions may, respectively and to a limited degree, inform disproportionately about older and younger splits. As part of this work we also validate the SISRS pipeline as an annotation-free ortholog discovery pipeline capable of identifying millions of phylogenetically informative sites directly from raw sequencing reads.

Download Full-text

Evaluation of parameters affecting performance and reliability of machine learning-based antibiotic susceptibility testing from whole genome sequencing data

10.1101/607127 ◽

2019 ◽

Author(s):

Allison L. Hicks ◽

Nicole Wheeler ◽

Leonor Sánchez-Busó ◽

Jennifer L. Rakeman ◽

Simon R. Harris ◽

...

Keyword(s):

Machine Learning ◽

Antibiotic Resistance ◽

Antibiotic Susceptibility ◽

Sequence Data ◽

Model Performance ◽

Outcome Data ◽

Whole Genome Sequence ◽

Whole Genome Sequencing Data ◽

Whole Genome ◽

Sequencing Data

AbstractPrediction of antibiotic resistance phenotypes from whole genome sequencing data by machine learning methods has been proposed as a promising platform for the development of sequence-based diagnostics. However, there has been no systematic evaluation of factors that may influence performance of such models, how they might apply to and vary across clinical populations, and what the implications might be in the clinical setting. Here, we performed a meta-analysis of seven large Neisseria gonorrhoeae datasets, as well as Klebsiella pneumoniae and Acinetobacter baumannii datasets, with whole genome sequence data and antibiotic susceptibility phenotypes using set covering machine classification, random forest classification, and random forest regression models to predict resistance phenotypes from genotype. We demonstrate how model performance varies by drug, dataset, resistance metric, and species, reflecting the complexities of generating clinically relevant conclusions from machine learning-derived models. Our findings underscore the importance of incorporating relevant biological and epidemiological knowledge into model design and assessment and suggest that doing so can inform tailored modeling for individual drugs, pathogens, and clinical populations. We further suggest that continued comprehensive sampling and incorporation of up-to-date whole genome sequence data, resistance phenotypes, and treatment outcome data into model training will be crucial to the clinical utility and sustainability of machine learning-based molecular diagnostics.Author SummaryMachine learning-based prediction of antibiotic resistance from bacterial genome sequences represents a promising tool to rapidly determine the antibiotic susceptibility profile of clinical isolates and reduce the morbidity and mortality resulting from inappropriate and ineffective treatment. However, while there has been much focus on demonstrating the diagnostic potential of these modeling approaches, there has been little assessment of potential caveats and prerequisites associated with implementing predictive models of drug resistance in the clinical setting. Our results highlight significant biological and technical challenges facing the application of machine learning-based prediction of antibiotic resistance as a diagnostic tool. By outlining specific factors affecting model performance, our findings provide a framework for future work on modeling drug resistance and underscore the necessity of continued comprehensive sampling and reporting of treatment outcome data for building reliable and sustainable diagnostics.

Download Full-text

Ethnically diverse urban transmission networks of Neisseria gonorrhoeae without evidence of HIV serosorting

Sexually Transmitted Infections ◽

10.1136/sextrans-2019-054025 ◽

2019 ◽

Vol 96 (2) ◽

pp. 106-109

Author(s):

Jayshree Dave ◽

John Paul ◽

Thomas Joshua Pasvol ◽

Andy Williams ◽

Fiona Warburton ◽

...

Keyword(s):

Neisseria Gonorrhoeae ◽

Ethnic Groups ◽

Antimicrobial Susceptibility ◽

Sequence Data ◽

Small Sample ◽

Whole Genome Sequence ◽

Whole Genome ◽

Sequencing Data ◽

Transmission Networks ◽

Hiv Serosorting

ObjectiveWe aimed to characterise gonorrhoea transmission patterns in a diverse urban population by linking genomic, epidemiological and antimicrobial susceptibility data.MethodsNeisseria gonorrhoeae isolates from patients attending sexual health clinics at Barts Health NHS Trust, London, UK, during an 11-month period underwent whole-genome sequencing and antimicrobial susceptibility testing. We combined laboratory and patient data to investigate the transmission network structure.ResultsOne hundred and fifty-eight isolates from 158 patients were available with associated descriptive data. One hundred and twenty-nine (82%) patients identified as male and 25 (16%) as female; four (3%) records lacked gender information. Self-described ethnicities were: 51 (32%) English/Welsh/Scottish; 33 (21%) white, other; 23 (15%) black British/black African/black, other; 12 (8%) Caribbean; 9 (6%) South Asian; 6 (4%) mixed ethnicity; and 10 (6%) other; data were missing for 14 (9%). Self-reported sexual orientations were 82 (52%) men who have sex with men (MSM); 49 (31%) heterosexual; 2 (1%) bisexual; data were missing for 25 individuals. Twenty-two (14%) patients were HIV positive. Whole-genome sequence data were generated for 151 isolates, which linked 75 (50%) patients to at least one other case. Using sequencing data, we found no evidence of transmission networks related to specific ethnic groups (p=0.64) or of HIV serosorting (p=0.35). Of 82 MSM/bisexual patients with sequencing data, 45 (55%) belonged to clusters of ≥2 cases, compared with 16/44 (36%) heterosexuals with sequencing data (p=0.06).ConclusionWe demonstrate links between 50% of patients in transmission networks using a relatively small sample in a large cosmopolitan city. We found no evidence of HIV serosorting. Our results do not support assortative selectivity as an explanation for differences in gonorrhoea incidence between ethnic groups.

Download Full-text

Population-level genome-wide STR typing in Plasmodium species reveals higher resolution population structure and genetic diversity relative to SNP typing

10.1101/2021.05.19.444768 ◽

2021 ◽

Author(s):

Jiru Han ◽

Jacob E Munro ◽

Anthony Kocoski ◽

Alyssa E Barry ◽

Melanie Bahlo

Keyword(s):

Genetic Diversity ◽

Large Scale ◽

Tandem Repeats ◽

Plasmodium Species ◽

Whole Genome Sequence ◽

Whole Genome Sequencing Data ◽

Whole Genome ◽

Sequencing Data ◽

Genome Wide ◽

Field Samples

Short tandem repeats (STRs) are highly informative genetic markers that have been used extensively in population genetics analysis. They are an important source of genetic diversity and can also have functional impact. Despite the availability of bioinformatic methods that permit large-scale genome-wide genotyping of STRs from whole genome sequencing data, they have not previously been applied to sequencing data from large collections of malaria parasite field samples. Here, we have genotyped STRs using HipSTR in more than 3,000 Plasmodium falciparum and 174 Plasmodium vivax published whole-genome sequence data from samples collected across the globe. High levels of noise and variability in the resultant callset necessitated the development of a novel method for quality control of STR genotype calls. A set of high-quality STR loci (6,768 from P. falciparum and 3,496 from P. vivax) were used to study Plasmodium genetic diversity, population structures and genomic signatures of selection and these were compared to genome-wide single nucleotide polymorphism (SNP) genotyping data. In addition, the genome-wide information about genetic variation and other characteristics of STRs in P. falciparum and P. vivax have been made available in an interactive web-based R Shiny application PlasmoSTR (https://github.com/bahlolab/PlasmoSTR).

Download Full-text

Discordant bioinformatic predictions of antimicrobial resistance from whole-genome sequencing data of bacterial isolates: An inter-laboratory study

10.1101/793885 ◽

2019 ◽

Cited By ~ 1

Author(s):

Ronan M. Doyle ◽

Denise M. O’Sullivan ◽

Sean D. Aller ◽

Sebastian Bruchmann ◽

Taane Clark ◽

...

Keyword(s):

Antimicrobial Resistance ◽

Whole Genome Sequencing ◽

Genome Sequencing ◽

Laboratory Study ◽

Clinical Microbiology ◽

Sequence Data ◽

Clinical Samples ◽

Whole Genome Sequencing Data ◽

Whole Genome ◽

Sequencing Data

AbstractBackgroundAntimicrobial resistance (AMR) poses a threat to public health. Clinical microbiology laboratories typically rely on culturing bacteria for antimicrobial susceptibility testing (AST). As the implementation costs and technical barriers fall, whole-genome sequencing (WGS) has emerged as a ‘one-stop’ test for epidemiological and predictive AST results. Few published comparisons exist for the myriad analytical pipelines used for predicting AMR. To address this, we performed an inter-laboratory study providing sets of participating researchers with identical short-read WGS data sequenced from clinical isolates, allowing us to assess the reproducibility of the bioinformatic prediction of AMR between participants and identify problem cases and factors that lead to discordant results.MethodsWe produced ten WGS datasets of varying quality from cultured carbapenem-resistant organisms obtained from clinical samples sequenced on either an Illumina NextSeq or HiSeq instrument. Nine participating teams (‘participants’) were provided these sequence data without any other contextual information. Each participant used their own pipeline to determine the species, the presence of resistance-associated genes, and to predict susceptibility or resistance to amikacin, gentamicin, ciprofloxacin and cefotaxime.ResultsIndividual participants predicted different numbers of AMR-associated genes and different gene variants from the same clinical samples. The quality of the sequence data, choice of bioinformatic pipeline and interpretation of the results all contributed to discordance between participants. Although much of the inaccurate gene variant annotation did not affect genotypic resistance predictions, we observed low specificity when compared to phenotypic AST results but this improved in samples with higher read depths. Had the results been used to predict AST and guide treatment a different antibiotic would have been recommended for each isolate by at least one participant.ConclusionsWe found that participants produced discordant predictions from identical WGS data. These challenges, at the final analytical stage of using WGS to predict AMR, suggest the need for refinements when using this technology in clinical settings. Comprehensive public resistance sequence databases and standardisation in the comparisons between genotype and resistance phenotypes will be fundamental before AST prediction using WGS can be successfully implemented in standard clinical microbiology laboratories.

Download Full-text

Population genomics of East Asian ethnic groups

Hereditas ◽

10.1186/s41065-020-00162-w ◽

2020 ◽

Vol 157 (1) ◽

Author(s):

Ziqing Pan ◽

Shuhua Xu

Keyword(s):

Genetic Diversity ◽

East Asia ◽

Ethnic Groups ◽

Population Genomics ◽

Sequence Data ◽

East Asian ◽

Whole Genome Sequence ◽

Whole Genome ◽

Evolutionary Forces ◽

Asian Populations

AbstractEast Asia constitutes one-fifth of the global population and exhibits substantial genetic diversity. However, genetic investigations on populations in this region have been largely under-represented compared with European populations. Nonetheless, the last decade has seen considerable efforts and progress in genome-wide genotyping and whole-genome sequencing of the East-Asian ethnic groups. Here, we review the recent studies in terms of ancestral origin, population relationship, genetic differentiation, and admixture of major East- Asian groups, such as the Chinese, Korean, and Japanese populations. We mainly focus on insights from the whole-genome sequence data and also include the recent progress based on mitochondrial DNA (mtDNA) and Y chromosome data. We further discuss the evolutionary forces driving genetic diversity in East-Asian populations, and provide our perspectives for future directions on population genetics studies, particularly on underrepresented indigenous groups in East Asia.

Download Full-text