scholarly journals Tiling the genome into consistently named subsequences enables precision medicine and machine learning with millions of complex individual data-sets

Author(s):  
Sarah Guthrie ◽  
Abram Connelly ◽  
Peter Amstutz ◽  
Adam F. Berrey ◽  
Nicolas Cesar ◽  
...  

The scientific and medical community is reaching an era of inexpensive whole genome sequencing, opening the possibility of precision medicine for millions of individuals. Here we present tiling: a flexible representation of whole genome sequences that supports simple and consistent names, annotation, queries, machine learning, and clinical screening. We partitioned the genome into 10,655,006 tiles: overlapping, variable-length sequences that begin and end with unique 24-base tags. We tiled and annotated 680 public whole genome sequences from the 1000 Genomes Project Consortium (1KG) and Harvard Personal Genome Project (PGP) using ClinVar database information. These genomes cover 14.13 billion tile sequences (4.087 trillion high quality bases and 0.4321 trillion low quality bases) and 251 phenotypes spanning ICD-9 code ranges 140-289, 320-629, and 680-759. We used these data to build a Global Alliance for Genomics and Health Beacon and graph database. We performed principal component analysis (PCA) on the 680 public whole genomes, and by projecting the tiled genomes onto their first two principal components, we replicated the 1KG principle component separation by population ethnicity codes. Interestingly, we found the PGP self reported ethnicities cluster consistently with 1KG ethnicity codes. We built a set of support-vector ABO blood-type classifiers using 75 PGP participants who had both a whole genome sequence and a self-reported blood type. Our classifier predicts A antigen presence to within 1% of the current state-of-the art for in silico A antigen prediction. Finally, we found six PGP participants with previously undiscovered pathogenic BRCA variants, and using our tiling, gave them simple, consistent names, which can be easily and independently re-derived. Given the near-future requirements of genomics research and precision medicine, we propose the adoption of tiling and invite all interested individuals and groups to view, rerun, copy, and modify these analyses at https://curover.se/su92l- j7d0g-swtofxa2rct8495

2015 ◽  
Author(s):  
Sarah Guthrie ◽  
Abram Connelly ◽  
Peter Amstutz ◽  
Adam F. Berrey ◽  
Nicolas Cesar ◽  
...  

The scientific and medical community is reaching an era of inexpensive whole genome sequencing, opening the possibility of precision medicine for millions of individuals. Here we present tiling: a flexible representation of whole genome sequences that supports simple and consistent names, annotation, queries, machine learning, and clinical screening. We partitioned the genome into 10,655,006 tiles: overlapping, variable-length sequences that begin and end with unique 24-base tags. We tiled and annotated 680 public whole genome sequences from the 1000 Genomes Project Consortium (1KG) and Harvard Personal Genome Project (PGP) using ClinVar database information. These genomes cover 14.13 billion tile sequences (4.087 trillion high quality bases and 0.4321 trillion low quality bases) and 251 phenotypes spanning ICD-9 code ranges 140-289, 320-629, and 680-759. We used these data to build a Global Alliance for Genomics and Health Beacon and graph database. We performed principal component analysis (PCA) on the 680 public whole genomes, and by projecting the tiled genomes onto their first two principal components, we replicated the 1KG principle component separation by population ethnicity codes. Interestingly, we found the PGP self reported ethnicities cluster consistently with 1KG ethnicity codes. We built a set of support-vector ABO blood-type classifiers using 75 PGP participants who had both a whole genome sequence and a self-reported blood type. Our classifier predicts A antigen presence to within 1% of the current state-of-the art for in silico A antigen prediction. Finally, we found six PGP participants with previously undiscovered pathogenic BRCA variants, and using our tiling, gave them simple, consistent names, which can be easily and independently re-derived. Given the near-future requirements of genomics research and precision medicine, we propose the adoption of tiling and invite all interested individuals and groups to view, rerun, copy, and modify these analyses at https://curover.se/su92l- j7d0g-swtofxa2rct8495


2017 ◽  
Vol 5 (28) ◽  
Author(s):  
Sara Jones ◽  
Raji Prasad ◽  
Anjana S. Nair ◽  
Sanjai Dharmaseelan ◽  
Remya Usha ◽  
...  

ABSTRACT We report here the whole-genome sequence of six clinical isolates of influenza A(H1N1)pdm09, isolated from Kerala, India. Amino acid analysis of all gene segments from the A(H1N1)pdm09 isolates obtained in 2014 and 2015 identified several new mutations compared to the 2009 A(H1N1) pandemic strain.


Author(s):  
Viola Kurm ◽  
Ilse Houwers ◽  
Claudia E. Coipan ◽  
Peter Bonants ◽  
Cees Waalwijk ◽  
...  

AbstractIdentification and classification of members of the Ralstonia solanacearum species complex (RSSC) is challenging due to the heterogeneity of this complex. Whole genome sequence data of 225 strains were used to classify strains based on average nucleotide identity (ANI) and multilocus sequence analysis (MLSA). Based on the ANI score (>95%), 191 out of 192(99.5%) RSSC strains could be grouped into the three species R. solanacearum, R. pseudosolanacearum, and R. syzygii, and into the four phylotypes within the RSSC (I,II, III, and IV). R. solanacearum phylotype II could be split in two groups (IIA and IIB), from which IIB clustered in three subgroups (IIBa, IIBb and IIBc). This division by ANI was in accordance with MLSA. The IIB subgroups found by ANI and MLSA also differed in the number of SNPs in the primer and probe sites of various assays. An in-silico analysis of eight TaqMan and 11 conventional PCR assays was performed using the whole genome sequences. Based on this analysis several cases of potential false positives or false negatives can be expected upon the use of these assays for their intended target organisms. Two TaqMan assays and two PCR assays targeting the 16S rDNA sequence should be able to detect all phylotypes of the RSSC. We conclude that the increasing availability of whole genome sequences is not only useful for classification of strains, but also shows potential for selection and evaluation of clade specific nucleic acid-based amplification methods within the RSSC.


2020 ◽  
Author(s):  
Zhong Peng ◽  
Junyang Liu ◽  
Wan Liang ◽  
Fei Wang ◽  
Li Wang ◽  
...  

Abstract Background: Different typing systems including capsular genotyping, lipopolysaccharide (LPS) genotyping, multilocus sequence typing (MLST), and virulence genotyping based on the detection of different virulence factor-encoding gene (VFG) profiles have been applied to characterize Pasteurella multocida strains from different host species. However, these methods require much time and effort in laboratories. Particularly, relying on one of these methods is difficult to address the biology of P. multocida from host species. Recently, we found that assigning P. multocida strains according to the combination of their capsular, LPS, and MLST genotypes (marked as capsular genotype: LPS genotype: MLST genotype) could help address the biological characteristics of P. multocida circulation in multiple hosts. However, it is still lack of a rapid, efficient, intelligent and cost-saving tool to diagnose P. multocida according to this system. Results: We have developed an intelligent genotyping and host tropism prediction tool PmGT for P. multocida strains according to their whole genome sequences by using machine learning and web 2.0 technologies. By using this tool, the capsular genotypes, LPS genotypes, and MLST genotypes as well as the main VFGs of P. multocida isolates in different host species were determined based on whole genome sequences. The results revealed a closer association between the genotypes and pasteurellosis rather than between genotypes and host species. Finally, we also used PmGT to predict the host species of P. multocida strains with the same capsular: lipopolysaccharide: MLST genotypes. Conclusions: With the advent of high-quality, inexpensive DNA sequencing, this platform represents a more efficient and cost-saving tool for P. multocida diagnosis in both epidemiological studies and clinical settings.


2019 ◽  
Author(s):  
DJ Darwin R. Bandoy ◽  
B Carol Huang ◽  
Bart C. Weimer

AbstractTaxonomic classification is an essential step in the analysis of microbiome data that depends on a reference database of whole genome sequences. Taxonomic classifiers are built on established reference species, such as the Human Microbiome Project database, that is growing rapidly. While constructing a population wide pangenome of the bacterium Hungatella, we discovered that the Human Microbiome Project reference species Hungatella hathewayi (WAL 18680) was significantly different to other members of this genus. Specifically, the reference lacked the core genome as compared to the other members. Further analysis, using average nucleotide identity (ANI) and 16s rRNA comparisons, indicated that WAL18680 was misclassified as Hungatella. The error in classification is being amplified in the taxonomic classifiers and will have a compounding effect as microbiome analyses are done, resulting in inaccurate assignment of community members and will lead to fallacious conclusions and possibly treatment. As automated genome homology assessment expands for microbiome analysis, outbreak detection, and public health reliance on whole genomes increases this issue will likely occur at an increasing rate. These observations highlight the need for developing reference free methods for epidemiological investigation using whole genome sequences and the criticality of accurate reference databases.


2020 ◽  
Vol 117 (7) ◽  
pp. 3678-3686 ◽  
Author(s):  
JaeJin Choi ◽  
Sung-Hou Kim

An organism tree of life (organism ToL) is a conceptual and metaphorical tree to capture a simplified narrative of the evolutionary course and kinship among the extant organisms. Such a tree cannot be experimentally validated but may be reconstructed based on characteristics associated with the organisms. Since the whole-genome sequence of an organism is, at present, the most comprehensive descriptor of the organism, a whole-genome sequence-based ToL can be an empirically derivable surrogate for the organism ToL. However, experimentally determining the whole-genome sequences of many diverse organisms was practically impossible until recently. We have constructed three types of ToLs for diversely sampled organisms using the sequences of whole genome, of whole transcriptome, and of whole proteome. Of the three, whole-proteome sequence-based ToL (whole-proteome ToL), constructed by applying information theory-based feature frequency profile method, an “alignment-free” method, gave the most topologically stable ToL. Here, we describe the main features of a whole-proteome ToL for 4,023 species with known complete or almost complete genome sequences on grouping and kinship among the groups at deep evolutionary levels. The ToL reveals 1) all extant organisms of this study can be grouped into 2 “Supergroups,” 6 “Major Groups,” or 35+ “Groups”; 2) the order of emergence of the “founders” of all of the groups may be assigned on an evolutionary progression scale; 3) all of the founders of the groups have emerged in a “deep burst” at the very beginning period near the root of the ToL—an explosive birth of life’s diversity.


2019 ◽  
Vol 20 (5) ◽  
pp. 1215 ◽  
Author(s):  
Xavier Argemi ◽  
Yves Hansmann ◽  
Kevin Prola ◽  
Gilles Prévost

Coagulase-negative Staphylococci (CoNS) are skin commensal bacteria. Besides their role in maintaining homeostasis, CoNS have emerged as major pathogens in nosocomial settings. Several studies have investigated the molecular basis for this emergence and identified multiple putative virulence factors with regards to Staphylococcus aureus pathogenicity. In the last decade, numerous CoNS whole-genome sequences have been released, leading to the identification of numerous putative virulence factors. Koch’s postulates and the molecular rendition of these postulates, established by Stanley Falkow in 1988, do not explain the microbial pathogenicity of CoNS. However, whole-genome sequence data has shed new light on CoNS pathogenicity. In this review, we analyzed the contribution of genomics in defining CoNS virulence, focusing on the most frequent and pathogenic CoNS species: S. epidermidis, S. haemolyticus, S. saprophyticus, S. capitis, and S. lugdunensis.


2015 ◽  
Vol 3 (6) ◽  
Author(s):  
Phuong N. Tran ◽  
Nicholas E. H. Tan ◽  
Yin Peng Lee ◽  
Han Ming Gan ◽  
Steven J. Polter ◽  
...  

Here, we report the whole-genome sequences and annotation of 11 endophytic bacteria from poison ivy ( Toxicodendron radicans ) vine tissue. Five bacteria belong to the genus Pseudomonas , and six single members from other genera were found present in interior vine tissue of poison ivy.


2017 ◽  
Vol 92 (2) ◽  
Author(s):  
Ivan Borozan ◽  
Marc Zapatka ◽  
Lori Frappier ◽  
Vincent Ferretti

ABSTRACTEpstein-Barr virus (EBV) is a causative agent of a variety of lymphomas, nasopharyngeal carcinoma (NPC), and ∼9% of gastric carcinomas (GCs). An important question is whether particular EBV variants are more oncogenic than others, but conclusions are currently hampered by the lack of sequenced EBV genomes. Here, we contribute to this question by mining whole-genome sequences of 201 GCs to identify 13 EBV-positive GCs and by assembling 13 new EBV genome sequences, almost doubling the number of available GC-derived EBV genome sequences and providing the first non-Asian EBV genome sequences from GC. Whole-genome sequence comparisons of all EBV isolates sequenced to date (85 from tumors and 57 from healthy individuals) showed that most GC and NPC EBV isolates were closely related although American Caucasian GC samples were more distant, suggesting a geographical component. However, EBV GC isolates were found to contain some consistent changes in protein sequences regardless of geographical origin. In addition, transcriptome data available for eight of the EBV-positive GCs were analyzed to determine which EBV genes are expressed in GC. In addition to the expected latency proteins (EBNA1, LMP1, and LMP2A), specific subsets of lytic genes were consistently expressed that did not reflect a typical lytic or abortive lytic infection, suggesting a novel mechanism of EBV gene regulation in the context of GC. These results are consistent with a model in which a combination of specific latent and lytic EBV proteins promotes tumorigenesis.IMPORTANCEEpstein-Barr virus (EBV) is a widespread virus that causes cancer, including gastric carcinoma (GC), in a small subset of individuals. An important question is whether particular EBV variants are more cancer associated than others, but more EBV sequences are required to address this question. Here, we have generated 13 new EBV genome sequences from GC, almost doubling the number of EBV sequences from GC isolates and providing the first EBV sequences from non-Asian GC. We further identify sequence changes in some EBV proteins common to GC isolates. In addition, gene expression analysis of eight of the EBV-positive GCs showed consistent expression of both the expected latency proteins and a subset of lytic proteins that was not consistent with typical lytic or abortive lytic expression. These results suggest that novel mechanisms activate expression of some EBV lytic proteins and that their expression may contribute to oncogenesis.


2013 ◽  
Vol 63 (Pt_7) ◽  
pp. 2742-2751 ◽  
Author(s):  
Henryk Urbanczyk ◽  
Yoshitoshi Ogura ◽  
Tetsuya Hayashi

Use of inadequate methods for classification of bacteria in the so-called Harveyi clade (family Vibrionaceae, Gammaproteobacteria) has led to incorrect assignment of strains and proliferation of synonymous species. In order to resolve taxonomic ambiguities within the Harveyi clade and to test usefulness of whole genome sequence data for classification of Vibrionaceae, draft genome sequences of 12 strains were determined and analysed. The sequencing included type strains of seven species: Vibrio sagamiensis NBRC 104589T, Vibrio azureus NBRC 104587T, Vibrio harveyi NBRC 15634T, Vibrio rotiferianus LMG 21460T, Vibrio campbellii NBRC 15631T, Vibrio jasicida LMG 25398T, and Vibrio owensii LMG 25443T. Draft genome sequences of strain LMG 25430, previously designated the type strain of [Vibrio communis], and two strains (MWB 21 and 090810c) from the ‘beijerinckii’ lineage were also determined. Whole genomes of two additional strains (ATCC 25919 and 200612B) that previously could not be assigned to any Harveyi clade species were also sequenced. Analysis of the genome sequence data revealed a clear case of synonymy between V. owensii and [V. communis], confirming an earlier proposal to synonymize both species. Both strains from the ‘beijerinckii’ lineage were classified as V. jasicida, while the strains ATCC 25919 and 200612B were classified as V. owensii and V. campbellii, respectively. We also found that two strains, AND4 and Ex25, are closely related to Harveyi clade bacteria, but could not be assigned to any species of the family Vibrionaceae. The use of whole genome sequence data for the taxonomic classification of the Harveyi clade bacteria and other members of the family Vibrionaceae is also discussed.


Sign in / Sign up

Export Citation Format

Share Document