SUITE OF TOOLS FOR STATISTICAL N-GRAM LANGUAGE MODELING FOR PATTERN MINING IN WHOLE GENOME SEQUENCES

2012 ◽  
Vol 10 (06) ◽  
pp. 1250016 ◽  
Author(s):  
MADHAVI K. GANAPATHIRAJU ◽  
ASIA D. MITCHELL ◽  
MOHAMED THAHIR ◽  
KAMIYA MOTWANI ◽  
SESHAN ANANTHASUBRAMANIAN

Genome sequences contain a number of patterns that have biomedical significance. Repetitive sequences of various kinds are a primary component of most of the genomic sequence patterns. We extended the suffix-array based Biological Language Modeling Toolkit to compute n-gram frequencies as well as n-gram language-model based perplexity in windows over the whole genome sequence to find biologically relevant patterns. We present the suite of tools and their application for analysis on whole human genome sequence.

2020 ◽  
Vol 117 (7) ◽  
pp. 3678-3686 ◽  
Author(s):  
JaeJin Choi ◽  
Sung-Hou Kim

An organism tree of life (organism ToL) is a conceptual and metaphorical tree to capture a simplified narrative of the evolutionary course and kinship among the extant organisms. Such a tree cannot be experimentally validated but may be reconstructed based on characteristics associated with the organisms. Since the whole-genome sequence of an organism is, at present, the most comprehensive descriptor of the organism, a whole-genome sequence-based ToL can be an empirically derivable surrogate for the organism ToL. However, experimentally determining the whole-genome sequences of many diverse organisms was practically impossible until recently. We have constructed three types of ToLs for diversely sampled organisms using the sequences of whole genome, of whole transcriptome, and of whole proteome. Of the three, whole-proteome sequence-based ToL (whole-proteome ToL), constructed by applying information theory-based feature frequency profile method, an “alignment-free” method, gave the most topologically stable ToL. Here, we describe the main features of a whole-proteome ToL for 4,023 species with known complete or almost complete genome sequences on grouping and kinship among the groups at deep evolutionary levels. The ToL reveals 1) all extant organisms of this study can be grouped into 2 “Supergroups,” 6 “Major Groups,” or 35+ “Groups”; 2) the order of emergence of the “founders” of all of the groups may be assigned on an evolutionary progression scale; 3) all of the founders of the groups have emerged in a “deep burst” at the very beginning period near the root of the ToL—an explosive birth of life’s diversity.


2015 ◽  
Vol 3 (6) ◽  
Author(s):  
Phuong N. Tran ◽  
Nicholas E. H. Tan ◽  
Yin Peng Lee ◽  
Han Ming Gan ◽  
Steven J. Polter ◽  
...  

Here, we report the whole-genome sequences and annotation of 11 endophytic bacteria from poison ivy ( Toxicodendron radicans ) vine tissue. Five bacteria belong to the genus Pseudomonas , and six single members from other genera were found present in interior vine tissue of poison ivy.


2013 ◽  
Vol 63 (Pt_7) ◽  
pp. 2742-2751 ◽  
Author(s):  
Henryk Urbanczyk ◽  
Yoshitoshi Ogura ◽  
Tetsuya Hayashi

Use of inadequate methods for classification of bacteria in the so-called Harveyi clade (family Vibrionaceae, Gammaproteobacteria) has led to incorrect assignment of strains and proliferation of synonymous species. In order to resolve taxonomic ambiguities within the Harveyi clade and to test usefulness of whole genome sequence data for classification of Vibrionaceae, draft genome sequences of 12 strains were determined and analysed. The sequencing included type strains of seven species: Vibrio sagamiensis NBRC 104589T, Vibrio azureus NBRC 104587T, Vibrio harveyi NBRC 15634T, Vibrio rotiferianus LMG 21460T, Vibrio campbellii NBRC 15631T, Vibrio jasicida LMG 25398T, and Vibrio owensii LMG 25443T. Draft genome sequences of strain LMG 25430, previously designated the type strain of [Vibrio communis], and two strains (MWB 21 and 090810c) from the ‘beijerinckii’ lineage were also determined. Whole genomes of two additional strains (ATCC 25919 and 200612B) that previously could not be assigned to any Harveyi clade species were also sequenced. Analysis of the genome sequence data revealed a clear case of synonymy between V. owensii and [V. communis], confirming an earlier proposal to synonymize both species. Both strains from the ‘beijerinckii’ lineage were classified as V. jasicida, while the strains ATCC 25919 and 200612B were classified as V. owensii and V. campbellii, respectively. We also found that two strains, AND4 and Ex25, are closely related to Harveyi clade bacteria, but could not be assigned to any species of the family Vibrionaceae. The use of whole genome sequence data for the taxonomic classification of the Harveyi clade bacteria and other members of the family Vibrionaceae is also discussed.


2018 ◽  
Vol 6 (2) ◽  
Author(s):  
Xiaoan Cao ◽  
Zhaocai Li ◽  
Zhongzi Lou ◽  
Baoquan Fu ◽  
Yongsheng Liu ◽  
...  

ABSTRACT The facultative intracellular Gram-negative bacterium Brucella melitensis causes brucellosis in domestic and wild mammals. Brucella melitensis QH61 was isolated from a yak suffering from abortion in 2015 in Qinghai, China. Here, we report the whole-genome sequence of B. melitensis strain QH61.


2020 ◽  
Author(s):  
Antonio Roberto Gomes de Farias ◽  
Wilson José da Silva Junior ◽  
José Bandeira do Nascimento Junior ◽  
Valdir de Queiroz Balbino ◽  
Ana Maria Benko-Iseppon ◽  
...  

Abstract Background Xanthomonas citri pv. viticola is one of the most critical grapevine diseases in the Northeast of Brazil, presenting a high risk to Brazilian and worldwide areas of grape production. The X.citri pv. viticola epithet was recently proposed to be changed from X. campestris pv. v iticola based on multilocus sequence analysis and whole-genome sequences. Besides, genomics has revolutionized the field of bacteriology, by associating genome sequencing with comparative analysis such as in silico analysis such as DNA-DNA hybridization, average nucleotide identity, distance between genomes, pan-genomic approach, and phylogenomic, providing valuable insights and knowledge about virulence factors and contributing to increase the understanding and clarifying the taxonomic relationship of Xanthomonas and others prokaryotic species.Results We used the whole-genome sequence of three Brazilian strains and the pathotype to characterize X.citri pv. viticola accessions plus 124 whole-genome sequences of Xanthomonas species available in NCBI, comprising 13 species and 15 pathovars. The whole-genome sequence structure of X. citri pv. viticola was shown presents a high level of conservation concerning other X. citri species. Pan-genomic approaches, average nucleotide identity analysis, and in silico DNA-DNA hybridization were carried out, allowing X.citri pv. viticola characterization and inferences on the phylogenetic relationships within Xanthomonas . The analysis of the sequence of the 128 genomes clustered the Xanthomonas strains in eight main groups according to the recently proposed classification in all approaches used. Also, the analysis revealed that X. hortorum and X. gardneri should be classified as a single species, and the strain 17 of X. campestris and XC01 of X. citri pv. mangiferaeindicae widely described in the literature are misclassified.Conclusions We performed the genomic characterization of three representative Brazilian strains of Xcv . The genomic approaches based in the pan-genome, average nucleotide identity, and in silico DNA-DNA hybridization support the proposed taxonomic position of X.citri pv. viticola and of the recently proposed Xanthomonas species and pathovars. In addition, we detected species delimitation of the misclassified Xanthomonas strains with extensive studies reported in the literature.


2019 ◽  
Vol 8 (3) ◽  
Author(s):  
Ana Maria Bocsanczy ◽  
Andres S. Espindola ◽  
David J. Norman

Ralstonia solanacearum is the causal agent of bacterial wilt in numerous species of plants. Here, we report the whole-genome sequence of three phylogenetically diverse R. solanacearum strains, P816, P822, and P824, reported for the first time as causal agents of an emerging blueberry disease in Florida.


2021 ◽  
Vol 11 (1) ◽  
Author(s):  
Gehendra Bhattarai ◽  
Ainong Shi ◽  
Devi R. Kandel ◽  
Nora Solís-Gracia ◽  
Jorge Alberto da Silva ◽  
...  

AbstractThe availability of well-assembled genome sequences and reduced sequencing costs have enabled the resequencing of many additional accessions in several crops, thus facilitating the rapid discovery and development of simple sequence repeat (SSR) markers. Although the genome sequence of inbred spinach line Sp75 is available, previous efforts have resulted in a limited number of useful SSR markers. Identification of additional polymorphic SSR markers will support genetics and breeding research in spinach. This study aimed to use the available genomic resources to mine and catalog a large number of polymorphic SSR markers. A search for SSR loci on six chromosome sequences of spinach line Sp75 using GMATA identified a total of 42,155 loci with repeat motifs of two to six nucleotides in the Sp75 reference genome. Whole-genome sequences (30x) of additional 21 accessions were aligned against the chromosome sequences of the reference genome and in silico genotyped using the HipSTR program by comparing and counting repeat numbers variation across the SSR loci among the accessions. The HipSTR program generated SSR genotype data were filtered for monomorphic and high missing loci, and a final set of the 5986 polymorphic SSR loci were identified. The polymorphic SSR loci were present at a density of 12.9 SSRs/Mb and were physically mapped. Out of 36 randomly selected SSR loci for validation, two failed to amplify, while the remaining were all polymorphic in a set of 48 spinach accessions from 34 countries. Genetic diversity analysis performed using the SSRs allele score data on the 48 spinach accessions showed three main population groups. This strategy to mine and develop polymorphic SSR markers by a comparative analysis of the genome sequences of multiple accessions and computational genotyping of the candidate SSR loci eliminates the need for laborious experimental screening. Our approach increased the efficiency of discovering a large set of novel polymorphic SSR markers, as demonstrated in this report.


2019 ◽  
Vol 8 (32) ◽  
Author(s):  
Sofia B. Mohamed ◽  
Mohamed Hassan ◽  
Abdalla Munir ◽  
Sumaya Kambal ◽  
Nusiba I. Abdalla ◽  
...  

Acinetobacter baumannii has emerged as an important pathogen leading to multiple nosocomial outbreaks. Here, we describe the genomic sequence of a multidrug-resistant Acinetobacter baumannii sequence type 164 (ST164) isolate from a hospital patient in Sudan. To our knowledge, this is the first reported draft genome of an A. baumannii strain isolated from Sudan.


Sign in / Sign up

Export Citation Format

Share Document