scholarly journals Large-scale k-mer-based analysis of the informational properties of genomes, comparative genomics and taxonomy

PLoS ONE ◽  
2021 ◽  
Vol 16 (10) ◽  
pp. e0258693
Author(s):  
Yuval Bussi ◽  
Ruti Kapon ◽  
Ziv Reich

Information theoretic approaches are ubiquitous and effective in a wide variety of bioinformatics applications. In comparative genomics, alignment-free methods, based on short DNA words, or k-mers, are particularly powerful. We evaluated the utility of varying k-mer lengths for genome comparisons by analyzing their sequence space coverage of 5805 genomes in the KEGG GENOME database. In subsequent analyses on four k-mer lengths spanning the relevant range (11, 21, 31, 41), hierarchical clustering of 1634 genus-level representative genomes using pairwise 21- and 31-mer Jaccard similarities best recapitulated a phylogenetic/taxonomic tree of life with clear boundaries for superkingdom domains and high subtree similarity for named taxons at lower levels (family through phylum). By analyzing ~14.2M prokaryotic genome comparisons by their lowest-common-ancestor taxon levels, we detected many potential misclassification errors in a curated database, further demonstrating the need for wide-scale adoption of quantitative taxonomic classifications based on whole-genome similarity.

Author(s):  
Jianglin Feng ◽  
Nathan C Sheffield

Abstract Summary Databases of large-scale genome projects now contain thousands of genomic interval datasets. These data are a critical resource for understanding the function of DNA. However, our ability to examine and integrate interval data of this scale is limited. Here, we introduce the integrated genome database (IGD), a method and tool for searching genome interval datasets more than three orders of magnitude faster than existing approaches, while using only one hundredth of the memory. IGD uses a novel linear binning method that allows us to scale analysis to billions of genomic regions. Availability https://github.com/databio/IGD


Author(s):  
Anna Lavecchia ◽  
Matteo Chiara ◽  
Caterina De Virgilio ◽  
Caterina Manzari ◽  
Carlo Pazzani ◽  
...  

Abstract Staphylococcus cohnii (SC), a coagulase-negative bacterium, was first isolated in 1975 from human skin. Early phenotypic analyses led to the delineation of two subspecies (subsp.), Staphylococcus cohnii subsp. cohnii (SCC) and Staphylococcus cohnii subsp. urealyticus (SCU). SCC was considered to be specific to humans whereas SCU apparently demonstrated a wider host range, from lower primates to humans. The type strains ATCC 29974 and ATCC 49330 have been designated for SCC and SCU, respectively. Comparative analysis of 66 complete genome sequences—including a novel SC isolate—revealed unexpected patterns within the SC complex, both in terms of genomic sequence identity and gene content, highlighting the presence of 3 phylogenetically distinct groups. Based on our observations, and on the current guidelines for taxonomic classification for bacterial species, we propose a revision of the SC species complex. We suggest that SCC and SCU should be regarded as two distinct species: SC and SU (Staphylococcus urealyticus), and that two distinct subspecies, SCC and SCB (SC subsp. barensis, represented by the novel strain isolated in Bari) should be recognized within SC. Furthermore, since large scale comparative genomics studies recurrently suggest inconsistencies or conflicts in taxonomic assignments of bacterial species, we believe that the approach proposed here might be considered for more general application.


2021 ◽  
Vol 22 (1) ◽  
Author(s):  
James M. Kunert-Graf ◽  
Nikita A. Sakhanenko ◽  
David J. Galas

Abstract Background Permutation testing is often considered the “gold standard” for multi-test significance analysis, as it is an exact test requiring few assumptions about the distribution being computed. However, it can be computationally very expensive, particularly in its naive form in which the full analysis pipeline is re-run after permuting the phenotype labels. This can become intractable in multi-locus genome-wide association studies (GWAS), in which the number of potential interactions to be tested is combinatorially large. Results In this paper, we develop an approach for permutation testing in multi-locus GWAS, specifically focusing on SNP–SNP-phenotype interactions using multivariable measures that can be computed from frequency count tables, such as those based in Information Theory. We find that the computational bottleneck in this process is the construction of the count tables themselves, and that this step can be eliminated at each iteration of the permutation testing by transforming the count tables directly. This leads to a speed-up by a factor of over 103 for a typical permutation test compared to the naive approach. Additionally, this approach is insensitive to the number of samples making it suitable for datasets with large number of samples. Conclusions The proliferation of large-scale datasets with genotype data for hundreds of thousands of individuals enables new and more powerful approaches for the detection of multi-locus genotype-phenotype interactions. Our approach significantly improves the computational tractability of permutation testing for these studies. Moreover, our approach is insensitive to the large number of samples in these modern datasets. The code for performing these computations and replicating the figures in this paper is freely available at https://github.com/kunert/permute-counts.


2001 ◽  
Vol 2 (4) ◽  
pp. 243-251
Author(s):  
Jo Wixon

We bring you a report from the CSHL Genome Sequencing and Biology Meeting, which has a long and prestigious history. This year there were sessions on large-scale sequencing and analysis, polymorphisms (covering discovery and technologies and mapping and analysis), comparative genomics of mammalian and model organism genomes, functional genomics and bioinformatics.


Author(s):  
Shinyi Lee ◽  
Tan Yigitcanlar

Stormwater has been recognised as one of the main culprits of aquatic ecosystem pollution and as a significant threat to the goal of ecological sustainable development. Water sensitive urban design is one of the key responses to the need to better manage urban stormwater runoff, the objectives of which go beyond rapid and efficient conveyance. Underpinned by the concepts of sustainable urban development, water sensitive urban design has proven to be an efficient and environmentally-friendly approach to urban stormwater management, with the necessary technical know-how and skills already available. However, large-scale implementation of water sensitive urban design is still lacking in Australia due to significant impediments and negative perceptions. Identification of the issues, barriers and drivers that affect sustainability outcomes of urban stormwater management is one of the first steps towards encouraging the wide-scale uptake of water sensitive urban design features which integrate sustainable urban stormwater management. This chapter investigates key water sensitive urban design perceptions, drivers and barriers in order to improve sustainable urban stormwater management efforts.


2020 ◽  
Author(s):  
Agata Motyka-Pomagruk ◽  
Sabina Zoledowska ◽  
Agnieszka Emilia Misztak ◽  
Wojciech Sledz ◽  
Alessio Mengoni ◽  
...  

Abstract Background: Dickeya solani is an important plant pathogenic bacterium causing severe losses in European potato production. This species draws a lot of attention due to its remarkable virulence, great devastating potential and easier spread in contrast to other Dickeya spp. In view of a high need for extensive studies on economically important soft rot Pectobacteriaceae , we performed a comparative genomics analysis on D. solani strains to search for genetic foundations that would explain the differences in the observed virulence levels within the D. solani population. Results: High quality assemblies of 8 de novo sequenced D. solani genomes have been obtained. Whole-sequence comparison, ANIb, ANIm, Tetra and pangenome-oriented analyses performed on these genomes and the sequences of 14 additional strains revealed an exceptionally high level of homogeneity among the studied genetic material of D. solani strains. With the use of 22 genomes, the pangenome of D. solani , comprising 84.7% core, 7.2% accessory and 8.1% unique genes, has been almost completely determined, suggesting the presence of a nearly closed pangenome structure. Attribution of the genes included in the D. solani pangenome fractions to functional COG categories showed that higher percentages of accessory and unique pangenome parts in contrast to the core section are encountered in phage/mobile elements- and transcription- associated groups with the genome of RNS 05.1.2A strain having the most significant impact. Also, the first D. solani large-scale genome-wide phylogeny computed on concatenated core gene alignments is herein reported. Conclusions: The almost closed status of D. solani pangenome achieved in this work points to the fact that the unique gene pool of this species should no longer expand. Such a feature is characteristic of taxa whose representatives either occupy isolated ecological niches or lack efficient mechanisms for gene exchange and recombination, which seems rational concerning a strictly pathogenic species with clonal population structure. Finally, no obvious correlations between the geographical origin of D. solani strains and their phylogeny were found, which might reflect the specificity of the international seed potato market.


2015 ◽  
Vol 40 ◽  
Author(s):  
Christy Danelle Di Frances

Robert Louis Stevenson is well known as a writer of popular Victorian adventures, yet much of his fiction is steeped in the cultural and historical preoccupations of Scotland. Texts such as Kidnapped (1886), The Master of Ballantrae (1889), and Catriona (1893) hinge upon culturally significant events such as the Jacobite Rising of 1745 and the Appin Murder. These works also allude to the Highland Clearances of the eighteenth and nineteenth centuries and the Battle of Culloden with its ensuing disarming acts—all occurrences which contributed to or comprised significant catalysts for the large-scale expulsion of Scots from their homeland. Certainly, themes of exile pervade Stevenson’s Scottish work and maintain a more liminal presence in his later South Seas fiction, and many of the author’s finest characters can be read as enactments of temporary or permanent expatriates whose real-life counterparts form a fascinating cross-section of the diasporic movement. This paper focuses on several of these characters, whose adventures are encoded into their corresponding texts as fictional re-constructions of a broader experience common to displaced Scots in the eighteenth and nineteenth centuries. Some are driven from Scotland as a direct result of economic hardship or domestic conflict, while others leave (at least temporarily) as a means of avoiding the political corruption and intrigue characteristic of the historical struggle for Scottish independence. Through characters like David Balfour, Alan Breck Stewart, James Durie, and Archie Weir, Stevenson explores the psychological ramifications of politically enforced and self-imposed exile, thus providing fictional extrapolations of the Scottish diasporic experience. These portrayals, infused with a the author’s own experiences abroad, offer fascinating microcosms which gesture towards the collective experience of a wide-scale network of displaced Scots in the Victorian world. An early version of this paper was presented at the NAVSA 2012 “Victorian Networks” conference hosted by the University of Wisconsin at Madison.


Author(s):  
Anjan Pakhira ◽  
Peter Andras

Testing is a critical phase in the software life-cycle. While small-scale component-wise testing is done routinely as part of development and maintenance of large-scale software, the system level testing of the whole software is much more problematic due to low level of coverage of potential usage scenarios by test cases and high costs associated with wide-scale testing of large software. Here, the authors investigate the use of cloud computing to facilitate the testing of large-scale software. They discuss the aspects of cloud-based testing and provide an example application of this. They describe the testing of the functional importance of methods of classes in the Google Chrome software. The methods that we test are predicted to be functionally important with respect to a functionality of the software. The authors use network analysis applied to dynamic analysis data generated by the software to make these predictions. They check the validity of these predictions by mutation testing of a large number of mutated variants of the Google Chrome. The chapter provides details of how to set up the testing process on the cloud and discusses relevant technical issues.


2015 ◽  
pp. 1175-1203
Author(s):  
Anjan Pakhira ◽  
Peter Andras

Testing is a critical phase in the software life-cycle. While small-scale component-wise testing is done routinely as part of development and maintenance of large-scale software, the system level testing of the whole software is much more problematic due to low level of coverage of potential usage scenarios by test cases and high costs associated with wide-scale testing of large software. Here, the authors investigate the use of cloud computing to facilitate the testing of large-scale software. They discuss the aspects of cloud-based testing and provide an example application of this. They describe the testing of the functional importance of methods of classes in the Google Chrome software. The methods that we test are predicted to be functionally important with respect to a functionality of the software. The authors use network analysis applied to dynamic analysis data generated by the software to make these predictions. They check the validity of these predictions by mutation testing of a large number of mutated variants of the Google Chrome. The chapter provides details of how to set up the testing process on the cloud and discusses relevant technical issues.


2020 ◽  
Vol 497 (4) ◽  
pp. 4077-4090 ◽  
Author(s):  
Suman Sarkar ◽  
Biswajit Pandey

ABSTRACT A non-zero mutual information between morphology of a galaxy and its large-scale environment is known to exist in Sloan Digital Sky Survey (SDSS) upto a few tens of Mpc. It is important to test the statistical significance of these mutual information if any. We propose three different methods to test the statistical significance of these non-zero mutual information and apply them to SDSS and Millennium run simulation. We randomize the morphological information of SDSS galaxies without affecting their spatial distribution and compare the mutual information in the original and randomized data sets. We also divide the galaxy distribution into smaller subcubes and randomly shuffle them many times keeping the morphological information of galaxies intact. We compare the mutual information in the original SDSS data and its shuffled realizations for different shuffling lengths. Using a t-test, we find that a small but statistically significant (at $99.9{{\ \rm per\ cent}}$ confidence level) mutual information between morphology and environment exists upto the entire length-scale probed. We also conduct another experiment using mock data sets from a semi-analytic galaxy catalogue where we assign morphology to galaxies in a controlled manner based on the density at their locations. The experiment clearly demonstrates that mutual information can effectively capture the physical correlations between morphology and environment. Our analysis suggests that physical association between morphology and environment may extend to much larger length-scales than currently believed, and the information theoretic framework presented here can serve as a sensitive and useful probe of the assembly bias and large-scale environmental dependence of galaxy properties.


Sign in / Sign up

Export Citation Format

Share Document