A New Genome-to-Genome Comparison Approach for Large-Scale Revisiting of Current Microbial Taxonomy

Ming-Hsin Tsai; Yen-Yi Liu; Von-Wun Soo; Chih-Chieh Chen

doi:10.3390/microorganisms7060161

A New Genome-to-Genome Comparison Approach for Large-Scale Revisiting of Current Microbial Taxonomy

Microorganisms ◽

10.3390/microorganisms7060161 ◽

2019 ◽

Vol 7 (6) ◽

pp. 161 ◽

Cited By ~ 1

Author(s):

Ming-Hsin Tsai ◽

Yen-Yi Liu ◽

Von-Wun Soo ◽

Chih-Chieh Chen

Keyword(s):

Microbial Diversity ◽

Large Scale ◽

Gene Selection ◽

Marker Gene ◽

Genome Comparison ◽

Marker Genes ◽

Species Classification ◽

Genome Wide ◽

A Genome ◽

Comparison Approach

Microbial diversity has always presented taxonomic challenges. With the popularity of next-generation sequencing technology, more unculturable bacteria have been sequenced, facilitating the discovery of additional new species and complicated current microbial classification. The major challenge is to assign appropriate taxonomic names. Hence, assessing the consistency between taxonomy and genomic relatedness is critical. We proposed and applied a genome comparison approach to a large-scale survey to investigate the distribution of genomic differences among microorganisms. The approach applies a genome-wide criterion, homologous coverage ratio (HCR), for describing the homology between species. The survey included 7861 microbial genomes that excluded plasmids, and 1220 pairs of genera exhibited ambiguous classification. In this study, we also compared the performance of HCR and average nucleotide identity (ANI). The results indicated that HCR and ANI analyses yield comparable results, but a few examples suggested that HCR has a superior clustering effect. In addition, we used the Genome Taxonomy Database (GTDB), the gold standard for taxonomy, to validate our analysis. The GTDB offers 120 ubiquitous single-copy proteins as marker genes for species classification. We determined that the analysis of the GTDB still results in classification boundary blur between some genera and that the marker gene-based approach has limitations. Although the choice of marker genes has been quite rigorous, the bias of marker gene selection remains unavoidable. Therefore, methods based on genomic alignment should be considered for use for species classification in order to avoid the bias of marker gene selection. On the basis of our observations of microbial diversity, microbial classification should be re-examined using genome-wide comparisons.

The mutL Gene as a Genome-Wide Taxonomic Marker for High Resolution Discrimination of Lactiplantibacillus plantarum and Its Closely Related Taxa

Microorganisms ◽

10.3390/microorganisms9081570 ◽

2021 ◽

Vol 9 (8) ◽

pp. 1570

Author(s):

Chien-Hsun Huang ◽

Chih-Chieh Chen ◽

Yu-Chun Lin ◽

Chia-Hsuan Chen ◽

Ai-Yun Lee ◽

...

Keyword(s):

16S Rrna ◽

16S Rrna Gene ◽

Target Genes ◽

Marker Genes ◽

Rrna Gene ◽

Accurate Identification ◽

Discrimination Power ◽

Sequence Identity ◽

Genome Wide ◽

A Genome

The current taxonomy of the Lactiplantibacillus plantarum group comprises of 17 closely related species that are indistinguishable from each other by using commonly used 16S rRNA gene sequencing. In this study, a whole-genome-based analysis was carried out for exploring the highly distinguished target genes whose interspecific sequence identity is significantly less than those of 16S rRNA or conventional housekeeping genes. In silico analyses of 774 core genes by the cano-wgMLST_BacCompare analytics platform indicated that csbB, morA, murI, mutL, ntpJ, rutB, trmK, ydaF, and yhhX genes were the most promising candidates. Subsequently, the mutL gene was selected, and the discrimination power was further evaluated using Sanger sequencing. Among the type strains, mutL exhibited a clearly superior sequence identity (61.6–85.6%; average: 66.6%) to the 16S rRNA gene (96.7–100%; average: 98.4%) and the conventional phylogenetic marker genes (e.g., dnaJ, dnaK, pheS, recA, and rpoA), respectively, which could be used to separat tested strains into various species clusters. Consequently, species-specific primers were developed for fast and accurate identification of L. pentosus, L. argentoratensis, L. plantarum, and L. paraplantarum. During this study, one strain (BCRC 06B0048, L. pentosus) exhibited not only relatively low mutL sequence identities (97.0%) but also a low digital DNA–DNA hybridization value (78.1%) with the type strain DSM 20314T, signifying that it exhibits potential for reclassification as a novel subspecies. Our data demonstrate that mutL can be a genome-wide target for identifying and classifying the L. plantarum group species and for differentiating novel taxa from known species.

Multi-view feature selection for identifying gene markers: a diversified biological data driven approach

BMC Bioinformatics ◽

10.1186/s12859-020-03810-0 ◽

2020 ◽

Vol 21 (S18) ◽

Author(s):

Sudipta Acharya ◽

Laizhong Cui ◽

Yi Pan

Keyword(s):

Gene Expression ◽

Feature Selection ◽

Gene Selection ◽

Marker Gene ◽

Biological Data ◽

Protein Interaction Data ◽

Marker Genes ◽

Data Sets ◽

Gene Markers ◽

Multi Objective

Abstract Background In recent years, to investigate challenging bioinformatics problems, the utilization of multiple genomic and proteomic sources has become immensely popular among researchers. One such issue is feature or gene selection and identifying relevant and non-redundant marker genes from high dimensional gene expression data sets. In that context, designing an efficient feature selection algorithm exploiting knowledge from multiple potential biological resources may be an effective way to understand the spectrum of cancer or other diseases with applications in specific epidemiology for a particular population. Results In the current article, we design the feature selection and marker gene detection as a multi-view multi-objective clustering problem. Regarding that, we propose an Unsupervised Multi-View Multi-Objective clustering-based gene selection approach called UMVMO-select. Three important resources of biological data (gene ontology, protein interaction data, protein sequence) along with gene expression values are collectively utilized to design two different views. UMVMO-select aims to reduce gene space without/minimally compromising the sample classification efficiency and determines relevant and non-redundant gene markers from three cancer gene expression benchmark data sets. Conclusion A thorough comparative analysis has been performed with five clustering and nine existing feature selection methods with respect to several internal and external validity metrics. Obtained results reveal the supremacy of the proposed method. Reported results are also validated through a proper biological significance test and heatmap plotting.

Phenotypic Screen and Transcriptomics Approach Complement Each Other in Functional Genomics of Defensive Stink Gland Physiology

10.21203/rs.3.rs-1117784/v1 ◽

2021 ◽

Author(s):

Sabrina Lehmann ◽

Bibi Atika ◽

Daniela Grossmann ◽

Christian Schmitt-Engel ◽

Nadi Strohlein ◽

...

Keyword(s):

Functional Genomics ◽

Large Scale ◽

Reverse Genetics ◽

Expression Profiles ◽

Forward Genetics ◽

Large Set ◽

Knock Down ◽

Genome Wide ◽

Phenotypic Screen ◽

A Genome

Abstract Background Functional genomics uses unbiased systematic genome-wide gene disruption or analyzes natural variations such as gene expression profiles of different tissues from multicellular organisms to link gene functions to particular phenotypes. Functional genomics approaches are of particular importance to identify large sets of genes that are specifically important for a particular biological process beyond known candidate genes, or when the process has not been studied with genetic methods before. Results Here, we present a large set of genes whose disruption interferes with the function of the odoriferous defensive stink glands of the red flour beetle Tribolium castaneum. This gene set is the result of a large-scale systematic phenotypic screen using a reverse genetics strategy based on RNA interference applied in a genome-wide forward genetics manner. In this first-pass screen, 130 genes were identified, of which 69 genes could be confirmed to cause knock-down gland phenotypes, which vary from necrotic tissue and irregular reservoir size to irregular color or separation of the secreted gland compounds. The knock-down of 13 genes caused specifically a strong reduction of para-benzoquinones, suggesting a specific function in the synthesis of these toxic compounds. Only 14 of the 69 confirmed gland genes are differentially overexpressed in stink gland tissue and thus could have been detected in a transcriptome-based analysis. Moreover, of the 29 previously transcriptomics-identified genes causing a gland phenotype, only one gene was recognized by this phenotypic screen despite the fact that 13 of them were covered by the screen. Conclusion Our results indicate the importance of combining diverse and independent methodologies to identify genes necessary for the function of a certain biological tissue, as the different approaches do not deliver redundant results but rather complement each other. The presented phenotypic screen together with a transcriptomics approach are now providing a set of close to hundred genes important for odoriferous defensive stink gland physiology in beetles.

A Comprehensive Survey on the Terpene Synthase Gene Family Provides New Insight into Its Evolutionary Patterns

Genome Biology and Evolution ◽

10.1093/gbe/evz142 ◽

2019 ◽

Vol 11 (8) ◽

pp. 2078-2098 ◽

Cited By ~ 8

Author(s):

Shu-Ye Jiang ◽

Jingjing Jin ◽

Rajani Sarojam ◽

Srinivasan Ramachandran

Keyword(s):

Gene Family ◽

Large Scale ◽

Family Members ◽

Terpene Synthase ◽

Limited Information ◽

Terpene Synthases ◽

Genome Wide ◽

A Genome ◽

Family Expansion ◽

Insight Into

Abstract Terpenes are organic compounds and play important roles in plant growth and development as well as in mediating interactions of plants with the environment. Terpene synthases (TPSs) are the key enzymes responsible for the biosynthesis of terpenes. Although some species were employed for the genome-wide identification and characterization of the TPS family, limited information is available regarding the evolution, expansion, and retention mechanisms occurring in this gene family. We performed a genome-wide identification of the TPS family members in 50 sequenced genomes. Additionally, we also characterized the TPS family from aromatic spearmint and basil plants using RNA-Seq data. No TPSs were identified in algae genomes but the remaining plant species encoded various numbers of the family members ranging from 2 to 79 full-length TPSs. Some species showed lineage-specific expansion of certain subfamilies, which might have contributed toward species or ecotype divergence or environmental adaptation. A large-scale family expansion was observed mainly in dicot and monocot plants, which was accompanied by frequent domain loss. Both tandem and segmental duplication significantly contributed toward family expansion and expression divergence and played important roles in the survival of these expanded genes. Our data provide new insight into the TPS family expansion and evolution and suggest that TPSs might have originated from isoprenyl diphosphate synthase genes.

Large-scale GWAS reveals insights into the genetic architecture of same-sex sexual behavior

Science ◽

10.1126/science.aat7693 ◽

2019 ◽

Vol 365 (6456) ◽

pp. eaat7693 ◽

Cited By ~ 53

Author(s):

Andrea Ganna ◽

Karin J. H. Verweij ◽

Michel G. Nivard ◽

Robert Maier ◽

Robbee Wedow ◽

...

Keyword(s):

Sexual Behavior ◽

Genetic Architecture ◽

Large Scale ◽

Genome Wide Association Study ◽

Same Sex ◽

Genome Wide ◽

A Genome ◽

Number Of Sexual Partners ◽

Opposite Sex ◽

Males And Females

Twin and family studies have shown that same-sex sexual behavior is partly genetically influenced, but previous searches for specific genes involved have been underpowered. We performed a genome-wide association study (GWAS) on 477,522 individuals, revealing five loci significantly associated with same-sex sexual behavior. In aggregate, all tested genetic variants accounted for 8 to 25% of variation in same-sex sexual behavior, only partially overlapped between males and females, and do not allow meaningful prediction of an individual’s sexual behavior. Comparing these GWAS results with those for the proportion of same-sex to total number of sexual partners among nonheterosexuals suggests that there is no single continuum from opposite-sex to same-sex sexual behavior. Overall, our findings provide insights into the genetics underlying same-sex sexual behavior and underscore the complexity of sexuality.

Genome-Wide Identification and Analysis of VQ Motif-containing Gene Family in Brassica napus and Functional Characterization of BnMKS1 in Response to Leptosphaeria maculans

Phytopathology ◽

10.1094/phyto-04-20-0134-r ◽

2020 ◽

Cited By ~ 1

Author(s):

Zhongwei Zou ◽

Fei Liu ◽

Shuanglong Huang ◽

DILANTHA GERARD FERNANDO

Keyword(s):

Brassica Napus ◽

Gene Family ◽

Functional Characterization ◽

Leptosphaeria Maculans ◽

Defense Responses ◽

Marker Genes ◽

Blackleg Disease ◽

Genome Wide ◽

A Genome

Proteins containing Valine-glutamine (VQ) motifs play important roles in plant growth and development, as well as in defense responses to both abiotic and biotic stresses. Blackleg disease, which is caused by Leptosphaeria maculans, is the most important disease in canola (Brassica napus L.) worldwide. H; however, the identification of B. napus VQs and their functions in response to blackleg disease have not yet been reported. In this study, we conducted a genome genome-wide identification and characterization of the VQ gene family in B. napus, including chromosome location, phylogenetic relations, gene structure, motif domain, synteny analysis, and cis-elements categorization of their promoter regions. To understand B. napus VQ gene function in response to blackleg disease, we overexpressed BnVQ7 (BnaA01g36880D, also known as the mitogen-activated protein kinase4 substrate1 (MKS1) gene) in a blackleg-susceptible canola variety Westar. Overexpression The overexpression of BnMKS1 in canola did not improve its resistance to blackleg disease at the seedling stage. H; however, transgenic canola plants overexpressing BnMKS1 displayed an enhanced resistance to L. maculans infection at the adult plant stage. Expression levels of downstream and defense marker genes in cotyledons increased significantly at the necrotrophic stage of L. maculans infection in the overexpression line of BnMKS1, suggesting that the SA salicylic acid (SA)- and jasmonic acid (JA )-mediated signaling pathways were both involved in the defense responses. Together, these results suggest that BnMKS1 might play an important role in the defense against L. maculans.

Exploiting regulatory heterogeneity to systematically identify enhancers with high accuracy

Proceedings of the National Academy of Sciences ◽

10.1073/pnas.1808833115 ◽

2018 ◽

Vol 116 (3) ◽

pp. 900-908 ◽

Cited By ~ 4

Author(s):

Hamutal Arbel ◽

Sumanta Basu ◽

William W. Fisher ◽

Ann S. Hammonds ◽

Kenneth H. Wan ◽

...

Keyword(s):

Large Scale ◽

Scale Validation ◽

Expression Patterns ◽

Prediction Method ◽

High Accuracy ◽

Genome Wide ◽

Rank List ◽

A Genome ◽

Improved Accuracy ◽

Genome Wide Scan

Identifying functional enhancer elements in metazoan systems is a major challenge. Large-scale validation of enhancers predicted by ENCODE reveal false-positive rates of at least 70%. We used the pregrastrula-patterning network of Drosophila melanogaster to demonstrate that loss in accuracy in held-out data results from heterogeneity of functional signatures in enhancer elements. We show that at least two classes of enhancers are active during early Drosophila embryogenesis and that by focusing on a single, relatively homogeneous class of elements, greater than 98% prediction accuracy can be achieved in a balanced, completely held-out test set. The class of well-predicted elements is composed predominantly of enhancers driving multistage segmentation patterns, which we designate segmentation driving enhancers (SDE). Prediction is driven by the DNA occupancy of early developmental transcription factors, with almost no additional power derived from histone modifications. We further show that improved accuracy is not a property of a particular prediction method: after conditioning on the SDE set, naïve Bayes and logistic regression perform as well as more sophisticated tools. Applying this method to a genome-wide scan, we predict 1,640 SDEs that cover 1.6% of the genome. An analysis of 32 SDEs using whole-mount embryonic imaging of stably integrated reporter constructs chosen throughout our prediction rank-list showed >90% drove expression patterns. We achieved 86.7% precision on a genome-wide scan, with an estimated recall of at least 98%, indicating high accuracy and completeness in annotating this class of functional elements.

Genomic variation across a clinical Cryptococcus population linked to disease outcome

10.1101/2021.11.22.469645 ◽

2021 ◽

Author(s):

Poppy Channa Sakti Sephton-Clark ◽

Jennifer Tenor ◽

Dena Toffaletti ◽

Nancy Meyers ◽

Charles Giamberardino ◽

...

Keyword(s):

Patient Outcomes ◽

Large Scale ◽

Genome Wide Association Study ◽

Sugar Transport ◽

Genomic Variation ◽

Clinical Isolates ◽

Genome Wide ◽

A Genome ◽

Gwas Approach

Cryptococcus neoformans is the causative agent of cryptococcosis, a disease with poor patient outcomes, accounting for approximately 180,000 deaths each year. Patient outcomes may be impacted by the underlying genetics of the infecting isolate, however, our current understanding of how genetic diversity contributes to clinical outcomes is limited. Here, we leverage clinical, in vitro growth and genomic data for 284 C. neoformans isolates to identify clinically relevant pathogen variants within a population of clinical isolates from patients with HIV-associated cryptococcosis in Malawi. Through a genome-wide association study (GWAS) approach, we identify variants associated with fungal burden and growth rate. We also find both small and large-scale variation, including aneuploidy, associated with alternate growth phenotypes, which may impact the course of infection. Genes impacted by these variants are involved in transcriptional regulation, signal transduction, glycolysis, sugar transport, and glycosylation. When combined with clinical data, we show that growth within the CNS is reliant upon glycolysis in an animal model, and likely impacts patient mortality, as CNS burden modulates patient outcome. Additionally, we find genes with roles in sugar transport are under selection in the majority of these clinical isolates. Further, we demonstrate that two hypothetical proteins identified by GWAS impact virulence in animal models. Our approach illustrates links between genetic variation and clinically relevant phenotypes, shedding light on survival mechanisms within the CNS and pathways involved in this persistence.

COPHYLOGENETIC ANALYSES OF TRACHYMYRMEX ANT-FUNGAL SPECIFICITY: ‘ONE TO ONE WITH SOME EXCEPTIONS’

10.22541/au.162211773.35216786/v1 ◽

2021 ◽

Author(s):

Katherine Beigel ◽

Alix Matthews ◽

Katrin Kellner ◽

Christine Pawlik ◽

Matthew Greenwold ◽

...

Keyword(s):

Large Scale ◽

Phylogenetic Analyses ◽

Marker Genes ◽

Snp Analysis ◽

Nucleotide Polymorphisms ◽

Single Nucleotide ◽

Genome Wide ◽

Native Populations ◽

Symbiotic Fungi ◽

Fungal Specificity

Over the past few decades, large-scale phylogenetic analyses of fungus-gardening ants and their symbiotic fungi have depicted strong concordance among major clades of ants and their symbiotic fungi, yet within clades, fungus sharing is somewhat widespread among unrelated ant lineages. These symbioses are thought to be explained by a diffuse coevolution model within major clades. Understanding horizontal exchange within clades has been limited by conventional genetic markers that lack both interspecific and geographic variation. To examine whether reports of horizontal exchange was indeed symbiont sharing or an issue of employing relatively uninformative molecular markers, samples of Trachymyrmex arizonensis and Trachymyrmex pomonae and their fungi were collected from native populations in Arizona and genotyped using conventional marker genes and genome-wide single nucleotide polymorphisms (SNPs). Conventional markers of the fungal symbionts generally exhibited cophylogenetic patterns that were consistent with some symbiont sharing, but most fungal clades had low support. SNP analysis, in contrast, indicated that each ant species exhibited fidelity to its own fungal subclade with only one instance of a colony growing a fungus that was otherwise associated with a different ant species. This evidence supports a pattern of codivergence between Trachymyrmex species and their fungi, and thus a diffuse coevolutionary model may not accurately predict symbiont exchange. These results suggest that fungal sharing across host species in these symbioses may be less extensive than previously thought.

CheckM: assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes

10.7287/peerj.preprints.554v2 ◽

2015 ◽

Cited By ~ 5

Author(s):

Donovan H Parks ◽

Michael Imelfort ◽

Connor T Skennerton ◽

Philip Hugenholtz ◽

Gene W Tyson

Keyword(s):

Large Scale ◽

Single Cells ◽

Metagenomic Data ◽

Marker Genes ◽

Specific Gene ◽

Diverse Range ◽

Automated Method ◽

A Genome ◽

Wide Range

Large-scale recovery of genomes from isolates, single cells, and metagenomic data has been made possible by advances in computational methods and substantial reductions in sequencing costs. While this increasing breadth of draft genomes is providing key information regarding the evolutionary and functional diversity of microbial life, it has become impractical to finish all available reference genomes. Making robust biological inferences from draft genomes requires accurate estimates of their completeness and contamination. Current methods for assessing genome quality are ad hoc and generally make use of a limited number of ‘marker’ genes conserved across all bacterial or archaeal genomes. Here we introduce CheckM, an automated method for assessing the quality of a genome using a broader set of marker genes specific to the position of a genome within a reference genome tree and information about the collocation of these genes. We demonstrate the effectiveness of CheckM using synthetic data and a wide range of isolate, single cell and metagenome derived genomes. CheckM is shown to provide accurate estimates of genome completeness and contamination, and to outperform existing approaches. Using CheckM, we identify a diverse range of errors currently impacting publicly available isolate genomes and demonstrate that genomes obtained from single cells and metagenomic data vary substantially in quality. In order to facilitate the use of draft genomes, we propose an objective measure of genome quality that can be used to select genomes suitable for specific gene- and genome-centric analyses of microbial communities.