Bactopia: a Flexible Pipeline for Complete Analysis of Bacterial Genomes

ABSTRACT Sequencing of bacterial genomes using Illumina technology has become such a standard procedure that often data are generated faster than can be conveniently analyzed. We created a new series of pipelines called Bactopia, built using Nextflow workflow software, to provide efficient comparative genomic analyses for bacterial species or genera. Bactopia consists of a data set setup step (Bactopia Data Sets [BaDs]), which creates a series of customizable data sets for the species of interest, the Bactopia Analysis Pipeline (BaAP), which performs quality control, genome assembly, and several other functions based on the available data sets and outputs the processed data to a structured directory format, and a series of Bactopia Tools (BaTs) that perform specific postprocessing on some or all of the processed data. BaTs include pan-genome analysis, computing average nucleotide identity between samples, extracting and profiling the 16S genes, and taxonomic classification using highly conserved genes. It is expected that the number of BaTs will increase to fill specific applications in the future. As a demonstration, we performed an analysis of 1,664 public Lactobacillus genomes, focusing on Lactobacillus crispatus, a species that is a common part of the human vaginal microbiome. Bactopia is an open source system that can scale from projects as small as one bacterial genome to ones including thousands of genomes and that allows for great flexibility in choosing comparison data sets and options for downstream analysis. Bactopia code can be accessed at https://www.github.com/bactopia/bactopia. IMPORTANCE It is now relatively easy to obtain a high-quality draft genome sequence of a bacterium, but bioinformatic analysis requires organization and optimization of multiple open source software tools. We present Bactopia, a pipeline for bacterial genome analysis, as an option for processing bacterial genome data. Bactopia also automates downloading of data from multiple public sources and species-specific customization. Because the pipeline is written in the Nextflow language, analyses can be scaled from individual genomes on a local computer to thousands of genomes using cloud resources. As a usage example, we processed 1,664 Lactobacillus genomes from public sources and used comparative analysis workflows (Bactopia Tools) to identify and analyze members of the L. crispatus species.

Download Full-text

Bactopia: a flexible pipeline for complete analysis of bacterial genomes

10.1101/2020.02.28.969394 ◽

2020 ◽

Author(s):

Robert A. Petit ◽

Timothy D. Read

Keyword(s):

Standard Procedure ◽

Bacterial Species ◽

Bacterial Genome ◽

Complete Analysis ◽

Comparative Genomic ◽

Bacterial Genomes ◽

Analysis Pipeline ◽

Genomic Analyses ◽

Conserved Genes ◽

Downstream Analysis

AbstractSequencing of bacterial genomes using Illumina technology has become such a standard procedure that often data are generated faster than can be conveniently analyzed. We created a new series of pipelines called Bactopia, built using Nextflow workflow software, to provide efficient comparative genomic analyses for bacterial species or genera. Bactopia consists of a dataset setup step (Bactopia Datasets; BaDs) where a series of customizable datasets are created for the species of interest; the Bactopia Analysis Pipeline (BaAP), which performs quality control, genome assembly and several other functions based on the available datasets and outputs the processed data to a structured directory format; and a series of Bactopia Tools (BaTs) that perform specific post-processing on some or all of the processed data. BaTs include pan-genome analysis, computing average nucleotide identity between samples, extracting and profiling the 16S genes and taxonomic classification using highly conserved genes. It is expected that the number of BaTs will increase to fill specific applications in the future. As a demonstration, we performed an analysis of 1,664 public Lactobacillus genomes, focusing on L. crispatus, a species that is a common part of the human vaginal microbiome. Bactopia is an open source system that can scale from projects as small as one bacterial genome to thousands that allows for great flexibility in choosing comparison datasets and options for downstream analysis. Bactopia code can be accessed at https://www.github.com/bactopia/bactopia.

Download Full-text

Consistent Metagenome-Derived Metrics Verify and Delineate Bacterial Species Boundaries

mSystems ◽

10.1128/msystems.00731-19 ◽

2020 ◽

Vol 5 (1) ◽

Cited By ~ 14

Author(s):

Matthew R. Olm ◽

Alexander Crits-Christoph ◽

Spencer Diamond ◽

Adi Lavy ◽

Paula B. Matheus Carnevali ◽

...

Keyword(s):

Bacterial Diversity ◽

Ribosomal Proteins ◽

Large Scale ◽

Bacterial Species ◽

Bacterial Genome ◽

16S Rrna Genes ◽

Rrna Genes ◽

Species Discrimination ◽

Bacterial Genomes ◽

Discrimination Power

ABSTRACT Longstanding questions relate to the existence of naturally distinct bacterial species and genetic approaches to distinguish them. Bacterial genomes in public databases form distinct groups, but these databases are subject to isolation and deposition biases. To avoid these biases, we compared 5,203 bacterial genomes from 1,457 environmental metagenomic samples to test for distinct clouds of diversity and evaluated metrics that could be used to define the species boundary. Bacterial genomes from the human gut, soil, and the ocean all exhibited gaps in whole-genome average nucleotide identities (ANI) near the previously suggested species threshold of 95% ANI. While genome-wide ratios of nonsynonymous and synonymous nucleotide differences (dN/dS) decrease until ANI values approach ∼98%, two methods for estimating homologous recombination approached zero at ∼95% ANI, supporting breakdown of recombination due to sequence divergence as a species-forming force. We evaluated 107 genome-based metrics for their ability to distinguish species when full genomes are not recovered. Full-length 16S rRNA genes were least useful, in part because they were underrecovered from metagenomes. However, many ribosomal proteins displayed both high metagenomic recoverability and species discrimination power. Taken together, our results verify the existence of sequence-discrete microbial species in metagenome-derived genomes and highlight the usefulness of ribosomal genes for gene-level species discrimination. IMPORTANCE There is controversy about whether bacterial diversity is clustered into distinct species groups or exists as a continuum. To address this issue, we analyzed bacterial genome databases and reports from several previous large-scale environment studies and identified clear discrete groups of species-level bacterial diversity in all cases. Genetic analysis further revealed that quasi-sexual reproduction via horizontal gene transfer is likely a key evolutionary force that maintains bacterial species integrity. We next benchmarked over 100 metrics to distinguish these bacterial species from each other and identified several genes encoding ribosomal proteins with high species discrimination power. Overall, the results from this study provide best practices for bacterial species delineation based on genome content and insight into the nature of bacterial species population genetics.

Download Full-text

An Evolutionary Link between Natural Transformation and CRISPR Adaptive Immunity

mBio ◽

10.1128/mbio.00309-12 ◽

2012 ◽

Vol 3 (5) ◽

Cited By ~ 52

Author(s):

Peter Jorth ◽

Marvin Whiteley

Keyword(s):

Gene Transfer ◽

Horizontal Gene Transfer ◽

Bacterial Diversity ◽

Natural Transformation ◽

Bacterial Species ◽

Comparative Genomic ◽

Content Type ◽

Immune Systems ◽

Adaptive Immune ◽

Adaptive Immune Systems

ABSTRACTNatural transformation by competent bacteria is a primary means of horizontal gene transfer; however, evidence that competence drives bacterial diversity and evolution has remained elusive. To test this theory, we used a retrospective comparative genomic approach to analyze the evolutionary history ofAggregatibacter actinomycetemcomitans, a bacterial species with both competent and noncompetent sister strains. Through comparative genomic analyses, we reveal that competence is evolutionarily linked to genomic diversity and speciation. Competence loss occurs frequently during evolution and is followed by the loss of clustered regularly interspaced short palindromic repeats (CRISPRs), bacterial adaptive immune systems that protect against parasitic DNA. Relative to noncompetent strains, competent bacteria have larger genomes containing multiple rearrangements. In contrast, noncompetent bacterial genomes are extremely stable but paradoxically susceptible to infective DNA elements, which contribute to noncompetent strain genetic diversity. Moreover, incomplete noncompetent strain CRISPR immune systems are enriched for self-targeting elements, which suggests that the CRISPRs have been co-opted for bacterial gene regulation, similar to eukaryotic microRNAs derived from the antiviral RNA interference pathway.IMPORTANCEThe human microbiome is rich with thousands of diverse bacterial species. One mechanism driving this diversity is horizontal gene transfer by natural transformation, whereby naturally competent bacteria take up environmental DNA and incorporate new genes into their genomes. Competence is theorized to accelerate evolution; however, attempts to test this theory have proved difficult. Through genetic analyses of the human periodontal pathogenAggregatibacter actinomycetemcomitans, we have discovered an evolutionary connection between competence systems promoting gene acquisition and CRISPRs (clustered regularly interspaced short palindromic repeats), adaptive immune systems that protect bacteria against genetic parasites. We show that competentA. actinomycetemcomitansstrains have numerous redundant CRISPR immune systems, while noncompetent bacteria have lost their CRISPR immune systems because of inactivating mutations. Together, the evolutionary data linking the evolution of competence and CRISPRs reveals unique mechanisms promoting genetic heterogeneity and the rise of new bacterial species, providing insight into complex mechanisms underlying bacterial diversity in the human body.

Download Full-text

A systematic review of machine learning-based missing value imputation techniques

Data Technologies and Applications ◽

10.1108/dta-12-2020-0298 ◽

2021 ◽

Vol ahead-of-print (ahead-of-print) ◽

Author(s):

Tressy Thomas ◽

Enayat Rajabi

Keyword(s):

Machine Learning ◽

Selection Process ◽

Evaluation Metrics ◽

Correct Prediction ◽

Data Sets ◽

Data Set ◽

Missing Value ◽

Content Type ◽

Missing Value Imputation ◽

Literature Reviews

PurposeThe primary aim of this study is to review the studies from different dimensions including type of methods, experimentation setup and evaluation metrics used in the novel approaches proposed for data imputation, particularly in the machine learning (ML) area. This ultimately provides an understanding about how well the proposed framework is evaluated and what type and ratio of missingness are addressed in the proposals. The review questions in this study are (1) what are the ML-based imputation methods studied and proposed during 2010–2020? (2) How the experimentation setup, characteristics of data sets and missingness are employed in these studies? (3) What metrics were used for the evaluation of imputation method?Design/methodology/approachThe review process went through the standard identification, screening and selection process. The initial search on electronic databases for missing value imputation (MVI) based on ML algorithms returned a large number of papers totaling at 2,883. Most of the papers at this stage were not exactly an MVI technique relevant to this study. The literature reviews are first scanned in the title for relevancy, and 306 literature reviews were identified as appropriate. Upon reviewing the abstract text, 151 literature reviews that are not eligible for this study are dropped. This resulted in 155 research papers suitable for full-text review. From this, 117 papers are used in assessment of the review questions.FindingsThis study shows that clustering- and instance-based algorithms are the most proposed MVI methods. Percentage of correct prediction (PCP) and root mean square error (RMSE) are most used evaluation metrics in these studies. For experimentation, majority of the studies sourced the data sets from publicly available data set repositories. A common approach is that the complete data set is set as baseline to evaluate the effectiveness of imputation on the test data sets with artificially induced missingness. The data set size and missingness ratio varied across the experimentations, while missing datatype and mechanism are pertaining to the capability of imputation. Computational expense is a concern, and experimentation using large data sets appears to be a challenge.Originality/valueIt is understood from the review that there is no single universal solution to missing data problem. Variants of ML approaches work well with the missingness based on the characteristics of the data set. Most of the methods reviewed lack generalization with regard to applicability. Another concern related to applicability is the complexity of the formulation and implementation of the algorithm. Imputations based on k-nearest neighbors (kNN) and clustering algorithms which are simple and easy to implement make it popular across various domains.

Download Full-text

Fast and accurate detection of surface defect based on improved YOLOv4

Assembly Automation ◽

10.1108/aa-04-2021-0044 ◽

2021 ◽

Vol ahead-of-print (ahead-of-print) ◽

Author(s):

Jiawei Lian ◽

Junhong He ◽

Yun Niu ◽

Tianze Wang

Keyword(s):

Feature Extraction ◽

Real Time ◽

Surface Defect ◽

Steel Ingot ◽

Industrial Applications ◽

Data Sets ◽

Data Set ◽

Processing Technologies ◽

Content Type ◽

Public Data

Purpose The current popular image processing technologies based on convolutional neural network have the characteristics of large computation, high storage cost and low accuracy for tiny defect detection, which is contrary to the high real-time and accuracy, limited computing resources and storage required by industrial applications. Therefore, an improved YOLOv4 named as YOLOv4-Defect is proposed aim to solve the above problems. Design/methodology/approach On the one hand, this study performs multi-dimensional compression processing on the feature extraction network of YOLOv4 to simplify the model and improve the feature extraction ability of the model through knowledge distillation. On the other hand, a prediction scale with more detailed receptive field is added to optimize the model structure, which can improve the detection performance for tiny defects. Findings The effectiveness of the method is verified by public data sets NEU-CLS and DAGM 2007, and the steel ingot data set collected in the actual industrial field. The experimental results demonstrated that the proposed YOLOv4-Defect method can greatly improve the recognition efficiency and accuracy and reduce the size and computation consumption of the model. Originality/value This paper proposed an improved YOLOv4 named as YOLOv4-Defect for the detection of surface defect, which is conducive to application in various industrial scenarios with limited storage and computing resources, and meets the requirements of high real-time and precision.

Download Full-text

An application of data envelopment analysis for Korean banks with negative data

Benchmarking An International Journal ◽

10.1108/bij-02-2016-0023 ◽

2017 ◽

Vol 24 (4) ◽

pp. 1052-1064 ◽

Cited By ~ 9

Author(s):

Yong Joo Lee ◽

Seong-Jong Joo ◽

Hong Gyun Park

Keyword(s):

Data Envelopment Analysis ◽

Data Sets ◽

Data Envelopment ◽

Negative Data ◽

Ownership Type ◽

Data Set ◽

Translation Invariant ◽

Content Type ◽

Dea Models ◽

Regional Banks

Purpose The purpose of this paper is to measure the comparative efficiency of 18 Korean commercial banks under the presence of negative observations and examine performance differences among them by grouping them according to their market conditions. Design/methodology/approach The authors employ two data envelopment analysis (DEA) models such as a Banker, Charnes, and Cooper (BCC) model and a modified slacks-based measure of efficiency (MSBM) model, which can handle negative data. The BCC model is proven to be translation invariant for inputs or outputs depending on output or input orientation. Meanwhile, the MSBM model is unit invariant in addition to translation invariant. The authors compare results from both models and choose one for interpreting results. Findings Most Korean banks recovered from the worst performance in 2011 and showed similar performance in recent years. Among three groups such as national banks, regional banks, and special banks, the most special banks demonstrated superb performance across models and years. Especially, the performance difference between the special banks and the regional banks was statistically significant. The authors concluded that the high performance of the special banks was due to their nationwide market access and ownership type. Practical implications This study demonstrates how to analyze and measure the efficiency of entities when variables contain negative observations using a data set for Korean banks. The authors have tried two major DEA models that are able to handle negative data and proposed a practical direction for future studies. Originality/value Although there are research papers for measuring the performance of banks in Korea, all of the papers in the topic have studied efficiency or productivity using positive data sets. However, variables such as net incomes and growth rates frequently include negative observations in bank data sets. This is the first paper to investigate the efficiency of bank operations in the presence of negative data in Korea.

Download Full-text

Genome Analysis of the Enterococcus faecium Entfac.YE Prophage

Avicenna Journal of Medical Biotechnology ◽

10.18502/ajmb.v14i1.8170 ◽

2022 ◽

Author(s):

Yara Elahi ◽

Ramin Mazaheri Nezhad Fard ◽

Arash Seifi ◽

Saeideh Mahfouzi ◽

Ali Akbar Saboor Yaraghi

Keyword(s):

Genome Analysis ◽

Enterococcus Faecium ◽

Clinical Sample ◽

Bacterial Species ◽

Bacterial Genome ◽

Rapid Evolution ◽

Housekeeping Genes ◽

Water Sources ◽

Clinical Samples ◽

Transfer Resistance

Background: Bacteriophages are viruses that infect bacteria. Bacteriophages are widely distributed in various environments. The prevalence of bacteriophages in water sources, especially wastewaters, is naturally high. These viruses affect evolution of most bacterial species. Bacteriophages are able to integrate their genomes into the chromosomes of their hosts as prophages and hence transfer resistance genes to the bacterial genomes. Enterococci are commensal bacteria that show high resistance to common antibiotics. For example, prevalence of vancomycin-resistant enterococci has increased within the last decades. Methods: Enterococcal isolates were isolated from clinical samples and morphological, phenotypical, biochemical, and molecular methods were used to identify and confirm their identity. Bacteriophages extracted from water sources were then applied to isolated Enterococcus faecium (E. faecium). In the next step, the bacterial genome was completely sequenced and the existing prophage genome in the bacterial genome was analyzed. Results: In this study, E. faecium EntfacYE was isolated from a clinical sample. The EntfacYE genome was analyzed and 88 prophage genes were identified. The prophage content included four housekeeping genes, 29 genes in the group of genes related to replication and regulation, 25 genes in the group of genes related to structure and packaging, and four genes belonging to the group of genes associated with lysis. Moreover, 26 genes were identified with unknown functions. Conclusion: In conclusion, genome analysis of prophages can lead to a better understanding of their roles in the rapid evolution of bacteria.

Download Full-text

A scalable eigenspace-based fuzzy c-means for topic detection

Data Technologies and Applications ◽

10.1108/dta-11-2020-0262 ◽

2021 ◽

Vol ahead-of-print (ahead-of-print) ◽

Author(s):

Hendri Murfi

Keyword(s):

Representation Learning ◽

Detection Methods ◽

Data Sets ◽

Topic Detection ◽

Data Set ◽

Content Type ◽

Running Time ◽

Fuzzy C Means ◽

Coherence Score ◽

Value Decomposition

PurposeThe aim of this research is to develop an eigenspace-based fuzzy c-means method for scalable topic detection.Design/methodology/approachThe eigenspace-based fuzzy c-means (EFCM) combines representation learning and clustering. The textual data are transformed into a lower-dimensional eigenspace using truncated singular value decomposition. Fuzzy c-means is performed on the eigenspace to identify the centroids of each cluster. The topics are provided by transforming back the centroids into the nonnegative subspace of the original space. In this paper, we extend the EFCM method for scalability by using the two approaches, i.e. single-pass and online. We call the developed topic detection methods as oEFCM and spEFCM.FindingsOur simulation shows that both oEFCM and spEFCM methods provide faster running times than EFCM for data sets that do not fit in memory. However, there is a decrease in the average coherence score. For both data sets that fit and do not fit into memory, the oEFCM method provides a tradeoff between running time and coherence score, which is better than spEFCM.Originality/valueThis research produces a scalable topic detection method. Besides this scalability capability, the developed method also provides a faster running time for the data set that fits in memory.

Download Full-text

A systematical approach to classification problems with feature space heterogeneity

Kybernetes ◽

10.1108/k-06-2018-0313 ◽

2019 ◽

Vol 48 (9) ◽

pp. 2006-2029

Author(s):

Hongshan Xiao ◽

Yu Wang

Keyword(s):

Factor Analysis ◽

Meta Analysis ◽

Feature Space ◽

Classification Performance ◽

Classification Algorithm ◽

Significant Feature ◽

Data Sets ◽

Data Set ◽

Classification Techniques ◽

Content Type

Purpose Feature space heterogeneity exists widely in various application fields of classification techniques, such as customs inspection decision, credit scoring and medical diagnosis. This paper aims to study the relationship between feature space heterogeneity and classification performance. Design/methodology/approach A measurement is first developed for measuring and identifying any significant heterogeneity that exists in the feature space of a data set. The main idea of this measurement is derived from a meta-analysis. For the data set with significant feature space heterogeneity, a classification algorithm based on factor analysis and clustering is proposed to learn the data patterns, which, in turn, are used for data classification. Findings The proposed approach has two main advantages over the previous methods. The first advantage lies in feature transform using orthogonal factor analysis, which results in new features without redundancy and irrelevance. The second advantage rests on samples partitioning to capture the feature space heterogeneity reflected by differences of factor scores. The validity and effectiveness of the proposed approach is verified on a number of benchmarking data sets. Research limitations/implications Measurement should be used to guide the heterogeneity elimination process, which is an interesting topic in future research. In addition, to develop a classification algorithm that enables scalable and incremental learning for large data sets with significant feature space heterogeneity is also an important issue. Practical implications Measuring and eliminating the feature space heterogeneity possibly existing in the data are important for accurate classification. This study provides a systematical approach to feature space heterogeneity measurement and elimination for better classification performance, which is favorable for applications of classification techniques in real-word problems. Originality/value A measurement based on meta-analysis for measuring and identifying any significant feature space heterogeneity in a classification problem is developed, and an ensemble classification framework is proposed to deal with the feature space heterogeneity and improve the classification accuracy.

Download Full-text

Modulating Pathogenesis with Mobile-CRISPRi

Journal of Bacteriology ◽

10.1128/jb.00304-19 ◽

2019 ◽

Vol 201 (22) ◽

Cited By ~ 7

Author(s):

Jiuxin Qu ◽

Neha K. Prasad ◽

Michelle A. Yu ◽

Shuyan Chen ◽

Amy Lyden ◽

...

Keyword(s):

Gene Expression ◽

Pathogenic Bacteria ◽

Bacterial Species ◽

Effector Proteins ◽

Infection Model ◽

Secretion Systems ◽

Bacterial Genomes ◽

Gene Encoding ◽

Content Type ◽

Animal Infection

ABSTRACT Conditionally essential (CE) genes are required by pathogenic bacteria to establish and maintain infections. CE genes encode virulence factors, such as secretion systems and effector proteins, as well as biosynthetic enzymes that produce metabolites not found in the host environment. Due to their outsized importance in pathogenesis, CE gene products are attractive targets for the next generation of antimicrobials. However, the precise manipulation of CE gene expression in the context of infection is technically challenging, limiting our ability to understand the roles of CE genes in pathogenesis and accordingly design effective inhibitors. We previously developed a suite of CRISPR interference-based gene knockdown tools that are transferred by conjugation and stably integrate into bacterial genomes that we call Mobile-CRISPRi. Here, we show the efficacy of Mobile-CRISPRi in controlling CE gene expression in an animal infection model. We optimize Mobile-CRISPRi in Pseudomonas aeruginosa for use in a murine model of pneumonia by tuning the expression of CRISPRi components to avoid nonspecific toxicity. As a proof of principle, we demonstrate that knock down of a CE gene encoding the type III secretion system (T3SS) activator ExsA blocks effector protein secretion in culture and attenuates virulence in mice. We anticipate that Mobile-CRISPRi will be a valuable tool to probe the function of CE genes across many bacterial species and pathogenesis models. IMPORTANCE Antibiotic resistance is a growing threat to global health. To optimize the use of our existing antibiotics and identify new targets for future inhibitors, understanding the fundamental drivers of bacterial growth in the context of the host immune response is paramount. Historically, these genetic drivers have been difficult to manipulate precisely, as they are requisite for pathogen survival. Here, we provide the first application of Mobile-CRISPRi to study conditionally essential virulence genes in mouse models of lung infection through partial gene perturbation. We envision the use of Mobile-CRISPRi in future pathogenesis models and antibiotic target discovery efforts.

Download Full-text