Identification of gene fusion events in Mycobacterium tuberculosis that encode chimeric proteins

Abstract Mycobacterium tuberculosis is a facultative intracellular pathogen responsible for causing tuberculosis. The harsh environment in which M. tuberculosis survives requires this pathogen to continuously adapt in order to maintain an evolutionary advantage. However, the apparent absence of horizontal gene transfer in M. tuberculosis imposes restrictions in the ways by which evolution can occur. Large-scale changes in the genome can be introduced through genome reduction, recombination events and structural variation. Here, we identify a functional chimeric protein in the ppe38–71 locus, the absence of which is known to have an impact on protein secretion and virulence. To examine whether this approach was used more often by this pathogen, we further develop software that detects potential gene fusion events from multigene deletions using whole genome sequencing data. With this software we could identify a number of other putative gene fusion events within the genomes of M. tuberculosis isolates. We were able to demonstrate the expression of one of these gene fusions at the protein level using mass spectrometry. Therefore, gene fusions may provide an additional means of evolution for M. tuberculosis in its natural environment whereby novel chimeric proteins and functions can arise.

Download Full-text

A large scale evaluation of TBProfiler and Mykrobe for antibiotic resistance prediction in Mycobacterium tuberculosis

PeerJ ◽

10.7717/peerj.6857 ◽

2019 ◽

Vol 7 ◽

pp. e6857 ◽

Cited By ~ 3

Author(s):

Pierre Mahé ◽

Meriem El Azami ◽

Philippine Barlas ◽

Maud Tournoud

Keyword(s):

Antibiotic Resistance ◽

Mycobacterium Tuberculosis ◽

Large Scale ◽

Predictive Power ◽

Predictive Performance ◽

Systematic Evaluation ◽

Whole Genome Sequencing Data ◽

Attractive Alternative ◽

Sequencing Data ◽

Trade Offs

Recent years saw a growing interest in predicting antibiotic resistance from whole-genome sequencing data, with promising results obtained for Staphylococcus aureus and Mycobacterium tuberculosis. In this work, we gathered 6,574 sequencing read datasets of M. tuberculosis public genomes with associated antibiotic resistance profiles for both first and second-line antibiotics. We performed a systematic evaluation of TBProfiler and Mykrobe, two widely recognized softwares allowing to predict resistance in M. tuberculosis. The size of the dataset allowed us to obtain confident estimations of their overall predictive performance, to assess precisely the individual predictive power of the markers they rely on, and to study in addition how these softwares behave across the major M. tuberculosis lineages. While this study confirmed the overall good performance of these tools, it revealed that an important fraction of the catalog of mutations they embed is of limited predictive power. It also revealed that these tools offer different sensitivity/specificity trade-offs, which is mainly due to the different sets of mutation they embed but also to their underlying genotyping pipelines. More importantly, it showed that their level of predictive performance varies greatly across lineages for some antibiotics, therefore suggesting that the predictions made by these softwares should be deemed more or less confident depending on the lineage inferred and the predictive performance of the marker(s) actually detected. Finally, we evaluated the relevance of machine learning approaches operating from the set of markers detected by these softwares and show that they present an attractive alternative strategy, allowing to reach better performance for several drugs while significantly reducing the number of candidate mutations to consider.

Download Full-text

Plasmids or no plasmids? A comparison between the agilent TapeStation and whole-genome sequencing data in a large-scale bacterial sequencing project

10.26226/morressier.56d5ba27d462b80296c95fe7 ◽

2016 ◽

Author(s):

Sarah Alexander

Keyword(s):

Whole Genome Sequencing ◽

Genome Sequencing ◽

Large Scale ◽

Whole Genome Sequencing Data ◽

Whole Genome ◽

Sequencing Data ◽

Sequencing Project

Download Full-text

Whole genome sequencing data of 1110 Mycobacterium tuberculosis isolates identifies insertions and deletions associated with drug resistance

BMC Genomics ◽

10.1186/s12864-018-4734-6 ◽

2018 ◽

Vol 19 (1) ◽

Cited By ~ 3

Author(s):

Xi Zeng ◽

Jamie Sui-Lam Kwok ◽

Kevin Yi Yang ◽

Kenneth Siu-Sing Leung ◽

Mai Shi ◽

...

Keyword(s):

Drug Resistance ◽

Mycobacterium Tuberculosis ◽

Whole Genome Sequencing ◽

Genome Sequencing ◽

Whole Genome Sequencing Data ◽

Whole Genome ◽

Sequencing Data ◽

Insertions And Deletions

Download Full-text

MTBseq: a comprehensive pipeline for whole genome sequence analysis of Mycobacterium tuberculosis complex isolates

PeerJ ◽

10.7717/peerj.5895 ◽

2018 ◽

Vol 6 ◽

pp. e5895 ◽

Cited By ~ 35

Author(s):

Thomas Andreas Kohl ◽

Christian Utpatel ◽

Viola Schleusener ◽

Maria Rosaria De Filippo ◽

Patrick Beckert ◽

...

Keyword(s):

Antibiotic Resistance ◽

Mycobacterium Tuberculosis ◽

Genome Sequence ◽

Sequence Data ◽

Whole Genome Sequence ◽

Whole Genome Sequencing Data ◽

Phylogenomic Analysis ◽

Whole Genome ◽

Sequencing Data ◽

Desktop Computer

Analyzing whole-genome sequencing data of Mycobacterium tuberculosis complex (MTBC) isolates in a standardized workflow enables both comprehensive antibiotic resistance profiling and outbreak surveillance with highest resolution up to the identification of recent transmission chains. Here, we present MTBseq, a bioinformatics pipeline for next-generation genome sequence data analysis of MTBC isolates. Employing a reference mapping based workflow, MTBseq reports detected variant positions annotated with known association to antibiotic resistance and performs a lineage classification based on phylogenetic single nucleotide polymorphisms (SNPs). When comparing multiple datasets, MTBseq provides a joint list of variants and a FASTA alignment of SNP positions for use in phylogenomic analysis, and identifies groups of related isolates. The pipeline is customizable, expandable and can be used on a desktop computer or laptop without any internet connection, ensuring mobile usage and data security. MTBseq and accompanying documentation is available from https://github.com/ngs-fzb/MTBseq_source.

Download Full-text

gplas: a comprehensive tool for plasmid analysis using short-read graphs

Bioinformatics ◽

10.1093/bioinformatics/btaa233 ◽

2020 ◽

Vol 36 (12) ◽

pp. 3874-3876 ◽

Cited By ~ 1

Author(s):

Sergio Arredondo-Alonso ◽

Martin Bootsma ◽

Yaïr Hein ◽

Malbert R C Rogers ◽

Jukka Corander ◽

...

Keyword(s):

Large Scale ◽

Sequence Data ◽

Bacterial Genome ◽

Workflow Management ◽

Supplementary Information ◽

Whole Genome Sequencing Data ◽

Network Partitioning ◽

Sequencing Data ◽

Genetic Traits ◽

Short Read

Abstract Summary Plasmids can horizontally transmit genetic traits, enabling rapid bacterial adaptation to new environments and hosts. Short-read whole-genome sequencing data are often applied to large-scale bacterial comparative genomics projects but the reconstruction of plasmids from these data is facing severe limitations, such as the inability to distinguish plasmids from each other in a bacterial genome. We developed gplas, a new approach to reliably separate plasmid contigs into discrete components using sequence composition, coverage, assembly graph information and network partitioning based on a pruned network of plasmid unitigs. Gplas facilitates the analysis of large numbers of bacterial isolates and allows a detailed analysis of plasmid epidemiology based solely on short-read sequence data. Availability and implementation Gplas is written in R, Bash and uses a Snakemake pipeline as a workflow management system. Gplas is available under the GNU General Public License v3.0 at https://gitlab.com/sirarredondo/gplas.git. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

SplitStrains, a tool to identify and separate mixed Mycobacterium tuberculosis infections from WGS data

10.1101/2021.02.07.21250981 ◽

2021 ◽

Author(s):

Einar Gabbasov ◽

Miguel Moreno-Molina ◽

Iñaki Comas ◽

Maxwell Libbrecht ◽

Leonid Chindelevitch

Keyword(s):

Public Health ◽

Mycobacterium Tuberculosis ◽

Whole Genome Sequencing ◽

Genome Sequencing ◽

Bacterial Pathogens ◽

Superior Performance ◽

Whole Genome Sequencing Data ◽

Whole Genome ◽

Sequencing Data ◽

Multiple Strains

AbstractThe occurrence of multiple strains of a bacterial pathogen such as M. tuberculosis or C. difficile within a single human host, referred to as a mixed infection, has important implications for both healthcare and public health. However, methods for detecting it, and especially determining the proportion and identities of the underlying strains, from WGS (whole-genome sequencing) data, have been limited.In this paper we introduce SplitStrains, a novel method for addressing these challenges. Grounded in a rigorous statistical model, SplitStrains not only demonstrates superior performance in proportion estimation to other existing methods on both simulated as well as real M. tuberculosis data, but also successfully determines the identity of the underlying strains.We conclude that SplitStrains is a powerful addition to the existing toolkit of analytical methods for data coming from bacterial pathogens, and holds the promise of enabling previously inaccessible conclusions to be drawn in the realm of public health microbiology.Author summaryWhen multiple strains of a pathogenic organism are present in a patient, it may be necessary to not only detect this, but also to identify the individual strains. However, this problem has not yet been solved for bacterial pathogens processed via whole-genome sequencing. In this paper, we propose the SplitStrains algorithm for detecting multiple strains in a sample, identifying their proportions, and inferring their sequences, in the case of Mycobacterium tuberculosis. We test it on both simulated and real data, with encouraging results. We believe that our work opens new horizons in public health microbiology by allowing a more precise detection, identification and quantification of multiple infecting strains within a sample.

Download Full-text

Population-level genome-wide STR typing in Plasmodium species reveals higher resolution population structure and genetic diversity relative to SNP typing

10.1101/2021.05.19.444768 ◽

2021 ◽

Author(s):

Jiru Han ◽

Jacob E Munro ◽

Anthony Kocoski ◽

Alyssa E Barry ◽

Melanie Bahlo

Keyword(s):

Genetic Diversity ◽

Large Scale ◽

Tandem Repeats ◽

Plasmodium Species ◽

Whole Genome Sequence ◽

Whole Genome Sequencing Data ◽

Whole Genome ◽

Sequencing Data ◽

Genome Wide ◽

Field Samples

Short tandem repeats (STRs) are highly informative genetic markers that have been used extensively in population genetics analysis. They are an important source of genetic diversity and can also have functional impact. Despite the availability of bioinformatic methods that permit large-scale genome-wide genotyping of STRs from whole genome sequencing data, they have not previously been applied to sequencing data from large collections of malaria parasite field samples. Here, we have genotyped STRs using HipSTR in more than 3,000 Plasmodium falciparum and 174 Plasmodium vivax published whole-genome sequence data from samples collected across the globe. High levels of noise and variability in the resultant callset necessitated the development of a novel method for quality control of STR genotype calls. A set of high-quality STR loci (6,768 from P. falciparum and 3,496 from P. vivax) were used to study Plasmodium genetic diversity, population structures and genomic signatures of selection and these were compared to genome-wide single nucleotide polymorphism (SNP) genotyping data. In addition, the genome-wide information about genetic variation and other characteristics of STRs in P. falciparum and P. vivax have been made available in an interactive web-based R Shiny application PlasmoSTR (https://github.com/bahlolab/PlasmoSTR).

Download Full-text

Genetic composition and evolution of the prevalent Mycobacterium tuberculosis lineages 2 and 4 in the Chinese and Zhejiang Province populations

Cell & Bioscience ◽

10.1186/s13578-021-00673-7 ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

Beibei Wu ◽

Wenlong Zhu ◽

Yue Wang ◽

Qi Wang ◽

Lin Zhou ◽

...

Keyword(s):

Mycobacterium Tuberculosis ◽

T Cell ◽

Zhejiang Province ◽

Whole Genome Sequencing Data ◽

Evolutionary Analysis ◽

T Cell Epitopes ◽

Sequencing Data ◽

Human T Cell ◽

Lineage 2 ◽

Highest Posterior Density

Abstract Background There are seven human-adaptation lineages of Mycobacterium tuberculosis (Mtb). Tuberculosis (TB) dissemination is strongly influenced by human movements and host genetics. The detailed lineage distribution evolution of Mtb in Zhejiang Province is unknown. We aim to determine how different sub-lineages are transmitted and distributed within China and Zhejiang Province. Methods We analysed whole-genome sequencing data for a worldwide collection of 1154 isolates and a provincial collection of 1296 isolates, constructed the best-scoring maximum likelihood phylogenetic tree. Bayesian evolutionary analysis was used to calculate the latest common ancestor of lineages 2 and 4. The antigenic diversity of human T cell epitopes was evaluated by calculating the pairwise dN/dS ratios. Results Of the Zhejiang isolates, 964 (74.38%) belonged to lineage 2 and 332 (25.62%) belonged to lineage 4. The distributions of the sub-lineages varied across the geographic regions of Zhejiang Province. L2.2 is the most ancient sub-lineage in Zhejiang, first appearing approximately 6897 years ago (95% highest posterior density interval (HDI): 6513–7298). L4.4 is the most modern sub-lineage, first appearing approximately 2217 years ago (95% HDI: 1864–2581). The dN/dS ratios showed that the epitope and non-epitope regions of lineage 2 strains were significantly (P < 0.001) more conserved than those of lineage 4. Conclusions An increase in the frequency of lineage 4 may reflect its successful transmission over the last 20 years. The recent common ancestors of the sub-lineages and their transmission routes are relevant to the entry of humans into China and Zhejiang Province. Diversity in T cell epitopes may prevent Mycobacterium tuberculosis from being recognized by the immune system.

Download Full-text

Whole genome sequencing data and analysis of a rifampicin-resistant Mycobacterium tuberculosis strain SBH162 from Sabah, Malaysia

Data in Brief ◽

10.1016/j.dib.2019.104445 ◽

2019 ◽

Vol 26 ◽

pp. 104445 ◽

Cited By ~ 1

Author(s):

Jaeyres Jani ◽

Zainal Arifin Mustapha ◽

Norfazirah Binti Jamal ◽

Cheronie Shely Stanis ◽

Chin Kai Ling ◽

...

Keyword(s):

Mycobacterium Tuberculosis ◽

Whole Genome Sequencing ◽

Genome Sequencing ◽

Whole Genome Sequencing Data ◽

Whole Genome ◽

Sequencing Data ◽

Tuberculosis Strain ◽

Mycobacterium Tuberculosis Strain

Download Full-text

Evaluation of Single-Molecule Sequencing Technologies for Structural Variant Detection in Two Swedish Human Genomes

Genes ◽

10.3390/genes11121444 ◽

2020 ◽

Vol 11 (12) ◽

pp. 1444

Author(s):

Nazeefa Fatima ◽

Anna Petri ◽

Ulf Gyllensten ◽

Lars Feuk ◽

Adam Ameur

Keyword(s):

Whole Genome Sequencing ◽

Genome Sequencing ◽

Single Molecule ◽

Large Scale ◽

Whole Genome Sequencing Data ◽

Whole Genome ◽

Sequencing Data ◽

Structural Variations ◽

Single Molecule Sequencing ◽

Human Samples

Long-read single molecule sequencing is increasingly used in human genomics research, as it allows to accurately detect large-scale DNA rearrangements such as structural variations (SVs) at high resolution. However, few studies have evaluated the performance of different single molecule sequencing platforms for SV detection in human samples. Here we performed Oxford Nanopore Technologies (ONT) whole-genome sequencing of two Swedish human samples (average 32× coverage) and compared the results to previously generated Pacific Biosciences (PacBio) data for the same individuals (average 66× coverage). Our analysis inferred an average of 17k and 23k SVs from the ONT and PacBio data, respectively, with a majority of them overlapping with an available multi-platform SV dataset. When comparing the SV calls in the two Swedish individuals, we find a higher concordance between ONT and PacBio SVs detected in the same individual as compared to SVs detected by the same technology in different individuals. Downsampling of PacBio reads, performed to obtain similar coverage levels for all datasets, resulted in 17k SVs per individual and improved overlap with the ONT SVs. Our results suggest that ONT and PacBio have a similar performance for SV detection in human whole genome sequencing data, and that both technologies are feasible for population-scale studies.

Download Full-text