scholarly journals The molecular clock of Mycobacterium tuberculosis

2019 ◽  
Author(s):  
F. Menardo ◽  
S. Duchêne ◽  
D. Brites ◽  
S. Gagneux

AbstractThe molecular clock and its phylogenetic applications to genomic data have changed how we study and understand one of the major human pathogens, Mycobacterium tuberculosis (MTB), the causal agent of tuberculosis. Genome sequences of MTB strains sampled at different times are increasingly used to infer when a particular outbreak begun, when a drug resistant clone appeared and expanded, or when a strain was introduced into a specific region. Despite the growing importance of the molecular clock in tuberculosis research, there is a lack of consensus as to whether MTB displays a clocklike behavior and about its rate of evolution. Here we performed a systematic study of the MTB molecular clock on a large genomic data set (6,285 strains), covering most of the global MTB diversity and representing different epidemiological settings. We found wide variation in the degree of clocklike structure among data sets, indicating that sampling times are sometimes insufficient to calibrate the clock of MTB. For data sets with temporal structure, we found that MTB genomes accumulate between 1×10−8 and 5×10−7 nucleotide changes per-site-per-year, which corresponds to 0.04 – 2.2 SNPs per-genome-per-year. Contrary to what expected, these estimates were not dependent on the time of the calibration points as they did not change significantly when we used epidemiological isolates (sampled in the last 40 years) or ancient DNA samples (about 1,000 years old) to calibrate the tree. Additionally, the uncertainty and the discrepancies in the results of different methods were often large, highlighting the importance of using different methods, and of considering carefully their assumptions and limitations.Significance StatementOne of the major recent advancement in evolutionary biology is the development of statistical methods to infer the past evolutionary history of species and populations with genomic data. In the last five years, many researchers have used the molecular clock to study the evolution of Mycobacterium tuberculosis, a bacterial pathogen that causes tuberculosis and is responsible for millions of human deaths every year. The application of the molecular clock to tuberculosis is extremely useful to understand the evolution of drug resistance, the spread of different strains and the origin of the disease. Since some of these studies found contrasting results, we performed a systematic analysis of the molecular clock of MTB. This study will provide an important guideline for future analyses of tuberculosis and other organisms.

BMC Genomics ◽  
2020 ◽  
Vol 21 (1) ◽  
Author(s):  
Da Xu ◽  
Jialin Zhang ◽  
Hanxiao Xu ◽  
Yusen Zhang ◽  
Wei Chen ◽  
...  

Abstract Background The small number of samples and the curse of dimensionality hamper the better application of deep learning techniques for disease classification. Additionally, the performance of clustering-based feature selection algorithms is still far from being satisfactory due to their limitation in using unsupervised learning methods. To enhance interpretability and overcome this problem, we developed a novel feature selection algorithm. In the meantime, complex genomic data brought great challenges for the identification of biomarkers and therapeutic targets. The current some feature selection methods have the problem of low sensitivity and specificity in this field. Results In this article, we designed a multi-scale clustering-based feature selection algorithm named MCBFS which simultaneously performs feature selection and model learning for genomic data analysis. The experimental results demonstrated that MCBFS is robust and effective by comparing it with seven benchmark and six state-of-the-art supervised methods on eight data sets. The visualization results and the statistical test showed that MCBFS can capture the informative genes and improve the interpretability and visualization of tumor gene expression and single-cell sequencing data. Additionally, we developed a general framework named McbfsNW using gene expression data and protein interaction data to identify robust biomarkers and therapeutic targets for diagnosis and therapy of diseases. The framework incorporates the MCBFS algorithm, network recognition ensemble algorithm and feature selection wrapper. McbfsNW has been applied to the lung adenocarcinoma (LUAD) data sets. The preliminary results demonstrated that higher prediction results can be attained by identified biomarkers on the independent LUAD data set, and we also structured a drug-target network which may be good for LUAD therapy. Conclusions The proposed novel feature selection method is robust and effective for gene selection, classification, and visualization. The framework McbfsNW is practical and helpful for the identification of biomarkers and targets on genomic data. It is believed that the same methods and principles are extensible and applicable to other different kinds of data sets.


2021 ◽  
Vol 7 (12) ◽  
Author(s):  
Arthur K. Turner ◽  
Muhammad Yasir ◽  
Sarah Bastkowski ◽  
Andrea Telatin ◽  
Andrew Page ◽  
...  

Trimethoprim and sulfamethoxazole are used commonly together as cotrimoxazole for the treatment of urinary tract and other infections. The evolution of resistance to these and other antibacterials threatens therapeutic options for clinicians. We generated and analysed a chemical-biology-whole-genome data set to predict new targets for antibacterial combinations with trimethoprim and sulfamethoxazole. For this we used a large transposon mutant library in Escherichia coli BW25113 where an outward-transcribing inducible promoter was engineered into one end of the transposon. This approach allows regulated expression of adjacent genes in addition to gene inactivation at transposon insertion sites, a methodology that has been called TraDIS-Xpress. These chemical genomic data sets identified mechanisms for both reduced and increased susceptibility to trimethoprim and sulfamethoxazole. The data identified that over-expression of FolA reduced trimethoprim susceptibility, a known mechanism for reduced susceptibility. In addition, transposon insertions into the genes tdk, deoR, ybbC, hha, ldcA, wbbK and waaS increased susceptibility to trimethoprim and likewise for rsmH, fadR, ddlB, nlpI and prc with sulfamethoxazole, while insertions in ispD, uspC, minC, minD, yebK, truD and umpG increased susceptibility to both these antibiotics. Two of these genes’ products, Tdk and IspD, are inhibited by AZT and fosmidomycin respectively, antibiotics that are known to synergise with trimethoprim. Thus, the data identified two known targets and several new target candidates for the development of co-drugs that synergise with trimethoprim, sulfamethoxazole or cotrimoxazole. We demonstrate that the TraDIS-Xpress technology can be used to generate information-rich chemical-genomic data sets that can be used for antibacterial development.


2017 ◽  
Author(s):  
Raúl Amado Cattáneo ◽  
Luis Diambra ◽  
Andrés Norman McCarthy

Phylogenetics and population genetics are central disciplines in evolutionary biology. Both are based on the comparison of single DNA sequences, or a concatenation of a number of these. However, with the advent of next-generation DNA sequencing technologies, the approaches that consider large genomic data sets are of growing importance for the elucidation of evolutionary relationships among species. Among these approaches, the assembly and alignment-free methods which allow an efficient distance computation and phylogeny reconstruction are of great importance. However, it is not yet clear under what quality conditions and abundance of genomic data such methods are able to infer phylogenies accurately. In the present study we assess the method originally proposed by Fan et al. for whole genome data, in the elucidation of Tomatoes' chloroplast phylogenetics using short read sequences. We find that this assembly and alignment-free method is capable of reproducing previous results under conditions of high coverage, given that low frequency k-mers (i.e. error prone data) are effectively filter out. Finally, we present a complete chloroplast phylogeny for the best data quality candidates of the recently published 360 tomato genomes.


2017 ◽  
Author(s):  
Raúl Amado Cattáneo ◽  
Luis Diambra ◽  
Andrés Norman McCarthy

Phylogenetics and population genetics are central disciplines in evolutionary biology. Both are based on the comparison of single DNA sequences, or a concatenation of a number of these. However, with the advent of next-generation DNA sequencing technologies, the approaches that consider large genomic data sets are of growing importance for the elucidation of evolutionary relationships among species. Among these approaches, the assembly and alignment-free methods which allow an efficient distance computation and phylogeny reconstruction are of great importance. However, it is not yet clear under what quality conditions and abundance of genomic data such methods are able to infer phylogenies accurately. In the present study we assess the method originally proposed by Fan et al. for whole genome data, in the elucidation of Tomatoes' chloroplast phylogenetics using short read sequences. We find that this assembly and alignment-free method is capable of reproducing previous results under conditions of high coverage, given that low frequency k-mers (i.e. error prone data) are effectively filter out. Finally, we present a complete chloroplast phylogeny for the best data quality candidates of the recently published 360 tomato genomes.


2019 ◽  
Author(s):  
Thomas A. Maigret ◽  
John J. Cox ◽  
David W. Weisrock

AbstractThe resolution offered by genomic data sets coupled with recently developed spatially informed analyses are allowing researchers to quantify population structure at increasingly fine temporal and spatial scales. However, uncertainties regarding data set size and quality thresholds and the time scale at which barriers to gene flow become detectable have limited both empirical research and conservation measures. Here, we used restriction site associated DNA sequencing to generate a large SNP data set for the copperhead snake (Agkistrodon contortrix) and address the population genomic impacts of recent and widespread landscape modification across an approximately 1000 km2 region of eastern Kentucky. Nonspatial population-based assignment and clustering methods supported little to no population structure. However, using individual-based spatial autocorrelation approaches we found evidence for genetic structuring which closely follows the path of a historic highway which experienced high traffic volumes from ca. 1920 to 1970. We found no similar spatial genomic signatures associated with more recently constructed highways or surface mining activity, though a time lag effect may be responsible for the lack of any emergent spatial genetic patterns. Subsampling of our SNP data set suggested that similar results could be obtained with as few as 250 SNPs, and thresholds for missing data exhibited limited impacts on the spatial patterns we detected outside of very strict or permissive extremes. Our findings highlight the importance of temporal factors in landscape genetics approaches, and suggest the potential advantages of large genomic data sets and fine-scale, spatially-informed approaches for quantifying subtle genetic patterns in temporally complex landscapes.


2016 ◽  
Author(s):  
K. Jun Tong ◽  
Nathan Lo ◽  
Simon Y W Ho

Reconstructing the timescale of the Tree of Life is one of the principal aims of evolutionary biology. This has been greatly aided by the development of the molecular clock, which enables evolutionary timescales to be estimated from genetic data. In recent years, high-throughput sequencing technology has led to an increase in the feasibility and availability of genome-scale data sets. These represent a rich source of biological information, but they also bring a set of analytical challenges. In this review, we provide an overview of phylogenomic dating and describe the challenges associated with analysing genome-scale data. We also report on recent phylogenomic estimates of the evolutionary timescales of mammals, birds, and insects.


2016 ◽  
Author(s):  
K. Jun Tong ◽  
Nathan Lo ◽  
Simon Y W Ho

Reconstructing the timescale of the Tree of Life is one of the principal aims of evolutionary biology. This has been greatly aided by the development of the molecular clock, which enables evolutionary timescales to be estimated from genetic data. In recent years, high-throughput sequencing technology has led to an increase in the feasibility and availability of genome-scale data sets. These represent a rich source of biological information, but they also bring a set of analytical challenges. In this review, we provide an overview of phylogenomic dating and describe the challenges associated with analysing genome-scale data. We also report on recent phylogenomic estimates of the evolutionary timescales of mammals, birds, and insects.


2019 ◽  
Vol 9 (1) ◽  
Author(s):  
Irena Fischer-Hwang ◽  
Idoia Ochoa ◽  
Tsachy Weissman ◽  
Mikel Hernaez

Abstract Noise in genomic sequencing data is known to have effects on various stages of genomic data analysis pipelines. Variant identification is an important step of many of these pipelines, and is increasingly being used in clinical settings to aid medical practices. We propose a denoising method, dubbed SAMDUDE, which operates on aligned genomic data in order to improve variant calling performance. Denoising human data with SAMDUDE resulted in improved variant identification in both individual chromosome as well as whole genome sequencing (WGS) data sets. In the WGS data set, denoising led to identification of almost 2,000 additional true variants, and elimination of over 1,500 erroneously identified variants. In contrast, we found that denoising with other state-of-the-art denoisers significantly worsens variant calling performance. SAMDUDE is written in Python and is freely available at https://github.com/ihwang/SAMDUDE.


Author(s):  
Xin Li

In this paper, the authors present a new approach to perform principal component analysis (PCA)-based gene clustering on genomic data distributed in multiple sites (horizontal partitions) with privacy protection. This approach allows data providers to collaborate together to identify gene profiles from a global viewpoint, and at the same time, protect the sensitive genomic data from possible privacy leaks. The authors developed a framework for privacy preserving PCA-based gene clustering, which includes two types of participants such as data providers and a trusted central site. Within this mechanism, distributed horizontal partitions of genomic data can be globally clustered with privacy preservation. Compared to results from centralized scenarios, the result generated from distributed partitions achieves 100% accuracy by using this approach. An experiment on a real genomic data set is conducted, and result shows that the proposed framework produces exactly the same cluster formation as that from the centralized data set.


Author(s):  
Xin Li

In this paper, the authors present a new approach to perform principal component analysis (PCA)-based gene clustering on genomic data distributed in multiple sites (horizontal partitions) with privacy protection. This approach allows data providers to collaborate together to identify gene profiles from a global viewpoint, and at the same time, protect the sensitive genomic data from possible privacy leaks. The authors developed a framework for privacy preserving PCA-based gene clustering, which includes two types of participants such as data providers and a trusted central site. Within this mechanism, distributed horizontal partitions of genomic data can be globally clustered with privacy preservation. Compared to results from centralized scenarios, the result generated from distributed partitions achieves 100% accuracy by using this approach. An experiment on a real genomic data set is conducted, and result shows that the proposed framework produces exactly the same cluster formation as that from the centralized data set.


Sign in / Sign up

Export Citation Format

Share Document