A scalable algorithm for clonal reconstruction from sparse time course genomic sequencing data

Mapping Intimacies ◽

10.1101/2021.08.19.457037 ◽

2021 ◽

Author(s):

Wazim Mohammed Ismail ◽

Haixu Tang

Keyword(s):

Maximum Likelihood ◽

Time Course ◽

Bacterial Population ◽

Greedy Algorithms ◽

Genomic Sequencing ◽

Sampling Variance ◽

Sequencing Data ◽

Novel Mutations ◽

Scalable Algorithm ◽

Time Course Data

Long-term evolution experiments (LTEEs) reveal the dynamics of clonal compositions in an evolving bacterial population over time. Accurately inferring the haplotypes - the set of mutations that identify each clone, as well as the clonal frequencies and evolutionary history in a bacterial population is useful for the characterization of the evolutionary pressure on multiple correlated mutations instead of that on individual mutations. Here, we study the computational problem of reconstructing the haplotypes of bacterial clones from the variant allele frequencies (VAFs) observed during a time course in a LTEE. Previously, we formulated the problem using a maximum likelihood approach under the assumption that mutations occur spontaneously, and thus the likelihood of a mutation occurring in a specific clone is proportional to the frequency of the clone in the population when the mutation occurs. We also developed several heuristic greedy algorithms to solve the problem, which were shown to report accurate results of clonal reconstruction on simulated and real time course genomic sequencing data in LTEE. However, these algorithms are too slow to handle sparse time course data when the number of novel mutations occurring during the time course are much greater than the number of time points sampled. In this paper, we present a novel scalable algorithm for clonal reconstruction from sparse time course data. We employed a statistical method to estimate the sampling variance of VAFs derived from low coverage sequencing data and incorporated it into the maximum likelihood framework for clonal reconstruction on noisy sequencing data. We implemented the algorithm (named ClonalTREE2) and tested it using simulated and real sparse time course genomic sequencing data. The results showed that the algorithm was fast and achieved near-optimal accuracy under the maximum likelihood framework for the time course data involving hundreds of novel mutations at each time point. The source code of ClonalTREE2 is available at https://github.com/COL-IU/ClonalTREE2.

Get full-text (via PubEx)

Clonal reconstruction from time course genomic sequencing data

BMC Genomics ◽

10.1186/s12864-019-6328-3 ◽

2019 ◽

Vol 20 (S12) ◽

Author(s):

Wazim Mohammed Ismail ◽

Haixu Tang

Keyword(s):

Maximum Likelihood ◽

Time Course ◽

Bacterial Population ◽

Heuristic Algorithms ◽

Long Term Evolution ◽

Genomic Sequencing ◽

Multiple Time ◽

Sequencing Data ◽

Term Evolution

Abstract Background Bacterial cells during many replication cycles accumulate spontaneous mutations, which result in the birth of novel clones. As a result of this clonal expansion, an evolving bacterial population has different clonal composition over time, as revealed in the long-term evolution experiments (LTEEs). Accurately inferring the haplotypes of novel clones as well as the clonal frequencies and the clonal evolutionary history in a bacterial population is useful for the characterization of the evolutionary pressure on multiple correlated mutations instead of that on individual mutations. Results In this paper, we study the computational problem of reconstructing the haplotypes of bacterial clones from the variant allele frequencies observed from an evolving bacterial population at multiple time points. We formalize the problem using a maximum likelihood function, which is defined under the assumption that mutations occur spontaneously, and thus the likelihood of a mutation occurring in a specific clone is proportional to the frequency of the clone in the population when the mutation occurs. We develop a series of heuristic algorithms to address the maximum likelihood inference, and show through simulation experiments that the algorithms are fast and achieve near optimal accuracy that is practically plausible under the maximum likelihood framework. We also validate our method using experimental data obtained from a recent study on long-term evolution of Escherichia coli. Conclusion We developed efficient algorithms to reconstruct the clonal evolution history from time course genomic sequencing data. Our algorithm can also incorporate clonal sequencing data to improve the reconstruction results when they are available. Based on the evaluation on both simulated and experimental sequencing data, our algorithms can achieve satisfactory results on the genome sequencing data from long-term evolution experiments. Availability The program (ClonalTREE) is available as open-source software on GitHub at https://github.com/COL-IU/ClonalTREE.

Get full-text (via PubEx)

Clonal reconstruction from time course genomic sequencing data

10.1101/832063 ◽

2019 ◽

Cited By ~ 1

Author(s):

Wazim Mohammed Ismail ◽

Haixu Tang

Keyword(s):

Maximum Likelihood ◽

Time Course ◽

Bacterial Population ◽

Heuristic Algorithms ◽

Long Term Evolution ◽

Genomic Sequencing ◽

Multiple Time ◽

Sequencing Data ◽

Term Evolution

AbstractBackgroundBacterial cells during many replication cycles accumulate spontaneous mutations, which result in the birth of novel clones. As a result of this clonal expansion, an evolving bacterial population has different clonal composition over time, as revealed in the long-term evolution experiments (LTEEs). Accurately inferring the haplotypes of novel clones as well as the clonal frequencies and the clonal evolutionary history in a bacterial population is useful for the characterization of the evolutionary pressure on multiple correlated mutations instead of that on individual mutations.ResultsIn this paper, we study the computational problem of reconstructing the haplotypes of bacterial clones from the variant allele frequencies observed from an evolving bacterial population at multiple time points. We formalize the problem using a maximum likelihood function, which is defined under the assumption that mutations occur spontaneously, and thus the likelihood of a mutation occurring in a specific clone is proportional to the frequency of the clone in the population when the mutation occurs. We develop a series of heuristic algorithms to address the maximum likelihood inference, and show through simulation experiments that the algorithms are fast and achieve near optimal accuracy that is practically plausible under the maximum likelihood framework. We also validate our method using experimental data obtained from a recent study on long-term evolution of Escherichia coli.ConclusionWe developed efficient algorithms to reconstruct the clonal evolution history from time course genomic sequencing data. Our algorithm can also incorporate clonal sequencing data to improve the reconstruction results when they are available. Based on the evaluation on both simulated and experimental sequencing data, our algorithms can achieve satisfactory results on the genome sequencing data from long-term evolution experiments.AvailabilityThe program (ClonalTREE) is available as open-source software on GitHub at https://github.com/COL-IU/ClonalTREE

Get full-text (via PubEx)

Commonalities across computational workflows for uncovering explanatory variants in undiagnosed cases

Genetics in Medicine ◽

10.1038/s41436-020-01084-8 ◽

2021 ◽

Author(s):

Shilpa Nadimpalli Kobren ◽

◽

Dustin Baldridge ◽

Matt Velinder ◽

Joel B. Krier ◽

...

Keyword(s):

Online Survey ◽

Variant Calling ◽

Theoretical Method ◽

The United States ◽

Genomic Sequencing ◽

Biomedical Data ◽

Sequencing Data ◽

Multimodal Data ◽

Undiagnosed Diseases ◽

Undiagnosed Diseases Network

Abstract Purpose Genomic sequencing has become an increasingly powerful and relevant tool to be leveraged for the discovery of genetic aberrations underlying rare, Mendelian conditions. Although the computational tools incorporated into diagnostic workflows for this task are continually evolving and improving, we nevertheless sought to investigate commonalities across sequencing processing workflows to reveal consensus and standard practice tools and highlight exploratory analyses where technical and theoretical method improvements would be most impactful. Methods We collected details regarding the computational approaches used by a genetic testing laboratory and 11 clinical research sites in the United States participating in the Undiagnosed Diseases Network via meetings with bioinformaticians, online survey forms, and analyses of internal protocols. Results We found that tools for processing genomic sequencing data can be grouped into four distinct categories. Whereas well-established practices exist for initial variant calling and quality control steps, there is substantial divergence across sites in later stages for variant prioritization and multimodal data integration, demonstrating a diversity of approaches for solving the most mysterious undiagnosed cases. Conclusion The largest differences across diagnostic workflows suggest that advances in structural variant detection, noncoding variant interpretation, and integration of additional biomedical data may be especially promising for solving chronically undiagnosed cases.

Get full-text (via PubEx)

Biallelic novel mutations of the COL27A1 gene in a patient with Steel syndrome

Human Genome Variation ◽

10.1038/s41439-021-00149-7 ◽

2021 ◽

Vol 8 (1) ◽

Author(s):

Jong Seop Kim ◽

Hyoungseok Jeon ◽

Hyeran Lee ◽

Jung Min Ko ◽

Yonghwan Kim ◽

...

Keyword(s):

Hip Dysplasia ◽

Large Deletion ◽

Compound Heterozygous ◽

Radial Head Dislocation ◽

Sequencing Data ◽

Novel Mutations ◽

Exome Sequencing Data ◽

Whole Exome ◽

Whole Exome Sequencing Data ◽

Carpal Coalition

AbstractAn 11-year-old Korean boy presented with short stature, hip dysplasia, radial head dislocation, carpal coalition, genu valgum, and fixed patellar dislocation and was clinically diagnosed with Steel syndrome. Scrutinizing the trio whole-exome sequencing data revealed novel compound heterozygous mutations of COL27A1 (c.[4229_4233dup]; [3718_5436del], p.[Gly1412Argfs*157];[Gly1240_Lys1812del]) in the proband, which were inherited from heterozygous parents. The maternal mutation was a large deletion encompassing exons 38–60, which was challenging to detect.

Get full-text (via PubEx)

Bayesian approach for predicting responses to therapy from high-dimensional time-course gene expression profiles

BMC Bioinformatics ◽

10.1186/s12859-021-04052-4 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Arika Fukushima ◽

Masahiro Sugimoto ◽

Satoru Hiwa ◽

Tomoyuki Hiroyasu

Keyword(s):

Gene Expression ◽

Time Course ◽

Expression Profiles ◽

Gene Expression Profiles ◽

Response To Therapy ◽

Hcv Infection ◽

Entire Treatment Period ◽

Time Points ◽

Time Course Data ◽

Conventional Methods

Abstract Background Historical and updated information provided by time-course data collected during an entire treatment period proves to be more useful than information provided by single-point data. Accurate predictions made using time-course data on multiple biomarkers that indicate a patient’s response to therapy contribute positively to the decision-making process associated with designing effective treatment programs for various diseases. Therefore, the development of prediction methods incorporating time-course data on multiple markers is necessary. Results We proposed new methods that may be used for prediction and gene selection via time-course gene expression profiles. Our prediction method consolidated multiple probabilities calculated using gene expression profiles collected over a series of time points to predict therapy response. Using two data sets collected from patients with hepatitis C virus (HCV) infection and multiple sclerosis (MS), we performed numerical experiments that predicted response to therapy and evaluated their accuracies. Our methods were more accurate than conventional methods and successfully selected genes, the functions of which were associated with the pathology of HCV infection and MS. Conclusions The proposed method accurately predicted response to therapy using data at multiple time points. It showed higher accuracies at early time points compared to those of conventional methods. Furthermore, this method successfully selected genes that were directly associated with diseases.

Get full-text (via PubEx)

Identification of the macrophage-specific promoter signature in FANTOM5 mouse embryo developmental time course data

Journal of Leukocyte Biology ◽

10.1189/jlb.1a0417-150rr ◽

2017 ◽

Vol 102 (4) ◽

pp. 1081-1092 ◽

Cited By ~ 17

Author(s):

Kim M. Summers ◽

David A. Hume

Keyword(s):

Mouse Embryo ◽

Time Course ◽

Developmental Time ◽

Time Course Data

Get full-text (via PubEx)

An empirical Bayes change-point model for transcriptome time-course data

The Annals of Applied Statistics ◽

10.1214/20-aoas1403 ◽

2021 ◽

Vol 15 (1) ◽

Author(s):

Tian Tian ◽

Ruihua Cheng ◽

Zhi Wei

Keyword(s):

Change Point ◽

Empirical Bayes ◽

Time Course ◽

Point Model ◽

Time Course Data ◽

Change Point Model

Get full-text (via PubEx)

Generalized Correlation Coefficient for Non-Parametric Analysis of Microarray Time-Course Data

Journal of Integrative Bioinformatics ◽

10.1515/jib-2017-0011 ◽

2017 ◽

Vol 14 (2) ◽

Author(s):

Qihua Tan ◽

Mads Thomassen ◽

Mark Burton ◽

Kristian Fredløv Mose ◽

Klaus Ejner Andersen ◽

...

Keyword(s):

Gene Expression ◽

Correlation Coefficient ◽

Parametric Analysis ◽

Time Course ◽

Expression Patterns ◽

Gene Expression Patterns ◽

Microarray Time Course ◽

Time Course Data ◽

Generalized Correlation ◽

Non Parametric

AbstractModeling complex time-course patterns is a challenging issue in microarray study due to complex gene expression patterns in response to the time-course experiment. We introduce the generalized correlation coefficient and propose a combinatory approach for detecting, testing and clustering the heterogeneous time-course gene expression patterns. Application of the method identified nonlinear time-course patterns in high agreement with parametric analysis. We conclude that the non-parametric nature in the generalized correlation analysis could be an useful and efficient tool for analyzing microarray time-course data and for exploring the complex relationships in the omics data for studying their association with disease and health.

Get full-text (via PubEx)