Joint detection of germline and somatic copy number events in matched tumor–normal sample pairs

2019 ◽  
Vol 35 (23) ◽  
pp. 4955-4961
Author(s):  
Yongzhuang Liu ◽  
Jian Liu ◽  
Yadong Wang

Abstract Motivation Whole-genome sequencing (WGS) of tumor–normal sample pairs is a powerful approach for comprehensively characterizing germline copy number variations (CNVs) and somatic copy number alterations (SCNAs) in cancer research and clinical practice. Existing computational approaches for detecting copy number events cannot detect germline CNVs and SCNAs simultaneously, and yield low accuracy for SCNAs. Results In this study, we developed TumorCNV, a novel approach for jointly detecting germline CNVs and SCNAs from WGS data of the matched tumor–normal sample pair. We compared TumorCNV with existing copy number event detection approaches using the simulated data and real data for the COLO-829 melanoma cell line. The experimental results showed that TumorCNV achieved superior performance than existing approaches. Availability and implementation The software TumorCNV is implemented using a combination of Java and R, and it is freely available from the website at https://github.com/yongzhuang/TumorCNV. Supplementary information Supplementary data are available at Bioinformatics online.

2010 ◽  
Vol 08 (02) ◽  
pp. 295-314 ◽  
Author(s):  
XIAO-LIN YIN ◽  
JING LI

Array comparative genomic hybridization (aCGH) allows identification of copy number alterations across genomes. The key computational challenge in analyzing copy number variations (CNVs) using aCGH data or other similar data generated by a variety of array technologies is the detection of segment boundaries of copy number changes and inference of the copy number state for each segment. We have developed a novel statistical model based on the framework of conditional random fields (CRFs) that can effectively combine data smoothing, segmentation and copy number state decoding into one unified framework. Our approach (termed CRF-CNV) provides great flexibilities in defining meaningful feature functions. Therefore, it can effectively integrate local spatial information of arbitrary sizes into the model. For model parameter estimations, we have adopted the conjugate gradient (CG) method for likelihood optimization and developed efficient forward/backward algorithms within the CG framework. The method is evaluated using real data with known copy numbers as well as simulated data with realistic assumptions, and compared with two popular publicly available programs. Experimental results have demonstrated that CRF-CNV outperforms a Bayesian Hidden Markov Model-based approach on both datasets in terms of copy number assignments. Comparing to a non-parametric approach, CRF-CNV has achieved much greater precision while maintaining the same level of recall on the real data, and their performance on the simulated data is comparable.


2020 ◽  
Vol 36 (12) ◽  
pp. 3890-3891
Author(s):  
Linjie Wu ◽  
Han Wang ◽  
Yuchao Xia ◽  
Ruibin Xi

Abstract Motivation Whole-genome sequencing (WGS) is widely used for copy number variation (CNV) detection. However, for most bacteria, their circular genome structure and high replication rate make reads more enriched near the replication origin. CNV detection based on read depth could be seriously influenced by such replication bias. Results We show that the replication bias is widespread using ∼200 bacterial WGS data. We develop CNV-BAC (CNV-Bacteria) that can properly normalize the replication bias and other known biases in bacterial WGS data and can accurately detect CNVs. Simulation and real data analysis show that CNV-BAC achieves the best performance in CNV detection compared with available algorithms. Availability and implementation CNV-BAC is available at https://github.com/XiDsLab/CNV-BAC. Supplementary information Supplementary data are available at Bioinformatics online.


2020 ◽  
Author(s):  
Viraj Shah ◽  
Chinmay Hegde

Abstract We consider the problem of reconstructing a signal from under-determined modulo observations (or measurements). This observation model is inspired by a (relatively) less well-known imaging mechanism called modulo imaging, which can be used to extend the dynamic range of imaging systems; variations of this model have also been studied under the category of phase unwrapping. Signal reconstruction in the under-determined regime with modulo observations is a challenging ill-posed problem, and existing reconstruction methods cannot be used directly. In this paper, we propose a novel approach to solving the inverse problem limited to two modulo periods, inspired by recent advances in algorithms for phase retrieval under sparsity constraints. We show that given a sufficient number of measurements, our algorithm perfectly recovers the underlying signal and provides improved performance over other existing algorithms. We also provide experiments validating our approach on both synthetic and real data to depict its superior performance.


2019 ◽  
Vol 35 (14) ◽  
pp. i408-i416 ◽  
Author(s):  
Nuraini Aguse ◽  
Yuanyuan Qi ◽  
Mohammed El-Kebir

Abstract Motivation Cancer phylogenies are key to studying tumorigenesis and have clinical implications. Due to the heterogeneous nature of cancer and limitations in current sequencing technology, current cancer phylogeny inference methods identify a large solution space of plausible phylogenies. To facilitate further downstream analyses, methods that accurately summarize such a set T of cancer phylogenies are imperative. However, current summary methods are limited to a single consensus tree or graph and may miss important topological features that are present in different subsets of candidate trees. Results We introduce the Multiple Consensus Tree (MCT) problem to simultaneously cluster T and infer a consensus tree for each cluster. We show that MCT is NP-hard, and present an exact algorithm based on mixed integer linear programming (MILP). In addition, we introduce a heuristic algorithm that efficiently identifies high-quality consensus trees, recovering all optimal solutions identified by the MILP in simulated data at a fraction of the time. We demonstrate the applicability of our methods on both simulated and real data, showing that our approach selects the number of clusters depending on the complexity of the solution space T. Availability and implementation https://github.com/elkebir-group/MCT. Supplementary information Supplementary data are available at Bioinformatics online.


Author(s):  
Xizhi Luo ◽  
Fei Qin ◽  
Guoshuai Cai ◽  
Feifei Xiao

Abstract Motivation Copy number variation plays important roles in human complex diseases. The detection of copy number variants (CNVs) is identifying mean shift in genetic intensities to locate chromosomal breakpoints, the step of which is referred to as chromosomal segmentation. Many segmentation algorithms have been developed with a strong assumption of independent observations in the genetic loci, and they assume each locus has an equal chance to be a breakpoint (i.e. boundary of CNVs). However, this assumption is violated in the genetics perspective due to the existence of correlation among genomic positions, such as linkage disequilibrium (LD). Our study showed that the LD structure is related to the location distribution of CNVs, which indeed presents a non-random pattern on the genome. To generate more accurate CNVs, we proposed a novel algorithm, LDcnv, that models the CNV data with its biological characteristics relating to genetic dependence structure (i.e. LD). Results We theoretically demonstrated the correlation structure of CNV data in SNP array, which further supports the necessity of integrating biological structure in statistical methods for CNV detection. Therefore, we developed the LDcnv that integrated the genomic correlation structure with a local search strategy into statistical modeling of the CNV intensities. To evaluate the performance of LDcnv, we conducted extensive simulations and analyzed large-scale HapMap datasets. We showed that LDcnv presented high accuracy, stability and robustness in CNV detection and higher precision in detecting short CNVs compared to existing methods. This new segmentation algorithm has a wide scope of potential application with data from various high-throughput technology platforms. Availability and implementation https://github.com/FeifeiXiaoUSC/LDcnv. Supplementary information Supplementary data are available at Bioinformatics online.


2018 ◽  
Author(s):  
Xi Chen ◽  
Jianhua Xuan

AbstractIn this paper, we propose a novel approach namely MSIGNET to identify subnetworks with significantly expressed genes by integrating context specific gene expression and protein-protein interaction (PPI) data. Specifically, we integrate differential expression of each gene and mutual information of gene pairs in a Bayesian framework and use Metropolis sampling to identify functional interactions. During the sampling process, a conditional probability is calculated given a randomly selected gene to control the network state transition. Our method provides global statistics of all genes and their interactions, and finally achieves a global optimal sub-network. We apply MSIGNET to simulated data and have demonstrated its superior performance over comparable network identification tools. Using a validated Parkinson data set we show that the network identified using MSIGNET is consistent to previously reported results but provides more biology meaningful interpretation of Parkinson’s disease. Finally, to study networks related to ovarian cancer recurrence, we investigate two patient data sets. Identified networks from independent data sets show functional consistence. And those common genes and interactions are well supported by current biological knowledge.


2020 ◽  
Vol 10 (1) ◽  
Author(s):  
Luísa Esteves ◽  
Francisco Caramelo ◽  
Ilda Patrícia Ribeiro ◽  
Isabel M. Carreira ◽  
Joana Barbosa de Melo

Abstract Copy number alterations (CNAs) comprise deletions or amplifications of fragments of genomic material that are particularly common in cancer and play a major contribution in its development and progression. High resolution microarray-based genome-wide technologies have been widely used to detect CNAs, generating complex datasets that require further steps to allow for the determination of meaningful results. In this work, we propose a methodology to determine common regions of CNAs from these datasets, that in turn are used to infer the probability distribution of disease profiles in the population. This methodology was validated using simulated data and assessed using real data from Head and Neck Squamous Cell Carcinoma and Lung Adenocarcinoma, from the TCGA platform. Probability distribution profiles were produced allowing for the distinction between different phenotypic groups established within that cohort. This method may be used to distinguish between groups in the diseased population, within well-established degrees of confidence. The application of such methods may be of greater value in the clinical context both as a diagnostic or prognostic tool and, even as a useful way for helping to establish the most adequate treatment and care plans.


Author(s):  
Giacomo Baruzzo ◽  
Ilaria Patuzzi ◽  
Barbara Di Camillo

Abstract Motivation Single cell RNA-seq (scRNA-seq) count data show many differences compared with bulk RNA-seq count data, making the application of many RNA-seq pre-processing/analysis methods not straightforward or even inappropriate. For this reason, the development of new methods for handling scRNA-seq count data is currently one of the most active research fields in bioinformatics. To help the development of such new methods, the availability of simulated data could play a pivotal role. However, only few scRNA-seq count data simulators are available, often showing poor or not demonstrated similarity with real data. Results In this article we present SPARSim, a scRNA-seq count data simulator based on a Gamma-Multivariate Hypergeometric model. We demonstrate that SPARSim allows to generate count data that resemble real data in terms of count intensity, variability and sparsity, performing comparably or better than one of the most used scRNA-seq simulator, Splat. In particular, SPARSim simulated count matrices well resemble the distribution of zeros across different expression intensities observed in real count data. Availability and implementation SPARSim R package is freely available at http://sysbiobig.dei.unipd.it/? q=SPARSim and at https://gitlab.com/sysbiobig/sparsim. Supplementary information Supplementary data are available at Bioinformatics online.


2019 ◽  
Author(s):  
Adrian L Hauber ◽  
Raphael Engesser ◽  
Joep Vanlier ◽  
Jens Timmer

Abstract Motivation Apparent time delays in partly observed, biochemical reaction networks can be modeled by lumping a more complex reaction into a series of linear reactions often referred to as the linear chain trick. Since most delays in biochemical reactions are no true, hard delays but a consequence of complex unobserved processes, this approach often more closely represents the true system compared to delay differential equations. In this paper, we address the question of how to select the optimal number of additional equations, i.e. the chain length. Results We derive a criterion based on parameter identifiability to infer chain lengths and compare this method to choosing the model with a chain length that leads to the best fit in a maximum likelihood sense, which corresponds to optimising the Bayesian information criterion. We evaluate performance with simulated data as well as with measured biological data for a model of JAK2/STAT5 signalling and access the influence of different model structures and data characteristics. Our analysis revealed that the proposed method features a superior performance when applied to biological models and data compared to choosing the model that maximises the likelihood. Availability Models and data used for simulations are available at https://github.com/Data2Dynamics/d2d and http://jeti.uni-freiburg.de/PNAS_Swameye_Data. Supplementary information Supplementary data are available at Bioinformatics online.


2015 ◽  
Vol 27 (9) ◽  
pp. 1983-2010 ◽  
Author(s):  
Antonio Soriano ◽  
Luis Vergara ◽  
Bouziane Ahmed ◽  
Addisson Salazar

We present a new method for fusing scores corresponding to different detectors (two-hypotheses case). It is based on alpha integration, which we have adapted to the detection context. Three optimization methods are presented: least mean square error, maximization of the area under the ROC curve, and minimization of the probability of error. Gradient algorithms are proposed for the three methods. Different experiments with simulated and real data are included. Simulated data consider the two-detector case to illustrate the factors influencing alpha integration and demonstrate the improvements obtained by score fusion with respect to individual detector performance. Two real data cases have been considered. In the first, multimodal biometric data have been processed. This case is representative of scenarios in which the probability of detection is to be maximized for a given probability of false alarm. The second case is the automatic analysis of electroencephalogram and electrocardiogram records with the aim of reproducing the medical expert detections of arousal during sleeping. This case is representative of scenarios in which probability of error is to be minimized. The general superior performance of alpha integration verifies the interest of optimizing the fusing parameters.


Sign in / Sign up

Export Citation Format

Share Document