HIVID2: an accurate tool to detect virus integrations in the host genome

Author(s):  
Xi Zeng ◽  
Linghao Zhao ◽  
Chenhang Shen ◽  
Yi Zhou ◽  
Guoliang Li ◽  
...  

Abstract Motivation Virus integration in the host genome is frequently reported to be closely associated with many human diseases, and the detection of virus integration is a critically challenging task. However, most existing tools show limited specificity and sensitivity. Therefore, the objective of this study is to develop a method for accurate detection of virus integration into host genomes. Results Herein, we report a novel method termed HIVID2 that is a significant upgrade of HIVID. HIVID2 performs a paired-end combination (PE-combination) for potentially integrated reads. The resulting sequences are then remapped onto the reference genomes, and both split and discordant chimeric reads are used to identify accurate integration breakpoints with high confidence. HIVID2 represents a great improvement in specificity and sensitivity, and predicts breakpoints closer to the real integrations, compared with existing methods. The advantage of our method was demonstrated using both simulated and real datasets. HIVID2 uncovered novel integration breakpoints in well-known cervical cancer-related genes, including FHIT and LRP1B, which was verified using protein expression data. In addition, HIVID2 allows the user to decide whether to automatically perform advanced analysis using the identified virus integrations. By analyzing the simulated data and real data tests, we demonstrated that HIVID2 is not only more accurate than HIVID but also better than other existing programs with respect to both sensitivity and specificity. We believe that HIVID2 will help in enhancing future research associated with virus integration. Availabilityand implementation HIVID2 can be accessed at https://github.com/zengxi-hada/HIVID2/. Supplementary information Supplementary data are available at Bioinformatics online.

Author(s):  
Krzysztof J Szkop ◽  
David S Moss ◽  
Irene Nobeli

Abstract Motivation We present flexible Modeling of Alternative PolyAdenylation (flexiMAP), a new beta-regression-based method implemented in R, for discovering differential alternative polyadenylation events in standard RNA-seq data. Results We show, using both simulated and real data, that flexiMAP exhibits a good balance between specificity and sensitivity and compares favourably to existing methods, especially at low fold changes. In addition, the tests on simulated data reveal some hitherto unrecognized caveats of existing methods. Importantly, flexiMAP allows modeling of multiple known covariates that often confound the results of RNA-seq data analysis. Availability and implementation The flexiMAP R package is available at: https://github.com/kszkop/flexiMAP. Scripts and data to reproduce the analysis in this paper are available at: https://doi.org/10.5281/zenodo.3689788. Supplementary information Supplementary data are available at Bioinformatics online.


2021 ◽  
Vol 11 (2) ◽  
pp. 582
Author(s):  
Zean Bu ◽  
Changku Sun ◽  
Peng Wang ◽  
Hang Dong

Calibration between multiple sensors is a fundamental procedure for data fusion. To address the problems of large errors and tedious operation, we present a novel method to conduct the calibration between light detection and ranging (LiDAR) and camera. We invent a calibration target, which is an arbitrary triangular pyramid with three chessboard patterns on its three planes. The target contains both 3D information and 2D information, which can be utilized to obtain intrinsic parameters of the camera and extrinsic parameters of the system. In the proposed method, the world coordinate system is established through the triangular pyramid. We extract the equations of triangular pyramid planes to find the relative transformation between two sensors. One capture of camera and LiDAR is sufficient for calibration, and errors are reduced by minimizing the distance between points and planes. Furthermore, the accuracy can be increased by more captures. We carried out experiments on simulated data with varying degrees of noise and numbers of frames. Finally, the calibration results were verified by real data through incremental validation and analyzing the root mean square error (RMSE), demonstrating that our calibration method is robust and provides state-of-the-art performance.


2019 ◽  
Vol 35 (14) ◽  
pp. i408-i416 ◽  
Author(s):  
Nuraini Aguse ◽  
Yuanyuan Qi ◽  
Mohammed El-Kebir

Abstract Motivation Cancer phylogenies are key to studying tumorigenesis and have clinical implications. Due to the heterogeneous nature of cancer and limitations in current sequencing technology, current cancer phylogeny inference methods identify a large solution space of plausible phylogenies. To facilitate further downstream analyses, methods that accurately summarize such a set T of cancer phylogenies are imperative. However, current summary methods are limited to a single consensus tree or graph and may miss important topological features that are present in different subsets of candidate trees. Results We introduce the Multiple Consensus Tree (MCT) problem to simultaneously cluster T and infer a consensus tree for each cluster. We show that MCT is NP-hard, and present an exact algorithm based on mixed integer linear programming (MILP). In addition, we introduce a heuristic algorithm that efficiently identifies high-quality consensus trees, recovering all optimal solutions identified by the MILP in simulated data at a fraction of the time. We demonstrate the applicability of our methods on both simulated and real data, showing that our approach selects the number of clusters depending on the complexity of the solution space T. Availability and implementation https://github.com/elkebir-group/MCT. Supplementary information Supplementary data are available at Bioinformatics online.


2021 ◽  
Author(s):  
J. Cobb Scott ◽  
Tyler M. Moore ◽  
David R Roalf ◽  
Theodore D. Satterthwaite ◽  
Daniel H. Wolf ◽  
...  

Objective: Data from neurocognitive assessments may not be accurate in the context of factors impacting validity, such as disengagement, unmotivated responding, or intentional underperformance. Performance validity tests (PVTs) were developed to address these phenomena and assess underperformance on neurocognitive tests. However, PVTs can be burdensome, rely on cutoff scores that reduce information, do not examine potential variations in task engagement across a battery, and are typically not well-suited to acquisition of large cognitive datasets. Here we describe the development of novel performance validity measures that could address some of these limitations by leveraging psychometric modeling from data embedded within the Penn Computerized Neurocognitive Battery (PennCNB). Method: We first developed these validity measures using simulations of invalid response patterns with parameters drawn from real data. Next, we examined their application in two large, independent samples: 1) children and adolescents from the Philadelphia Neurodevelopmental Cohort (n=9,498); and 2) adult servicemembers from the Marine Resiliency Study-II (n=1,444). Results: Our performance validity metrics detected patterns of invalid responding in simulated data, even at subtle levels. Furthermore, a combination of these metrics significantly predicted previously established validity rules for these tests in both developmental and adult datasets. Moreover, most clinical diagnostic groups did not show reduced validity estimates. Conclusion: These results provide proof-of-concept evidence for multivariate, data-driven performance validity metrics. These metrics offer a novel method for determining the performance validity for individual neurocognitive tests that is scalable, applicable across different tests, less burdensome, and dimensional. However, more research is needed into their application.


2020 ◽  
Vol 60 (8) ◽  
pp. 999
Author(s):  
Lianjie Hou ◽  
Wenshuai Liang ◽  
Guli Xu ◽  
Bo Huang ◽  
Xiquan Zhang ◽  
...  

Low-density single-nucleotide polymorphism (LD-SNP) panel is one effective way to reduce the cost of genomic selection in animal breeding. The present study proposes a new type of LD-SNP panel called mixed low-density (MLD) panel, which considers SNPs with a substantial effect estimated by Bayes method B (BayesB) from many traits and evenly spaced distribution simultaneously. Simulated and real data were used to compare the imputation accuracy and genomic-selection accuracy of two types of LD-SNP panels. The result of genotyping imputation for simulated data showed that the number of quantitative trait loci (QTL) had limited influence on the imputation accuracy only for MLD panels. Evenly spaced (ELD) panel was not affected by QTL. For real data, ELD performed slightly better than did MLD when panel contained 500 and 1000 SNP. However, this advantage vanished quickly as the density increased. The result of genomic selection for simulated data using BayesB showed that MLD performed much better than did ELD when QTL was 100. For real data, MLD also outperformed ELD in growth and carcass traits when using BayesB. In conclusion, the MLD strategy is superior to ELD in genomic selection under most situations.


2019 ◽  
Vol 11 (11) ◽  
pp. 1297 ◽  
Author(s):  
Mingyang Shang ◽  
Xiaolan Qiu ◽  
Bing Han ◽  
Chibiao Ding ◽  
Yuxin Hu

Azimuth multichannel (AMC) synthetic aperture radar (SAR), which contains multiple receiving antennas along the azimuth, can prevent the minimum antenna area constraint and provide high-resolution and wide-swath (HRWS) SAR images. Channel calibration and along-track baseline estimation are important topics in an AMC SAR system, since they have a great impact on image quality. Based on the signal model for stationary target of AMC SAR, this paper first analyses the influence of the along-track baseline and channel imbalances on SAR images by simulation. Then, a novel method to simultaneously estimate the along-track baseline, phase imbalance and range sample time imbalance (RSTI) based on the azimuth cross-correlation in the two-dimensional frequency domain is addressed. In addition, with the help of simulations and real data acquired by Gaofen-3 (GF-3), the effectiveness of this method is verified by comparing with some existing methods. Finally, this paper analyzes the estimation accuracy of this method under different scenarios and signal-to-noise ratios (SNRs), and points out the direction for future research.


Author(s):  
Giacomo Baruzzo ◽  
Ilaria Patuzzi ◽  
Barbara Di Camillo

Abstract Motivation Single cell RNA-seq (scRNA-seq) count data show many differences compared with bulk RNA-seq count data, making the application of many RNA-seq pre-processing/analysis methods not straightforward or even inappropriate. For this reason, the development of new methods for handling scRNA-seq count data is currently one of the most active research fields in bioinformatics. To help the development of such new methods, the availability of simulated data could play a pivotal role. However, only few scRNA-seq count data simulators are available, often showing poor or not demonstrated similarity with real data. Results In this article we present SPARSim, a scRNA-seq count data simulator based on a Gamma-Multivariate Hypergeometric model. We demonstrate that SPARSim allows to generate count data that resemble real data in terms of count intensity, variability and sparsity, performing comparably or better than one of the most used scRNA-seq simulator, Splat. In particular, SPARSim simulated count matrices well resemble the distribution of zeros across different expression intensities observed in real count data. Availability and implementation SPARSim R package is freely available at http://sysbiobig.dei.unipd.it/? q=SPARSim and at https://gitlab.com/sysbiobig/sparsim. Supplementary information Supplementary data are available at Bioinformatics online.


2019 ◽  
Vol 35 (23) ◽  
pp. 4955-4961
Author(s):  
Yongzhuang Liu ◽  
Jian Liu ◽  
Yadong Wang

Abstract Motivation Whole-genome sequencing (WGS) of tumor–normal sample pairs is a powerful approach for comprehensively characterizing germline copy number variations (CNVs) and somatic copy number alterations (SCNAs) in cancer research and clinical practice. Existing computational approaches for detecting copy number events cannot detect germline CNVs and SCNAs simultaneously, and yield low accuracy for SCNAs. Results In this study, we developed TumorCNV, a novel approach for jointly detecting germline CNVs and SCNAs from WGS data of the matched tumor–normal sample pair. We compared TumorCNV with existing copy number event detection approaches using the simulated data and real data for the COLO-829 melanoma cell line. The experimental results showed that TumorCNV achieved superior performance than existing approaches. Availability and implementation The software TumorCNV is implemented using a combination of Java and R, and it is freely available from the website at https://github.com/yongzhuang/TumorCNV. Supplementary information Supplementary data are available at Bioinformatics online.


Author(s):  
Achmad Syahrul Choir ◽  
Nur Iriawan ◽  
Brodjol Sutijo Suprih Ulama ◽  
Mohammad Dokhi

MSNBurr and MSTBurr distribution have been developed as Neo-Normal distributions that represent a relaxation of normality. The difference between them is that the MSTBurr’s peak is below MSNBurr’s. In this paper, we propose a MSEPBurr distribution with its peak could be not only lower but also high-er than MSNBurr. Furthermore, we study several properties of MSEPBurr, such as mean, variance, skewness, kurtosis, and quantile. The MSEPBurr parameters are estimated by using the Bayesian approach with the BUGS language implementation for its computation. We employ simulation study and use existing data to illustrate the application of the regression model. In real data, we notice that MSEPBurr has similar performance with MSNBurr and MSTBurr that they outperform Normal and Student-t distribution in Australian athlete data because their skewness can accommodate long left tail excellently. However, their performance is less than the Student-t model in chemical reaction rate data because their skewness can not accommodate long right tail perfectly. Although in general their perfor-mance is the same,  we observe that the MSEPBurr performs better than the MSNBurr and the MSTBurr in some simulated data.


2020 ◽  
Vol 36 (Supplement_1) ◽  
pp. i362-i370
Author(s):  
Palash Sashittal ◽  
Mohammed El-Kebir

Abstract Motivation The combination of genomic and epidemiological data holds the potential to enable accurate pathogen transmission history inference. However, the inference of outbreak transmission histories remains challenging due to various factors such as within-host pathogen diversity and multi-strain infections. Current computational methods ignore within-host diversity and/or multi-strain infections, often failing to accurately infer the transmission history. Thus, there is a need for efficient computational methods for transmission tree inference that accommodate the complexities of real data. Results We formulate the direct transmission inference (DTI) problem for inferring transmission trees that support multi-strain infections given a timed phylogeny and additional epidemiological data. We establish hardness for the decision and counting version of the DTI problem. We introduce Transmission Tree Uniform Sampler (TiTUS), a method that uses SATISFIABILITY to almost uniformly sample from the space of transmission trees. We introduce criteria that prioritize parsimonious transmission trees that we subsequently summarize using a novel consensus tree approach. We demonstrate TiTUS’s ability to accurately reconstruct transmission trees on simulated data as well as a documented HIV transmission chain. Availability and implementation https://github.com/elkebir-group/TiTUS. Supplementary information Supplementary data are available at Bioinformatics online.


Sign in / Sign up

Export Citation Format

Share Document