scholarly journals A new statistic for efficient detection of repetitive sequences

2019 ◽  
Vol 35 (22) ◽  
pp. 4596-4606 ◽  
Author(s):  
Sijie Chen ◽  
Yixin Chen ◽  
Fengzhu Sun ◽  
Michael S Waterman ◽  
Xuegong Zhang

Abstract Motivation Detecting sequences containing repetitive regions is a basic bioinformatics task with many applications. Several methods have been developed for various types of repeat detection tasks. An efficient generic method for detecting most types of repetitive sequences is still desirable. Inspired by the excellent properties and successful applications of the D2 family of statistics in comparative analyses of genomic sequences, we developed a new statistic D2R that can efficiently discriminate sequences with or without repetitive regions. Results Using the statistic, we developed an algorithm of linear time and space complexity for detecting most types of repetitive sequences in multiple scenarios, including finding candidate clustered regularly interspaced short palindromic repeats regions from bacterial genomic or metagenomics sequences. Simulation and real data experiments show that the method works well on both assembled sequences and unassembled short reads. Availability and implementation The codes are available at https://github.com/XuegongLab/D2R_codes under GPL 3.0 license. Supplementary information Supplementary data are available at Bioinformatics online.

2018 ◽  
Author(s):  
Sijie Chen ◽  
Fengzhu Sun ◽  
Michael S. Waterman ◽  
Xuegong Zhang

ABSTRACTDetecting sequences containing repetitive regions is a basic bioinformatics task with many applications. Several methods have been developed for various types of repeat detection tasks. An efficient generic method for detecting all types of repetitive sequences is still desirable.Inspired by the excellent properties and successful applications of the D2 family of statistics in comparative analyses of genomic sequences, we developed a new statistic that can efficiently discriminate sequences with or without repetitive regions. Using the statistic, we developed an algorithm of linear complexity in both computation time and memory usage for detecting all types of repetitive sequences in multiple scenarios, including finding candidate CRISPR regions from bacterial genomic or metagenomics sequences. Simulation and real data experiments showed that the method works well on both assembled sequences and unassembled short reads.


Author(s):  
Julia Yan ◽  
Nick Patterson ◽  
Vagheesh M Narasimhan

Abstract Summary Admixture graphs represent the genetic relationship between a set of populations through splits, drift and admixture. In this article, we present the Julia package miqoGraph, which uses mixed-integer quadratic optimization to fit topology, drift lengths and admixture proportions simultaneously. Through applications of miqoGraph to both simulated and real data, we show that integer optimization can greatly speed up and automate what is usually an arduous manual process. Availability and implementation https://github.com/juliayyan/PhylogeneticTrees.jl. Supplementary information Supplementary data are available at Bioinformatics online.


Author(s):  
Jorge Alvarez-Jarreta ◽  
Patricia R S Rodrigues ◽  
Eoin Fahy ◽  
Anne O’Connor ◽  
Anna Price ◽  
...  

Abstract Summary We present LipidFinder 2.0, incorporating four new modules that apply artefact filters, remove lipid and contaminant stacks, in-source fragments and salt clusters, and a new isotope deletion method which is significantly more sensitive than available open-access alternatives. We also incorporate a novel false discovery rate method, utilizing a target–decoy strategy, which allows users to assess data quality. A renewed lipid profiling method is introduced which searches three different databases from LIPID MAPS and returns bulk lipid structures only, and a lipid category scatter plot with color blind friendly pallet. An API interface with XCMS Online is made available on LipidFinder’s online version. We show using real data that LipidFinder 2.0 provides a significant improvement over non-lipid metabolite filtering and lipid profiling, compared to available tools. Availability and implementation LipidFinder 2.0 is freely available at https://github.com/ODonnell-Lipidomics/LipidFinder and http://lipidmaps.org/resources/tools/lipidfinder. Contact [email protected] or [email protected] Supplementary information Supplementary data are available at Bioinformatics online.


2020 ◽  
Author(s):  
Gustavo de los Campos ◽  
Torsten Pook ◽  
Agustin Gonzalez-Raymundez ◽  
Henner Simianer ◽  
George Mias ◽  
...  

AbstractMotivationModern genomic data sets often involve multiple data-layers (e.g., DNA-sequence, gene expression), each of which itself can be high-dimensional. The biological processes underlying these data-layers can lead to intricate multivariate association patterns.ResultsWe propose and evaluate two methods for analysis variance when both input and output sets are high-dimensional. Our approach uses random effects models to estimate the proportion of variance of vectors in the linear span of the output set that can be explained by regression on the input set. We consider a method based on orthogonal basis (Eigen-ANOVA) and one that uses random vectors (Monte Carlo ANOVA, MC-ANOVA) in the linear span of the output set. We used simulations to assess the bias and variance of each of the methods, and to compare it with that of the Partial Least Squares (PLS)–an approach commonly used in multivariate-high-dimensional regressions. The MC-ANOVA method gave nearly unbiased estimates in all the simulation scenarios considered. Estimates produced by Eigen-ANOVA and PLS had noticeable biases. Finally, we demonstrate insight that can be obtained with the of MC-ANOVA and Eigen-ANOVA by applying these two methods to the study of multi-locus linkage disequilibrium in chicken genomes and to the assessment of inter-dependencies between gene expression, methylation and copy-number-variants in data from breast cancer tumors.AvailabilityThe Supplementary data includes an R-implementation of each of the proposed methods as well as the scripts used in simulations and in the real-data [email protected] informationSupplementary data are available at Bioinformatics online.


Author(s):  
Simone Ciccolella ◽  
Giulia Bernardini ◽  
Luca Denti ◽  
Paola Bonizzoni ◽  
Marco Previtali ◽  
...  

Abstract Motivation The latest advances in cancer sequencing, and the availability of a wide range of methods to infer the evolutionary history of tumors, have made it important to evaluate, reconcile and cluster different tumor phylogenies. Recently, several notions of distance or similarities have been proposed in the literature, but none of them has emerged as the golden standard. Moreover, none of the known similarity measures is able to manage mutations occurring multiple times in the tree, a circumstance often occurring in real cases. Results To overcome these limitations, in this article, we propose MP3, the first similarity measure for tumor phylogenies able to effectively manage cases where multiple mutations can occur at the same time and mutations can occur multiple times. Moreover, a comparison of MP3 with other measures shows that it is able to classify correctly similar and dissimilar trees, both on simulated and on real data. Availability and implementation An open source implementation of MP3 is publicly available at https://github.com/AlgoLab/mp3treesim. Supplementary information Supplementary data are available at Bioinformatics online.


2006 ◽  
Vol Vol. 8 ◽  
Author(s):  
Stavros D. Nikolopoulos ◽  
Leonidas Palios

International audience In this paper, we consider the recognition problem on three classes of perfectly orderable graphs, namely, the HH-free, the HHD-free, and the Welsh-Powell opposition graphs (or WPO-graphs). In particular, we prove properties of the chordal completion of a graph and show that a modified version of the classic linear-time algorithm for testing for a perfect elimination ordering can be efficiently used to determine in O(n min \m α (n,n), m + n^2 log n\) time whether a given graph G on n vertices and m edges contains a house or a hole; this implies an O(n min \m α (n,n), m + n^2 log n\)-time and O(n+m)-space algorithm for recognizing HH-free graphs, and in turn leads to an HHD-free graph recognition algorithm exhibiting the same time and space complexity. We also show that determining whether the complement øverlineG of the graph G is HH-free can be efficiently resolved in O(n m) time using O(n^2) space, which leads to an O(n m)-time and O(n^2)-space algorithm for recognizing WPO-graphs. The previously best algorithms for recognizing HH-free, HHD-free, and WPO-graphs required O(n^3) time and O(n^2) space.


2020 ◽  
Vol 36 (10) ◽  
pp. 3260-3262 ◽  
Author(s):  
Vladimir Perovic ◽  
Jeremy Y Leclercq ◽  
Neven Sumonja ◽  
Francois D Richard ◽  
Nevena Veljkovic ◽  
...  

Abstract Motivation Proteins containing tandem repeats (TRs) are abundant, frequently fold in elongated non-globular structures and perform vital functions. A number of computational tools have been developed to detect TRs in protein sequences. A blurred boundary between imperfect TR motifs and non-repetitive sequences gave rise to necessity to validate the detected TRs. Results Tally-2.0 is a scoring tool based on a machine learning (ML) approach, which allows to validate the results of TR detection. It was upgraded by using improved training datasets and additional ML features. Tally-2.0 performs at a level of 93% sensitivity, 83% specificity and an area under the receiver operating characteristic curve of 95%. Availability and implementation Tally-2.0 is available, as a web tool and as a standalone application published under Apache License 2.0, on the URL https://bioinfo.crbm.cnrs.fr/index.php? route=tools&tool=27. It is supported on Linux. Source code is available upon request. Supplementary information Supplementary data are available at Bioinformatics online.


2020 ◽  
Author(s):  
Jorge Alvarez-Jarreta ◽  
Patricia R.S. Rodrigues ◽  
Eoin Fahy ◽  
Anne O’Connor ◽  
Anna Price ◽  
...  

AbstractWe present LipidFinder 2.0, incorporating four new modules that apply artefact filters, remove lipid and contaminant stacks, in-source fragments and salt clusters, and a new isotope deletion method which is significantly more sensitive than available open-access alternatives. We also incorporate a novel false discovery rate (FDR) method, utilizing a target-decoy strategy, which allows users to assess data quality. A renewed lipid profiling method is introduced which searches three different databases from LIPID MAPS and returns bulk lipid structures only, and a lipid category scatter plot with color blind friendly pallet. An API interface with XCMS Online is made available on LipidFinder’s online version. We show using real data that LipidFinder 2.0 provides a significant improvement over non-lipid metabolite filtering and lipid profiling, compared to available tools.AvailabilityLipidFinder 2.0 is freely available at https://github.com/ODonnell-Lipidomics/LipidFinder and http://lipidmaps.org/resources/tools/[email protected] informationSupplementary data are available at Bioinformatics online.


Sign in / Sign up

Export Citation Format

Share Document