A new statistic for efficient detection of repetitive sequences

Sijie Chen; Yixin Chen; Fengzhu Sun; Michael S Waterman; Xuegong Zhang

doi:10.1093/bioinformatics/btz262

A new statistic for efficient detection of repetitive sequences

Bioinformatics ◽

10.1093/bioinformatics/btz262 ◽

2019 ◽

Vol 35 (22) ◽

pp. 4596-4606 ◽

Cited By ~ 1

Author(s):

Sijie Chen ◽

Yixin Chen ◽

Fengzhu Sun ◽

Michael S Waterman ◽

Xuegong Zhang

Keyword(s):

Linear Time ◽

Repetitive Sequences ◽

Real Data ◽

Space Complexity ◽

Supplementary Information ◽

Supplementary Data ◽

Efficient Detection ◽

Time And Space Complexity ◽

Multiple Scenarios ◽

Repeat Detection

Abstract Motivation Detecting sequences containing repetitive regions is a basic bioinformatics task with many applications. Several methods have been developed for various types of repeat detection tasks. An efficient generic method for detecting most types of repetitive sequences is still desirable. Inspired by the excellent properties and successful applications of the D2 family of statistics in comparative analyses of genomic sequences, we developed a new statistic D2R that can efficiently discriminate sequences with or without repetitive regions. Results Using the statistic, we developed an algorithm of linear time and space complexity for detecting most types of repetitive sequences in multiple scenarios, including finding candidate clustered regularly interspaced short palindromic repeats regions from bacterial genomic or metagenomics sequences. Simulation and real data experiments show that the method works well on both assembled sequences and unassembled short reads. Availability and implementation The codes are available at https://github.com/XuegongLab/D2R_codes under GPL 3.0 license. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

A new statistic for efficient detection of repetitive sequences

10.1101/420745 ◽

2018 ◽

Author(s):

Sijie Chen ◽

Fengzhu Sun ◽

Michael S. Waterman ◽

Xuegong Zhang

Keyword(s):

Linear Complexity ◽

Repetitive Sequences ◽

Computation Time ◽

Real Data ◽

Genomic Sequences ◽

Memory Usage ◽

Short Reads ◽

Efficient Detection ◽

Multiple Scenarios ◽

Repeat Detection

ABSTRACTDetecting sequences containing repetitive regions is a basic bioinformatics task with many applications. Several methods have been developed for various types of repeat detection tasks. An efficient generic method for detecting all types of repetitive sequences is still desirable.Inspired by the excellent properties and successful applications of the D2 family of statistics in comparative analyses of genomic sequences, we developed a new statistic that can efficiently discriminate sequences with or without repetitive regions. Using the statistic, we developed an algorithm of linear complexity in both computation time and memory usage for detecting all types of repetitive sequences in multiple scenarios, including finding candidate CRISPR regions from bacterial genomic or metagenomics sequences. Simulation and real data experiments showed that the method works well on both assembled sequences and unassembled short reads.

Download Full-text

miqoGraph: fitting admixture graphs using mixed-integer quadratic optimization

Bioinformatics ◽

10.1093/bioinformatics/btaa988 ◽

2020 ◽

Author(s):

Julia Yan ◽

Nick Patterson ◽

Vagheesh M Narasimhan

Keyword(s):

Genetic Relationship ◽

Real Data ◽

Quadratic Optimization ◽

Supplementary Information ◽

Mixed Integer ◽

Supplementary Data ◽

Integer Optimization ◽

Speed Up

Abstract Summary Admixture graphs represent the genetic relationship between a set of populations through splits, drift and admixture. In this article, we present the Julia package miqoGraph, which uses mixed-integer quadratic optimization to fit topology, drift lengths and admixture proportions simultaneously. Through applications of miqoGraph to both simulated and real data, we show that integer optimization can greatly speed up and automate what is usually an arduous manual process. Availability and implementation https://github.com/juliayyan/PhylogeneticTrees.jl. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

LipidFinder 2.0: advanced informatics pipeline for lipidomics discovery applications

Bioinformatics ◽

10.1093/bioinformatics/btaa856 ◽

2020 ◽

Author(s):

Jorge Alvarez-Jarreta ◽

Patricia R S Rodrigues ◽

Eoin Fahy ◽

Anne O’Connor ◽

Anna Price ◽

...

Keyword(s):

Real Data ◽

Supplementary Information ◽

Supplementary Data ◽

Scatter Plot ◽

Lipid Profiling ◽

False Discovery ◽

False Discovery Rate Method ◽

Rate Method ◽

Assess Data Quality ◽

Lipid Structures

Abstract Summary We present LipidFinder 2.0, incorporating four new modules that apply artefact filters, remove lipid and contaminant stacks, in-source fragments and salt clusters, and a new isotope deletion method which is significantly more sensitive than available open-access alternatives. We also incorporate a novel false discovery rate method, utilizing a target–decoy strategy, which allows users to assess data quality. A renewed lipid profiling method is introduced which searches three different databases from LIPID MAPS and returns bulk lipid structures only, and a lipid category scatter plot with color blind friendly pallet. An API interface with XCMS Online is made available on LipidFinder’s online version. We show using real data that LipidFinder 2.0 provides a significant improvement over non-lipid metabolite filtering and lipid profiling, compared to available tools. Availability and implementation LipidFinder 2.0 is freely available at https://github.com/ODonnell-Lipidomics/LipidFinder and http://lipidmaps.org/resources/tools/lipidfinder. Contact [email protected] or [email protected] Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Analysis of variance when both input and output sets are high-dimensional

10.1101/2020.02.15.950949 ◽

2020 ◽

Author(s):

Gustavo de los Campos ◽

Torsten Pook ◽

Agustin Gonzalez-Raymundez ◽

Henner Simianer ◽

George Mias ◽

...

Keyword(s):

Gene Expression ◽

Linear Span ◽

Copy Number Variants ◽

Real Data ◽

Supplementary Information ◽

High Dimensional ◽

Supplementary Data ◽

Random Effects Models ◽

Input And Output ◽

Data Layers

AbstractMotivationModern genomic data sets often involve multiple data-layers (e.g., DNA-sequence, gene expression), each of which itself can be high-dimensional. The biological processes underlying these data-layers can lead to intricate multivariate association patterns.ResultsWe propose and evaluate two methods for analysis variance when both input and output sets are high-dimensional. Our approach uses random effects models to estimate the proportion of variance of vectors in the linear span of the output set that can be explained by regression on the input set. We consider a method based on orthogonal basis (Eigen-ANOVA) and one that uses random vectors (Monte Carlo ANOVA, MC-ANOVA) in the linear span of the output set. We used simulations to assess the bias and variance of each of the methods, and to compare it with that of the Partial Least Squares (PLS)–an approach commonly used in multivariate-high-dimensional regressions. The MC-ANOVA method gave nearly unbiased estimates in all the simulation scenarios considered. Estimates produced by Eigen-ANOVA and PLS had noticeable biases. Finally, we demonstrate insight that can be obtained with the of MC-ANOVA and Eigen-ANOVA by applying these two methods to the study of multi-locus linkage disequilibrium in chicken genomes and to the assessment of inter-dependencies between gene expression, methylation and copy-number-variants in data from breast cancer tumors.AvailabilityThe Supplementary data includes an R-implementation of each of the proposed methods as well as the scripts used in simulations and in the real-data [email protected] informationSupplementary data are available at Bioinformatics online.

Download Full-text

Triplet-based similarity score for fully multilabeled trees with poly-occurring labels

Bioinformatics ◽

10.1093/bioinformatics/btaa676 ◽

2020 ◽

Author(s):

Simone Ciccolella ◽

Giulia Bernardini ◽

Luca Denti ◽

Paola Bonizzoni ◽

Marco Previtali ◽

...

Keyword(s):

Open Source ◽

Evolutionary History ◽

Similarity Measures ◽

Real Data ◽

Similarity Score ◽

Supplementary Information ◽

Supplementary Data ◽

Wide Range ◽

Golden Standard ◽

History Of

Abstract Motivation The latest advances in cancer sequencing, and the availability of a wide range of methods to infer the evolutionary history of tumors, have made it important to evaluate, reconcile and cluster different tumor phylogenies. Recently, several notions of distance or similarities have been proposed in the literature, but none of them has emerged as the golden standard. Moreover, none of the known similarity measures is able to manage mutations occurring multiple times in the tree, a circumstance often occurring in real cases. Results To overcome these limitations, in this article, we propose MP3, the first similarity measure for tumor phylogenies able to effectively manage cases where multiple mutations can occur at the same time and mutations can occur multiple times. Moreover, a comparison of MP3 with other measures shows that it is able to classify correctly similar and dissimilar trees, both on simulated and on real data. Availability and implementation An open source implementation of MP3 is publicly available at https://github.com/AlgoLab/mp3treesim. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

FLEXMAP-a neural network for the traveling salesman problem with linear time and space complexity

[Proceedings] 1991 IEEE International Joint Conference on Neural Networks ◽

10.1109/ijcnn.1991.170519 ◽

1991 ◽

Cited By ~ 11

Author(s):

B. Fritzke ◽

P. Wilke

Keyword(s):

Neural Network ◽

Traveling Salesman Problem ◽

Linear Time ◽

Traveling Salesman ◽

Space Complexity ◽

Time And Space ◽

The Traveling Salesman Problem ◽

Time And Space Complexity

Download Full-text

Recognizing HH-free, HHD-free, and Welsh-Powell Opposition Graphs

Discrete Mathematics & Theoretical Computer Science ◽

10.46298/dmtcs.370 ◽

2006 ◽

Vol Vol. 8 ◽

Author(s):

Stavros D. Nikolopoulos ◽

Leonidas Palios

Keyword(s):

Linear Time ◽

Time Algorithm ◽

Recognition Algorithm ◽

Space Complexity ◽

Linear Time Algorithm ◽

Free Graph ◽

Graph Recognition ◽

International Audience ◽

Time And Space Complexity ◽

Perfect Elimination Ordering

International audience In this paper, we consider the recognition problem on three classes of perfectly orderable graphs, namely, the HH-free, the HHD-free, and the Welsh-Powell opposition graphs (or WPO-graphs). In particular, we prove properties of the chordal completion of a graph and show that a modified version of the classic linear-time algorithm for testing for a perfect elimination ordering can be efficiently used to determine in O(n min \m α (n,n), m + n^2 log n\) time whether a given graph G on n vertices and m edges contains a house or a hole; this implies an O(n min \m α (n,n), m + n^2 log n\)-time and O(n+m)-space algorithm for recognizing HH-free graphs, and in turn leads to an HHD-free graph recognition algorithm exhibiting the same time and space complexity. We also show that determining whether the complement øverlineG of the graph G is HH-free can be efficiently resolved in O(n m) time using O(n^2) space, which leads to an O(n m)-time and O(n^2)-space algorithm for recognizing WPO-graphs. The previously best algorithms for recognizing HH-free, HHD-free, and WPO-graphs required O(n^3) time and O(n^2) space.

Download Full-text

Tally-2.0: upgraded validator of tandem repeat detection in protein sequences

Bioinformatics ◽

10.1093/bioinformatics/btaa121 ◽

2020 ◽

Vol 36 (10) ◽

pp. 3260-3262 ◽

Cited By ~ 1

Author(s):

Vladimir Perovic ◽

Jeremy Y Leclercq ◽

Neven Sumonja ◽

Francois D Richard ◽

Nevena Veljkovic ◽

...

Keyword(s):

Operating Characteristic ◽

Tandem Repeats ◽

Characteristic Curve ◽

Repetitive Sequences ◽

Protein Sequences ◽

Supplementary Information ◽

Web Tool ◽

Vital Functions ◽

Scoring Tool ◽

Repeat Detection

Abstract Motivation Proteins containing tandem repeats (TRs) are abundant, frequently fold in elongated non-globular structures and perform vital functions. A number of computational tools have been developed to detect TRs in protein sequences. A blurred boundary between imperfect TR motifs and non-repetitive sequences gave rise to necessity to validate the detected TRs. Results Tally-2.0 is a scoring tool based on a machine learning (ML) approach, which allows to validate the results of TR detection. It was upgraded by using improved training datasets and additional ML features. Tally-2.0 performs at a level of 93% sensitivity, 83% specificity and an area under the receiver operating characteristic curve of 95%. Availability and implementation Tally-2.0 is available, as a web tool and as a standalone application published under Apache License 2.0, on the URL https://bioinfo.crbm.cnrs.fr/index.php? route=tools&tool=27. It is supported on Linux. Source code is available upon request. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

A Simple Modification in CMA-ES Achieving Linear Time and Space Complexity

Parallel Problem Solving from Nature – PPSN X - Lecture Notes in Computer Science ◽

10.1007/978-3-540-87700-4_30 ◽

2008 ◽

pp. 296-305 ◽

Cited By ~ 79

Author(s):

Raymond Ros ◽

Nikolaus Hansen

Keyword(s):

Linear Time ◽

Space Complexity ◽

Time And Space ◽

Simple Modification ◽

Time And Space Complexity

Download Full-text

LipidFinder 2.0: advanced informatics pipeline for lipidomics discovery applications

10.1101/2020.08.16.250878 ◽

2020 ◽

Author(s):

Jorge Alvarez-Jarreta ◽

Patricia R.S. Rodrigues ◽

Eoin Fahy ◽

Anne O’Connor ◽

Anna Price ◽

...

Keyword(s):

Open Access ◽

Real Data ◽

Supplementary Information ◽

Supplementary Data ◽

Scatter Plot ◽

Lipid Profiling ◽

Link Type ◽

False Discovery ◽

Assess Data Quality ◽

Lipid Structures

AbstractWe present LipidFinder 2.0, incorporating four new modules that apply artefact filters, remove lipid and contaminant stacks, in-source fragments and salt clusters, and a new isotope deletion method which is significantly more sensitive than available open-access alternatives. We also incorporate a novel false discovery rate (FDR) method, utilizing a target-decoy strategy, which allows users to assess data quality. A renewed lipid profiling method is introduced which searches three different databases from LIPID MAPS and returns bulk lipid structures only, and a lipid category scatter plot with color blind friendly pallet. An API interface with XCMS Online is made available on LipidFinder’s online version. We show using real data that LipidFinder 2.0 provides a significant improvement over non-lipid metabolite filtering and lipid profiling, compared to available tools.AvailabilityLipidFinder 2.0 is freely available at https://github.com/ODonnell-Lipidomics/LipidFinder and http://lipidmaps.org/resources/tools/[email protected] informationSupplementary data are available at Bioinformatics online.

Download Full-text