A FAST TECHNIQUE FOR DERIVING FREQUENT STRUCTURED PATTERNS FROM BIOLOGICAL DATA SETS

In the last years, the completion of the human genome sequencing showed a wide range of new challenging issues involving raw data analysis. In particular, the discovery of information implicitly encoded in biological sequences is assuming a prominent role in identifying genetic diseases and in deciphering biological mechanisms. This information is usually represented by patterns frequently occurring in the sequences. Because of biological observations, a specific class of patterns is becoming particularly interesting: frequent structured patterns. In this respect, it is biologically meaningful to look at both "exact" and "approximate" repetitions of pattens within the available sequences. This paper gives a contribution in this setting by providing algorithms which allow to discover frequent structured patterns, both in "exact" and "approximate" form, present in a collection of input biological sequences.

Download Full-text

Denoising large-scale biological data using network filters

10.21203/rs.3.rs-66071/v2 ◽

2021 ◽

Author(s):

Andrew J Kavran ◽

Aaron Clauset

Keyword(s):

Large Scale ◽

Synthetic Data ◽

Interaction Network ◽

Learning Task ◽

Biological Data ◽

Data Sets ◽

Proteomics Data ◽

Life History Variation ◽

Wide Range ◽

Underlying Processes

Abstract Background: Large-scale biological data sets are often contaminated by noise, which can impede accurate inferences about underlying processes. Such measurement noise can arise from endogenous biological factors like cell cycle and life history variation, and from exogenous technical factors like sample preparation and instrument variation.Results: We describe a general method for automatically reducing noise in large-scale biological data sets. This method uses an interaction network to identify groups of correlated or anti-correlated measurements that can be combined or “ﬁltered” to better recover an underlying biological signal. Similar to the process of denoising an image, a single network ﬁlter may be applied to an entire system, or the system may be ﬁrst decomposed into distinct modules and a diﬀerent ﬁlter applied to each. Applied to synthetic data with known network structure and signal, network ﬁlters accurately reduce noise across a wide range of noise levels and structures. Applied to a machine learning task of predicting changes in human protein expression in healthy and cancerous tissues, network ﬁltering prior to training increases accuracy up to 43% compared to using unﬁltered data.Conclusions: Network ﬁlters are a general way to denoise biological data and can account for both correlation and anti-correlation between diﬀerent measurements. Furthermore, we ﬁnd that partitioning a network prior to ﬁltering can signiﬁcantly reduce errors in networks with heterogenous data and correlation patterns, and this approach outperforms existing diﬀusion based methods. Our results on proteomics data indicate the broad potential utility of network ﬁlters to applications in systems biology.

Download Full-text

Pattern Discovery in Biosequences

Data Mining Patterns ◽

10.4018/978-1-59904-162-9.ch004 ◽

2011 ◽

pp. 85-105 ◽

Cited By ~ 1

Author(s):

Simona Este Rombo ◽

Luigi Palopoli

Keyword(s):

Pattern Discovery ◽

Protein Sequences ◽

Biological Data ◽

Data Sets ◽

Biological Sequences ◽

Biological Functions ◽

Pattern Extraction ◽

New Methods

In the last years, the information stored in biological data-sets grew up exponentially, and new methods and tools have been proposed to interpret and retrieve useful information from such data. Most biological data-sets contain biological sequences (e.g., DNA and protein sequences). Thus, it is much significant to have techniques available capable of mining patterns from such sequences to discover interesting information from them. For instance, singling out for common or similar sub-sequences in sets of bio-sequences is sensible as these are usually associated to similar biological functions expressed by the corresponding macromolecules. The aim of this chapter is to explain how pattern discovery can be applied to deal with such important biological problems, describing also a number of relevant techniques proposed in the literature. A simple formalization of the problem is given and specialized for each of the presented approaches. Such a formalization should ease reading and understanding the illustrated material by providing a simple-to-follow roadmap scheme through the diverse methods for pattern extraction we are going to illustrate.

Download Full-text

Overview of Big Data in Healthcare

Advances in Healthcare Information Systems and Administration - Handbook of Research on Data Science for Effective Healthcare Practice and Administration ◽

10.4018/978-1-5225-2515-8.ch016 ◽

2017 ◽

pp. 360-384

Author(s):

Mohammad Hossein Fazel Zarandi ◽

Reyhaneh Gamasaee

Keyword(s):

Big Data ◽

Data Analysis ◽

Best Practice ◽

Complex Structure ◽

Big Data Analysis ◽

Data Sets ◽

Exciting Field ◽

Wide Range ◽

Development Outcomes ◽

Practice Development

Big data is a new ubiquitous term for massive data sets having large, more varied and complex structure with the complexities and difficulties of storing, analyzing and visualizing for further processes or results. The use of Big Data in health is a new and exciting field. A wide range of use cases for Big Data and analytics in healthcare will benefit best practice development, outcomes analysis, prediction, and surveillance. Consequently, the aim of this chapter is to provide an overview of Big Data in Healthcare systems including two applications of Big Data analysis in healthcare. The first one is understanding disease outcomes through analyzing Big Data, and the second one is the application of Big Data in genetics, biological, and molecular fields. Moreover, characteristics and challenges of healthcare Big Data analysis as well as technologies and software used for Big Data analysis are reviewed.

Download Full-text

Disentangling Multidimensional Spatio-Temporal Data into their Common and Aberrant Responses

10.1101/004259 ◽

2014 ◽

Author(s):

Young Hwan Chang ◽

Jim Korkola ◽

Dhara N. Amin ◽

Mark M. Moasser ◽

Jose M. Carmena ◽

...

Keyword(s):

Gene Expression ◽

Time Series ◽

Cell Lines ◽

Biological Data ◽

Series Data ◽

State Transitions ◽

Data Sets ◽

Wide Range ◽

Spatio Temporal ◽

Experimental Trials

With the advent of high-throughput measurement techniques, scientists and engineers are starting to grapple with massive data sets and encountering challenges with how to organize, process and extract information into meaningful structures. Multidimensional spatio-temporal biological data sets such as time series gene expression with various perturbations with different cell lines, or neural spike data sets across many experimental trials have the potential to acquire insight across multiple dimensions. For this potential to be realized, we need a suitable representation to turn data into insight. Since a wide range of experiments and the (unknown) complexity of underlying system make biological data more heterogeneous than those in other fields, we propose the method based on Robust Principal Component Analysis (RPCA), which is well suited for extracting principal components where we have corrupted observations. The proposed method provides us a new representation of these data sets which consists of its common and aberrant response. This representation might help users to acquire a new insight from data. %For example, identifying common event-related neural features across many experimental trials can be used as a signature to detect discrete events or state transitions. Also, the proposed method can be useful to biologists in clustering and analyzing gene expression time series data with a new perspective, for example, it can not only extract canonical cell signaling response but also inform them to get insight into the heterogeneity of different responses across different cell lines.

Download Full-text

Denoising Large-Scale Biological Data Using Network Filters

10.21203/rs.3.rs-66071/v1 ◽

2020 ◽

Author(s):

Andrew J Kavran ◽

Aaron Clauset

Keyword(s):

Large Scale ◽

Synthetic Data ◽

Interaction Network ◽

Learning Task ◽

Biological Data ◽

Data Sets ◽

Proteomics Data ◽

Life History Variation ◽

Wide Range ◽

Underlying Processes

Abstract Background: Large-scale biological data sets, e.g., transcriptomic, proteomic, or ecological, are often contaminated by noise, which can impede accurate inferences about underlying processes. Such measurement noise can arise from endogenous biological factors like cell cycle and life history variation, and from exogenous technical factors like sample preparation and instrument variation. Results: We describe a general method for automatically reducing noise in large-scale biological data sets. This method uses an interaction network to identify groups of correlated or anti-correlated measurements that can be combined or “filtered” to better recover an underlying biological signal. Similar to the process of denoising an image, a single network filter may be applied to an entire system, or the system may be first decomposed into distinct modules and a different filter applied to each. Applied to synthetic data with known network structure and signal, network filters accurately reduce noise across a wide range of noise levels and structures. Applied to a machine learning task of predicting changes in human protein expression in healthy and cancerous tissues, network filtering prior to training increases accuracy up to 58% compared to using unfiltered data. Conclusions: Network filters are a general way to denoise biological data and can account for both correlation and anti-correlation between different measurements. Furthermore, we find that partitioning a network prior to filtering can significantly reduce errors in networks with heterogenous data and correlation patterns, andthis approach outperforms existing diffusion based methods. Our results on proteomics data indicate the broad potential utility of network filters to applications in systems biology.

Download Full-text

Denoising large-scale biological data using network filters

BMC Bioinformatics ◽

10.1186/s12859-021-04075-x ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Andrew J. Kavran ◽

Aaron Clauset

Keyword(s):

Large Scale ◽

Synthetic Data ◽

Interaction Network ◽

Learning Task ◽

Biological Data ◽

Data Sets ◽

Proteomics Data ◽

Life History Variation ◽

Wide Range ◽

Underlying Processes

Abstract Background Large-scale biological data sets are often contaminated by noise, which can impede accurate inferences about underlying processes. Such measurement noise can arise from endogenous biological factors like cell cycle and life history variation, and from exogenous technical factors like sample preparation and instrument variation. Results We describe a general method for automatically reducing noise in large-scale biological data sets. This method uses an interaction network to identify groups of correlated or anti-correlated measurements that can be combined or “filtered” to better recover an underlying biological signal. Similar to the process of denoising an image, a single network filter may be applied to an entire system, or the system may be first decomposed into distinct modules and a different filter applied to each. Applied to synthetic data with known network structure and signal, network filters accurately reduce noise across a wide range of noise levels and structures. Applied to a machine learning task of predicting changes in human protein expression in healthy and cancerous tissues, network filtering prior to training increases accuracy up to 43% compared to using unfiltered data. Conclusions Network filters are a general way to denoise biological data and can account for both correlation and anti-correlation between different measurements. Furthermore, we find that partitioning a network prior to filtering can significantly reduce errors in networks with heterogenous data and correlation patterns, and this approach outperforms existing diffusion based methods. Our results on proteomics data indicate the broad potential utility of network filters to applications in systems biology.

Download Full-text

Interactive and coordinated visualization approaches for biological data analysis

Briefings in Bioinformatics ◽

10.1093/bib/bby019 ◽

2018 ◽

Vol 20 (4) ◽

pp. 1513-1523 ◽

Cited By ~ 4

Author(s):

António Cruz ◽

Joel P Arrais ◽

Penousal Machado

Keyword(s):

Data Analysis ◽

Biological Data ◽

Data Sets ◽

Protein Protein Interaction ◽

Biological Data Analysis ◽

Time Series Gene Expression ◽

Protein Protein Interaction Networks ◽

Complex Relationships ◽

Meaningful Relationships ◽

Different Sources

AbstractThe field of computational biology has become largely dependent on data visualization tools to analyze the increasing quantities of data gathered through the use of new and growing technologies. Aside from the volume, which often results in large amounts of noise and complex relationships with no clear structure, the visualization of biological data sets is hindered by their heterogeneity, as data are obtained from different sources and contain a wide variety of attributes, including spatial and temporal information. This requires visualization approaches that are able to not only represent various data structures simultaneously but also provide exploratory methods that allow the identification of meaningful relationships that would not be perceptible through data analysis algorithms alone. In this article, we present a survey of visualization approaches applied to the analysis of biological data. We focus on graph-based visualizations and tools that use coordinated multiple views to represent high-dimensional multivariate data, in particular time series gene expression, protein–protein interaction networks and biological pathways. We then discuss how these methods can be used to help solve the current challenges surrounding the visualization of complex biological data sets.

Download Full-text

mixOmics: an R package for ‘omics feature selection and multiple data integration

10.1101/108597 ◽

2017 ◽

Cited By ~ 19

Author(s):

Florian Rohart ◽

Benoît Gautier ◽

Amrit Singh ◽

Kim-Anh Lê Cao

Keyword(s):

Data Integration ◽

Large Scale ◽

Relevant Information ◽

R Package ◽

Biological Data ◽

Molecular Signature ◽

Single Type ◽

Data Sets ◽

Omics Data ◽

Wide Range

AbstractThe advent of high throughput technologies has led to a wealth of publicly available ‘omics data coming from different sources, such as transcriptomics, proteomics, metabolomics. Combining such large-scale biological data sets can lead to the discovery of important biological insights, provided that relevant information can be extracted in a holistic manner. Current statistical approaches have been focusing on identifying small subsets of molecules (a ‘molecular signature’) to explain or predict biological conditions, but mainly for a single type of ‘omics. In addition, commonly used methods are univariate and consider each biological feature independently.We introducemixOmics, an R package dedicated to the multivariate analysis of biological data sets with a specific focus on data exploration, dimension reduction and visualisation. By adopting a system biology approach, the toolkit provides a wide range of methods that statistically integrate several data sets at once to probe relationships between heterogeneous ‘omics data sets. Our recent methods extend Projection to Latent Structure (PLS) models for discriminant analysis, for data integration across multiple ‘omics data or across independent studies, and for the identification of molecular signatures. We illustrate our latestmixOmicsintegrative frameworks for the multivariate analyses of ‘omics data available from the package.

Download Full-text

Symmetries among Multivariate Information Measures Explored Using Möbius Operators

10.20944/preprints201811.0234.v1 ◽

2018 ◽

Author(s):

David J. Galas ◽

Nikita A. Sakhanenko

Keyword(s):

Data Analysis ◽

Operator Algebra ◽

Full Range ◽

Direct Consequence ◽

Biological Data ◽

Diverse Range ◽

Common Information ◽

Information Measures ◽

Wide Range ◽

Range Of Functions

Information-related measures are useful tools for multi-variable data analysis, as measures of dependence among variables, and as descriptions of order and disorder in biological and physical systems.  Measures, like marginal entropies, mutual / interaction / multi -information, have long been used in a number of fields including descriptions of systems complexity and biological data analysis.  The mathematical relationships among these measures are therefore of significant inherent interest.  Relations between common information measures include the duality relations based on Möbius inversion on lattices.  These are the direct consequence of the symmetries of the lattices of the sets of variables (subsets ordered by inclusion).  While these relationships are of significant interest there has been, to our knowledge, no systematic examination of the full range of relationships of this diverse range of functions into a unifying formalism as we do here.  In this paper we define operators on functions on these lattices based on the Möbius inversions that map functions into one another (Möbius operators).  We show that these operators form a simple group isomorphic to the symmetric group S3.  Relations among the set of functions on the lattice are transparently expressed in terms of the operator algebra, and, applied to the information measures, can be used to derive a wide range of relationships among diverse information measures.  The Möbius operator algebra is naturally generalized which yields extensive new relationships.  This formalism now provides a fundamental unification of information-related measures, and the isomorphism of all distributive lattices with the subset lattice implies an even broader application of these results.

Download Full-text