scholarly journals Denoising large-scale biological data using network filters

2021 ◽  
Author(s):  
Andrew J Kavran ◽  
Aaron Clauset

Abstract Background: Large-scale biological data sets are often contaminated by noise, which can impede accurate inferences about underlying processes. Such measurement noise can arise from endogenous biological factors like cell cycle and life history variation, and from exogenous technical factors like sample preparation and instrument variation.Results: We describe a general method for automatically reducing noise in large-scale biological data sets. This method uses an interaction network to identify groups of correlated or anti-correlated measurements that can be combined or “filtered” to better recover an underlying biological signal. Similar to the process of denoising an image, a single network filter may be applied to an entire system, or the system may be first decomposed into distinct modules and a different filter applied to each. Applied to synthetic data with known network structure and signal, network filters accurately reduce noise across a wide range of noise levels and structures. Applied to a machine learning task of predicting changes in human protein expression in healthy and cancerous tissues, network filtering prior to training increases accuracy up to 43% compared to using unfiltered data.Conclusions: Network filters are a general way to denoise biological data and can account for both correlation and anti-correlation between different measurements. Furthermore, we find that partitioning a network prior to filtering can significantly reduce errors in networks with heterogenous data and correlation patterns, and this approach outperforms existing diffusion based methods. Our results on proteomics data indicate the broad potential utility of network filters to applications in systems biology.

2020 ◽  
Author(s):  
Andrew J Kavran ◽  
Aaron Clauset

Abstract Background: Large-scale biological data sets, e.g., transcriptomic, proteomic, or ecological, are often contaminated by noise, which can impede accurate inferences about underlying processes. Such measurement noise can arise from endogenous biological factors like cell cycle and life history variation, and from exogenous technical factors like sample preparation and instrument variation. Results: We describe a general method for automatically reducing noise in large-scale biological data sets. This method uses an interaction network to identify groups of correlated or anti-correlated measurements that can be combined or “filtered” to better recover an underlying biological signal. Similar to the process of denoising an image, a single network filter may be applied to an entire system, or the system may be first decomposed into distinct modules and a different filter applied to each. Applied to synthetic data with known network structure and signal, network filters accurately reduce noise across a wide range of noise levels and structures. Applied to a machine learning task of predicting changes in human protein expression in healthy and cancerous tissues, network filtering prior to training increases accuracy up to 58% compared to using unfiltered data. Conclusions: Network filters are a general way to denoise biological data and can account for both correlation and anti-correlation between different measurements. Furthermore, we find that partitioning a network prior to filtering can significantly reduce errors in networks with heterogenous data and correlation patterns, andthis approach outperforms existing diffusion based methods. Our results on proteomics data indicate the broad potential utility of network filters to applications in systems biology.


2021 ◽  
Vol 22 (1) ◽  
Author(s):  
Andrew J. Kavran ◽  
Aaron Clauset

Abstract Background Large-scale biological data sets are often contaminated by noise, which can impede accurate inferences about underlying processes. Such measurement noise can arise from endogenous biological factors like cell cycle and life history variation, and from exogenous technical factors like sample preparation and instrument variation. Results We describe a general method for automatically reducing noise in large-scale biological data sets. This method uses an interaction network to identify groups of correlated or anti-correlated measurements that can be combined or “filtered” to better recover an underlying biological signal. Similar to the process of denoising an image, a single network filter may be applied to an entire system, or the system may be first decomposed into distinct modules and a different filter applied to each. Applied to synthetic data with known network structure and signal, network filters accurately reduce noise across a wide range of noise levels and structures. Applied to a machine learning task of predicting changes in human protein expression in healthy and cancerous tissues, network filtering prior to training increases accuracy up to 43% compared to using unfiltered data. Conclusions Network filters are a general way to denoise biological data and can account for both correlation and anti-correlation between different measurements. Furthermore, we find that partitioning a network prior to filtering can significantly reduce errors in networks with heterogenous data and correlation patterns, and this approach outperforms existing diffusion based methods. Our results on proteomics data indicate the broad potential utility of network filters to applications in systems biology.


2020 ◽  
Author(s):  
Andrew J. Kavran ◽  
Aaron Clauset

Large-scale biological data sets, e.g., transcriptomic, proteomic, or ecological, are often contaminated by noise, which can impede accurate inferences about underlying processes. Such measurement noise can arise from endogenous biological factors like cell cycle and life history variation, and from exogenous technical factors like sample preparation and instrument variation. Here we describe a general method for automatically reducing noise in large-scale biological data sets. This method uses an interaction network to identify groups of correlated or anti-correlated measurements that can be combined or “filtered” to better recover an underlying biological signal. Similar to the process of denoising an image, a single network filter may be applied to an entire system, or the system may be first decomposed into distinct modules and a different filter applied to each. Applied to synthetic data with known network structure and signal, network filters accurately reduce noise across a wide range of noise levels and structures. Applied to a machine learning task of predicting changes in human protein expression in healthy and cancerous tissues, network filtering prior to training increases accuracy up to 58% compared to using unfiltered data. These results indicate the broad potential utility of network-based filters to applications in systems biology.Author SummarySystem-wide measurements of many biological signals, whether derived from molecules, cells, or entire organisms, are often noisy. Removing or mitigating this noise prior to analysis can improve our understanding and predictions of biological phenomena. We describe a general way to denoise biological data that can account for both correlation and anti-correlation between different measurements. These “network filters” take as input a set of biological measurements, e.g., metabolite concentration, animal traits, neuron activity, or gene expression, and a network of how those measurements are biologically related, e.g., a metabolic network, food web, brain connectome, or protein-protein interaction network. Measurements are then “filtered” for correlated or anti-correlated noise using a set of other measurements that are identified using the network. We investigate the accuracy of these filters in synthetic and real-world data sets, and find that they can substantially reduce noise of different levels and structure. By denoising large-scale biological data sets, network filters have the potential to improve the analysis of many types of biological data.


2017 ◽  
Author(s):  
Florian Rohart ◽  
Benoît Gautier ◽  
Amrit Singh ◽  
Kim-Anh Lê Cao

AbstractThe advent of high throughput technologies has led to a wealth of publicly available ‘omics data coming from different sources, such as transcriptomics, proteomics, metabolomics. Combining such large-scale biological data sets can lead to the discovery of important biological insights, provided that relevant information can be extracted in a holistic manner. Current statistical approaches have been focusing on identifying small subsets of molecules (a ‘molecular signature’) to explain or predict biological conditions, but mainly for a single type of ‘omics. In addition, commonly used methods are univariate and consider each biological feature independently.We introducemixOmics, an R package dedicated to the multivariate analysis of biological data sets with a specific focus on data exploration, dimension reduction and visualisation. By adopting a system biology approach, the toolkit provides a wide range of methods that statistically integrate several data sets at once to probe relationships between heterogeneous ‘omics data sets. Our recent methods extend Projection to Latent Structure (PLS) models for discriminant analysis, for data integration across multiple ‘omics data or across independent studies, and for the identification of molecular signatures. We illustrate our latestmixOmicsintegrative frameworks for the multivariate analyses of ‘omics data available from the package.


F1000Research ◽  
2018 ◽  
Vol 7 ◽  
pp. 741 ◽  
Author(s):  
Kevin Rue-Albrecht ◽  
Federico Marini ◽  
Charlotte Soneson ◽  
Aaron T.L. Lun

Data exploration is critical to the comprehension of large biological data sets generated by high-throughput assays such as sequencing. However, most existing tools for interactive visualisation are limited to specific assays or analyses. Here, we present the iSEE (Interactive SummarizedExperiment Explorer) software package, which provides a general visual interface for exploring data in a SummarizedExperiment object. iSEE is directly compatible with many existing R/Bioconductor packages for analysing high-throughput biological data, and provides useful features such as simultaneous examination of (meta)data and analysis results, dynamic linking between plots and code tracking for reproducibility. We demonstrate the utility and flexibility of iSEE by applying it to explore a range of real transcriptomics and proteomics data sets.


2017 ◽  
Author(s):  
Ross Mounce

In this thesis I attempt to gather together a wide range of cladistic analyses of fossil and extant taxa representing a diverse array of phylogenetic groups. I use this data to quantitatively compare the effect of fossil taxa relative to extant taxa in terms of support for relationships, number of most parsimonious trees (MPTs) and leaf stability. In line with previous studies I find that the effects of fossil taxa are seldom different to extant taxa – although I highlight some interesting exceptions. I also use this data to compare the phylogenetic signal within vertebrate morphological data sets, by choosing to compare cranial data to postcranial data. Comparisons between molecular data and morphological data have been previously well explored, as have signals between different molecular loci. But comparative signal within morphological data sets is much less commonly characterized and certainly not across a wide array of clades. With this analysis I show that there are many studies in which the evidence provided by cranial data appears to be be significantly incongruent with the postcranial data – more than one would expect to see just by the effect of chance and noise alone. I devise and implement a modification to a rarely used measure of homoplasy that will hopefully encourage its wider usage. Previously it had some undesirable bias associated with the distribution of missing data in a dataset, but my modification controls for this. I also take an in-depth and extensive review of the ILD test, noting it is often misused or reported poorly, even in recent studies. Finally, in attempting to collect data and metadata on a large scale, I uncovered inefficiencies in the research publication system that obstruct re-use of data and scientific progress. I highlight the importance of replication and reproducibility – even simple reanalysis of high profile papers can turn up some very different results. Data is highly valuable and thus it must be retained and made available for further re-use to maximize the overall return on research investment.


2021 ◽  
Vol 12 ◽  
Author(s):  
Genís Calderer ◽  
Marieke L. Kuijjer

Networks are useful tools to represent and analyze interactions on a large, or genome-wide scale and have therefore been widely used in biology. Many biological networks—such as those that represent regulatory interactions, drug-gene, or gene-disease associations—are of a bipartite nature, meaning they consist of two different types of nodes, with connections only forming between the different node sets. Analysis of such networks requires methodologies that are specifically designed to handle their bipartite nature. Community structure detection is a method used to identify clusters of nodes in a network. This approach is especially helpful in large-scale biological network analysis, as it can find structure in networks that often resemble a “hairball” of interactions in visualizations. Often, the communities identified in biological networks are enriched for specific biological processes and thus allow one to assign drugs, regulatory molecules, or diseases to such processes. In addition, comparison of community structures between different biological conditions can help to identify how network rewiring may lead to tissue development or disease, for example. In this mini review, we give a theoretical basis of different methods that can be applied to detect communities in bipartite biological networks. We introduce and discuss different scores that can be used to assess the quality of these community structures. We then apply a wide range of methods to a drug-gene interaction network to highlight the strengths and weaknesses of these methods in their application to large-scale, bipartite biological networks.


Author(s):  
Sacha J. van Albada ◽  
Jari Pronold ◽  
Alexander van Meegen ◽  
Markus Diesmann

AbstractWe are entering an age of ‘big’ computational neuroscience, in which neural network models are increasing in size and in numbers of underlying data sets. Consolidating the zoo of models into large-scale models simultaneously consistent with a wide range of data is only possible through the effort of large teams, which can be spread across multiple research institutions. To ensure that computational neuroscientists can build on each other’s work, it is important to make models publicly available as well-documented code. This chapter describes such an open-source model, which relates the connectivity structure of all vision-related cortical areas of the macaque monkey with their resting-state dynamics. We give a brief overview of how to use the executable model specification, which employs NEST as simulation engine, and show its runtime scaling. The solutions found serve as an example for organizing the workflow of future models from the raw experimental data to the visualization of the results, expose the challenges, and give guidance for the construction of an ICT infrastructure for neuroscience.


2017 ◽  
Vol 44 (2) ◽  
pp. 203-229 ◽  
Author(s):  
Javier D Fernández ◽  
Miguel A Martínez-Prieto ◽  
Pablo de la Fuente Redondo ◽  
Claudio Gutiérrez

The publication of semantic web data, commonly represented in Resource Description Framework (RDF), has experienced outstanding growth over the last few years. Data from all fields of knowledge are shared publicly and interconnected in active initiatives such as Linked Open Data. However, despite the increasing availability of applications managing large-scale RDF information such as RDF stores and reasoning tools, little attention has been given to the structural features emerging in real-world RDF data. Our work addresses this issue by proposing specific metrics to characterise RDF data. We specifically focus on revealing the redundancy of each data set, as well as common structural patterns. We evaluate the proposed metrics on several data sets, which cover a wide range of designs and models. Our findings provide a basis for more efficient RDF data structures, indexes and compressors.


Sensors ◽  
2019 ◽  
Vol 19 (14) ◽  
pp. 3158
Author(s):  
Jian Yang ◽  
Xiaojuan Ban ◽  
Chunxiao Xing

With the rapid development of mobile networks and smart terminals, mobile crowdsourcing has aroused the interest of relevant scholars and industries. In this paper, we propose a new solution to the problem of user selection in mobile crowdsourcing system. The existing user selection schemes mainly include: (1) find a subset of users to maximize crowdsourcing quality under a given budget constraint; (2) find a subset of users to minimize cost while meeting minimum crowdsourcing quality requirement. However, these solutions have deficiencies in selecting users to maximize the quality of service of the task and minimize costs. Inspired by the marginalism principle in economics, we wish to select a new user only when the marginal gain of the newly joined user is higher than the cost of payment and the marginal cost associated with integration. We modeled the scheme as a marginalism problem of mobile crowdsourcing user selection (MCUS-marginalism). We rigorously prove the MCUS-marginalism problem to be NP-hard, and propose a greedy random adaptive procedure with annealing randomness (GRASP-AR) to achieve maximize the gain and minimize the cost of the task. The effectiveness and efficiency of our proposed approaches are clearly verified by a large scale of experimental evaluations on both real-world and synthetic data sets.


Sign in / Sign up

Export Citation Format

Share Document